## Seminar week 2: Fun with Word Embeddings

Today we gonna play with word embeddings: train our own little embeddings, load one from gensim model zoo and use it to visualize text corpora.

This whole thing is gonna happen on top of embedding dataset.

__Requirements:__  `pip install --upgrade nltk gensim bokeh` , but only if you're running locally.

In [1]:
# download the data:
!wget https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1 -O ./quora.txt
# alternative download link: https://yadi.sk/i/BPQrUu1NaTduEw

--2024-11-14 16:43:58--  https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.1.18, 2620:100:6016:18::a27d:112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.1.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.dropbox.com/scl/fi/p0t2dw6oqs6oxpd6zz534/quora.txt?rlkey=bjupppwua4zmd4elz8octecy9&dl=1 [following]
--2024-11-14 16:43:58--  https://www.dropbox.com/scl/fi/p0t2dw6oqs6oxpd6zz534/quora.txt?rlkey=bjupppwua4zmd4elz8octecy9&dl=1
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc614b2f9b2e2fc63bcafecac638.dl.dropboxusercontent.com/cd/0/inline/CeYpSO7sOq_sx43_Ml6K9qNrz4Akw-h5-f9MUh6VtrvHeKsgn0AT2H6E6MD_7nAmpyxhB6PfFmjzOc0ctSVc8eVptB6Yeyf3z0__x3ZucT-K_7_TRR27bTqUzyZvO0aXF9M/file?dl=1# [following]
--2024-11-14 16:43:59--  https://uc614b2f9b2e2fc63bcafecac638.dl.dropboxusercontent.com/cd/0/inline/CeYpSO7sOq_

In [2]:
import numpy as np

with open("./quora.txt", encoding="utf-8") as file:
    data = list(file)

data[50]

"What TV shows or books help you read people's body language?\n"

__Tokenization:__ a typical first step for an NLP task is to split raw data into words.
The text we're working with is in raw format: with all the punctuation and smiles attached to some words, so a simple str.split won't do.

Let's use __`nltk`__ - a library that handles many NLP tasks like tokenization, stemming or part-of-speech tagging.

In [3]:
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()

print(tokenizer.tokenize(data[50]))

['What', 'TV', 'shows', 'or', 'books', 'help', 'you', 'read', 'people', "'", 's', 'body', 'language', '?']


In [4]:
# TASK: lowercase everything and extract tokens with tokenizer.
# data_tok should be a list of lists of tokens for each line in data.

data_tok = [tokenizer.tokenize(row.lower()) for row in data]

In [6]:
print([' '.join(row) for row in data_tok[:2]])

["can i get back with my ex even though she is pregnant with another guy ' s baby ?", 'what are some ways to overcome a fast food addiction ?']


__Word vectors:__ as the saying goes, there's more than one way to train word embeddings. There's Word2Vec and GloVe with different objective functions. Then there's fasttext that uses character-level models to train word embeddings.

The choice is huge, so let's start someplace small: __gensim__ is another nlp library that features many vector-based models incuding word2vec.

In [5]:
from gensim.models import Word2Vec
model = Word2Vec(data_tok,
                 vector_size=32,  # embedding vector size
                 min_count=5,  # consider words that occured at least 5 times
                 window=5).wv  # define context as a 5-word window around the target word

In [6]:
# now you can get word vectors !
model.get_vector('anything')

array([-1.5236349 ,  1.2860825 ,  1.2162362 ,  2.4818227 ,  1.8691998 ,
        2.1207948 ,  1.2753237 , -5.0275455 ,  1.1994302 ,  2.6479285 ,
       -2.1651874 ,  2.7422156 ,  4.7689395 ,  0.97529227,  3.4458208 ,
       -0.40487108,  0.46728992, -0.4599355 , -0.61598766, -1.7746505 ,
       -2.8115013 ,  0.6620162 , -0.41575834, -0.86920214, -0.21429607,
       -3.5330222 ,  0.46978557, -0.51837933,  0.6393923 ,  0.5872444 ,
       -1.3635228 , -0.18616243], dtype=float32)

In [8]:
# or query similar words directly. Go play with it!
model.most_similar('wine')

[('tea', 0.9188472032546997),
 ('bread', 0.9113323092460632),
 ('beer', 0.9043233394622803),
 ('orange', 0.9010516405105591),
 ('chocolate', 0.900094747543335),
 ('rice', 0.8967821598052979),
 ('cheese', 0.8953024744987488),
 ('vodka', 0.8952997326850891),
 ('fruit', 0.8863333463668823),
 ('beans', 0.8837348818778992)]

### Using pre-trained model

Thankfully, nowadays you can get a pre-trained word embedding model in 2 lines of code (no sms required, promise).

In [9]:
import gensim.downloader as api
model = api.load('glove-twitter-100')



In [10]:
# or query similar words directly. Go play with it!
model.most_similar('wine')

[('beer', 0.8526083827018738),
 ('bottle', 0.7869870662689209),
 ('drink', 0.7679266333580017),
 ('tasting', 0.7584347724914551),
 ('coffee', 0.7540070414543152),
 ('drinks', 0.7457992434501648),
 ('drinking', 0.7210016846656799),
 ('beers', 0.7155925631523132),
 ('whiskey', 0.7105889320373535),
 ('wines', 0.708965539932251)]

In [11]:
model.most_similar(positive=["coder", "money"], negative=["brain"])

[('broker', 0.5820155739784241),
 ('bonuses', 0.5424473285675049),
 ('banker', 0.5385112762451172),
 ('designer', 0.5197198390960693),
 ('merchandising', 0.4964233338832855),
 ('treet', 0.4922019839286804),
 ('shopper', 0.4920562207698822),
 ('part-time', 0.4912828207015991),
 ('freelance', 0.4843311905860901),
 ('aupair', 0.4796452522277832)]

### Visualizing word vectors

One way to see if our vectors are any good is to plot them. Thing is, those vectors are in 30D+ space and we humans are more used to 2-3D.

Luckily, we machine learners know about __dimensionality reduction__ methods.

Let's use that to plot 1000 most frequent words

In [12]:
words = model.index_to_key[:1000]

print(words[::100])

['<user>', '_', 'please', 'apa', 'justin', 'text', 'hari', 'playing', 'once', 'sei']


In [13]:
# for each word, compute it's vector with model
word_vectors = np.stack([model.get_vector(word) for word in words], axis=0)

In [14]:
word_vectors.shape

(1000, 100)

In [15]:
assert isinstance(word_vectors, np.ndarray)
assert word_vectors.shape == (len(words), 100)
assert np.isfinite(word_vectors).all()

#### Linear projection: PCA

The simplest linear dimensionality reduction method is __P__rincipial __C__omponent __A__nalysis.

In geometric terms, PCA tries to find axes along which most of the variance occurs. The "natural" axes, if you wish.

<img src="https://github.com/yandexdataschool/Practical_RL/raw/master/yet_another_week/_resource/pca_fish.png" style="width:30%">


Under the hood, it attempts to decompose object-feature matrix $X$ into two smaller matrices: $W$ and $\hat W$ minimizing _mean squared error_:

$$\|(X W) \hat{W} - X\|^2_2 \to_{W, \hat{W}} \min$$
- $X \in \mathbb{R}^{n \times m}$ - object matrix (**centered**);
- $W \in \mathbb{R}^{m \times d}$ - matrix of direct transformation;
- $\hat{W} \in \mathbb{R}^{d \times m}$ - matrix of reverse transformation;
- $n$ samples, $m$ original dimensions and $d$ target dimensions;



In [16]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# map word vectors onto 2d plane with PCA. Use good old sklearn api (fit, transform)
# after that, normalize vectors to make sure they have zero mean and unit variance
pca = PCA(n_components=2)
word_vectors_pca = pca.fit_transform(word_vectors)

scaler = StandardScaler()
word_vectors_pca_scaled = scaler.fit_transform(word_vectors_pca)

In [17]:
assert word_vectors_pca_scaled.shape == (len(word_vectors), 2), "there must be a 2d vector for each word"
assert max(abs(word_vectors_pca_scaled.mean(0))) < 1e-5, "points must be zero-centered"
assert max(abs(1.0 - word_vectors_pca_scaled.std(0))) < 1e-2, "points must have unit variance"

#### Let's draw it!

In [18]:
import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook
output_notebook()

def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    """ draws an interactive plot for data points with auxilirary info on hover """
    if isinstance(color, str): color = [color] * len(x)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show: pl.show(fig)
    return fig

In [19]:
draw_vectors(word_vectors_pca[:, 0], word_vectors_pca[:, 1], token=words)

# hover a mouse over there and see if you can identify the clusters

### Visualizing neighbors with t-SNE
PCA is nice but it's strictly linear and thus only able to capture coarse high-level structure of the data.

If we instead want to focus on keeping neighboring points near, we could use TSNE, which is itself an embedding method. Here you can read __[more on TSNE](https://distill.pub/2016/misread-tsne/)__.

In [21]:
from sklearn.manifold import TSNE

# map word vectors onto 2d plane with TSNE. hint: don't panic it may take a minute or two to fit.
# normalize them as just lke with pca

# word_vectors
word_vectors_tsne = TSNE(n_components=2).fit_transform(word_vectors)

# and maybe MORE OF YOUR CODE here :)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
word_vectors_tsne_scaled = scaler.fit_transform(word_vectors_tsne)

__Now what?__
* Try running TSNE on all data, not just 1000 phrases
* See what other embeddings are there in the model zoo: `gensim.downloader.info()`
* Take a look at [FastText](https://github.com/facebookresearch/fastText) embeddings