In [None]:
# gensim is the package to use for word2vec word embeddings
# don't try to code word2vec yourself !
import gensim

# Word embedding with word2vec

In this notebook, we will look at word2vec, a technique for word embedding.

The goal of word2vec is to provide a mapping from the dictionary of words present in a text to vectors of fixed dimension. The main motivation is that working on words directly is not possible (except with one-hot encoding in really large dimensions which do not behave well).

There are two ways to train a word2vec model. The first is to use continuous bag of words (CBOW). In this CBOW setting, the text is preprocessed to build:
- small sets (e.g. size 5) of neighborhing words in the text
- one other neighboring word in the text.
and it trains a neural network to predict this neighboring word from the other close words.
Example: text = 'the cat climbed the tree'
The bag of words and other word could be:
- 'the', 'climbed' -> cat
- 'cat', 'the' -> climbed
- 'climbed', 'tree' -> 'the'

To do so, the network learns en embedding i.e. a function which maps each word to a continuous vector in dimension 64. Then there is one dense layer and the network outputs a softmax on the number of words.

The second way is skip-gram, where this time, given a single word for context, the network attempts to find several words that are likely to occur next to it.



#### Below we load the data: we will work on imdb database (if you have a dataset of interest, try it ! word2vec is extremely robust)

In [None]:
from gensim.test.utils import datapath
from gensim import utils

from keras.datasets import imdb

(x_train, y_train), _ = imdb.load_data()

word_to_id = imdb.get_word_index()
word_to_id = {k:(v+3) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2
word_to_id["<UNUSED>"] = 3

id_to_word = {value:key for key,value in word_to_id.items()}

#### What is x_train ? What is y_train ? In your opinion, why doesn't it look like text ?

Keras provides the text already tokenized i.e. the words have been encoded in integers. This is the right **input** format for an embedding layer. Now in our case, we will feed word2vec with texts, so we need to go back to texts !

#### What is id_to_word dictionary ?

It gives correspondence between words and integers. Some extra words exist, we will see how they are useful in the next exercises.

#### Create a list `sentences_train` which contains the list of strings corresponding to x_train. Similarly, create a list `sentences_test`.
Print some elements to check.

In [None]:
sentences = []


#### word2vec object from gensim cannot be fed with `sentences` object directly, but we need to define an iterator abstraction for it, as done below. Iterate through this object. Do you understand what `yield` does ?

In [None]:
class MyCorpus(object):
    """An interator that yields sentences (lists of str)."""

    def __iter__(self):
        for line in sentences:
            # assume there's one document per line, tokens separated by whitespace
            yield line
            
corpus = MyCorpus()

# iterate !

#### Below we run the word2vec model on our corpus

In [None]:
import gensim.models
model = gensim.models.Word2Vec(sentences=corpus)

#### Below we print the vocabulary i.e. the whole set of words present in the corpus.

In [None]:
print(model.wv.vocab)

#### The model.wv object has a `most_similar` method which, given a word in the vocabulary, gives you the most similar words according to the computed embedding. Try it !

#### Below is an example of arithmetic on the word embeddings (possible because all activations in word2vec are linear). Add yours !

In [None]:
print(model.similar_by_vector(model.wv['king'] - model.wv['man'] + model.wv['woman']))
print(model.similar_by_vector(model.wv['queen'] + model.wv['man'] - model.wv['woman']))

#### We will now do a classification based on each review embedding. Write a method `compute_mean_embedding` which, given a list of words, returns the mean of the embeddings of each words (systematically check if the words are in the vocabulary).

In [None]:
def compute_mean_embedding(l):
    """
    l: list of words
    """


#### Construct a numpy array `mean_vectors` which contains the mean embedding of each sentence in sentences.

In [None]:
# Now classification using mean vector !
import numpy as np
mean_vectors = []


#### Train, in a 10-fold fashion, a logistic regression on `mean_vectors` to predict the class of the review, which is given in y_train. What is the mean test accuracy ?

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold


#### What do you think of this accuracy estimation ? Is there some data leakage somewhere ? How would you do otherwise ?

#### Possible continuation: use the word2vec embedding as initialization for an embedding layer to perform the same classification task.