# Getting started with Word2Vec in Gensim and making it work!

The idea behind Word2Vec is pretty simple. We are making and assumption that you can tell the meaning of a word by the company it keeps. This is analogous to the saying *show me your friends, and I'll tell who you are*. So if you have two words that have very similar neighbors (i.e. the usage context is about the same), then these words are probably quite similar in meaning or are at least highly related. For example, the words `shocked`,`appalled` and `astonished` are typically used in a similar context. 

In this tutorial, you will learn how to use the Gensim implementation of Word2Vec and actually get it to work!

**Note:** Model performance is a combination of two things:
  1. Your input data
  2. Your parameter settings

Note that the training algorithms in the `gensim` package were ported from the [original Word2Vec implementation by Google](https://arxiv.org/pdf/1301.3781.pdf) and extended with additional functionality.

## Imports

Our dataset is in csv form, so we'll use `pandas` to read it, and `gensim` is another obivous import, as it contains the Word2Vec implementation we'll use.

In [2]:
import pandas as pd
import gensim 

## Dataset 
Next, is our dataset. The secret to getting Word2Vec really working for you is to have lots and lots of text data. In this case I am going to use data from the [OpinRank](http://kavita-ganesan.com/entity-ranking-data/) dataset. This dataset has full user reviews of cars and hotels. I have specifically concatenated all of the hotel reviews into one big file which is about 97MB compressed and 229MB uncompressed. We will use the compressed file for this tutorial. Each line in this file represents a hotel review. You can download the OpinRank Word2Vec dataset here.

To avoid confusion, while gensim’s word2vec tutorial says that you need to pass it a sequence of sentences as its input, you can always pass it a whole review as a sentence (i.e. a much larger size of text), and it should not make much of a difference. 

Now, let's take a closer look at this data below by printing the first line. You can see that this is a pretty hefty review.

In [3]:
import io
import requests
url = "https://raw.githubusercontent.com/susanli2016/PyCon-Canada-2019-NLP-Tutorial/master/bbc-text.csv"
s = requests.get(url).content
df = pd.read_csv(io.StringIO(s.decode('utf-8')))
# df = pd.read_csv('bbc-text.csv')  # use this line instead if you've downloaded the dataset directly

In [4]:
df.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


In [5]:
print('Number of news items: {}'.format(len(df)))

Number of news items: 2225


### Converting the dataset into the right format

Now that we've had a sneak peak of our dataset, we need to convert it to the right format so that we can pass this on to the Word2Vec model.

**Question 1:** Read the documentation of the `gensim.models.word2vec.Word2Vec` class (at https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec). What is the form of data it expects you to provide to the `sentences` parameter?

**Answer 1:** `sentences` expects an iterable of iterables of tokens. In simple terms a list of documents, where each document is a list of strings (other options are generators that read data lazily from disk/network/etc).

Since text preprocessing is not the point of this exercise, simple preprocessing will do. You can write it yourself, or use existing code.

**Question 2:** The `gensim.utils` module contains a function that can be used to easily preprocess documents into the desired format. Examine the documentation of the module at https://radimrehurek.com/gensim/utils.html and find the appropriate function.

**Answer 2:** The `gensim.utils.simple_preprocess(doc)` function accepts an input document as an `str` object, does some basic pre-processing such as tokenization, lowercasing, etc and returns back a list of tokens (words).

**Question 3:** Implement the following method to convert our dataframe dataset to the required format.

In [6]:
def dataframe(df):
    """This method converts the data from a pandas.DataFrame object to the format gensim.Word2Vec expects."""
    return [
        gensim.utils.simple_preprocess(doc)
        for doc in df['text']
    ]

In [7]:
documents = dataframe(df)

## Training the Word2Vec model

Training the model is fairly straightforward. You just instantiate Word2Vec and pass the documents that we read in the previous step (the `documents`).

So, we are essentially passing on a list of lists. Where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary - a set of unique words.

After building the vocabulary, we just need to call `train(...)` to start training the Word2Vec model. Training on the our small (2250 docuemnts) BBC News dataset should take very little time.

As we've [seen in the lecture](https://docs.google.com/presentation/d/1EXOBaV7rg_KQXpEJU9XnQkkxHOT50PV2ATeau5tdjRY/edit?usp=sharing), behind the scenes we are actually training a simple neural network with a single hidden layer. But, we are actually not going to use the neural network after training. Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn. 

### Understanding some of the parameters

To train the model we need to set some parameters. Let's first recall what the most important ones mean. For reference, this is the command that we used to train the model.

Use the [documentation of gensim.Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec) and the [lecture slides](https://docs.google.com/presentation/d/1EXOBaV7rg_KQXpEJU9XnQkkxHOT50PV2ATeau5tdjRY/edit?usp=sharing) to answer the following questions:


**Question 4:** What is the meaning of the `size` parameter? To what part of the model architecture does it relate?

**Answer 4:** `size` detremines the size of the single hidden layer of neurons in the neural network we will train on the data. As a result, it will also be the dimension of the dense vectors used to represent each token or word. If you have very limited data, then size should be relatively small. If you have lots of data, its good to experiment with various sizes. Values between 50 and 300 are very common, and have proved to work well for various applications.

**Question 5:**  Does the `window` parameter means we will look at a moving window the size of  `window` and use the word in its center as the target word, and all other words as neighboring words?

**Answer 5:** No. As the documentation of the parameter says, this determines the maximum distance between the target word and its neighboring words. If your neighbor's position is greater than the maximum window width to the left and the right, then, some neighbors are not considered as being related to the target word. 

So, setting `window=5` does NOT mean we will look at the center word of a window of size 5, rather that we will take both 5 words to the left of the center word and 5 words to the right of the center word as neighboring words.

In theory, a smaller window should give you terms that are more related. If you have lots of data, then the window size should not matter too much, as long as its a decent sized window. 

Two other important parameters are:

#### `min_count`
Minimium frequency count of words. The model would ignore words that do not statisfy the `min_count`. Extremely infrequent words are usually unimportant, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.

#### `workers`
This parameter determines how many processor threads are use behind the scenes to train the model.

Let's initialize the `Word2Vec` model with some sensible defaut parameter values:

In [12]:
model = gensim.models.Word2Vec(documents, size=100, window=10, min_count=2, workers=10)

With the `model` object now initialized we can train it using the following command:

In [13]:
model.train(
    sentences=documents,
    total_examples=len(documents),
    epochs=10,
)

(6435793, 8222420)

**Note:** To support linear learning-rate decay from (initial) alpha to min_alpha, and accurate progress-percentage logging, either `total_examples` (count of sentences) or `total_words` (count of raw words in sentences) MUST be provided.

## Using a trained model to find similar words

After training the resulting embedding is represented by a `gensim.models.keyedvectors.Word2VecKeyedVectors` object ([see documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.Word2VecKeyedVectors)) that you can access as the `wv` attribute of the model object - in our case as `model.wv`.

This is the object that should be used to make queries of the resulting embedding.

**Question 6:** The `gensim.models.keyedvectors.Word2VecKeyedVectors` class has a method that can help you find similar words to some input word.

**6.1 - What is the name of this method?**

**6.2 - What is the name of the parameter that determines how many similar words will be returned?**

**6.3 - Find the 8 most similar words to "terror" (see the definition of `w` below).**

In [20]:
w1 = "terror"

**Answer 6.1:**

`gensim.Word2Vec.wv.most_similar()`

**Answer 6.2:** 

`topn`

**Answer 6.3:**

In [21]:
model.wv.most_similar(positive=w1, topn=8)

[('terrorism', 0.8812843561172485),
 ('suspects', 0.8728781938552856),
 ('laws', 0.8391991853713989),
 ('detention', 0.82683926820755),
 ('liberties', 0.7959276437759399),
 ('detained', 0.794581413269043),
 ('detainees', 0.7931122779846191),
 ('arrest', 0.7869716286659241)]

That looks pretty good, right?

We can also provide several positive examples!

**Question 7:** Find the 6 words most similar to "france" AND "germany" (see how the `positive_words` variable is defined below and use it).

In [25]:
positive_words = ["france", "germany"]

**Answer 7:**

In [26]:
model.wv.most_similar(positive=positive_words, topn=6)

[('australia', 0.7740954160690308),
 ('sweden', 0.7739661931991577),
 ('italy', 0.75986647605896),
 ('argentina', 0.7227758169174194),
 ('korea', 0.7216360569000244),
 ('netherlands', 0.7166856527328491)]

**Question 8:** How results differ when we use two Asian countries as our positive words?

(see how the `positive_words` variable is defined below and use it)

In [35]:
positive_words = ["vietnam", "china"]

**Answer 8:**

In [36]:
model.wv.most_similar(positive=positive_words, topn=6)

[('sedans', 0.7861303091049194),
 ('chile', 0.7666130661964417),
 ('creek', 0.7617331743240356),
 ('automobiles', 0.7466710805892944),
 ('mill', 0.7374764680862427),
 ('siberia', 0.7351670265197754)]

That's, nice. You can even specify several positive examples to get things that are related in the provided context and provide negative examples to say what should not be considered as related.

**Question 9.1:** Let's first look at the 10 most related words to "afghanistan" and "iraq".

(see how the `positive_words` variable is defined below and use it)

In [54]:
positive_words = ["afghanistan", "iraq"]

**Answer 9.1**:

In [55]:
model.wv.most_similar(positive=positive_words)

[('war', 0.7471005916595459),
 ('reform', 0.719367504119873),
 ('administration', 0.7119519114494324),
 ('troops', 0.7085784077644348),
 ('pakistan', 0.6957085132598877),
 ('surrendering', 0.6812551021575928),
 ('allies', 0.6811525225639343),
 ('debates', 0.6729350090026855),
 ('ukraine', 0.6714586019515991),
 ('conflict', 0.6648493409156799)]

**Question 9.2:** Let's see how results change when we add "terror" as a negative word. What interesting word remains in the list but drops in rank? What new words appear?

(see how the `positive_words` and `negative_words` variables are defined below and use them)

In [56]:
positive_words = ["afghanistan", "iraq"]
negative_words = ["terror"]

**Answer 9.2**:

In [57]:
model.wv.most_similar(positive=positive_words, negative=negative_words)

[('pakistan', 0.6353394985198975),
 ('economic', 0.6186375021934509),
 ('administration', 0.6080954670906067),
 ('turkish', 0.5843637585639954),
 ('negative', 0.5838034152984619),
 ('stability', 0.580738365650177),
 ('war', 0.5729011297225952),
 ('liberia', 0.5663363337516785),
 ('unemployment', 0.5603222846984863),
 ('trade', 0.5568909645080566)]

### Similarity between two words in the vocabulary

**Question 10:** The `gensim.models.keyedvectors.Word2VecKeyedVectors` class has a method that can help you find the normalized similarity score between two words.

**10.1 - What is the name of this method?**

**10.2 - What measure of vector similarity it uses?**

**10.3 - What is the highest possible similarity score? Demonstrate it.**

**10.4 - Find the similarity between "france" and "germany". Compare it to the similarity between "france" and "sudan". Which pair do expect to be more similar? By what degree?**

You can even use the Word2Vec model to return the similarity between two words that are present in the vocabulary. 

**Answer 10.1:** `Word2VecKeyedVectors.similarity()`. Can be accessed in our case with `model.wv.similarity`.

**Answer 10.2:** Cosine similarity.  You can read more about cosine similarity scoring [here](https://en.wikipedia.org/wiki/Cosine_similarity).

**Answer 10.3:**

In [63]:
# similarity between two identical words
model.wv.similarity(w1="dirty", w2="dirty")

1.0000001

If you do a similarity between two identical words, the score will be 1.0 (or very very close), as the range of the cosine similarity score will always be between [0.0-1.0].

**Answer 10.4:**

In [64]:
model.wv.similarity(w1="france",w2="germany")

0.6285567

In [65]:
model.wv.similarity(w1="france",w2="sudan")

0.40872124

### Find the odd one out
You can even use Word2Vec to find odd items given a list of items.

In [68]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["france","germany","sudan"])

'sudan'

In [70]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["economy","treasury","soccer"])

'soccer'

## Representing documents based on trained word embeddings

Once we have a trained model for word embeddings, we can now represent entire documents as vectors in the same space. There are several ways to do so, but the most common - and extremely strong baseline - is simply averaging all the words vectors corresponding to words in a document to get a document vector.

**Question 11:** Implement the `document_to_vector(document, model)` method below.

*Note:* You'll have to somehow handle the fact that not all words in the dataset were included in our vocabulary!

*Hint:* Use `np.average`.

**Answer 11:**

In [87]:
import gensim
import numpy as np

def document_to_vector(document, model):
    token_list = gensim.utils.simple_preprocess(document)
    vector_list = []
    for token in token_list:
        try:
            vector_list.append(model.wv[token])
        except KeyError:
            pass
    return np.average(vector_list, axis=0)

We can now use this method to examine the vector representation of the first document in our corpus:

In [90]:
df.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


In [91]:
df['text'].iloc[0]

'tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to high

In [89]:
vec0 = document_to_vector(df['text'].iloc[0], model)
print(vec0.shape)

(100,)


In [92]:
print(vec0)

[ 0.3736611  -0.06766405 -0.49864665 -0.12934841  0.07735138  0.46799487
  0.48925743  0.06766327 -0.44355062  0.5480065   0.03922173  0.65703917
 -0.1093204  -0.11587837  0.06965002  0.63740337  0.391286   -0.33929914
 -0.41782418 -0.29526746 -0.01245262 -0.01923249 -0.42284146  0.2264081
 -0.44572663  0.08350075 -0.7117216  -0.28296927 -0.48120835 -0.37991342
  0.3782996  -0.09798248  0.28279737 -0.81057364  0.57862514 -0.08224098
  0.10575825  0.18886563  0.20654963  0.16025494  0.8921339  -0.22214389
  0.18034428 -0.07066693  0.31765705  0.12881042  0.20420377 -0.23285232
 -0.12589552  0.7712561   0.48479965  0.19197053  0.21331042 -0.20917788
  0.53909165  0.36398512 -0.45529592 -0.38396904  0.71978474  0.15105893
  0.11681143 -0.5689564  -0.44588482  0.05243408 -0.17566644  0.2621075
  0.06959043 -0.01958802  0.21099374 -0.24675322  0.40552518 -0.1680388
 -0.26236477  0.10751045 -0.16701369 -0.05243683  0.14572123 -0.14670472
  0.5065245  -0.17632087  0.06368297 -0.00894601 -0.63

## Using Word2Vec-induced document representation for classification

That's it! From here on the code to use this representation for some task on our data is trivial!

Let's see how to do this (no questions here, just read and make sure you understand the code):

In [96]:
df['w2v'] = df['text'].apply(document_to_vector, model=model)

In [97]:
df.head()

Unnamed: 0,category,text,w2v
0,tech,tv future in the hands of viewers with home th...,"[0.3736611, -0.06766405, -0.49864665, -0.12934..."
1,business,worldcom boss left books alone former worldc...,"[-0.28780502, -0.42619047, 0.0740801, -0.27487..."
2,sport,tigers wary of farrell gamble leicester say ...,"[-0.34710944, 0.21615139, -0.020872245, 0.1819..."
3,sport,yeading face newcastle in fa cup premiership s...,"[-0.67314976, 0.10055729, 0.10629929, 0.167987..."
4,entertainment,ocean s twelve raids box office ocean s twelve...,"[-0.021189053, -0.44900036, 0.30422804, -0.222..."


In [101]:
from sklearn.preprocessing import LabelEncoder
lblenc = LabelEncoder()

In [102]:
df['lbl'] = lblenc.fit_transform(df['category'])

In [106]:
lblenc.classes_

array(['business', 'entertainment', 'politics', 'sport', 'tech'],
      dtype=object)

We now have a column for the way we want to represent our data, and another one for our encoded labels:

In [103]:
df.head()

Unnamed: 0,category,text,w2v,lbl
0,tech,tv future in the hands of viewers with home th...,"[0.3736611, -0.06766405, -0.49864665, -0.12934...",4
1,business,worldcom boss left books alone former worldc...,"[-0.28780502, -0.42619047, 0.0740801, -0.27487...",0
2,sport,tigers wary of farrell gamble leicester say ...,"[-0.34710944, 0.21615139, -0.020872245, 0.1819...",3
3,sport,yeading face newcastle in fa cup premiership s...,"[-0.67314976, 0.10055729, 0.10629929, 0.167987...",3
4,entertainment,ocean s twelve raids box office ocean s twelve...,"[-0.021189053, -0.44900036, 0.30422804, -0.222...",1


Let's generate `X` and `y` the way `sklearn` expects them:

In [93]:
from sklearn.model_selection import train_test_split

In [116]:
X = np.stack(df['w2v'].values)
y = df['lbl'].values

And split them:

In [117]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [118]:
X_train.shape

(1490, 100)

We can now train any classifier over our data:

In [119]:
from sklearn.ensemble import RandomForestClassifier

In [120]:
rndforest = RandomForestClassifier(n_estimators=50, max_depth=8)

In [122]:
rndforest.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=8, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=50,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [124]:
y_pred = rndforest.predict(X_test)

In [123]:
from sklearn.metrics import classification_report

We get some pretty good results! As an interesting exercise, compare them with the results over a bag-of-words representation.

Do you expect it to be better or words?

What does it mean about our learned Word2Vec representation if performance is roughly the same?

In [126]:
print(classification_report(y_test, y_pred, target_names=lblenc.classes_))

               precision    recall  f1-score   support

     business       0.94      0.93      0.94       181
entertainment       0.95      0.91      0.93       129
     politics       0.90      0.96      0.93       125
        sport       1.00      0.96      0.98       158
         tech       0.95      0.99      0.97       142

     accuracy                           0.95       735
    macro avg       0.95      0.95      0.95       735
 weighted avg       0.95      0.95      0.95       735

