# Getting started with Word2Vec in Gensim and making it work!

The idea behind Word2Vec is pretty simple. We are making and assumption that you can tell the meaning of a word by the company it keeps. This is analogous to the saying *show me your friends, and I'll tell who you are*. So if you have two words that have very similar neighbors (i.e. the usage context is about the same), then these words are probably quite similar in meaning or are at least highly related. For example, the words `shocked`,`appalled` and `astonished` are typically used in a similar context. 

In this tutorial, you will learn how to use the Gensim implementation of Word2Vec and actually get it to work!

**Note:** Model performance is a combination of two things:
  1. Your input data
  2. Your parameter settings

Note that the training algorithms in the `gensim` package were ported from the [original Word2Vec implementation by Google](https://arxiv.org/pdf/1301.3781.pdf) and extended with additional functionality.

## Imports

Our dataset is in csv form, so we'll use `pandas` to read it, and `gensim` is another obivous import, as it contains the Word2Vec implementation we'll use.

In [None]:
import pandas as pd
import gensim 

## Dataset 
Next, is our dataset. The secret to getting Word2Vec really working for you is to have lots and lots of text data. In this case I am going to use data from the [OpinRank](http://kavita-ganesan.com/entity-ranking-data/) dataset. This dataset has full user reviews of cars and hotels. I have specifically concatenated all of the hotel reviews into one big file which is about 97MB compressed and 229MB uncompressed. We will use the compressed file for this tutorial. Each line in this file represents a hotel review. You can download the OpinRank Word2Vec dataset here.

To avoid confusion, while gensim’s word2vec tutorial says that you need to pass it a sequence of sentences as its input, you can always pass it a whole review as a sentence (i.e. a much larger size of text), and it should not make much of a difference. 

Now, let's take a closer look at this data below by printing the first line. You can see that this is a pretty hefty review.

In [None]:
import io
import requests
url = "https://raw.githubusercontent.com/susanli2016/PyCon-Canada-2019-NLP-Tutorial/master/bbc-text.csv"
s = requests.get(url).content
df = pd.read_csv(io.StringIO(s.decode('utf-8')))
# df = pd.read_csv('bbc-text.csv')  # use this line instead if you've downloaded the dataset directly

In [None]:
df.head()

In [None]:
print('Number of news items: {}'.format(len(df)))

### Converting the dataset into the right format

Now that we've had a sneak peak of our dataset, we need to convert it to the right format so that we can pass this on to the Word2Vec model.

**Question 1:** Read the documentation of the `gensim.models.word2vec.Word2Vec` class (at https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec). What is the form of data it expects you to provide to the `sentences` parameter?

**Answer 1:** 

Since text preprocessing is not the point of this exercise, simple preprocessing will do. You can write it yourself, or use existing code.

**Question 2:** The `gensim.utils` module contains a function that can be used to easily preprocess documents into the desired format. Examine the documentation of the module at https://radimrehurek.com/gensim/utils.html and find the appropriate function.

**Answer 2:**

**Question 3:** Implement the following method to convert our dataframe dataset to the required format.

In [None]:
def dataframe(df):
    """This method converts the data from a pandas.DataFrame object to the format gensim.Word2Vec expects."""
    # implement me!

In [None]:
documents = dataframe(df)

## Training the Word2Vec model

Training the model is fairly straightforward. You just instantiate Word2Vec and pass the documents that we read in the previous step (the `documents`).

So, we are essentially passing on a list of lists. Where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary - a set of unique words.

After building the vocabulary, we just need to call `train(...)` to start training the Word2Vec model. Training on the our small (2250 docuemnts) BBC News dataset should take very little time.

As we've [seen in the lecture](https://docs.google.com/presentation/d/1EXOBaV7rg_KQXpEJU9XnQkkxHOT50PV2ATeau5tdjRY/edit?usp=sharing), behind the scenes we are actually training a simple neural network with a single hidden layer. But, we are actually not going to use the neural network after training. Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn. 

### Understanding some of the parameters

To train the model we need to set some parameters. Let's first recall what the most important ones mean. For reference, this is the command that we used to train the model.

Use the [documentation of gensim.Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec) and the [lecture slides](https://docs.google.com/presentation/d/1EXOBaV7rg_KQXpEJU9XnQkkxHOT50PV2ATeau5tdjRY/edit?usp=sharing) to answer the following questions:


**Question 4:** What is the meaning of the `size` parameter? To what part of the model architecture does it relate?

**Answer 4:**

**Question 5:**  Does the `window` parameter means we will look at a moving window the size of  `window` and use the word in its center as the target word, and all other words as neighboring words?

**Answer 5:**

Two other important parameters are:

#### `min_count`
Minimium frequency count of words. The model would ignore words that do not statisfy the `min_count`. Extremely infrequent words are usually unimportant, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.

#### `workers`
This parameter determines how many processor threads are use behind the scenes to train the model.

Let's initialize the `Word2Vec` model with some sensible defaut parameter values:

In [None]:
model = gensim.models.Word2Vec(documents, size=100, window=10, min_count=2, workers=10)

With the `model` object now initialized we can train it using the following command:

In [None]:
model.train(
    sentences=documents,
    total_examples=len(documents),
    epochs=10,
)

**Note:** To support linear learning-rate decay from (initial) alpha to min_alpha, and accurate progress-percentage logging, either `total_examples` (count of sentences) or `total_words` (count of raw words in sentences) MUST be provided.

## Using a trained model to find similar words

After training the resulting embedding is represented by a `gensim.models.keyedvectors.Word2VecKeyedVectors` object ([see documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.Word2VecKeyedVectors)) that you can access as the `wv` attribute of the model object - in our case as `model.wv`.

This is the object that should be used to make queries of the resulting embedding.

**Question 6:** The `gensim.models.keyedvectors.Word2VecKeyedVectors` class has a method that can help you find similar words to some input word.

**6.1 - What is the name of this method?**

**6.2 - What is the name of the parameter that determines how many similar words will be returned?**

**6.3 - Find the 8 most similar words to "terror" (see the definition of `w` below).**

In [None]:
w1 = "terror"

**Answer 6.1:**



**Answer 6.2:** 



**Answer 6.3:**

That looks pretty good, right?

We can also provide several positive examples!

**Question 7:** Find the 6 words most similar to "france" AND "germany" (see how the `positive_words` variable is defined below and use it).

In [None]:
positive_words = ["france", "germany"]

**Answer 7:**

**Question 8:** How results differ when we use two Asian countries as our positive words?

(see how the `positive_words` variable is defined below and use it)

In [None]:
positive_words = ["vietnam", "china"]

**Answer 8:**

That's, nice. You can even specify several positive examples to get things that are related in the provided context and provide negative examples to say what should not be considered as related.

**Question 9.1:** Let's first look at the 10 most related words to "afghanistan" and "iraq".

(see how the `positive_words` variable is defined below and use it)

In [None]:
positive_words = ["afghanistan", "iraq"]

**Answer 9.1**:

**Question 9.2:** Let's see how results change when we add "terror" as a negative word. What interesting word remains in the list but drops in rank? What new words appear?

(see how the `positive_words` and `negative_words` variables are defined below and use them)

In [None]:
positive_words = ["afghanistan", "iraq"]
negative_words = ["terror"]

**Answer 9.2**:

### Similarity between two words in the vocabulary

**Question 10:** The `gensim.models.keyedvectors.Word2VecKeyedVectors` class has a method that can help you find the normalized similarity score between two words.

**10.1 - What is the name of this method?**

**10.2 - What measure of vector similarity it uses?**

**10.3 - What is the highest possible similarity score? Demonstrate it.**

**10.4 - Find the similarity between "france" and "germany". Compare it to the similarity between "france" and "sudan". Which pair do expect to be more similar? By what degree?**

You can even use the Word2Vec model to return the similarity between two words that are present in the vocabulary. 

**Answer 10.1:**

**Answer 10.2:**

**Answer 10.3:**

In [None]:
# similarity between two identical words


If you do a similarity between two identical words, the score will be 1.0 (or very very close), as the range of the cosine similarity score will always be between [0.0-1.0].

**Answer 10.4:**

### Find the odd one out
You can even use Word2Vec to find odd items given a list of items.

In [None]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["france","germany","sudan"])

In [None]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["economy","treasury","soccer"])

## Representing documents based on trained word embeddings

Once we have a trained model for word embeddings, we can now represent entire documents as vectors in the same space. There are several ways to do so, but the most common - and extremely strong baseline - is simply averaging all the words vectors corresponding to words in a document to get a document vector.

**Question 11:** Implement the `document_to_vector(document, model)` method below.

*Note:* You'll have to somehow handle the fact that not all words in the dataset were included in our vocabulary!

*Hint:* Use `np.average`.

**Answer 11:**

In [None]:
import gensim
import numpy as np

def document_to_vector(document, model):
    # implement me!

We can now use this method to examine the vector representation of the first document in our corpus:

In [None]:
df.head()

In [None]:
df['text'].iloc[0]

In [None]:
vec0 = document_to_vector(df['text'].iloc[0], model)
print(vec0.shape)

In [None]:
print(vec0)

## Using Word2Vec-induced document representation for classification

That's it! From here on the code to use this representation for some task on our data is trivial!

Let's see how to do this (no questions here, just read and make sure you understand the code):

In [None]:
df['w2v'] = df['text'].apply(document_to_vector, model=model)

In [None]:
df.head()

In [None]:
from sklearn.preprocessing import LabelEncoder
lblenc = LabelEncoder()

In [None]:
df['lbl'] = lblenc.fit_transform(df['category'])

In [None]:
lblenc.classes_

We now have a column for the way we want to represent our data, and another one for our encoded labels:

In [None]:
df.head()

Let's generate `X` and `y` the way `sklearn` expects them:

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = np.stack(df['w2v'].values)
y = df['lbl'].values

And split them:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
X_train.shape

We can now train any classifier over our data:

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rndforest = RandomForestClassifier(n_estimators=50, max_depth=8)

In [None]:
rndforest.fit(X_train, y_train)

In [None]:
y_pred = rndforest.predict(X_test)

In [None]:
from sklearn.metrics import classification_report

We get some pretty good results! As an interesting exercise, compare them with the results over a bag-of-words representation.

Do you expect it to be better or words?

What does it mean about our learned Word2Vec representation if performance is roughly the same?

In [None]:
print(classification_report(y_test, y_pred, target_names=lblenc.classes_))