# Programming assignment. Week 2. Vector models.

This assignment is graded by your `submission.json` which is created by a cell below. To generate your submission file you need to fill the values of **GRADING_ANSWER** instance fields.

You can press **"Submit Assignment"** at any time to submit partial progress.

In [339]:
from submit import Answer


GRADING_ANSWER = Answer()
GRADING_ANSWER.submit()

In this programming assignment you will work with different embeddings. You will work with `gensim` library that provides the access and cinvenient usage of word2vec and fasttext models. On the sentiment task you will check how the vector embeddings work. 
Good luck!

In [340]:
# First, load the data for sentiment task and prepare the data.
# Please, note, that these cells may have quite long runtime.

import pandas as pd
import numpy as np


data = pd.read_csv('sentiment.csv', index_col=0)
data.head()

Unnamed: 0,sentiment,review
0,1,With all this stuff going down at the moment w...
1,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,0,The film starts with a manager (Nicholas Bell)...
3,0,It must be assumed that those who praised this...
4,1,Superbly trashy and wondrously unpretentious 8...


In [341]:
import re


tag_regexp = re.compile("<[^>]*>")
regex = re.compile("[A-Za-z-]+")

def words_only(text, regex=regex):
    text = re.sub(tag_regexp, '', text)
    text = re.sub('\s+', ' ',text)
    text = re.sub(r'\\','', text)
    text = text.lower().strip()
    try:
        return " ".join(regex.findall(text))
    except:
        return ""

data['cleaned_review'] = data['review'].apply(words_only)
data['tokenized'] = data['cleaned_review'].apply(lambda x: x.split())

data.head()

Unnamed: 0,sentiment,review,cleaned_review,tokenized
0,1,With all this stuff going down at the moment w...,with all this stuff going down at the moment w...,"[with, all, this, stuff, going, down, at, the,..."
1,1,"\The Classic War of the Worlds\"" by Timothy Hi...",the classic war of the worlds by timothy hines...,"[the, classic, war, of, the, worlds, by, timot..."
2,0,The film starts with a manager (Nicholas Bell)...,the film starts with a manager nicholas bell g...,"[the, film, starts, with, a, manager, nicholas..."
3,0,It must be assumed that those who praised this...,it must be assumed that those who praised this...,"[it, must, be, assumed, that, those, who, prai..."
4,1,Superbly trashy and wondrously unpretentious 8...,superbly trashy and wondrously unpretentious s...,"[superbly, trashy, and, wondrously, unpretenti..."


In [342]:
# Split the data on train and test

from sklearn.model_selection import train_test_split
from sklearn.metrics import *


X_train, X_test, y_train, y_test = train_test_split(data.tokenized, data.sentiment, test_size=0.2, random_state = 5)
X_train[:5]

7751    [back, when, alec, baldwin, and, kim, basinger...
4154    [i, too, was, quite, astonished, to, see, how,...
3881    [i, saw, this, film, for, the, very, first, ti...
9238    [i, think, a, great, many, viewers, missed, en...
5210    [this, is, a, taut, suspenseful, masterpiece, ...
Name: tokenized, dtype: object

In [343]:
import nltk
from nltk.corpus import stopwords


nltk.download('stopwords')

STOPWORDS = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [344]:
data.head()

Unnamed: 0,sentiment,review,cleaned_review,tokenized
0,1,With all this stuff going down at the moment w...,with all this stuff going down at the moment w...,"[with, all, this, stuff, going, down, at, the,..."
1,1,"\The Classic War of the Worlds\"" by Timothy Hi...",the classic war of the worlds by timothy hines...,"[the, classic, war, of, the, worlds, by, timot..."
2,0,The film starts with a manager (Nicholas Bell)...,the film starts with a manager nicholas bell g...,"[the, film, starts, with, a, manager, nicholas..."
3,0,It must be assumed that those who praised this...,it must be assumed that those who praised this...,"[it, must, be, assumed, that, those, who, prai..."
4,1,Superbly trashy and wondrously unpretentious 8...,superbly trashy and wondrously unpretentious s...,"[superbly, trashy, and, wondrously, unpretenti..."


## BoW

#### Bag of words

Implement *bag-of-words* representation. To create this transformation, follow the steps:
1. Find *N* most popular words in train corpus and numerate them. Do not count words which are in STOPWORDS. Now we have a dictionary of the most popular words.

2. For each review in the corpora create a zero vector with the dimension equals to *N*.
3. For each text in the corpora iterate over words which are in the dictionary and increase by 1 the corresponding coordinate.

In [345]:
dict_words = {}

for _, tokens in X_train.iteritems():
    for token in tokens:
        if token not in STOPWORDS:
            cnt = dict_words.setdefault(token, 1) + 1
            dict_words[token] = cnt

In [346]:
from collections import OrderedDict
from operator import itemgetter


sorted_dict_words = OrderedDict(sorted(dict_words.items(), key = itemgetter(1), reverse=True))

In [347]:
DICT_SIZE = 500

words_counts = list(sorted_dict_words.values())[:DICT_SIZE]
WORDS_TO_INDEX = list(sorted_dict_words)[:DICT_SIZE]

In [348]:
def BoW(words, words_to_index, dict_size):
    """
        words: a list of words
        dict_size: size of the dictionary
        
        return a vector which is a bag-of-words representation of 'text'
    """
    result_vector = [words.count(word) for word in words_to_index]
 
    return result_vector

In [349]:
from scipy import sparse as sp_sparse


X_train_bow = sp_sparse.vstack([sp_sparse.csr_matrix(BoW(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_train])
X_test_bow = sp_sparse.vstack([sp_sparse.csr_matrix(BoW(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_test])

print('X_train shape ', X_train_bow.shape)
print('X_test shape ', X_test_bow.shape)

X_train shape  (8000, 500)
X_test shape  (2000, 500)


**Task 1.1 (0.5 points).** For the 5th row in *X_train_bow* find how many non-zero elements it has. 

**Hint** Do not forget that indexes start with 0 and the first row has index 0.

In [350]:
ROW_IDX = 5

non_zeros = sum([1 for x in range(0, DICT_SIZE) if X_train_bow[ROW_IDX-1, x] != 0])

In [351]:
## GRADED PART, PUT YOUR ANSWER HERE
GRADING_ANSWER.Q1 = non_zeros

**Task 1.2 (0.5 points)** 
Train (on train set) `RandomForestClassifier` from `sklearn.ensemble` with `n_estimators = 300` and `random_state=5` and `max_depth = 5`. What is the accuracy score on test set? 

In [352]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


rf = RandomForestClassifier(n_estimators = 300, random_state = 5, max_depth = 5)
rf.fit(X_train_bow, y_train)

RandomForestClassifier(max_depth=5, n_estimators=300, random_state=5)

In [353]:
y_pred = rf.predict(X_test_bow)
accuracy = accuracy_score(y_test, y_pred)

In [354]:
accuracy

0.7715

In [355]:
## GRADED PART, DO NOT CHANGE!
GRADING_ANSWER.Q2 = accuracy

## TF-IDF
Now vectorize your texts using `TfidfVectorizer` from `sklearn.feature_extraction.text`.
Pass `STOPWORDS` as `stop_words` and set `max_fetures = 500`.

**Task 2 (0.5 points).** Train (on train set) `RandomForestClassifier` from `sklearn.ensemble` with `n_estimators = 300`, `random_state=5`, and `max_depth = 5` on tf-idf embeddings. What is the accuracy score on test set? 

In [356]:
from sklearn.feature_extraction.text import TfidfVectorizer


def identity_tokenizer(text):
    return text

tfidfvectorizer = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words=STOPWORDS, max_features=500, lowercase=False)

In [357]:
X_train_tfidf = tfidfvectorizer.fit_transform(X_train.to_list())
X_test_tfidf = tfidfvectorizer.fit_transform(X_test.to_list())

In [358]:
rf = RandomForestClassifier(n_estimators = 300, random_state = 5, max_depth = 5)
rf.fit(X_train_tfidf, y_train)

RandomForestClassifier(max_depth=5, n_estimators=300, random_state=5)

In [359]:
y_pred = rf.predict(X_test_tfidf)
accuracy = accuracy_score(y_test, y_pred)

In [360]:
accuracy

0.5955

In [361]:
## GRADED PART, DO NOT CHANGE!
GRADING_ANSWER.Q3 = accuracy

## Distributional embeddings

Let us use a few pre-trained distributional embedding models to analyze the performance of the classifier:

*   ```word2vec```
*   ```fastText```

We will use the [```Gensim```](https://radimrehurek.com/gensim_3.8.3/) library for python which provides a wide range of useful functions and pre-trained embedding models. 

### Vector models. ```word2vec```



In this assignment you are going to work with the pretrined model. The file `GoogleNews-vectors-negative300.bin` **is already located in the root directory**. 

You may also download it using the code below, or you may directly download it from [here](https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz).

In [365]:
# CODE FOR DOWNLOADING PRETRAINED MODEL [local usage only]
!brew install wget
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
!gzip -d GoogleNews-vectors-negative300.bin.gz

/bin/sh: 1: brew: not found
--2021-07-20 16:25:34--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving proxy-coursera-apps.org (proxy-coursera-apps.org)... 10.0.0.170
Connecting to proxy-coursera-apps.org (proxy-coursera-apps.org)|10.0.0.170|:3128... connected.
Proxy tunneling failed: ForbiddenUnable to establish SSL connection.
gzip: GoogleNews-vectors-negative300.bin.gz: No such file or directory


**Task 3.1 (0.5 points)**

Load word2vec model using `gensim.models.KeyedVectors.load_word2vec_format`. How many words in model vocabulary start with `a` or `A`? (In the answer write the sum of vocabulary words wich start with `a` and with `A`).



In [363]:
from gensim import models
## YOUR CODE HERE

## GRADED PART, PUT YOUR ANSWER HERE
GRADING_ANSWER.Q4 = 0 ## YOUR CODE HERE

**Task 3.2 (0.5 points)**
Select all words from the vocabulary, which start from `z` or `Z`. Among these words which is the most similar to the word `park`?

Use `sklearn.metrics.pairwise.cosine_similarity` as a similarity measure.


In [364]:
from sklearn.metrics.pairwise import cosine_similarity
import collections


## YOUR CODE HERE
words_counts = ['a', 'b'] ## YOUR CODE HERE
words_counts.most_common(5)

AttributeError: 'list' object has no attribute 'most_common'

In [None]:
## GRADED PART, PUT YOUR ANSWER HERE
GRADING_ANSWER.Q5 = 0 ## YOUR CODE HERE

**Task 3.3 (0.5 points)** 
Compute top-5 most frequent words in `X_train` without counting words from `STOPWORDS`.
In answer write top-5 words in one string separating words by spaces.

**Example answer:** `"cat dom apple home house"`.

In [None]:
## YOUR CODE HERE
words_counts = ## YOUR CODE HERE
answer = ' '.join([x[0] for x in words_counts.most_common(5)])
answer

In [None]:
## GRADED PART, PUT YOUR ANSWER HERE
GRADING_ANSWER.Q6 = 0 ## YOUR CODE HERE

**Task 3.4 (0.5 points)**. With word2vec embeddings you can perform different linear operations. For example, you can check which of the words do not belong in the sequence (or does not match), or what vector we get if we sum of the `king` and the `woman` vectors and subtract the vector `man`. For the model above, do the following operations: 

1) what word from these: `London Warsaw Peru Kiev`
does not match; 

2) take the word from step 1 and make linear operation `Moscow + <your word> - Russia`; 

3) for the word from step 2 write the score for most simliar one. Round the answer up to 2 digits after the decimal point.

In [None]:
# YOUR CODE HERE

## GRADED PART, PUT YOUR ANSWER HERE
GRADING_ANSWER.Q7 = 0 ## YOUR CODE HERE

**Task 3.5 (0.5 point).** Add information in the pre-trained word2vec model. We have already pre-trained word2vec model, but what if we want to add some new information inside and increse the occurance of some words. You can update the parameters using the gensim interface, you may find more information [here](https://radimrehurek.com/gensim/models/keyedvectors.html). Note, that KeyedVectors does not working, you can update only the full model.  

First, take the tokenized sentences from sentiment data and train Word2Vec model. Set parameters: vector_size=100, window=2, min_count=1, workers=1. Write the most similar word for the woman vector for trained model and its similarity score (round the answer up to 4 digits after the decimal point). 

Answer format is a string with a word and the corresponding similarity score separated by space.

Example answer: `"girl 0.5123"`

In [None]:
sentences = [line for line in data.tokenized]

In [None]:
## GRADED PART, PUT YOUR ANSWER HERE
GRADING_ANSWER.Q8 = 'girl 0.5123' ## YOUR CODE HERE

**Task 3.6 (0.5 point)**. Next, we will learn how to update parameters of the model with new sentences. 


Let's take some data from Alice’s Adventures in Wonderland by Lewis Carroll and add new texts in the model. First, load the data and prepare:

In [None]:
import nltk
import re
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

with open("alice.txt", 'r', encoding='utf-8') as f:
    text = f.read()

text = re.sub('\n', ' ', text)
sents = nltk.sent_tokenize(text.lower())

punct = '!"#$%&()*+,-./:;<=>?@[\]^_`{|}~„“«»†*—/\-‘’'
clean_sents = []

for sent in sents:
    s = [w.lower().strip(punct) for w in sent.split()]
    clean_sents.append(s)

print(clean_sents[:2])

You need to save the current model in the right format, load again the full model and add new texts in it. Run with default parameters 5 epochs. After, again check the most similar words for `woman` and write the word and score. Round the answer up to 4 digits after the decimal point.

In [None]:
from gensim.models import word2vec
# YOUR CODE HERE

In [None]:
## GRADED PART, PUT YOUR ANSWER HERE
GRADING_ANSWER.Q9 = 0 ## YOUR CODE HERE

### Pooling methods

Once we have a word embedding model, how can we get from word embeddings to document embeddings?

For this, there are several *pooling* strategies that define the way to aggregate the embeddings of each word in a document by taking an element-wise **average**, **minimum**, **maximum** or **sum**:

*   **Mean-pooling** (**average-pooling**)
*   **Min-pooling**
*   **Max-pooling**
*   **Sum-pooling**

It is very common in practice to use mean-pooled document embeddings for the downstream task.

**Task 4.1 (1 point).**
Using the model build sentence embedder, which computes the sentence vector as the mean vector of its word vectors (**mean-pooling**). Use zero vectors for out of vocabulary words.

What is the embedding for the sentence `'the cat sat on the mat'`?
Tokenize the sentence splitting it by spaces.

In the asnwer write the mean of its components.

In [None]:
class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        # if a text is empty we should return a vector of zeros
        # with the same dimensionality as all the other vectors
        self.dim = ## YOUR CODE HERE

    def fit(self, X, y):
        return self

    def transform(self, X):
        # X is a list of tokenized sentences 
        # example:
        # X = [['the','cat,'sat','on','the','mat'],
        #      ['the','dog,'lies','on','the','sofe']]
        
        ### YOUR CODE HERE

In [None]:
vectorizer = MeanEmbeddingVectorizer(model)
cat_vector = ### YOUR CODE HERE
print(cat_vector.shape)

In [None]:
## GRADED PART, PUT YOUR ANSWER HERE
GRADING_ANSWER.Q10 = 0 ## YOUR CODE HERE

**Task 4.2 (1 point).** Using `MeanEmbeddingVectorizer` vectorize tokenized reviews. Than train (on train set) `RandomForestClassifier` from `sklearn.ensemble` with `n_estimators = 300` and `random_state=5`.
What is the accuracy score on test set? 

In [None]:
### YOUR CODE HERE

pred = ## YOUR CODE HERE

In [None]:
## GRADED PART, PUT YOUR ANSWER HERE
GRADING_ANSWER.Q11 = 0 ## YOUR CODE HERE

**Task 4.3 (1 point)**
Transform `MeanEmbeddingVectorizer` for into MinEmbeddingVectorizer, 
MaxEmbeddingVectorizer, and SumEmbeddingVectorizer. 

For each of the vectorizers:


1) vectorize the data


2) train (on train set) `RandomForestClassifier` from `sklearn.ensemble` with `n_estimators = 300` and `random_state=5`.


3) compute accuracy score on test set

What method among three is the best? The answer should be of three: "min", "max", "sum".



In [None]:
### YOU CODE HERE

In [None]:
class EmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        # if a text is empty we should return a vector of zeros
        # with the same dimensionality as all the other vectors
        self.dim = model[model.index_to_key[0]].shape[0]

    def fit(self, X, y):
        return self

    def transform(self, X, type_ = 'min'):
        # X is a list of tokenized sentences 
        # example:
        # X = [['the','cat,'sat','on','the','mat'],
        #      ['the','dog,'lies','on','the','sofe']]
        
        ### YOUR CODE HERE

In [None]:
results = []
for type_ in ['min', 'max', 'sum']:
  ### YOUR CODE HERE
  results.append([type_, accuracy])
    
results

In [None]:
## GRADED PART, PUT YOUR ANSWER HERE
GRADING_ANSWER.Q12 = 0 ## YOUR ANSWER

**Task 4.4 (0.5 points)** What is the accuracy score for this method?


In [None]:
## GRADED PART, PUT YOUR ANSWER HERE
GRADING_ANSWER.Q13 = 0 ## YOUR ACCURACY SCORE HERE

## Text classification with ```fastText```

Let us have a look at how we can use the ```fastText``` library for text classification. Specifically, the [library](https://github.com/facebookresearch/fastText#text-classification) can be used to train superrvised text classifiers, for example for sentiment analysis in a single command:

```./fasttext supervised -input train.txt -output model```, where ```train.txt``` is the train set with annotated examples for the task, and ```model``` is a name of your model that will be saved in 2 files: ```model.bin``` and ```model.vec```.

**Data format**: 
The train file should be in a specified format, containing a training sentence or document per line along with the labels.

In [None]:
## Install fasttext if you work in Colab (otherwise it is already installed)

# !wget https://github.com/facebookresearch/fastText/archive/v0.2.0.zip
# !unzip v0.2.0.zip
# %cd fastText-0.2.0
# !make
# !pip install .

Let us prepare the data in the required format as follows.

In [None]:
X = [sentence.strip() for sentence in data.cleaned_review]
y = [str(label) for label in data.sentiment]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 5)

with open('ft_train.txt', 'w') as outfile:
    for i in range(len(X_train)):
        outfile.write('__label__' + y_train[i] + ' '+ X_train[i] + '\n')
    
with open('ft_test.txt', 'w') as outfile:
    for i in range(len(X_test)):
        outfile.write('__label__' + y_test[i] + ' ' + X_test[i] + '\n')

print("Files are written")

In [None]:
!head -n 5 ft_train.txt

### ```fastText```

Let us have a look at how we can apply a pre-trained fastText model that was trained over different text corpora: Wikipedia, UMBC Webbase corpus and statmt.org news dataset (16B tokens in total!).

**Task 5.1 (1 point).** Train fasttext model! Set default parapmeters, but train it for 20 epochs! 

How can we now evaluate the model now? Enter the code to test the model on the ```ft_test.txt``` file with the help of the ```fastText``` library. Write Precision (P@1) and recall (R@1) **scores**. Round the answer up to 3 digits after the decimal point. 

In [None]:
# <YOUR CODE HERE>

In [None]:
## GRADED PART, PUT YOUR ANSWER HERE
GRADING_ANSWER.Q14 = 0 ## YOUR PRECISION VALUE HERE
GRADING_ANSWER.Q15 = 0 ## YOUR RECALL VALUE HERE

We can also load the model and use the python code to make predictions.

In [None]:
import fasttext

## save trained model
classifier.save_model("sentiment_model.bin")
## load our trained classifier
sentiment_ft = fasttext.load_model("sentiment_model.bin")

**Task 5.2. (1 point)** Write a code to make a prediction with the classifier. As answer write the label (1 or 0) and the confidence score (Round the answer up to 3 digits after the decimal point.).

In [None]:
review = "This was such a great film! I am so lucky to watch it."

# <YOUR CODE HERE>

In [None]:
## GRADED PART, PUT YOUR ANSWER HERE
GRADING_ANSWER.Q16 = '' ## YOUR LABEL HERE
GRADING_ANSWER.Q17 = 0 ## YOUR CONFIDENCE SCORE HERE

❗️Remember to **run the first code cell again** before submitting the solution.

### Extra. Ungraded. Extrinsic evaluation

Train a few embedding models using various text corpora from the ```Gensim``` library. Set different hyperparameters for training, and compare the models by fixing the sentiment classifier hyperparameters. Analyze the performance of the classifiers trained over your embedding models. 