# Sentiment analysis with Word2Vec

### Notebook objectives:
- Convert words to vectors with Word2Vec
- Use the word representation given by Word2vec to feed a RNN

<hr>
<hr>


# The data


Let's first load the data

⚠️ **Warning** ⚠️ The `load_data` function has a `percentage_of_sentences` argument. Depending on your computer, there are chances that too many sentences will make your compute slow down, or even freeze - your RAM can overflow. For that reason, **you should start with 10% of the sentences** and see if your computer handles it. Otherwise, rerun with a lower number. 


In [1]:
from NLPmoviereviews.data import load_data

X_train, y_train, X_test, y_test = load_data(percentage_of_sentences=10)

2022-08-05 11:13:24.789436: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-05 11:13:24.820294: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)


# Baseline model

It is always good to have a very simple model to test your own model against - to be sure you are doing something better than a very simple algorithm.

Our baseline can be to predict the label that is the most present in `y_train` (of course, if the dataset is balanced, the baseline accuracy is 1/n where n is the number of classes - 2 here).

In [2]:
import pandas as pd
pd.Series(y_test).value_counts()

1    1270
0    1230
dtype: int64

In [3]:
baseline_accuracy=1/2
print(f'Baseline accuracy on the test set : {baseline_accuracy:.2f}')

Baseline accuracy on the test set : 0.50


# Trained Word2Vec - Transfer Learning


List all the different models available in the word2vec: 

In [4]:
from gensim.models import Word2Vec
import gensim.downloader as api
print(list(api.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


Load one of the pre-trained word2vec embedding spaces. 
    
The `glove-wiki-gigaword-50` model is a good candidate to start with as it is smaller (65 MB).


In [5]:
word2vec_transfer=api.load('glove-wiki-gigaword-50')

Check the size of the vocabulary, but also the size of the embedding space.

In [6]:
len(word2vec_transfer.key_to_index)

400000

In [7]:
len(word2vec_transfer[0])

50

Embed `X_train` and `X_test`

In [8]:
import numpy as np
from NLPmoviereviews.utilities import padding

In [9]:
X_train_pad = padding(word2vec_transfer, X_train, maxlen=200)
X_test_pad = padding(word2vec_transfer, X_test, maxlen=200)

In [10]:
X_train_pad.shape, y_train.shape, X_test_pad.shape, y_test.shape

((2500, 200, 50), (2500,), (2500, 200, 50), (2500,))

☝️ To be sure that it worked, let's check the following for `X_train_pad` and `X_test_pad`:
- they are numpy arrays
- they are 3-dimensional
- the last dimension is of the size of your word2vec embedding space (you can get it with `word2vec.wv.vector_size`
- the first dimension is of the size of your `X_train` and `X_test`

In [13]:
# TEST
for X in [X_train_pad, X_test_pad]:
    assert type(X) == np.ndarray
    assert X.shape[-1] == word2vec_transfer.vector_size


assert X_train_pad.shape[0] == len(X_train)
assert X_test_pad.shape[0] == len(X_test)

Initialize a model and fit it on the embedded (and padded) data!  Evaluate it on the test set and compare it to baseline accuracy.

In [14]:
from tensorflow.keras import models,layers
from tensorflow.keras.callbacks import EarlyStopping

model=models.Sequential()
model.add(layers.Masking(mask_value=0, input_shape=(200,50)))
model.add(layers.LSTM(20))
model.add(layers.Dense(10, activation="relu"))
model.add(layers.Dense(1, activation="sigmoid"))

model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

In [15]:
es = EarlyStopping(patience=4,restore_best_weights=True, verbose=1)

# Fit the model on the train data
history = model.fit(X_train_pad, y_train,
                    validation_split=0.2,
                    epochs = 20,
                    batch_size = 32, 
                    verbose = 0, 
                    callbacks = [es])

In [16]:
res = model.evaluate(X_test_pad, y_test, verbose=0)

print(f'The accuracy evaluated on the test set is of {res[1]*100:.3f}%')

The accuracy evaluated on the test set is of 72.440%
