# Getting Started 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df_train = pd.read_csv('../Datasets/imdb_review/train_data.csv')
df_train.head()

In [None]:
def test_casing(x):
    if x.isupper():
        print('Contains Upper Case')


_ = df_train.SentimentText.apply(test_casing)

**Observations**

1. Contains single letters
2. All lower case

# Data Pre-processing

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
STOPWORDS = stopwords.words("english")
lemmatizer = WordNetLemmatizer()

In [None]:
def clean_text(x):
    """
    This method removes stopwords, removes single characters and lemmatizes the words
    """
    return " ".join([
        lemmatizer.lemmatize(each_token.strip()) for each_token in x.split(' ')
        if each_token not in STOPWORDS and len(each_token) > 1
    ])

In [None]:
df_train['SentimentTextCleaned'] = df_train.SentimentText.apply(clean_text)
df_train.head()

In [None]:
# !pip install gensim==4.0.0

## Word Embedding Generation

In [None]:
import gensim
print(gensim.__version__)

# Prepare the data
sentences = [word_tokenize(each_text) for each_text in list(df_train['SentimentTextCleaned'])]
print(sentences[:5])

**Comments**

To pass the data to Gensim, the format of the input data is -

```
[ 
    [ token 1, token 2, token 3, ..... ], # signifies text in one row of a dataframe or one sentence in a document
    [ token 2, token 31, token 12, ....], # The token numbers are random
    ......
    [ token 1, token 16, token 91, .....]
]

```

In [None]:
from gensim.models import Word2Vec

# Train the Word2Vec SkipGram model
w2v = Word2Vec(sentences=sentences, 
               vector_size=100, 
               window=5,
               max_vocab_size=10000,
               min_count=2,  
               sg=1)

In [None]:
vocabulary_size = len(w2v.wv)
vocabulary_size

In [None]:
vocabulary = list(w2v.wv.key_to_index.keys())
vocab_word_vec = w2v.wv[vocabulary]
vocab_word_vec

**Comments**

```w2v.wv['any word']``` will give me an array of the 100 dimensional vector

In [None]:
vocab_word_vec.shape

In [None]:
vocabulary[:5]

In [None]:
w2v.wv.most_similar('cartoon')

**Comment**

Similarity is based on Cosine similarity which is a mathematical technique to measure distance between two vectors.

In [None]:
from sklearn.decomposition import PCA


def plot_similarity_PCA(model, word_vector, vocabulary):
    pca = PCA(n_components=2)
    result = pca.fit_transform(word_vector)
    print(result.shape)
    plt.scatter(
        result[:, 0], # column (dimesion) 1
        result[:, 1], # column (dimension) 2
        color='b'
    )
    # annotation or printing words in the plot
    for i, word in enumerate(vocabulary):
        plt.annotate(word, xy=(result[i, 0], result[i, 1]))
    plt.show()

In [None]:
plt.figure(figsize=[10, 10])
plot_similarity_PCA(w2v, w2v.wv[vocabulary[:100]], vocabulary[:100])

**Comment**
 
This visualization is for demonstration only. 

In [None]:
# Fill back with word index from vocabulary

w2v.wv.key_to_index.items()

In [None]:
df_train['SentimentTextTokenized'] = df_train['SentimentTextCleaned'].apply(word_tokenize)
df_train.head()

In [None]:
def map_vocab_index(tokenized_review):
    """
    This function maps the word index from the vocabulary and creates a list of the indices
    For example: If index of the word 'film' is 1 and 'movie' is 2, 
    then for a text like ['film', 'movie'], the output of this function will be [1, 2]
    """
    vocab_index_mapped_review = [
        w2v.wv.key_to_index[each_token] for each_token in tokenized_review
        if each_token in w2v.wv.key_to_index.keys()
    ]
    return vocab_index_mapped_review


# TEST
map_vocab_index(['movie', 'film', 'great'])

In [None]:
df_train['SentimentTextVocabIndexed'] = df_train['SentimentTextTokenized'].apply(map_vocab_index)
df_train.head()

# Train Test Split

In [None]:
feature_column = 'SentimentTextVocabIndexed'
target_column = 'Sentiment'

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df_train[feature_column].values, 
                                                    df_train[target_column].values)

# Standardising Input Text

We will be using a Deep Learning model for predicting sentiment. 

Embedding Layer from Keras API is used a the layer 0 (or the first layer) to pass word embeddings into a NN.

More information on Embedding Layer -https://keras.io/api/layers/core_layers/embedding/

In order to feed texts into an Embedding Layer, each input text (a sequence of tokens) should be of a particular length.

Example: 
1. Sentence 1 - [3, 6, 21, 67, 32]
2. Sentence 2 - [1, 2, 6, 12, 34, 45, 7, 8]
3. Sentence 3 - [21, 2, 31]

These sentences will have to be of a particular length, say 5.

Then, on using Padding [pad_sequences from Keras API], the following will be achieved:

1. Sentence 1 - [3, 6, 21, 67, 32]
2. Sentence 2 - [1, 2, 6, 12, 34]
3. Sentence 3 - [21, 2, 31, 0 , 0]

You can use ```padding``` and ```truncating``` as pre or post to convey from where the cut-off or padding will be applied.

Hence, a length analysis is done here to see what sort of length of reviews are there in this dataset.

In [None]:
# Length Analysis

import seaborn as sns
sns.boxplot(data=list(df_train['SentimentTextVocabIndexed'].apply(len)))

**Observation**

Most of the reviews are of about 100 to 175 in length. Lert us set the maximum length of a review as 200.

# Deep Learning Model Training

## Declare constants

In [None]:
embedding_dim = 100 # as we had set while training the word2vec model
max_len = 200  # Length of input - All input should have the same length - if length over 200, the input will be truncated at 200.
vocabulary_size, embedding_dim, max_len

## Make paddings

In [None]:
from keras.preprocessing.sequence import pad_sequences

X_train_padded = pad_sequences(X_train, maxlen=max_len, truncating='post',padding='post')
X_test_padded = pad_sequences(X_test, maxlen=max_len, truncating='post',padding='post')

## Create Model

Here we have used an Embedding layer along with Flatten and Dense Layers.

In the embedding layer, you specify the vocabulary length, the embedding dimansion and then the length of the input. Additionally, the weights are also initiated with the word's vectors or word's embeddings. This is transfering the learned weights from the Skip Gram model into the NN.

In [None]:
# Refer - https://keras.io/api/layers/core_layers/embedding/

from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

model = Sequential()
model.add(Embedding(input_dim=vocabulary_size, # 1 + Vocab Size or 1 + max vocab index
                    output_dim=embedding_dim, # Dimnsionality of word embeddings
                    input_length=max_len,
                    weights=[vocab_word_vec],
                    trainable=False
                    ))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid')) # sigmoid since binary classification
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc']
              )
model.summary()

In [None]:
model.fit(X_train_padded, 
          y_train,
          epochs=5,
          validation_split=0.2
)

**Comment**

Experiment with different configurations and plot the train/val loss.

## Predictions

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

y_pred = (model.predict(X_test_padded) > 0.5).astype("int32")
print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Precision: ', precision_score(y_test, y_pred))
print('Recall: ', recall_score(y_test, y_pred))
print('F1 Score: ', f1_score(y_test, y_pred))

## LSTM Implementation

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies.

In [None]:
from keras.layers import LSTM

lstm_model = Sequential()
lstm_model.add(
    Embedding(
        input_dim=vocabulary_size,  # 1 + Vocab Size or 1 + max vocab index
        output_dim=embedding_dim,  # Dimnsionality of word embeddings
        input_length=max_len,
        weights=[vocab_word_vec],
        trainable=False))
lstm_model.add(LSTM(50))
lstm_model.add(Dense(32, activation='relu'))
lstm_model.add(Dense(
    1, activation='sigmoid'))  # sigmoid since binary classification
lstm_model.compile(optimizer='rmsprop',
                   loss='binary_crossentropy',
                   metrics=['acc'])
lstm_model.summary()

In [None]:
lstm_model.fit(X_train_padded, y_train, epochs=5, validation_split=0.2)

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

y_pred = (lstm_model.predict(X_test_padded) > 0.5).astype("int32")
print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Precision: ', precision_score(y_test, y_pred))
print('Recall: ', recall_score(y_test, y_pred))
print('F1 Score: ', f1_score(y_test, y_pred))