# Hi!

This notebook will give a beginner guide to NLP and perform a Sentiment Analysis on the IMDB movie reviews dataset using a reccurent neural network. 

# What is NLP?

There are different levels of tasks in NLP, from speech processing to semantic interpretation and discourse processing. The goal of NLP is to be able to design algorithms to allow computers to "understand" natural language in order to perform some task.

# What is Sentiment Analysis?

Sentiment Classification is the task of looking at a piece of text and telling if someone likes or dislikes the thing they are talking about. It is one of the most important building blocks in NLP and used in many applications. 

Importing Dependencies

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk

from nltk.tokenize import word_tokenize
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from keras.initializers import Constant
from sklearn.preprocessing import LabelEncoder

import warnings
warnings.filterwarnings('ignore')
sns.set()

Importing Data

In [None]:
imdb = pd.read_csv("/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv")
imdb.head()

Distribution of target variable

In [None]:
imdb.sentiment.value_counts()

# How to represent words?

The first and arguably most important common denominator across all NLP tasks is how we represent words as input to any of our models. Much of the earlier NLP work that we will not cover treats words as atomic symbols. To perform well on most NLP tasks we first need
to have some notion of similarity and difference between words.

One common solution is using WordNet. This is something which uses NLTK which is the swiss army knife for NLP meaning it is not terribly good for anything but has a lot of basic functions. WordNet makes very fine distinctions of a word like a thesaurus containing lists of synonym sets and hypernyms ('is a' relationship). But there are a few problems such as it is built with human labour, can't compute accurate word similarities and is subjective. 

Well, how about one-hot-vectors? There are few things which are bad here such as language has a lot of words and there is no notion of similarity. 

how about representing a word using its context?

When a word is used in a text, its context is the set of words that appear nearby, right?
When you get a sense of the idea of 'oh no that's the wrong word to use there' you understand the meaning of the word right? This is the idea of Distributional Semantics to understand the meaning of a word.

So that leads us to representing words in a better way using Word Embeddings. 

# What are Word Embeddings?

We will build a dense vector for each word, chosen so that it's similar to vectors of words that appear in the same context.

There are 2 types of algorithms used for word embedding:

The first set are count-based and rely on matrix factorization (e.g. LSA, HAL). While these methods effectively leverage global statistical information, they are primarily used to capture word similarities and do poorly on tasks such as word analogy, indicating a suboptimal vector space structure. The other set of methods are shallow window-based (e.g. the skip-gram and the CBOW models), which learn word embeddings by making predictions in local context windows. These models demonstrate the capacity to capture complex linguistic patterns beyond word similarity, but fail to make use of the global co-occurrence statistics.

how about we combine them but how do we do that?

we could use ratios of co-occurrence probabilities to encode meaning components which is called 
# GLOVE Embedding. 

The training objective of Glove is to learn word vectors such that their dot product equals the logarithm of the words probability of co-occurence. 

Factoring sentences into words

In [None]:
corpus = []
for text in imdb['review']:
    words = [word.lower() for word in word_tokenize(text)] 
    corpus.append(words)

In [None]:
num_words = len(corpus)
print(num_words)

Splitting data to train(80%) and test(20%)

In [None]:
train_size = int(imdb.shape[0] * 0.8)
X_train = imdb.review[:train_size]
y_train = imdb.sentiment[:train_size]

X_test = imdb.review[train_size:]
y_test = imdb.sentiment[train_size:]

Tokenizing the words and padding for equal input dimensions

In [None]:
tokenizer = Tokenizer(num_words)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_train = pad_sequences(X_train, maxlen=128, truncating='post', padding='post')

In [None]:
X_train[0], len(X_train[0])

In [None]:
X_test = tokenizer.texts_to_sequences(X_test)
X_test = pad_sequences(X_test, maxlen=128, truncating='post', padding='post')

In [None]:
X_test[0], len(X_test[0])

In [None]:
word_index = tokenizer.word_index
print("Number of unique words: {}".format(len(word_index)))

Glove Embedding

Task1: Make dictionary of all words in corpus in pre-trained glove embeddings

In [None]:
embedding = {}
with open("/kaggle/input/glovetwitter27b100dtxt/glove.twitter.27B.100d.txt") as file:
    for line in file:
        values = line.split()
        word = values[0]
        vectors = np.asarray(values[1:], 'float32')
        embedding[word] = vectors
file.close()

Task2: Make matrix of all words in imdb-dataset with vectors from embedding dictionary

In [None]:
embedding_matrix = np.zeros((num_words, 100))
for i, word in tokenizer.index_word.items():
    if i < (num_words+1):
        vector = embedding.get(word)
        if vector is not None:
            embedding_matrix[i] = vector

# Modelling

Text is a sequence of words, and sequential data has some problems when it comes to modelling:

##### Problem #1: Can't model long-term dependencies

For eg, consider this sentence "France is where I grew up, I can speak really good _____" and in order to predict the blank word we need information from the distant past to accurately predict the correct word. 

##### Problem #2: Counts don't preserve order

For eg, consider these sentences "The food is bad, not good at all" and "The food is good, not bad at all" hence this means that order is neccessary to not lose sequential information. 

##### Problem #3: Parameters don't share information

For eg, consider these sentences "I took the cat out this morning" and "This morning, I took the cat out" when encoded using a count algorithm would mean that things we learn about the sequence won't transfer if they appear elsewhere in the sequence. 

##### Problem #4: Variable-length input

How to approach this?

### Recurrent Neural Networks

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

##### Creating a Base Model

In [None]:
model = Sequential()

model.add(Embedding(input_dim=num_words, output_dim=100, 
                    embeddings_initializer=Constant(embedding_matrix), 
                    input_length=128, trainable=False))
model.add(LSTM(100, dropout=0.1))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
history = model.fit(X_train, y_train, epochs=5, batch_size=2048, validation_data=(X_test, y_test))

In [None]:
plt.figure(figsize=(16,5))
epochs = range(1, len(history.history['accuracy'])+1)
plt.plot(epochs, history.history['loss'], 'b', label='Training Loss', color='red')
plt.plot(epochs, history.history['val_loss'], 'b', label='Validation Loss')
plt.legend()
plt.show()

In [None]:
plt.figure(figsize=(16,5))
epochs = range(1, len(history.history['accuracy'])+1)
plt.plot(epochs, history.history['accuracy'], 'b', label='Training Accuracy', color='red')
plt.plot(epochs, history.history['val_accuracy'], 'b', label='Validation Accuracy')
plt.legend()
plt.show()

##### Deeper n Deeper

In [None]:
model = Sequential()
model.add(Embedding(input_dim=num_words, output_dim=100, 
                    embeddings_initializer=Constant(embedding_matrix), 
                    input_length=128, trainable=False))
model.add(Bidirectional(LSTM(128, return_sequences=True)))
model.add(Bidirectional(LSTM(64)))
model.add(Dense(256, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
history = model.fit(X_train, y_train, epochs=10, batch_size=1024, validation_data=(X_test, y_test))

In [None]:
plt.figure(figsize=(16,5))
epochs = range(1, len(history.history['accuracy'])+1)
plt.plot(epochs, history.history['loss'], 'b', label='Training Loss', color='red')
plt.plot(epochs, history.history['val_loss'], 'b', label='Validation Loss')
plt.legend()
plt.show()

In [None]:
plt.figure(figsize=(16,5))
epochs = range(1, len(history.history['accuracy'])+1)
plt.plot(epochs, history.history['accuracy'], 'b', label='Training Accuracy', color='red')
plt.plot(epochs, history.history['val_accuracy'], 'b', label='Validation Accuracy')
plt.legend()
plt.show()

In [None]:
validation_sentence = ['This movie was not good at all. It had some good parts like the acting was pretty good but the story was not impressing at all.']
validation_sentence_tokened = tokenizer.texts_to_sequences(validation_sentence)
validation_sentence_padded = pad_sequences(validation_sentence_tokened, maxlen=128, 
                                    truncating='post', padding='post')
print(validation_sentence[0])
print("Probability of Positive: {}".format(model.predict(validation_sentence_padded)[0]))

Even though the sentence had words like 'good' and 'impressing' in it but the overall review was negative and this model predicted correctly (almost perfectly) with only 1.9% chance of being positive. 

In [None]:
validation_sentence = ['It had some bad parts like the storyline although the actors performed really well and that is why overall I enjoyed it.']
validation_sentence_tokened = tokenizer.texts_to_sequences(validation_sentence)
validation_sentence_padded = pad_sequences(validation_sentence_tokened, maxlen=128, 
                                    truncating='post', padding='post')
print(validation_sentence[0])
print("Probability of Positive: {}".format(model.predict(validation_sentence_padded)[0]))

This is a neutral review and this model predicted correctly with only 50.5% chance of being positive meaning it was neutral. 

In [None]:
validation_sentence = ['I can watch this movie forever just because of the beauty in its cinematography.']
validation_sentence_tokened = tokenizer.texts_to_sequences(validation_sentence)
validation_sentence_padded = pad_sequences(validation_sentence_tokened, maxlen=128, 
                                    truncating='post', padding='post')
print(validation_sentence[0])
print("Probability of Positive: {}".format(model.predict(validation_sentence_padded)[0]))

This is a positive review with utmost love for the movie and this model predicted correctly with 90.9% chance of being positive. 

Scope of Improvements: 

As you can see, the model is not generalised and starts overfitting after 9 iterations. Here is an example: 

In [None]:
validation_sentence = ['What can I say? It was so astonishing that I dont have any words for it.']
validation_sentence_tokened = tokenizer.texts_to_sequences(validation_sentence)
validation_sentence_padded = pad_sequences(validation_sentence_tokened, maxlen=128, 
                                    truncating='post', padding='post')
print(validation_sentence[0])
print("Probability of Positive: {}".format(model.predict(validation_sentence_padded)[0]))

# References

MIT: Introduction to Deep Learning: 6.S191

Stanford: Natural Language Processing with Deep Learning: CS224N

##### I highly recommend going through these amazing courses available for free.

##### Please upvote if you like my work and comment your feedback!