# **SENTIMENT CLASSIFIER**

Sentiment Classification is a very important part of Natural Language Processing. This notebook uses a CNN + Bi-LSTM model on a dataset with phrases extracted from the rotten tomatoes dataset and classifies them into 5 different sentiments.

Importing the essential libraries

In [None]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import string
from sklearn.model_selection import train_test_split
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import label_binarize
from sklearn.utils.class_weight import compute_sample_weight
from itertools import count


import tensorflow as tf
from tensorflow import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense,Embedding,Bidirectional,Dropout,SpatialDropout1D,GlobalMaxPool1D,LSTM,BatchNormalization,Conv1D,MaxPool1D
from keras.models import Sequential
from keras.optimizers import Adam
from keras import regularizers

Importing the dataset. The dataset can be downloaded at https://www.kaggle.com/c/movie-review-sentiment-analysis-kernels-only/data

In [None]:
train = pd.read_csv("../input/movie-review-sentiment-analysis-kernels-only/train.tsv.zip",sep="\t")
test = pd.read_csv("../input/movie-review-sentiment-analysis-kernels-only/test.tsv.zip",sep="\t")

In [None]:
train.head()

In [None]:
sns.countplot(train['Sentiment'])
plt.title("No of Tweet Sentiments")

The **create_vocabulary(df)** function defined below creates an indexed vocabulary from the lemmatized tokens of words present in the dataframe passed to it.

In [None]:
def create_vocabulary(df):
    counter = count(2)  # index 0 reserved for padding, index 1 for UNK token
    vocabulary = dict()
    lemmatizer = WordNetLemmatizer()
    for k in df['Phrase']:
        tokens = k.lower().split(" ")
        for token in tokens:
            lemmatoken = lemmatizer.lemmatize(token)
            if lemmatoken in vocabulary:
                continue
            vocabulary[lemmatoken] = next(counter)
    print("Vocabulary length: {}".format(max(vocabulary.values())))  
    return vocabulary

The **function preprocess_df(df, vocabulary, max_sentence_length)** defined below is used to pre process the dataframe before sending into the model to be defined later on. The function converts the sentiment labels into inteher values. Also it converts the phrases into lemmatized tokens of their words represented by their respective indices in the vocabulary created earlier.

In [None]:
def preprocess_df(df, vocabulary, max_sentence_length):
    vocabulary_length = max(vocabulary.values())
    X = []
    # Use the same function for test sets.
    Y = label_binarize(df.Sentiment.to_xarray(), classes=[0, 1, 2, 3, 4]) if 'Sentiment' in df else None
    lemmatizer = WordNetLemmatizer()
    for sample in df.iterrows():
        tokens = sample[1]['Phrase'].lower().split(" ")
        vocab_tokens = []
        for i in range(max_sentence_length):
            try:
                vocab_tokens.append(vocabulary.get(lemmatizer.lemmatize(tokens[i]), 1))  # 1 : UNK token
            except IndexError:
                vocab_tokens.append(0)  # 0 : padding token
        X.append(vocab_tokens)
    return np.asarray(X), Y

Calling the functions defined above

In [None]:
vocabulary = create_vocabulary(train)
X, Y = preprocess_df(train, vocabulary, 52)

Train-Test Split

In [None]:
train_X,x_valid,train_Y,y_valid = train_test_split(X,Y,test_size=0.2,random_state=42)

Defining the **model**. It uses 1D CNN layers followed by few bi-LSTM layers and ending with dense layers.

In [None]:
model = keras.models.Sequential()
model.add(keras.layers.Embedding(input_dim=15189, output_dim=10, mask_zero=True))

model.add(Conv1D(128,3,activation='relu',padding='same'))
model.add(MaxPool1D(2))
model.add(BatchNormalization())
model.add(Dropout(0.3))

model.add(Conv1D(128,3,activation='relu',padding='same'))
model.add(MaxPool1D(2))
model.add(BatchNormalization())
model.add(Dropout(0.3))

model.add(Conv1D(128,3,activation='relu',padding='same'))
model.add(MaxPool1D(2))
model.add(BatchNormalization())
model.add(Dropout(0.15))

model.add(Conv1D(64,3,activation='relu',padding='same'))
model.add(MaxPool1D(2))
model.add(BatchNormalization())
model.add(Dropout(0.15))

model.add(Conv1D(64,3,activation='relu',padding='same'))
model.add(MaxPool1D(2))
model.add(BatchNormalization())
model.add(Dropout(0.15))

model.add(Bidirectional(LSTM(1280,recurrent_dropout=0.5,dropout=0.2,return_sequences=True)))
model.add(Bidirectional(LSTM(640,recurrent_dropout=0.5,dropout=0.2,return_sequences=True)))
model.add(Bidirectional(LSTM(640,recurrent_dropout=0.5,dropout=0.2,return_sequences=True)))
model.add(Bidirectional(LSTM(320,recurrent_dropout=0.5,dropout=0.2,return_sequences=True)))

model.add(GlobalMaxPool1D())

model.add(Dense(64,activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.15))

model.add(Dense(32,activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.15))

model.add(Dense(32,activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.15))

model.add(Dense(5,activation='softmax'))

model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['acc'])
model.summary()

Fitting the model

In [None]:
history=model.fit(x=train_X, y=train_Y, batch_size=256, epochs=15,validation_data=(x_valid,y_valid))

Plotting the loss and accuracy with respect to epochs

In [None]:
def plot_graphs(history, string):
    plt.plot(history.history[string])
    plt.plot(history.history["val_"+string])
    plt.xlabel("Epochs")
    plt.ylabel(string)
    plt.legend([string,"val_"+string])
    plt.show()
plot_graphs(history,'acc')
plot_graphs(history,'loss')

Preprocessing the test data

In [None]:
test_X,test_Y = preprocess_df(test, vocabulary, 52)

Predicting the model on the test dataset and converting results into the desired format

In [None]:
predictions = model.predict(x=np.asarray(test_X))

prediction_results = pd.concat([test,
                                pd.DataFrame([np.argmax(k) for k in predictions], columns=['Sentiment'])],
                               axis=1)

Submitting results

In [None]:
submission = prediction_results[['PhraseId', 'Sentiment']]
submission.to_csv('submission.csv', index=False)