This notebook aims to practice a typical deep learning approach to sentiment analysis. EDA is omitted and only the modelling part is presented.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
from bs4 import BeautifulSoup #to extract words from HTML documents

import string
from keras.utils import to_categorical
import tensorflow as tf
from sklearn.model_selection import train_test_split
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.layers import Dense,Dropout,Embedding,LSTM,GlobalMaxPooling1D
from keras.callbacks import EarlyStopping
from keras.losses import categorical_crossentropy
from keras.optimizers import Adam
from keras.models import Sequential

#set random seed for the session and also for tensorflow that runs in background for keras
tf.random.set_seed(514)

# load data
train = pd.read_csv("/kaggle/input/sentiment-analysis-on-movie-reviews/train.tsv.zip", sep="\t")
test = pd.read_csv("/kaggle/input/sentiment-analysis-on-movie-reviews/test.tsv.zip", sep="\t")

Have a first look into both the training and testing data.

In [None]:
# train data
print(f"The shape of training data is {train.shape}.")
print(train.head())
# test data
print(f"The shape of testing data is {test.shape}.")
print(test.head())

First prepare lists of words that will be to our model. Here we will extract lemmatized words in lower case without punctuation.

In [None]:
train_seq = train['Phrase'].apply(str.lower).apply(lemmatizer.lemmatize) \
    .apply(lambda s: s.translate(str.maketrans('', '', string.punctuation))) \
    .apply(str.split).tolist()
test_seq = test['Phrase'].apply(str.lower).apply(lemmatizer.lemmatize) \
    .apply(lambda s: s.translate(str.maketrans('', '', string.punctuation))) \
    .apply(str.split).tolist()

print(f"Length of sequences - training data: {len(train_seq)}, testing data: {len(test_seq)}.")

Apply one-hot encoding to target, and perform train/test split with proportion = 20%.

In [None]:
y_target = to_categorical(train.Sentiment.values)
num_classes = y_target.shape[1]

X_train,X_val,y_train,y_val = train_test_split(train_seq, y_target, test_size=0.2, stratify=y_target)

Prepare a set containing words in the training data. If the set is considerably small (e.g. < 20000) we will use all the words to create a Tokenizer.

In [None]:
# record the maximum word length of sequences in the training data
unique_words = set()
len_max = 0
for sent in X_train:    
    unique_words.update(sent)
    if(len(sent) > len_max):
        len_max = len(sent)

print(f"Number of unique words in training set = {len(unique_words)}.")

As discussed, we will use all the words in the training data to set up a word tokenizer by Keras. Fit the sequences to such tokenizer. Finally pad (or occasionally truncate since the maximum length here is obtained from the training data only) the sequences. We use pre-padding here since the typical LSTM unit (which is not bi-directional) will be used.

In [None]:
# create tokenizer
tokenizer = Tokenizer(num_words=len(unique_words))
tokenizer.fit_on_texts(X_train)

# transform the word sequences to numerical vectors
X_train = tokenizer.texts_to_sequences(X_train)
X_val = tokenizer.texts_to_sequences(X_val)
X_test = tokenizer.texts_to_sequences(test_seq)

# padding
X_train = pad_sequences(X_train, maxlen=len_max, padding="pre", truncating="pre")
X_val = pad_sequences(X_val, maxlen=len_max, padding="pre", truncating="pre")
X_test = pad_sequences(X_test, maxlen=len_max, padding="pre", truncating="pre")

Finally we build our network for sentiment analysis. We will add 2 layers of LSTM which are able to capture long-term dependencies. The maximum value of the return sequences is captured using a GlobalMaxPooling later. We pass the output to 2 layers of Dense and finally gives the probabilities of all the target classes.

In [None]:
D = 50 #embedding dimensionality
dropout_rate = 0.5

# Define early stopping that will be used as callback
early_stopping = EarlyStopping(min_delta=0.01, mode='max', monitor='val_accuracy', patience=3)
callback = [early_stopping]

# add layers to model
model = Sequential()
model.add(Embedding(len(tokenizer.word_index), D, input_length=len_max))
model.add(LSTM(128, dropout=dropout_rate, recurrent_dropout=dropout_rate, return_sequences=True))
model.add(LSTM(64, dropout=dropout_rate, recurrent_dropout=dropout_rate, return_sequences=True))
model.add(GlobalMaxPooling1D())
model.add(Dense(50, activation='relu'))
model.add(Dropout(dropout_rate))
model.add(Dense(num_classes,activation='softmax'))

# use a relatively low learning rate
model.compile(loss='categorical_crossentropy',optimizer=Adam(lr=0.001),metrics=['accuracy'])

# let the model fit with the data until early stopping criteria is met
r = model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=20, batch_size=256, callbacks=callback)

Create visualization of losses against number of epoches. Finally perform prediction on testing data and prepare the corresponding submission file.

In [None]:
counts = range(1, len(r.history['loss'])+1)
plt.plot(counts, r.history['loss'], 'r-')
plt.plot(counts, r.history['val_loss'], 'b-')
plt.legend(['Training Loss', 'Validation Loss'])
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()

sub_file = pd.read_csv('/kaggle/input/sentiment-analysis-on-movie-reviews/sampleSubmission.csv',sep=',')
sub_file.Sentiment = model.predict_classes(X_test)
sub_file.to_csv('submission.csv',index=False)