# Sentiment Analysis for Movie Reviews - Starter Code

This Jupyter notebook was created as part of an academic activity of [COC891-Deep Learning](http://www.coc.ufrj.br/pt/disciplinas/catalogo/8673-coc-891-deep-learning) course from professor Alexandre Evsukoff at PEC/Coppe/UFRJ. Videos from past offer from this course can be found [here](http://www.coc.ufrj.br/pt/sistema-online2/598-cursos-online/9417-cursos-online-disciplina-coc891-deep-learning)

The activity consists of a [competition](https://www.kaggle.com/c/pec-dl-202101) between students to create deep learning models to predict sentiment (positive or negative) of movie reviews. 

This Starter Code presentes the data and two baseline models for students get started with the problem. At the end, a submission file is created and can be send directly to competition grader. You may turn GPU on for faster processing, but remember you have limited budget for it.

In [None]:
# Importing Libraries
import numpy as np 
import pandas as pd
import re
import os
import random
from collections import Counter
from tqdm import tqdm

%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore')

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.model_selection import KFold, train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
tqdm.pandas()

# 1. Read and Explore Data

In [None]:
df_train = pd.read_csv('../input/dl-analise-sentimento/train.csv')
df_test = pd.read_csv('../input/dl-analise-sentimento/test.csv')

In [None]:
n_train_samples = len(df_train)
print('Train samples: ', n_train_samples, '\tTest samples: ', len(df_test))

In [None]:
df_train.head()

In [None]:
df_test.head()

Note that test set doesn't have sentiment column (target). Our model has to be built on train dataset and predictions made on test dataset to submit to competition's grader.

In [None]:
y = df_train['sentiment'] #save targets
X_train = df_train.drop('sentiment', axis=1)
X_test = df_test
#Concatenate Train and Test to transform data only once.
X = pd.concat([X_train,X_test]) 
print(X.shape)

In [None]:
# Count label distribution
y.value_counts()

We have almost 50-50% positive-negative reviews on train dataset.

In [None]:
# Now, let's see the average number of words per sample
len_of_review = X['review'].apply(lambda x : len(x.split(' ')))
plt.figure(figsize=(10, 6))
plt.hist(len_of_review, 50)
plt.xlabel('Length of reviews')
plt.ylabel('Number of samples')
plt.title('Number of Words distribution')
plt.show()

In [None]:
word_counter = Counter(" ".join(X["review"]).split())

In [None]:
print('Found ', len(word_counter), ' distinct words!')

print('List of most common words in reviews: \n')
word_counter.most_common(50)

As we can see, few duplicate entries ('the', 'The') because of case insensitive, lots of "stop words" and a HTML tag garbage.

# 2. Preprocessing Text

In this section we'll preprocess the text to clean it. 

In [None]:
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
STOP = stopwords.words('english') #List of english most common words

In [None]:
print('Sample of most common words in english (stop words): \n', random.sample(STOP,50))

Here we put text in lowercase and remove all html tags using regex.

In [None]:
# put into lowercase
X['review'] = X['review'].str.lower()
# Strip HTML tags
X['review'] = X['review'].progress_apply(lambda review : re.sub('<[^>]*>', '', review))

For some kind of algorithms, it's interesting to remove stop words and do stemming on text. Below we create a new column with this processing step. 

In [None]:
# remove stop words and do stemming
# save to another field to preserve original data for later models.
X['review_processed'] = X['review'].progress_apply(lambda review: ' '.join(
    [PorterStemmer().stem(word) for word in word_tokenize(review) if word not in (STOP)]))

In [None]:
word_counter_processed = Counter(" ".join(X["review_processed"]).split())
print('Found ', len(word_counter_processed), ' distinct words!')
print('List of most common words in reviews after preprocessing: \n')
word_counter_processed.most_common(50)

# 3. First Try: Traditional ML Model with Bag of Words

Our simplest model is based on a concept called Bag of Words (BoW). We'll create a Matrix whereas lines represents reviews and columns distincts words. Cells of this matrix counts number of occurences of each word in each review. As you can imagine, this is a very sparse matrix.

To accomplish this we'll use a library from scikit-learn, [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) (Please, see docs for details of parameters).

For BoW model, we'll use the version without stopwords and with stemmed processing step ('review_processed' column)

In [None]:
count_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 2), min_df=2)
BoW = count_vectorizer.fit_transform(X['review_processed'])

In [None]:
BoW.shape

In [None]:
# Split train and test
BoW_train = BoW[:n_train_samples]
BoW_test = BoW[n_train_samples:]
print(BoW_train.shape, BoW_test.shape)

Our BoW model generated 692,081 distinct "words". Since we defined ngram_range = (1,2), we included bigrams into our model as distinct words, so this is not a surprise the number of columns is bigger than number of distinct words we counted before.

Below we use the [SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html) from Scikit-Learn to select N best columns which correlate more with our labels. 

In [None]:
N_columns = 20000
selector = SelectKBest(f_classif, k=min(N_columns, BoW_train.shape[1]))
selector.fit(BoW_train, y)
BoW_train = selector.transform(BoW_train).astype('float32')
BoW_test = selector.transform(BoW_test).astype('float32')
print('New data shapes: ', BoW_train.shape, BoW_test.shape)

In [None]:
# Split data for train / validation 
BoW_train, BoW_valid, y_train, y_valid = train_test_split(BoW_train, y, test_size=0.25, random_state=42)

Finally we train a gradient boosting algorithm to fit our BoW data and score against validation split.

In [None]:
gbt = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, random_state=42).fit(BoW_train, y_train)
predictions = gbt.predict_proba(BoW_valid)
print('Accuracy ', accuracy_score(y_valid, np.argmax(predictions,axis=1)), 'ROC-AUC', roc_auc_score(y_valid, predictions[:,1]))

# 4. RNN Model

Now we'll try to build a more sophisticated model, using Deep Learning. We'll use Keras to build a Bidirectional LSTM model. Most of the code below was borrowed from [here](https://keras.io/api/datasets/imdb/).

But unlike the previous algorithm, here we'll use the less preprocessed version of reviews, because RNN could take advantage of all words in its all forms in the context, including stop words. 

In [None]:
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
#Max number of words to consider in our network. 
#Larger values means more processing time. Smaller means lost of information.
#Remember the first histogram we build?
maxlen = 200 

#Using same number of distinct words as previous algorithm
tokenizer = Tokenizer(num_words=N_columns)

#Using review, not review_processed
tokenizer.fit_on_texts(X['review'])
tokenized_data = tokenizer.texts_to_sequences(X['review'])

X_t = pad_sequences(tokenized_data, maxlen=maxlen)

The tokenizer creates an ID for each token (word). The pad_sequences put all reviews into a fixed length format.

In [None]:
print('===>This review: \n ', X.iloc[0]['review'], '\n===>Was tokenized to:\n', X_t[0,:], '\nSize of tokens: ', X_t[0,:].shape)

In [None]:
#Split data back into train and test
X_t_train = X_t[:n_train_samples,:]
X_t_test = X_t[n_train_samples:,:]
#Split train into train and validation
X_t_train, X_t_valid, y_train, y_valid = train_test_split(X_t_train, y, test_size=0.25, random_state=42)
print(X_t_train.shape, X_t_valid.shape, X_t_test.shape)

Below we define the achitecture of our RNN model.

In [None]:
# Input for variable-length sequences of integers
inputs = keras.Input(shape=(None,), dtype="int32")
# Embed each integer in a 128-dimensional vector
x = layers.Embedding(N_columns, 128)(inputs)
# Add 2 bidirectional LSTMs
x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x)
x = layers.Bidirectional(layers.LSTM(64))(x)
# Add a classifier
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.summary()

Now we start training our model for 3 epochs

In [None]:
model.compile("adam", "binary_crossentropy", metrics=["accuracy", "AUC"])
model.fit(X_t_train, y_train, batch_size=32, epochs=3, validation_data=(X_t_valid, y_valid))

As you can see, our RNN model improves the ROC-AUC score from previous model. This is the metric we care about in the competition. So next, we'll use it to create predictions on test file and submit results to grader.

# 5. Create submission file

In [None]:
test_preds = model.predict(X_t_test)

In [None]:
test_preds[:10]

In [None]:
df_submission = pd.DataFrame({'id': df_test['id'].values, 'sentiment': test_preds.T[0]})
df_submission.head()

In [None]:
#Save to file
#After Save (commit) this notebook (button in right upper corner) you may submit to competition
df_submission.to_csv('submission.csv', index=False)

# 6. Suggestions to improve your score

* Try to use pretrained embeddings (glove, word2vec, fasttext) instead of training from sketch on your Deep Learning Models. See [here](https://keras.io/examples/nlp/pretrained_word_embeddings/). 
* Try other hyperparameters on your models (number of max columns, range of n-grams, max length of sentences)
* Try other combination of preprocessing steps. We assumed it's not a good idea remove stop words for RNN. Is it true? We put data in lowercase. Is it better?
* You can create another handcraft features such as "number of words entirely on CAPS LOCK". Your predictions could mix this kind of features with the others.
* Try to ensemble and/or stack models: https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python 

In [None]:
print('Done!')