# Sentiment analysis comparison: text features, LSTM and RoBERTa

0. [Introduction](#0)

1. [Preparation](#1)

    1.1 [Packages](#1.1)
    
    1.2 [Data](#1.2)
    
2. [Text features](#2)

3. [Deep Learning](#3)

    3.1 [LSTM](#3.1)
    
    3.2 [RoBERTa](#3.2)

## 0. Introduction <a id=0></a>

Sentiment analysis is a NLP task which aims to classify a text based on the sentiment it conveys, aka its *polarity* (whether it is positive, neutral or negative). A typical business-oriented application is to analyze product reviews and customer feedbacks.

The dataset which we investigate contains tens of thousands of Amazon reviews, which have been labeled as positive or negative by looking at the score given by users. We show different approaches to the problem of sorting them in the correct class based on the content of the review, both using text-feature extraction and deep learning. 

## 1. Preparation <a id=1></a>

### 1.1 Packages <a id=1.1></a>

In [None]:
!pip install -Uqq fastbook

!pip install git+https://github.com/ohmeow/blurr/
!pip install fsspec==2021.6.0
!pip uninstall torchaudio --y
!pip uninstall transformers --y
!pip install transformers==4.6.1

In [None]:
import seaborn as sns; sns.set()
import numpy as np
import pandas as pd
from collections import Counter

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, roc_auc_score, precision_recall_fscore_support
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import make_pipeline

import nltk
from nltk.corpus import stopwords
import spacy
import langid

import plotly
import plotly.express as px
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf

init_notebook_mode(connected=True)
cf.set_config_file(sharing='public',theme='white',offline=True)

import fastbook
fastbook.setup_book()

from fastai.text.all import *
from fastbook import *
    
warnings.filterwarnings(action='ignore', category=UserWarning)

### 1.2 Data <a id=1.2></a>

In [None]:
train_df = pd.read_csv('../input/amazon-reviews/train.csv', header=None, names=['label', 'title', 'text'])
test_df = pd.read_csv('../input/amazon-reviews/test.csv', header=None, names=['label', 'title', 'text'])

The two dataframes have three columns:

`label` - Target variable with two categorical levels: 1 if the review is negative (1/2 stars rating); 2 if the review is positive (4/5 stars rating).

`title` - Heading of the review.

`text` - Body of the review.

Since the original dataset is huge, for time and memory contsraints **we will restrict to a random subset of 50000 rows from `train_df` and 10000 rows from `test_df`**, which will be respectively our training and validation set. We select such subsets randomly and so that their are both perfectly balanced. We also merge the `title` and `text` features in a single `text` column.

In [None]:
train_df['title'].fillna('', inplace=True)
test_df['title'].fillna('', inplace=True)

train_len = 50000
test_len = 10000
rs = 42

df = pd.concat([train_df.loc[train_df['label'] == 1].sample(train_len//2, random_state=rs),
                train_df.loc[train_df['label'] == 2].sample(train_len//2, random_state=rs),
                test_df.loc[test_df['label'] == 1].sample(test_len//2, random_state=rs),
                test_df.loc[test_df['label'] == 2].sample(test_len//2, random_state=rs)]).reset_index(drop=True)
df['text'] = df['title'] + '. ' + df['text']
df.drop('title', axis=1, inplace=True)
df.head()

In [None]:
print(f'Label counts - training set:\n{df[:train_len].label.value_counts()}')
print(f'\nLabel counts - validation set:\n{df[train_len:].label.value_counts()}')

## 2. Text Features <a id=2></a>

We first try classifying reviews using **text vectorization**: we split the text into tokens and, for each token in the corpus, we count how many times it appears in a review. The counts for each token become the numerical features which we feed to the model. As tokens we will use not only single words, aka 1-grams, but also 2-grams and 3-grams, i.e. sequences of contiguous words of length 2 and 3 respectively.

In [None]:
cv = CountVectorizer(ngram_range=(1,3))
cv.fit(df['text'])
print(f"Number of n-grams in the corpus: {len(cv.vocabulary_)}")

We see that our corpus contains more than four milion n-grams (for n = 1,2,3). Which are the most frequent ones?

In [None]:
fts = cv.get_feature_names()
freq = cv.transform(df['text'])
gram_counts = np.array(freq.sum(0)).squeeze()

gram_counts_df = pd.DataFrame({'n-gram': [fts[i] for i in gram_counts.argsort()],
                               'count': sorted(gram_counts)})

px.bar(gram_counts_df[-20:], y='n-gram', x='count', title='Most frequent n-grams', orientation='h', height=600)

The 1-grams which appears more frequently in the corpus are unlikely to be particularly meaningful for the classification task at hand. This problem is usually tackled in two ways: we can either remove **stopwords** (i.e. the most frequent words in the language the review is written in, such as pronouns and conjunctions), or normalize the frequency counts based on how often each n-gram appears across the documents in the corpus, e.g. by using **tf-idf** vectorization, in order to penalize tokens that are common in both positive and negative reviews. 

Before trying any of these approaches, let us consider as baseline a simple multinomial naive Bayes model which uses the n-gram frequencies (n=1,2,3) from count vectorization as features.

In [None]:
cv_model = make_pipeline(CountVectorizer(ngram_range=(1,3)), MultinomialNB())

x_train, y_train = df['text'][:train_len], df['label'][:train_len].values.reshape(-1,1)
x_test, y_test = df['text'][train_len:], df['label'][train_len:].values.reshape(-1,1)

cv_model.fit(x_train, y_train)
preds = cv_model.predict(x_test)

print("Model: CountVectorizer + MultinomialNB")
print(f"Number of features: {len(cv_model[0].vocabulary_)}")
print("Accuracy: {:.4f}\n".format(accuracy_score(y_test, preds)))
mat = confusion_matrix(y_test, preds)
sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False, xticklabels=cv_model.classes_, yticklabels=cv_model.classes_)
plt.xlabel('Predicted label')
plt.ylabel('True label')

print(classification_report(y_test, preds, labels=cv_model.classes_))

Let's check now how tf-idf vectorization performs.

In [None]:
tfidf_model = make_pipeline(TfidfVectorizer(ngram_range=(1,3)), MultinomialNB())

tfidf_model.fit(x_train, y_train)
preds = tfidf_model.predict(x_test)

print("Model: TfidfVectorizer + MultinomialNB")
print(f"Number of features: {len(tfidf_model[0].vocabulary_)}")
print("Accuracy: {:.4f}\n".format(accuracy_score(y_test, preds)))
mat = confusion_matrix(y_test, preds)
sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False, xticklabels=tfidf_model.classes_, yticklabels=tfidf_model.classes_)
plt.xlabel('Predicted label')
plt.ylabel('True label')

print(classification_report(y_test, preds, labels=tfidf_model.classes_))

The two types of vectorization give very similar results. We can try removing stopwords before vectorizing: in order to do so, we need to understand which languages are present in the corpus of reviews. The most frequent n-grams all come from english, but there might also be reviews written in other languages. We use the `langid` library to predict the language of each review in our dataset.

In [None]:
langs = [langid.classify(s)[0] for s in df['text']]
Counter(langs)

Non-english reviews appear to be a very small minority, so we'll simply drop such samples from our dataset.

In [None]:
df = df[[l=='en' for l in langs]]

x_train, y_train = df['text'][:train_len], df['label'][:train_len].values.reshape(-1,1)
x_test, y_test = df['text'][train_len:], df['label'][train_len:].values.reshape(-1,1)

print(f'Label counts - training set:\n{df[:train_len].label.value_counts()}')
print(f'\nLabel counts - validation set:\n{df[train_len:].label.value_counts()}')

Let us now try removing english stopwords from the corpus of reviews.

In [None]:
nltk.download('stopwords')
sw = stopwords.words('english')

cv = CountVectorizer(ngram_range=(1,3), stop_words=frozenset(sw))
cv.fit(df['text'])
print(f"Number of ngrams in the corpus w/o stopwords: {len(cv.vocabulary_)}")

fts = cv.get_feature_names()
freq = cv.transform(df['text'])
gram_counts = np.array(freq.sum(0)).squeeze()

gram_counts_df = pd.DataFrame({'n-gram': [fts[i] for i in gram_counts.argsort()],
                               'count': sorted(gram_counts)})

px.bar(gram_counts_df[-20:], y='n-gram', x='count', title='Most frequent n-grams w/o stopwords',
       orientation='h', height=600)

The resulting features are now much more likely to be meaningful to sentiment analysis for the reviewed products, in particular when considering 2-grams and 3-grams.

In [None]:
cv2 = CountVectorizer(ngram_range=(2,3), stop_words=frozenset(sw))
cv2.fit(df['text'])
fts2 = cv2.get_feature_names()

freq_pos = cv2.transform(df.loc[df['label'] == 2, 'text'])
gram_counts_pos = np.array(freq_pos.sum(0)).squeeze()

gram_counts_pos_df = pd.DataFrame({'n-gram': [fts2[i] for i in gram_counts_pos.argsort()],
                               'count': sorted(gram_counts_pos)})

px.bar(gram_counts_pos_df[-20:], y='n-gram', x='count', title='Most frequent 2/3-grams w/o stopwords - Positive reviews',
       orientation='h', height=600)

In [None]:
freq_neg = cv2.transform(df.loc[df['label'] == 1, 'text'])
gram_counts_neg = np.array(freq_neg.sum(0)).squeeze()
gram_counts_neg_df = pd.DataFrame({'n-gram': [fts2[i] for i in gram_counts_neg.argsort()],
                               'count': sorted(gram_counts_neg)})

px.bar(gram_counts_neg_df[-20:], y='n-gram', x='count', title='Most frequent 2/3-grams w/o stopwords - Negative reviews',
       orientation='h', height=600)

We need to be careful when removing stopwords. For instance, we see from the charts that 'would recommend' is a frequent 2-gram in both types of reviews, since 'not' is currently included in our stopword list and therefore 'would recommend' and 'would not recommend' become the same token! This is clearly something we do not want to happen: let us fix the problem by modifying the stopword list.

In [None]:
for a in [l for l in sw if l.endswith('n') or l.endswith("n't")][12:]:
    sw.remove(a)
sw.remove('no')
sw.remove('nor')
sw.remove('not')

In [None]:
cvwosw_model = make_pipeline(CountVectorizer(ngram_range=(1,3), stop_words=frozenset(sw)), MultinomialNB())

cvwosw_model.fit(x_train, y_train)
preds = cvwosw_model.predict(x_test)

print("Model: CountVectorizer + MultinomialNB w/o stopwords")
print(f"Number of features: {len(cvwosw_model[0].vocabulary_)}")
print("Accuracy: {:.4f}\n".format(accuracy_score(y_test, preds)))
mat = confusion_matrix(y_test, preds)
sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False, xticklabels=cvwosw_model.classes_, yticklabels=cvwosw_model.classes_)
plt.xlabel('Predicted label')
plt.ylabel('True label')

print(classification_report(y_test, preds, labels=cvwosw_model.classes_))

In [None]:
tfidfwosw_model = make_pipeline(TfidfVectorizer(ngram_range=(1,3), stop_words=frozenset(sw)), MultinomialNB())

tfidfwosw_model.fit(x_train, y_train)
preds = tfidfwosw_model.predict(x_test)

print("Model: TfidfVectorizer + MultinomialNB w/o stopwords")
print(f"Number of features: {len(tfidfwosw_model[0].vocabulary_)}")
print("Accuracy: {:.4f}\n".format(accuracy_score(y_test, preds)))
mat = confusion_matrix(y_test, preds)
sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False, xticklabels=tfidfwosw_model.classes_, yticklabels=tfidfwosw_model.classes_)
plt.xlabel('Predicted label')
plt.ylabel('True label')

print(classification_report(y_test, preds, labels=tfidfwosw_model.classes_))

Vectorization using all n-grams in the original corpus still provides a model with better accuracy. However, removing stopwords has the advantage of reducing the number of features. To this end, another approach which can be tried is **lemmatization**, i.e. substituting all words in the corpus with their lemma, grouping together all inflected forms of the same word.

In [None]:
spcy = spacy.load('en_core_web_sm')

x_train_lemm = x_train.apply(lambda x: ' '.join([w.lemma_ for w in spcy(x)]))
x_test_lemm = x_test.apply(lambda x: ' '.join([w.lemma_ for w in spcy(x)]))

In [None]:
cvlemm_model = make_pipeline(CountVectorizer(ngram_range=(1,3)), MultinomialNB())

cvlemm_model.fit(x_train_lemm, y_train)
preds = cvlemm_model.predict(x_test_lemm)

print("Model: CountVectorizer + MultinomialNB lemmatized")
print(f"Number of features: {len(cvlemm_model[0].vocabulary_)}")
print("Accuracy: {:.4f}\n".format(accuracy_score(y_test, preds)))
mat = confusion_matrix(y_test, preds)
sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False, xticklabels=cvlemm_model.classes_, yticklabels=cvlemm_model.classes_)
plt.xlabel('Predicted label')
plt.ylabel('True label')

print(classification_report(y_test, preds, labels=cvlemm_model.classes_))

Lemmatizing gives better results than removing stopwords, and it further reduces the number of features. However, the accuracy is still not as good as compared to using the original text.

## 3. Deep Learning <a id=3></a>

Neural networks have proved to be extremely effective tools for NLP tasks. They generally provide significant improvements compared to standard ML models based on text features, even with minimal preprocessing (tokenization and numericalization of tokens is often enough). We will deploy two different models: one based on a Long Short-Term Memory (LSTM) recurrent neural network architecture and one using a state of the art transformer architecture. We will use PyTorch + fastai for implementation, together with huggingface Transformers library.

### 3.1 LSTM <a id=3.1></a>

We use the `AWD-LSTM` architecture by [Smerity et al.](https://arxiv.org/pdf/1708.02182.pdf) 

We first finetune on our corpus the **language model** (whose task is predicting the next token in the text based on the previous ones), before using it to build a classifier (*transfer learning*). Preprocessing is done by tokenizing with SpaCy and adding special tokens for capitalization, repetitions, beginning of strings, etc.

In [None]:
lm_dblock = DataBlock(blocks=TextBlock.from_df('text', is_lm=True),
                   get_x=ColReader('text'),
                   splitter=IndexSplitter(range(train_len, len(df))))
lm_dls = lm_dblock.dataloaders(df, bs=32, seq_len=72)

lm_dls.show_batch(dataloaders=lm_dls, max_n=4)

Fastai provides a very convenient *learning rate finder* to determine the best learning rate. The training of the neural network will be then performed using the **1cycle** policy (each epoch features a *warmup phase*, where the learning rate is gradually increased, followed by an *annealing phase*, where the lr decreases back to the minimum).

In [None]:
lm_learn = language_model_learner(lm_dls, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, perplexity])
lm_learn.lr_find()

In [None]:
lm_learn.fit_one_cycle(3, lr_max=1e-2)

We save the weights in the body of the RNN and then move on to building the classifier.

In [None]:
lm_learn.save_encoder('lstm_finetuned')

clas_dblock = DataBlock(blocks=(TextBlock.from_df('text', vocab=lm_dls.vocab), CategoryBlock),
                        get_x=ColReader('text'), get_y=ColReader('label'),
                        splitter=IndexSplitter(range(train_len, len(df))))

clas_dls = clas_dblock.dataloaders(df, bs=32, seq_len=72, dl_type=SortedDL)

clas_dls.show_batch(dls=clas_dls, max_n=5)

We load the language model weights in our classifier and then look for the best learning rate.

In [None]:
clas_learn = text_classifier_learner(clas_dls, AWD_LSTM, seq_len=72, metrics=accuracy)
clas_learn.load_encoder('lstm_finetuned')
clas_learn.freeze()

clas_learn.lr_find()

In order to train the classifier it is often a good choice to **gradually unfreeze** the layers of the NN, starting from training just the head (which at the moment contains random weights).

In [None]:
clas_learn.fit_one_cycle(1, lr_max=3e-3)

In [None]:
clas_learn.freeze_to(-2)
clas_learn.lr_find()

In [None]:
clas_learn.fit_one_cycle(1, lr_max=1e-4)

In [None]:
clas_learn.freeze_to(-3)
clas_learn.lr_find()

In [None]:
clas_learn.fit_one_cycle(1, lr_max=3e-5)

We now unfreeze all remaining layers and train for a few more epochs with discriminative learning rates (lower lr for early layers, higher lr for later ones).

In [None]:
clas_learn.unfreeze()
clas_learn.lr_find()

In [None]:
clas_learn.fit_one_cycle(3, lr_max=slice(1e-6, 1e-4))

In [None]:
interp = ClassificationInterpretation.from_learner(clas_learn)

interp.print_classification_report()
interp.plot_confusion_matrix()

The LSTM model has reached an accuracy of 91.3%, significantly higher than the baseline obtained by using text features and with much more homogeneous precision and recall for the two classes. It is worth observing that the validation dataset contains wrongly labeled samples, as we can see by analyzing the top losses.

In [None]:
interp.plot_top_losses(k=8)

### 3.2 RoBERTa <a id=3.2></a>

Let's see whether a transformer NN can further improve results. We use [RoBERTa](https://arxiv.org/abs/1907.11692), based on BERT architecture; I highly recommend the [blurr library](https://github.com/ohmeow/blurr) to integrate huggingface Transformers with fastai.

In [None]:
from transformers import AutoModelForMaskedLM, AutoModelForSequenceClassification
from blurr.data.all import *
from blurr.modeling.all import *

pretrained_model_name = "roberta-base"
model_cls = AutoModelForSequenceClassification
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(pretrained_model_name, model_cls=model_cls)

In [None]:
dblock = DataBlock(blocks=(HF_TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model), CategoryBlock), 
                   get_x=ColReader('text'), get_y=ColReader('label'),
                   splitter=IndexSplitter(range(train_len, len(df))))

dls = dblock.dataloaders(df, bs=4, dl_type=SortedDL)

dls.show_batch(dataloaders=dls)

In [None]:
roberta_model = HF_BaseModelWrapper(hf_model)

roberta_learn = Learner(dls, 
                roberta_model,
                opt_func=partial(Adam, decouple_wd=True),
                loss_func=CrossEntropyLossFlat(),
                metrics=accuracy,
                cbs=[HF_BaseModelCallback],
                splitter=hf_splitter)

roberta_learn.freeze()

roberta_learn.lr_find()

In [None]:
roberta_learn.fit_one_cycle(1, lr_max=3e-4)

In [None]:
roberta_learn.freeze_to(-2)
roberta_learn.lr_find()

In [None]:
roberta_learn.fit_one_cycle(1, lr_max=3e-6)

In [None]:
roberta_learn.unfreeze()
roberta_learn.fit_one_cycle(2, lr_max=slice(1e-7, 1e-5))

In [None]:
preds = roberta_learn.blurr_predict(dls.valid.items)

pred_proc = np.array([int(dls.categorize.decode(a)) for i in list(zip(*preds))[1] for a in i])

true_y = dls.valid.items['label'].values

print(classification_report(true_y, pred_proc))

Finetuning a pretrained transformer architecture leads to an accuracy of **96.5%**, much better than what we achieved with a LSTM RNN. 