# Sentiment analysis of text data

The focus for this lab is classification of natural language data, we'll be using the [Movie Review Polarity Dataset](http://www.cs.cornell.edu/people/pabo/movie-review-data/) that includes film reviews annotated with a label that classify them as positive or negative. The task is to build a classifier to predict new (unseen) reviews.

The usual workflow for building and deploying a classifier is depicted in the image below

![Text Classification workflow](https://developers.google.com/machine-learning/guides/text-classification/images/Workflow.png)

First we prepare the computing environment by importing the necessary libraries

In [1]:
import os
import random
import warnings

import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import GridSearchCV


The [NLTK](https://www.nltk.org/) installed package doesn't include the necessary data which should be installed as described in the [documentation](https://www.nltk.org/data.html). The full list of available corpora data is available on [NLTK website](https://www.nltk.org/nltk_data/).

In our case we just need the stopwords, which can be downloaded as following:

In [2]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\samue\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\samue\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Collect and load data

The dataset is in a zip archive `data/review_polarity.zip`, in the archive reviews are stored within the `txt_sentoken` directory as single files. Each file corresponding to a single review is stored in the `pos` or `neg` subdirectory according to its classification. The function below will load the dataset in a pandas dataframe:

In [3]:
from zipfile import ZipFile
import re


def load_dataset_archive(ziparch, seed=None, encoding='utf-8'):
    """Load the Movie Review Polarity Dataset from the given zip archive.
    For the description of the data see <http://www.cs.cornell.edu/people/pabo/movie-review-data/>
    """
    data = []
    with ZipFile(ziparch, 'r') as myzip:
        for fi in myzip.infolist():
            if not fi.is_dir():
                m = re.search('/(neg|pos)/(\w+).txt$', fi.filename)
                if m:
                    row = {'id': m.group(2), 'Text': myzip.read(fi).decode(
                        encoding), 'Label': 0 if m.group(1) == 'neg' else 1}
                    data.append(row)

    # shuffle data to avoid order biases
    random.seed(seed)
    random.shuffle(data)
    return pd.DataFrame.from_records(data, columns=['id', 'Text', 'Label'], index='id')


In [4]:
dataset = load_dataset_archive('data/review_polarity.zip')
print(dataset.info())
dataset.sample(10)


<class 'pandas.core.frame.DataFrame'>
Index: 1999 entries, cv113_24354 to cv221_2695
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Text    1999 non-null   object
 1   Label   1999 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 46.9+ KB
None


Unnamed: 0_level_0,Text,Label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
cv887_5306,"this talky , terribly-plotted thriller stars a...",0
cv664_4264,warner brothers has scored another marketing c...,0
cv895_21022,the soldiers of three kings have taken their c...,1
cv244_22935,"i should have known , damn it , i should have ...",0
cv332_17997,there is a rule when it comes to movies . \na ...,0
cv880_29800,for those who associate italian cinema with fe...,1
cv090_0042,"warning : anyone offended by blatant , leering...",1
cv190_27052,good films are hard to find these days . \ngre...,1
cv816_13655,"in my review of there's something about mary ,...",1
cv126_28821,"plot : a separated , glamorous , hollywood cou...",0


## Feature extraction

To use ML techniques we need to transform the textual representation into a set of features, we can use the infrastructure provided by [scikit](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature). For this example I used the [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer), but you can try also the [Tf–idf term weighting](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) based extractor.

To limit the number of features we can set the parameter `max_features` to the `CountVectorizer` constructor (the set of all features might be unmanageable).

We can ignore stopwords by using the `stop_words` parameter, below we'll use the data from the downloaded NLTK corpus.

N-grams can be considered by specifying the `ngram_range` parameter. E.g `(1,3)` uses length 1, 2, and 3.

In [5]:
vectorizer = sklearn.feature_extraction.text.CountVectorizer(
    stop_words=nltk.corpus.stopwords.words('english'),
    max_features=100000,
    ngram_range=(1, 3)
)

%time fmatrix = vectorizer.fit_transform(dataset['Text'])

print(fmatrix.shape)


Wall time: 8.39 s
(1999, 100000)


The default tokeniser is the regular expression `(?u)\b\w\w+\b`, but we can also use one of the [NLTK tokenisers](https://www.nltk.org/api/nltk.tokenize.html)

In [6]:
small_vectorizer = sklearn.feature_extraction.text.CountVectorizer(
    stop_words=nltk.corpus.stopwords.words('english'),
    max_features=1000,
    ngram_range=(1, 3),
    tokenizer=nltk.tokenize.word_tokenize
)

small_matrix = small_vectorizer.fit_transform(dataset['Text'])

print(small_matrix.shape)


  % sorted(inconsistent)


(1999, 1000)


In [7]:
Not to run!

SyntaxError: invalid syntax (2569769624.py, line 1)

## Classification and Evaluation

Below you'll find an example using the [logistic regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) classifier, with the corresponding C parameter tuning using the `lbfgs` solver:


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

base_estimator = LogisticRegression(solver='sag')

param_grid = {'C': [0.01, 0.05, 0.25, 0.5, 1]}

clf = GridSearchCV(base_estimator, param_grid=param_grid)
with warnings.catch_warnings():
    warnings.simplefilter("ignore")

    # features: fmatrix
    # target: dataset['Label']
    %time clf.fit(fmatrix, dataset['Label'])

pd.concat([pd.DataFrame(clf.cv_results_["params"]), pd.DataFrame(
    clf.cv_results_["mean_test_score"], columns=["Accuracy"])], axis=1)


Wall time: 31.2 s


Unnamed: 0,C,Accuracy
0,0.01,0.835912
1,0.05,0.84192
2,0.25,0.84192
3,0.5,0.84042
4,1.0,0.839921


### Train with whole dataset

Once you selected the parameter you can prepare the model for classifying unseen data. Usually you prepare the model for deployment by using the whole dataset (beware of overfitting, though).

In [None]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    %time lr_full = LogisticRegression(C=1, solver='sag').fit(fmatrix, dataset['Label'])


Wall time: 1.52 s


Trained models can be saved for later deployment using Python libraries for serialisation. Scikit documentation suggests to use [joblib](https://scikit-learn.org/stable/modules/model_persistence.html):

In [None]:
import joblib

joblib.dump(lr_full, 'my_lr_classifier.joblib')
joblib.dump(vectorizer, 'my_full_vectorizer.joblib')
lr_copy = joblib.load('my_lr_classifier.joblib')
lr_copy


LogisticRegression(C=1, solver='sag')

## Using the model for prediction

To classify new instance the features must be aliged to the ones used for training. To this end you use the `transform` method of the corresponding vectoriser (the `fit` phase is the one where the features are selected): 

In [None]:
new_example = vectorizer.transform(['this is not a drill'])
lr_full.predict(new_example)

array([0], dtype=int64)

Let's have a look at the features of the new data. To understand which features are in the example we need to consider the list of feature names in the vectoriser (the method `get_feature_names()`).

In [None]:
features = vectorizer.get_feature_names_out()
for row, col in zip(*new_example.nonzero()):
    print('{} ({},{})={} '.format(
        features[col], row, col, new_example[row, col]))


drill (0,23230)=1 


## Try a different classifier

Select a different classifier and verify whether you can obtain a better accuracy. With textual data [naïve Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes) and [support vector machine (SVM)](https://scikit-learn.org/stable/modules/svm.html#svm) are often used, but you can also train and use [deep learning models](https://developers.google.com/machine-learning/guides/text-classification/step-4).

Comment on your experiments.

### Naive Bayes

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB

In [None]:
# X: features that the classification is based on
# y: target value (the classes)
X = fmatrix
X = X.toarray()
y = dataset['Label']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42)


In [None]:
model = GaussianNB()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.73

In [None]:
param_grid = {'alpha': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]}


base_estimator = MultinomialNB()
clf = GridSearchCV(base_estimator, param_grid=param_grid)
with warnings.catch_warnings():
    warnings.simplefilter("ignore")

    # features: fmatrix
    # target: dataset['Label']
    %time clf.fit(X_train, y_train)

pd.concat([pd.DataFrame(clf.cv_results_["params"]), pd.DataFrame(
    clf.cv_results_["mean_test_score"], columns=["Accuracy"])], axis=1)

Wall time: 1min 12s


Unnamed: 0,alpha,Accuracy
0,0.4,0.797186
1,0.5,0.796517
2,0.6,0.797184
3,0.7,0.799191
4,0.8,0.799855
5,0.9,0.801188
6,1.0,0.802524


In [None]:
model = MultinomialNB(alpha=0.9)
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.804

In [None]:
base_estimator = BernoulliNB()
clf = GridSearchCV(base_estimator, param_grid=param_grid)
with warnings.catch_warnings():
    warnings.simplefilter("ignore")

    %time clf.fit(X_train, y_train)

pd.concat([pd.DataFrame(clf.cv_results_["params"]), pd.DataFrame(
    clf.cv_results_["mean_test_score"], columns=["Accuracy"])], axis=1)

Wall time: 1min 49s


Unnamed: 0,alpha,Accuracy
0,0.4,0.803851
1,0.5,0.802517
2,0.6,0.799179
3,0.7,0.793177
4,0.8,0.790508
5,0.9,0.787842
6,1.0,0.783175


In [None]:
model = BernoulliNB(alpha=0.5)
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.8

### SVM

In [None]:
#conda update scikit-learn
from sklearn import svm
from sklearn.model_selection import GridSearchCV

In [None]:
# Defining kernel

base_estimator = svm.SVC()

param_grid = {'C': [0.1, 1, 5, 10, 100], 'gamma': [1,0.1,0.01,0.001],'kernel': ['rbf', 'linear', 'sigmoid']}

clf = GridSearchCV(base_estimator, param_grid=param_grid) #Random Grid Search CV can be used to save time
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    %time clf.fit(X_train, y_train)

pd.concat([pd.DataFrame(clf.cv_results_["params"]), pd.DataFrame(
    clf.cv_results_["mean_test_score"], columns=["Accuracy"])], axis=1)

Wall time: 7h 37min 56s


Unnamed: 0,C,gamma,kernel,Accuracy
0,0.1,1.0,rbf,0.499666
1,0.1,1.0,linear,0.835229
2,0.1,1.0,sigmoid,0.493645
3,0.1,0.1,rbf,0.499666
4,0.1,0.1,linear,0.835229
5,0.1,0.1,sigmoid,0.507672
6,0.1,0.01,rbf,0.50301
7,0.1,0.01,linear,0.835229
8,0.1,0.01,sigmoid,0.58037
9,0.1,0.001,rbf,0.61041


In [None]:
model_svm = svm.SVC(C=1, kernel='linear', gamma=1) 
model_svm.fit(X_train, y_train)
model_svm.score(X_test, y_test)

0.836

### Deep Learning Models

In [None]:
from sklearn.neural_network import MLPClassifier

In [None]:
#Choose the classifier
base_estimator = MLPClassifier(max_iter=2000, random_state=42)

n_feat = len (X_train)
param_grid = {'hidden_layer_sizes': [(n_feat), (n_feat, n_feat), (n_feat, n_feat, n_feat)], 'activation': ['relu'], 'solver': ['adam']}
clf = GridSearchCV(base_estimator, param_grid)

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    %time clf.fit(X_train, y_train)

clf.fit(X_train,y_train)

pd.concat([pd.DataFrame(clf.cv_results_["params"]), pd.DataFrame(
    clf.cv_results_["mean_test_score"], columns=["Accuracy"])], axis=1)

Wall time: 5h 59s


Unnamed: 0,activation,hidden_layer_sizes,solver,Accuracy
0,relu,1499,adam,0.813229
1,relu,"(1499, 1499)",adam,0.794584
2,relu,"(1499, 1499, 1499)",adam,0.807857


In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
#Choose the classifier
base_estimator = MLPClassifier(max_iter=100, random_state=42)

n_feat = len (X_train)
param_grid = {'hidden_layer_sizes': [n_feat], 'activation': ['relu'], 'solver': ['adam'], 'alpha': [(10**(-1)),(10**(-3)),(10**(-5))]}
clf = RandomizedSearchCV(base_estimator, param_grid)

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    %time clf.fit(X_train, y_train)

clf.fit(X_train,y_train)

pd.concat([pd.DataFrame(clf.cv_results_["params"]), pd.DataFrame(
    clf.cv_results_["mean_test_score"], columns=["Accuracy"])], axis=1)

Wall time: 8h 24min 52s




Unnamed: 0,solver,hidden_layer_sizes,alpha,activation,Accuracy
0,adam,1499,0.1,relu,0.814542
1,adam,1499,0.001,relu,0.777055
2,adam,1499,1e-05,relu,0.827235


In [None]:
model_mlp = MLPClassifier(max_iter=2000, random_state=42, hidden_layer_sizes= (n_feat), activation= 'relu', solver= 'adam', alpha=0.00001)
model_mlp.fit(X_train, y_train)
model_mlp.score(X_test, y_test)

0.804

As in the example of Logistic regression the data was not standardized, we kept the same pattern. However, it is important to standardize for better results.