# [Sentiment Analysis](https://www.kaggle.com/c/sentiment-analysis-pmr3508)

[PMR3508](https://uspdigital.usp.br/jupiterweb/obterDisciplina?sgldis=PMR3508) - Machine Learning and Pattern Recognition

Professor Fabio Gagliardi Cozman

PMR3508-2020-83 - [Vitor Gratiere Torres](https://github.com/vitorgt/PMR3508)

This assignment's goal is to tell whether an IMDb review is positive or negative (complementary).

My analysis takes the following steps:

1. [Import data and python modules](#1.-Import-data-and-python-modules) (Acquisition)
1. [Preprocessing](#2.-Preprocessing)
    1. [Clean text and vectorize](#2.1.-Clean-text-and-vectorize) (Preprocessing)
    1. [Embed words](#2.2.-Embed-words) (Representation)
1. [Modelling](#3.-Modelling)
    1. [Models Definition and Hyperparameters Search](#3.1.-Models-Definition-and-Hyperparameters-Search)
        1. [Logistic Regression](#3.1.1.-Logistic-Regression)
        1. [K-nearest Neighbors](#3.1.2.-K-nearest-Neighbors)
        1. [Multi-layer Perceptron - 1 Hidden Layer](#3.1.3.-Multi-layer-Perceptron---1-Hidden-Layer)
        1. [Multi-layer Perceptron - 2 Hidden Layers](#3.1.4.-Multi-layer-Perceptron---2-Hidden-Layers)
        1. [PyTorch Multi-layer Perceptron - 2 Hidden Layers](#3.1.5.-PyTorch-Multi-layer-Perceptron---2-Hidden-Layers)
    1. [Metric Comparison](#3.2.-Metric-Comparison)
1. [Submission](#4.-Submission)

# 1. Import data and python modules

In [1]:
import time
import random

# Data handling
import numpy as np
import pandas as pd
from scipy.stats import loguniform
from skopt.space import Integer, Real

# Text modules
import re
import string
from ftfy import fix_text

# Embeddings
from gensim.models.doc2vec import Doc2Vec

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neural_network import MLPClassifier
from skorch import NeuralNetClassifier
import torch
from torch import optim
import torch.nn as nn
import torch.nn.functional as F

# Hyperparameters
from sklearn.model_selection import RandomizedSearchCV
from skopt import BayesSearchCV

# Metrics
from sklearn.metrics import roc_auc_score

In [2]:
XYtrain = pd.read_csv("sentiment-analysis-pmr3508/data_train.csv")
XYtest = pd.read_csv("sentiment-analysis-pmr3508/data_test1.csv")
Xsubmit = pd.read_csv("sentiment-analysis-pmr3508/data_test2_X.csv")

print("Training data shape:", XYtrain.shape)
print(XYtrain.describe())

print("\nTest data shape:", XYtest.shape)
print(XYtest.describe())

print("\nSubmission data shape:", Xsubmit.shape)
print(Xsubmit.describe())

XYtrain.sample(5)

Training data shape: (24984, 2)
          positive
count  24984.00000
mean       0.49988
std        0.50001
min        0.00000
25%        0.00000
50%        0.00000
75%        1.00000
max        1.00000

Test data shape: (12492, 2)
           positive
count  12492.000000
mean       0.495037
std        0.499995
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        1.000000

Submission data shape: (12493, 1)
                                                   review
count                                               12493
unique                                              12436
top     Footprints is a very interesting movie that is...
freq                                                    2


Unnamed: 0,review,positive
11369,Rock n' roll is a messy business and DiG! demo...,1
6382,"Hunky Geordie Robson Green is Owen Springer, a...",1
8766,It was easy for Sir Richard Attenborough to ma...,1
20369,"Yes, this is one of THOSE movies, so terrible,...",0
5205,This movie is incredibly realistic and I feel ...,1


# 2. Preprocessing

The nature of this dataset is different from previous, there might not be ```NAs```, ```NANs``` or ```Nulls```, this time we ought to notice duplicates.

In [3]:
print("Training data shape initially:", XYtrain.shape)
XYtrain = XYtrain.drop_duplicates(keep="first")
print("Training data shape without duplicates:", XYtrain.shape)

print("\nTest data shape initially:", XYtest.shape)
XYtest = XYtest.drop_duplicates(keep="first")
print("Test data shape without duplicates:", XYtest.shape)

Training data shape initially: (24984, 2)
Training data shape without duplicates: (24888, 2)

Test data shape initially: (12492, 2)
Test data shape without duplicates: (12441, 2)


In [4]:
Xtrain = XYtrain.loc[:, "review"]
Ytrain = XYtrain.loc[:, "positive"]

Xtest = XYtest.loc[:, "review"]
Ytest = XYtest.loc[:, "positive"]

Xsubmit = Xsubmit.iloc[:, 0]

del XYtrain
del XYtest

print("Xtrain", Xtrain.shape, "Ytrain", Ytrain.shape)
print("Xtest", Xtest.shape, "Ytest", Ytest.shape)
print("Xsubmit", Xsubmit.shape)

Xtrain (24888,) Ytrain (24888,)
Xtest (12441,) Ytest (12441,)
Xsubmit (12493,)


They are all ```pandas.Series``` now.

## 2.1. Clean text and vectorize

We were given the below function to clean and prepare the dataset. This is an important step dealing with NLP, we could make use of NLTK or spaCy, but we rather a manual approach.

In [5]:
def preptext(txt):
    # removing tags
    txt = txt.replace("<br />", " ")
    # fixing Mojibakes (See https://pypi.org/project/ftfy/)
    txt = fix_text(txt)
    # converting case
    txt = txt.lower()
    # removing punctuation
    txt = txt.translate(str.maketrans("", "", string.punctuation))
    # removing hyphens
    txt = txt.replace(" — ", " ")
    # replacing digits with a tag
    txt = re.sub("\d+", " <number> ", txt)
    # removing double spaces
    txt = re.sub(" +", " ", txt)
    return txt


def clean_split(data):
    # let's apply this function to clean the dataset
    data = data.apply(preptext)

    # now I need a vector of words, I'll split those strings
    data = data.apply(lambda x: x.split())

    print(data.sample(3))
    return data


Xtrain = clean_split(Xtrain)
Xtest = clean_split(Xtest)
Xsubmit = clean_split(Xsubmit)

14177    [i, didnt, enjoy, this, movie, at, allfor, one...
17087    [im, starting, to, write, this, review, during...
3717     [stanley, and, iris, show, the, triumph, of, t...
Name: review, dtype: object
7477     [i, have, been, wanting, to, see, this, since,...
7643     [i, like, movies, that, show, real, people, am...
12166    [there, have, been, some, great, television, m...
Name: review, dtype: object
11688    [the, <number>, s, were, overrun, by, all, tho...
11757    [i, love, this, young, people, trapped, in, a,...
537      [as, an, avid, gone, with, the, wind, fan, i, ...
Name: review, dtype: object


## 2.2. Embed words

It is a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension $f:\; \text{word} \hookrightarrow \mathbb{R}^{50}$.

We'll use a pre-trained Doc2Vec model.

In [6]:
d2v = Doc2Vec.load("sentiment-analysis-pmr3508/doc2vec")

print("Embedding Xtrain")
srt = time.perf_counter()
Xtrain = Xtrain.apply(d2v.infer_vector, steps=20)
Xtrain = pd.DataFrame(Xtrain.to_list())
end = time.perf_counter()
print("Xtrain shape:", Xtrain.shape)
print(f"Embedding Xtrain done in {end - srt:.1f}s\n")

print("Embedding Xtest")
srt = time.perf_counter()
Xtest = Xtest.apply(d2v.infer_vector, steps=20)
Xtest = pd.DataFrame(Xtest.to_list())
end = time.perf_counter()
print("Xtest shape:", Xtest.shape)
print(f"Embedding Xtest done in {end - srt:.1f}s\n")

print("Embedding Xsubmit")
srt = time.perf_counter()
Xsubmit = Xsubmit.apply(d2v.infer_vector, steps=20)
Xsubmit = pd.DataFrame(Xsubmit.to_list())
end = time.perf_counter()
print("Xsubmit shape:", Xsubmit.shape)
print(f"Embedding Xsubmit done in {end - srt:.1f}s\n")

Xtrain.sample(5)

Embedding Xtrain
Xtrain shape: (24888, 50)
Embedding Xtrain done in 199.1s

Embedding Xtest
Xtest shape: (12441, 50)
Embedding Xtest done in 84.7s

Embedding Xsubmit
Xsubmit shape: (12493, 50)
Embedding Xsubmit done in 97.7s



Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
1096,-0.389007,-1.906648,0.083847,0.601308,-1.903552,1.436674,-0.111065,0.992137,-1.178109,-1.134489,...,-0.365865,-0.257926,-1.978419,-0.264564,-0.235605,0.146907,-0.171853,-0.113333,-0.369105,0.213336
197,-0.650514,-1.132547,-0.72638,0.003158,-0.450859,-0.095051,0.066853,-0.460648,0.34948,-0.882087,...,0.466829,-1.123515,0.117008,0.294727,-0.206961,-0.576789,-0.197853,-0.341755,0.696647,-1.143242
11430,-0.063845,-0.122038,-1.091817,1.574319,0.432568,0.186428,-1.155819,-0.077032,-0.112485,0.720673,...,1.413528,0.637412,0.067488,0.206915,-1.191482,-0.216786,0.352696,-1.397749,-0.363585,1.34943
17845,-0.706304,0.159149,-0.510835,-0.191042,-0.47192,-0.068397,0.996231,-1.246378,-0.089533,1.793827,...,0.210131,-0.408126,-1.939433,-0.766466,1.58631,-0.604824,-0.154632,-0.068213,-0.40983,-1.208645
23523,-0.536427,0.590435,0.024463,0.828699,0.303688,-0.334226,0.431376,-0.314037,-0.556991,-0.641372,...,0.329647,-0.803746,0.206624,0.371842,-0.856415,-1.156844,-0.259034,-0.114242,0.588103,0.373455


Data is ready!

# 3. Modelling

## 3.1. Models Definition and Hyperparameters Search

Now, using ```sklearn.model_selection.RandomizedSearchCV```, we'll search for the best hyperparameters for each classifier maximizing Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

### 3.1.1. Logistic Regression

In [7]:
%%time

# Model definition
logreg = LogisticRegression(solver="liblinear")

# Hyperparameters
logreg_hp = dict(
    # Inverse of regularization strength; must be a positive float.
    # Like in support vector machines, smaller values specify
    # stronger regularization.
    C=np.linspace(0.01, 10, 100),
    # Used to specify the norm used in the penalization.
    penalty=["l1", "l2"],
)

# Research
logreg_researcher = RandomizedSearchCV(
    logreg,
    logreg_hp,
    scoring="roc_auc",
    cv=2,
    n_iter=50,
    n_jobs=-1,
)
logreg_results = logreg_researcher.fit(Xtrain, Ytrain)

# Result
print(logreg_results.best_params_, logreg_results.best_score_)

{'penalty': 'l2', 'C': 0.5145454545454545} 0.8817682434170644
CPU times: user 834 ms, sys: 391 ms, total: 1.23 s
Wall time: 19.1 s


### 3.1.2. K-nearest Neighbors

In [8]:
%%time

# Model definition
knn = KNeighborsClassifier()

# Hyperparameters
knn_hp = dict(
    # Number of neighbors to use.
    n_neighbors=np.arange(151, 210, 1),
)

# Research
knn_researcher = RandomizedSearchCV(
    knn,
    knn_hp,
    scoring="roc_auc",
    cv=2,
    n_iter=50,
    n_jobs=-1,
)
knn_results = knn_researcher.fit(Xtrain, Ytrain)

# Result
print(knn_results.best_params_, knn_results.best_score_)

{'n_neighbors': 207} 0.8647431582264684
CPU times: user 759 ms, sys: 82 ms, total: 841 ms
Wall time: 19min 16s


### 3.1.3. Multi-layer Perceptron - 1 Hidden Layer

In [9]:
%%time

# Model definition
mlp_1h = MLPClassifier(early_stopping=True)

# Hyperparameters
mlp_1h_hp = dict(
    # The ith element represents the number of neurons in the ith
    # hidden layer.
    hidden_layer_sizes=[(2 ** i,) for i in np.arange(6, 12)],
    # L2 penalty (regularization term) parameter.
    alpha=loguniform(0.000001, 0.1),
)

# Research
mlp_1h_researcher = RandomizedSearchCV(
    mlp_1h,
    mlp_1h_hp,
    scoring="roc_auc",
    cv=2,
    n_iter=30,
    n_jobs=-1,
)
mlp_1h_results = mlp_1h_researcher.fit(Xtrain, Ytrain)

# Result
print(mlp_1h_results.best_params_, mlp_1h_results.best_score_)

{'alpha': 5.182449896642826e-06, 'hidden_layer_sizes': (256,)} 0.8895135969653997
CPU times: user 28.7 s, sys: 1min 11s, total: 1min 39s
Wall time: 3min 30s


### 3.1.4. Multi-layer Perceptron - 2 Hidden Layers

In [10]:
%%time

# Model definition
mlp_2h = MLPClassifier(early_stopping=True)

# Hyperparameters
mlp_2h_hp = dict(
    # The ith element represents the number of neurons in the ith
    # hidden layer.
    hidden_layer_sizes=[
        (2 ** i, 2 ** j)
        for j in np.arange(6, 10)
        for i in np.arange(6, 10)
    ],
    # L2 penalty (regularization term) parameter.
    alpha=loguniform(0.000001, 0.1),
)

# Research
mlp_2h_researcher = RandomizedSearchCV(
    mlp_2h,
    mlp_2h_hp,
    scoring="roc_auc",
    cv=2,
    n_iter=30,
    n_jobs=-1,
)
mlp_2h_results = mlp_2h_researcher.fit(Xtrain, Ytrain)

# Result
print(mlp_2h_results.best_params_, mlp_2h_results.best_score_)

{'alpha': 4.485364848374194e-06, 'hidden_layer_sizes': (256, 64)} 0.8896468326732914
CPU times: user 17.9 s, sys: 24.7 s, total: 42.6 s
Wall time: 5min 22s


### 3.1.5. PyTorch Multi-layer Perceptron - 2 Hidden Layers

In [11]:
%%time

# Model definition
class MLPNet(nn.Module):
    def __init__(self, hidden1_dim=512, hidden2_dim=64, p=0.25):
        super(MLPNet, self).__init__()

        # 50 -> hidden1_dim -> hidden2_dim -> 2
        self.fc1 = nn.Linear(50, hidden1_dim)
        self.fc2 = nn.Linear(hidden1_dim, hidden2_dim)
        self.fc3 = nn.Linear(hidden2_dim, 2)

        self.dropout = nn.Dropout(p)

    def forward(self, X, **kwargs):
        fc_out = F.relu(self.fc1(X))
        fc_out = self.dropout(fc_out)

        fc_out = F.relu(self.fc2(fc_out))
        fc_out = self.dropout(fc_out)

        fc_out = self.fc3(fc_out)
        soft_out = F.softmax(fc_out, dim=-1)

        return soft_out


# Model definition
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_2h = NeuralNetClassifier(
    MLPNet().to(device),
    max_epochs=20,
    lr=1e-4,
    optimizer=optim.Adam,
    optimizer__weight_decay=1e-4,
    train_split=False,
    verbose=0,
    iterator_train__shuffle=True,
)

# Hyperparameters
torch_2h_hp = dict(
    module__hidden1_dim=Integer(256, 2048),
    module__hidden2_dim=Integer(64, 1024),
    module__p=Real(0.1, 0.75, prior="uniform"),
    optimizer__weight_decay=Real(1e-10, 1e-2, prior="log-uniform"),
)

# Research
torch_2h_researcher = BayesSearchCV(
    torch_2h,
    torch_2h_hp,
    scoring="roc_auc",
    cv=2,
    n_iter=30,
    n_jobs=-1,
)
torch_2h_results = torch_2h_researcher.fit(
    Xtrain.to_numpy(np.float32), Ytrain.to_numpy(np.int64)
)

# Result
print(torch_2h_results.best_params_, torch_2h_results.best_score_)

OrderedDict([('module__hidden1_dim', 1476), ('module__hidden2_dim', 518), ('module__p', 0.47278526786037134), ('optimizer__weight_decay', 6.087811802931341e-09)]) 0.8934223721123139
CPU times: user 5min 7s, sys: 1min 56s, total: 7min 4s
Wall time: 28min 12s


## 3.2. Metric Comparison

Based on ```Test``` dataframe

In [12]:
srt = time.perf_counter()
logreg_test_score = roc_auc_score(
    Ytest, logreg_results.predict_proba(Xtest)[:, 1]
)
end = time.perf_counter()
print(f"Logistic Regression: {logreg_test_score:.5f}")
print(f"It took {end - srt:.1f}s\n")

srt = time.perf_counter()
knn_test_score = roc_auc_score(
    Ytest, knn_results.predict_proba(Xtest)[:, 1]
)
end = time.perf_counter()
print(f"K-nearest Neighbors: {knn_test_score:.5f}")
print(f"It took {end - srt:.1f}s\n")

srt = time.perf_counter()
mlp_1h_test_score = roc_auc_score(
    Ytest, mlp_1h_results.predict_proba(Xtest)[:, 1]
)
end = time.perf_counter()
print(f"Multi-layer Perceptron 1 Hidden: {mlp_1h_test_score:.5f}")
print(f"It took {end - srt:.1f}s\n")

srt = time.perf_counter()
mlp_2h_test_score = roc_auc_score(
    Ytest, mlp_2h_results.predict_proba(Xtest)[:, 1]
)
end = time.perf_counter()
print(f"Multi-layer Perceptron 2 Hidden: {mlp_2h_test_score:.5f}")
print(f"It took {end - srt:.1f}s\n")

srt = time.perf_counter()
torch_2h_test_score = roc_auc_score(
    Ytest,
    torch_2h_results.predict_proba(Xtest.to_numpy(np.float32))[:, 1],
)
end = time.perf_counter()
print(
    f"PyTorch Multi-layer Perceptron 2 Hidden: {torch_2h_test_score:.5f}"
)
print(f"It took {end - srt:.1f}s\n")

Logistic Regression: 0.88221
It took 0.0s

K-nearest Neighbors: 0.86408
It took 46.6s

Multi-layer Perceptron 1 Hidden: 0.89149
It took 0.0s

Multi-layer Perceptron 2 Hidden: 0.88714
It took 0.1s

PyTorch Multi-layer Perceptron 2 Hidden: 0.89708
It took 0.5s



# 4. Submission

In [13]:
Ysubmit = torch_2h_results.predict_proba(Xsubmit.to_numpy(np.float32))[:, 1]
Ysubmit = pd.DataFrame({"positive": Ysubmit})
Ysubmit.to_csv("submission.csv", index=True, index_label="Id")
Ysubmit

Unnamed: 0,positive
0,0.000494
1,0.490494
2,0.002213
3,0.136872
4,0.023931
...,...
12488,0.123713
12489,0.316657
12490,0.005438
12491,0.099295
