# [Sentiment Analysis](https://www.kaggle.com/c/sentiment-analysis-pmr3508)

[PMR3508](https://uspdigital.usp.br/jupiterweb/obterDisciplina?sgldis=PMR3508) - Machine Learning and Pattern Recognition

Professor Fabio Gagliardi Cozman

PMR3508-2020-83 - [Vitor Gratiere Torres](https://github.com/vitorgt/PMR3508)

This assignment's goal is to tell whether an IMDb review is positive or negative (complementary).

My analysis takes the following steps:

1. [Import data and python modules](#1.-Import-data-and-python-modules) (Acquisition)
1. [Preprocessing](#2.-Preprocessing)
    1. [Clean text and vectorize](#2.1.-Clean-text-and-vectorize) (Preprocessing)
    1. [Embed words](#2.2.-Embed-words) (Representation)
1. [Modelling](#3.-Modelling)
    1. [Models Definition and Hyperparameters Search](#3.1.-Models-Definition-and-Hyperparameters-Search)
        1. [Logistic Regression](#3.1.1.-Logistic-Regression)
        1. [K-nearest Neighbors](#3.1.2.-K-nearest-Neighbors)
        1. [Multi-layer Perceptron - 1 Hidden Layer](#3.1.3.-Multi-layer-Perceptron---1-Hidden-Layer)
        1. [Multi-layer Perceptron - 2 Hidden Layers](#3.1.4.-Multi-layer-Perceptron---2-Hidden-Layers)
        1. [PyTorch Multi-layer Perceptron - 2 Hidden Layers](#3.1.5.-PyTorch-Multi-layer-Perceptron---2-Hidden-Layers)
    1. [Metric Comparison](#3.2.-Metric-Comparison)
1. [Submission](#4.-Submission)

# 1. Import data and python modules

In [None]:
import time
import random

# Data handling
import numpy as np
import pandas as pd
from scipy.stats import loguniform
from skopt.space import Integer, Real

# Text modules
import re
import string
from ftfy import fix_text

# Embeddings
from gensim.models.doc2vec import Doc2Vec

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neural_network import MLPClassifier
from skorch import NeuralNetClassifier
import torch
from torch import optim
import torch.nn as nn
import torch.nn.functional as F

# Hyperparameters
from sklearn.model_selection import RandomizedSearchCV
from skopt import BayesSearchCV

# Metrics
from sklearn.metrics import roc_auc_score

In [None]:
XYtrain = pd.read_csv("sentiment-analysis-pmr3508/data_train.csv")
# kaggle path = "../input/sentiment-analysis-pmr3508/data_XYtrain.csv"
XYtest = pd.read_csv("sentiment-analysis-pmr3508/data_test1.csv")
Xsubmit = pd.read_csv("sentiment-analysis-pmr3508/data_test2_X.csv")

print("Training data shape:", XYtrain.shape)
print(XYtrain.describe())

print("\nTest data shape:", XYtest.shape)
print(XYtest.describe())

print("\nSubmission data shape:", Xsubmit.shape)
print(Xsubmit.describe())

XYtrain.sample(5)

# 2. Preprocessing

The nature of this dataset is different from previous, there might not be ```NAs```, ```NANs``` or ```Nulls```, this time we ought to notice duplicates.

In [None]:
print("Training data shape initially:", XYtrain.shape)
XYtrain = XYtrain.drop_duplicates(keep="first")
print("Training data shape without duplicates:", XYtrain.shape)

print("\nTest data shape initially:", XYtest.shape)
XYtest = XYtest.drop_duplicates(keep="first")
print("Test data shape without duplicates:", XYtest.shape)

print("\nSubmission data shape initially:", Xsubmit.shape)
Xsubmit = Xsubmit.drop_duplicates(keep="first")
print("Submission data shape without duplicates:", Xsubmit.shape)

In [None]:
Xtrain = XYtrain.loc[:, "review"]
Ytrain = XYtrain.loc[:, "positive"]

Xtest = XYtest.loc[:, "review"]
Ytest = XYtest.loc[:, "positive"]

Xsubmit = Xsubmit.iloc[:, 0]

del XYtrain
del XYtest

print("Xtrain", Xtrain.shape, "Ytrain", Ytrain.shape)
print("Xtest", Xtest.shape, "Ytest", Ytest.shape)
print("Xsubmit", Xsubmit.shape)

They are all ```pandas.Series``` now.

## 2.1. Clean text and vectorize

We were given the below function to clean and prepare the dataset. This is an important step dealing with NLP, we could make use of NLTK or spaCy, but we rather a manual approach.

In [None]:
def preptext(txt):
    # removing tags
    txt = txt.replace("<br />", " ")
    # fixing Mojibakes (See https://pypi.org/project/ftfy/)
    txt = fix_text(txt)
    # converting case
    txt = txt.lower()
    # removing punctuation
    txt = txt.translate(str.maketrans("", "", string.punctuation))
    # removing hyphens
    txt = txt.replace(" — ", " ")
    # replacing digits with a tag
    txt = re.sub("\d+", " <number> ", txt)
    # removing double spaces
    txt = re.sub(" +", " ", txt)
    return txt


def clean_split(data):
    # let's apply this function to clean the dataset
    data = data.apply(preptext)

    # now I need a vector of words, I'll split those strings
    data = data.apply(lambda x: x.split())

    print(data.sample(3))
    return data


Xtrain = clean_split(Xtrain)
Xtest = clean_split(Xtest)
Xsubmit = clean_split(Xsubmit)

## 2.2. Embed words

It is a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension $f:\; \text{word} \hookrightarrow \mathbb{R}^{50}$.

We'll use a pre-trained Doc2Vec model.

In [None]:
def embed(txt, model):
    # model.random.seed(42)
    x = model.infer_vector(txt, steps=20)
    return x


d2v = Doc2Vec.load("sentiment-analysis-pmr3508/doc2vec")

In [None]:
print("Embedding Xtrain")
srt = time.perf_counter()
Xtrain = Xtrain.apply(embed, model=d2v)
Xtrain = pd.DataFrame(Xtrain.to_list())
end = time.perf_counter()
print("Xtrain shape:", Xtrain.shape)
print(f"Embedding Xtrain done in {end - srt:.1f}s\n")

print("Embedding Xtest")
srt = time.perf_counter()
Xtest = Xtest.apply(embed, model=d2v)
Xtest = pd.DataFrame(Xtest.to_list())
end = time.perf_counter()
print("Xtest shape:", Xtest.shape)
print(f"Embedding Xtest done in {end - srt:.1f}s\n")

print("Embedding Xsubmit")
srt = time.perf_counter()
Xsubmit = Xsubmit.apply(embed, model=d2v)
Xsubmit = pd.DataFrame(Xsubmit.to_list())
end = time.perf_counter()
print("Xsubmit shape:", Xsubmit.shape)
print(f"Embedding Xsubmit done in {end - srt:.1f}s\n")

Xtrain.sample(5)

Data is ready!

# 3. Modelling

## 3.1. Models Definition and Hyperparameters Search

Now, using ```sklearn.model_selection.RandomizedSearchCV```, we'll search for the best hyperparameters for each classifier maximizing Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

### 3.1.1. Logistic Regression

In [None]:
%%time

# Model definition
logreg = LogisticRegression(solver="liblinear")

# Hyperparameters
logreg_hp = dict(
    # Inverse of regularization strength; must be a positive float.
    # Like in support vector machines, smaller values specify
    # stronger regularization.
    C=np.linspace(0, 10, 100),
    # Used to specify the norm used in the penalization.
    penalty=["l1", "l2"],
)

# Research
logreg_researcher = RandomizedSearchCV(
    logreg,
    logreg_hp,
    scoring="roc_auc",
    cv=2,
    n_iter=50,
    n_jobs=-1,
)
logreg_results = logreg_researcher.fit(Xtrain, Ytrain)

# Result
print(logreg_results.best_params_, logreg_results.best_score_)

### 3.1.2. K-nearest Neighbors

In [None]:
%%time

# Model definition
knn = KNeighborsClassifier()

# Hyperparameters
knn_hp = dict(
    # Number of neighbors to use.
    n_neighbors=np.arange(151, 210, 1),
)

# Research
knn_researcher = RandomizedSearchCV(
    knn,
    knn_hp,
    scoring="roc_auc",
    cv=2,
    n_iter=50,
    n_jobs=-1,
)
knn_results = knn_researcher.fit(Xtrain, Ytrain)

# Result
print(knn_results.best_params_, knn_results.best_score_)

### 3.1.3. Multi-layer Perceptron - 1 Hidden Layer

In [None]:
%%time

# Model definition
mlp_1h = MLPClassifier(early_stopping=True)

# Hyperparameters
mlp_1h_hp = dict(
    # The ith element represents the number of neurons in the ith
    # hidden layer.
    hidden_layer_sizes=[(2 ** i,) for i in np.arange(6, 12)],
    # L2 penalty (regularization term) parameter.
    alpha=loguniform(0.000001, 0.1),
)

# Research
mlp_1h_researcher = RandomizedSearchCV(
    mlp_1h,
    mlp_1h_hp,
    scoring="roc_auc",
    cv=2,
    n_iter=30,
    n_jobs=-1,
)
mlp_1h_results = mlp_1h_researcher.fit(Xtrain, Ytrain)

# Result
print(mlp_1h_results.best_params_, mlp_1h_results.best_score_)

### 3.1.4. Multi-layer Perceptron - 2 Hidden Layers

In [None]:
%%time

# Model definition
mlp_2h = MLPClassifier(early_stopping=True)

# Hyperparameters
mlp_2h_hp = dict(
    # The ith element represents the number of neurons in the ith
    # hidden layer.
    hidden_layer_sizes=[
        (2 ** i, 2 ** j)
        for j in np.arange(6, 10)
        for i in np.arange(6, 10)
    ],
    # L2 penalty (regularization term) parameter.
    alpha=loguniform(0.000001, 0.1),
)

# Research
mlp_2h_researcher = RandomizedSearchCV(
    mlp_2h,
    mlp_2h_hp,
    scoring="roc_auc",
    cv=2,
    n_iter=30,
    n_jobs=-1,
)
mlp_2h_results = mlp_2h_researcher.fit(Xtrain, Ytrain)

# Result
print(mlp_2h_results.best_params_, mlp_2h_results.best_score_)

### 3.1.5. PyTorch Multi-layer Perceptron - 2 Hidden Layers

In [None]:
%%time

# Model definition
class MLPNet(nn.Module):
    def __init__(self, hidden1_dim=512, hidden2_dim=64, p=0.25):
        super().__init__()

        # 50 -> hidden1_dim -> hidden2_dim -> 2
        self.fc1 = nn.Linear(50, hidden1_dim)
        self.fc2 = nn.Linear(hidden1_dim, hidden2_dim)
        self.fc3 = nn.Linear(hidden2_dim, 2)

        self.dropout = nn.Dropout(p)

    def forward(self, X, **kwargs):
        fc_out = F.relu(self.fc1(X))
        fc_out = self.dropout(fc_out)

        fc_out = F.relu(self.fc2(fc_out))
        fc_out = self.dropout(fc_out)

        fc_out = self.fc3(fc_out)
        soft_out = F.softmax(fc_out, dim=-1)

        return soft_out


# Model definition
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_2h = NeuralNetClassifier(
    MLPNet().to(device),
    max_epochs=20,
    lr=1e-4,
    optimizer=optim.Adam,
    optimizer__weight_decay=1e-4,
    train_split=False,
    verbose=0,
    iterator_train__shuffle=True,
)

# Hyperparameters
torch_2h_hp = dict(
    module__hidden1_dim=Integer(256, 2048),
    module__hidden2_dim=Integer(64, 1024),
    module__p=Real(0.1, 0.75, prior="uniform"),
    optimizer__weight_decay=Real(1e-10, 1e-2, prior="log-uniform"),
)

# Research
torch_2h_researcher = BayesSearchCV(
    torch_2h,
    torch_2h_hp,
    scoring="roc_auc",
    cv=2,
    n_iter=30,
    n_jobs=-1,
)
torch_2h_results = torch_2h_researcher.fit(
    Xtrain.to_numpy(np.float32), Ytrain.to_numpy(np.int64)
)

# Result
print(torch_2h_results.best_params_, torch_2h_results.best_score_)

## 3.2. Metric Comparison

Based on ```Test``` dataframe

In [None]:
logreg_test_score = roc_auc_score(
    Ytest, logreg_results.predict_proba(Xtest)[:, 1]
)
print(f"Logistic Regression: {logreg_test_score:.5f}")

knn_test_score = roc_auc_score(
    Ytest, knn_results.predict_proba(Xtest)[:, 1]
)
print(f"K-nearest Neighbors: {knn_test_score:.5f}")

mlp_1h_test_score = roc_auc_score(
    Ytest, mlp_1h_results.predict_proba(Xtest)[:, 1]
)
print(f"Multi-layer Perceptron 1 Hidden: {mlp_1h_test_score:.5f}")

mlp_2h_test_score = roc_auc_score(
    Ytest, mlp_2h_results.predict_proba(Xtest)[:, 1]
)
print(f"Multi-layer Perceptron 2 Hidden: {mlp_2h_test_score:.5f}")

torch_2h_test_score = roc_auc_score(
    Ytest, torch_2h_results.predict_proba(Xtest)[:, 1]
)
print(f"PyTorch Multi-layer Perceptron 2 Hidden: {torch_2h_test_score:.5f}")

# 4. Submission

In [None]:
Ysubmit = torch_2h_results.predict_proba(Xsubmit)[:, 1]
Ysubmit = pd.DataFrame({"positive": Ysubmit})
Ysubmit.to_csv("submission.csv", index=True, index_label="Id")