# Sentiment Analysis on Movie Reviews

## Final solution

The experiment phase highlighted how the Transformer model achieves the best result in this particular task with our dataset.
So, as final solution, we use the Transformer model.
Also, validating the model showed that the best results are obtained fine-tuning the pre-trained model *distilbert-base-uncased*.
The model seems to overfit pretty soon, possibly because it is very complex and we have little training data.
So, for our final version, we train it for only 4 epochs with all of our training data, increasing the dropout from 0.1 (default) to 0.2.
We then use this model to produce the final output.

<br/>

This solution scored 0.69352 (accuracy) on the official Kaggle competition, which would have ranked 5th according to the official Leaderboard, among more than 850 submissions. 



This section is designed to be independent from the rest of the notebook, so it can be run by itself.



In [None]:
! pip install datasets transformers

In [None]:
import transformers
import torch
import pandas as pd
from huggingface_hub import notebook_login

model_checkpoint_final = "distilbert-base-uncased"

class MovieReviewDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    
    def __len__(self):
        return len(self.labels)
      
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item


# Login huggingface
# Copy and paste the following in the form that pops up: 
# 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
# The token enables only read operations
notebook_login()  

The following cell was used to train our final model.
Because it has been already done, this cell has been commented out. You can skip directly to the next where we download the model from the HugginFace hub and validate it.

In [None]:
# from sklearn.model_selection import train_test_split 
# from transformers import TrainingArguments, Trainer, AutoTokenizer, AutoModelForSequenceClassification
# import pandas as pd
# import numpy as np

# # Load datatest 
# train_url = "train.tsv"
# train_data = pd.read_csv(train_url, sep = '\t')

# X_train_final = train_data['Phrase'].tolist()           # training features
# y_train_final = np.array(train_data['Sentiment'])         # training labels

# # Model config label
# id2label = {
#   0: 0, 
#   1: 1,
#   2: 2,
#   3: 3,
#   4: 4
# }

# # Model creation
# model_final = AutoModelForSequenceClassification.from_pretrained(model_checkpoint_final, num_labels=5, dropout=0.2, id2label=id2label)
# tokenizer_final = AutoTokenizer.from_pretrained(model_checkpoint_final, use_fast=True)

# # Create a list of words ids from distilbert vocabulary
# X_train_tokenized_final = tokenizer_final(X_train_final, truncation=True)

# # Network Arguments
# train_dataset_final = MovieReviewDataset(X_train_tokenized_final, y_train_final)

# batch_size_final = 16
# epochs_final = 4      # 15
# learning_rate = 2e-5
# weight_decay = 0.02
# gradient_accumulation_steps = 1

# args_final = TrainingArguments(
#     f"{model_checkpoint_final}-finetuned-final-nlp-1-epoch-default-dropout",
#     save_strategy = "epoch",
#     learning_rate = learning_rate,
#     per_device_train_batch_size = batch_size_final,
#     per_device_eval_batch_size = batch_size_final * 4,
#     gradient_accumulation_steps = gradient_accumulation_steps,
#     num_train_epochs = epochs_final,
#     weight_decay = weight_decay,
#     do_eval = False,
#     push_to_hub = True
# )

# trainer_final = Trainer(
#     model_final,
#     args_final,
#     train_dataset=train_dataset_final,
#     tokenizer=tokenizer_final
# )

# trainer_final.train()

Here we download the model that we have fine-tuned from the HuggingFace hub and produce our final output.
We also ensure that both the model and the data are stored on the same device, otherwise the pipeline would raise an error.

<br/>
Before running the predictions, we need to set the model in eval mode to disable the dropout layers.

<br/>

The predictions on the whole test set take about 7 minutes when using GPU and much more when using CPU.


In [None]:
import csv
from transformers import TextClassificationPipeline, AutoTokenizer, AutoModelForSequenceClassification

device_model = "cuda:0" if torch.cuda.is_available() else "cpu"
device_pipe = 0 if torch.cuda.is_available() else -1

# Load dataset
test_url = "../input/dataset/test.tsv"

test_data = pd.read_csv(test_url, sep = '\t')
X_test_phrases = test_data['Phrase'].tolist()
X_test_phrase_ids = test_data['PhraseId'].tolist()

tokenizer = AutoTokenizer.from_pretrained(f"gianfrancodemarco/{model_checkpoint_final}-finetuned-final-nlp")
model = AutoModelForSequenceClassification.from_pretrained(f"gianfrancodemarco/{model_checkpoint_final}-finetuned-final-nlp")

# We need to set the model in eval mode in order to disable dropout layers. Otherwise, predictions would have a random component.
model.eval()

pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, device=device_pipe)

# About 7 mins (on GPU)
predictions = pipe(X_test_phrases)

output = [{'PhraseId': phrase_id, 'Sentiment': prediction['label']} for (phrase_id, prediction) in zip(X_test_phrase_ids, predictions)]


with open('./output.csv', 'w', newline='') as output_file:
    dict_writer = csv.DictWriter(output_file, ['PhraseId', 'Sentiment'])
    dict_writer.writeheader()
    dict_writer.writerows(output)



Author : Gianfranco Demarco

Author : Francesco Ranieri
