# Introduction

* In this notebook we examine using HuggingFace transformers to fine-tune a pre-trained model for sentiment analysis.
* The data consist of shoppers' reviews (free text) and ratings (score of 1 to 5) of women's apparels from an ecommerce shop.
* The task is to predict the ratings based on a given free text review.

# Install Packages

In [None]:
!pip install transformers
!pip install fast_ml==3.68
!pip install datasets

# Import Packages

In [None]:
import numpy as np
import pandas as pd
from fast_ml.model_development import train_valid_test_split
from transformers import Trainer, TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForSequenceClassification
import torch
from torch import nn
from torch.nn.functional import softmax
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder
import datasets

In [None]:
# ensure that Accelerator is set to "GPU"
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print (f'Device Availble: {DEVICE}')

# Data Preparation

In [None]:
df = pd.read_csv('/kaggle/input/womens-ecommerce-clothing-reviews/Womens Clothing E-Commerce Reviews.csv')
df.drop(columns = ['Unnamed: 0'], inplace = True)
df.head()

* only the `Review Text` and `Rating` columns as these are the only columns that we are interested in for this task.
* The `Review Text` column serves as input variable to the model and the `Rating` column is our target variable.
* The `Rating` is label encoded.

In [None]:
df_reviews = df.loc[:, ['Review Text', 'Rating']].dropna()
df_reviews['Rating'] = df_reviews['Rating'].apply(lambda x: f'{x} Stars' if x != 1 else f'{x} Star')
df_reviews.head()

In [None]:
le = LabelEncoder()
df_reviews['Rating'] = le.fit_transform(df_reviews['Rating'])
df_reviews.head()

In [None]:
print (le.classes_)

* We split the data into train, validation and test set in the ratio of 80%, 10% and 10% respectively
* The review/rating are converted into a list format, where each item in the list corresponds to one review/rating

In [None]:
(train_texts, train_labels,
 val_texts, val_labels,
 test_texts, test_labels) = train_valid_test_split(df_reviews, target = 'Rating', train_size=0.8, valid_size=0.1, test_size=0.1)

train_texts = train_texts['Review Text'].to_list()
train_labels = train_labels.to_list()
val_texts = val_texts['Review Text'].to_list()
val_labels = val_labels.to_list()
test_texts = test_texts['Review Text'].to_list()
test_labels = test_labels.to_list()

print ('Sample training data')
print ('')
print (train_texts[0:5])
print ('')
print ('')
print ('Sample target variable')
print ('')
print (train_labels[0:5])

* We initialize the `bert-base-uncased` tokenizer and encode the train, valdiation and test data 

In [None]:
class DataLoader(torch.utils.data.Dataset):
    def __init__(self, sentences=None, labels=None):
        self.sentences = sentences
        self.labels = labels
        self.tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
        
        if bool(sentences):
            self.encodings = self.tokenizer(self.sentences,
                                            truncation = True,
                                            padding = True)
        
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        
        if self.labels == None:
            item['labels'] = None
        else:
            item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.sentences)
    
    
    def encode(self, x):
        return self.tokenizer(x, return_tensors = 'pt').to(DEVICE) 
        
        

In [None]:
train_dataset = DataLoader(train_texts, train_labels)
val_dataset = DataLoader(val_texts, val_labels)
test_dataset = DataLoader(test_texts, test_labels)

* This is how the data looks like after going through the `DataLoader`. This is the format of data accepted by HuggingFace transformer that we will be using.
* The output data is a dictionary consisting of 3 keys-value pairs
- `input_ids`: this contains a tensor of integers where each integer represents words from the original sentence. The `tokenizer` steps has transformed the individuals words into tokens represented by the integers. The first token `101` is the start of sentence token and the`102` token is the end of sentence token. Notice that there are many trailing zeros, this is due to padding that was applied to the sentences at the `tokenizer` step.
- `attention_mask`: this is an array of binary values. Each position of the `attention_mask` corresponds to a token in the same position in the `input_ids`. `1` indicates that the token at the given position should be attended to and `0` indicates that the token at the given position is a padded value.
- `labels`: this is the target label

In [None]:
train_dataset.__getitem__(0)

Below is the original text before tokenizing

In [None]:
train_texts[0]

Let's set up the evaluation metrics.  

In [None]:
f1 = datasets.load_metric('f1')
accuracy = datasets.load_metric('accuracy')
precision = datasets.load_metric('precision')
recall = datasets.load_metric('recall')

def compute_metrics(eval_pred):
    metrics_dict = {}
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    metrics_dict.update(f1.compute(predictions = predictions, references = labels, average = 'macro'))
    metrics_dict.update(accuracy.compute(predictions = predictions, references = labels))
    metrics_dict.update(precision.compute(predictions = predictions, references = labels, average = 'macro'))
    metrics_dict.update(recall.compute(predictions = predictions, references = labels, average = 'macro'))

    return metrics_dict

In [None]:
id2label = {idx:label for idx, label in enumerate(le.classes_)}
label2id = {label:idx for idx, label in enumerate(le.classes_)}

config = AutoConfig.from_pretrained('distilbert-base-uncased',
                                    num_labels = 5,
                                    id2label = id2label,
                                    label2id = label2id)
model = AutoModelForSequenceClassification.from_config(config)

In [None]:
print (f'id2label: {id2label}')
print (f'label2id: {label2id}')

In [None]:
config

In [None]:
model

## 

In [None]:
training_args = TrainingArguments(
    output_dir='/kaggle/working/results',
    num_train_epochs=1,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.05,
    report_to='none',
    evaluation_strategy='steps',
    logging_dir='/kagge/working/logs',
    logging_steps=50)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics)

trainer.train()


Let's evaluate the trained model on test data. The output contains the unnormalized score, groudtruth label ids and evaluation metrics

In [None]:
test_results = trainer.predict(test_dataset)

print ('Predictions, unnormalized score')
print (test_results.predictions)
print ('')
print ('Ground Truth Labels')
print (test_results.label_ids)
print ('')
print ('Metrics')
print (test_results.metrics)

In [None]:
label2id_mapper = model.config.id2label
proba = softmax(torch.from_numpy(test_results.predictions))
pred = [label2id_mapper[i] for i in torch.argmax(proba, dim = -1).numpy()]
actual = [label2id_mapper[i] for i in test_results.label_ids]
class_report = classification_report(actual, pred, output_dict = True)
pd.DataFrame(class_report)

# Save Model

In [None]:
trainer.save_model('/kaggle/working/sentiment_model')