<a href="https://colab.research.google.com/github/ritwickban/Sentiment-Classification/blob/main/Sentiment_Classification_on_SST2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Classification on SST2 data set
This notebook is taken from the Huggingface website (https://huggingface.co/docs/transformers/tasks/sequence_classification), modified for CSCI5541 and used for as a template for the homework assignment. Today we will be:
1. Finetuning [Roberta](https://huggingface.co/distilbert-base-uncased) on the [SST2](https://huggingface.co/datasets/imdb) dataset to determine whether a given text is positive or negative.
2. Using the finetuned model for inference.

### 1. Load dataset

In [38]:
%%capture
# Use this only after you check everything is being loaded properly

# First install necessary libraries
# Exclamation marks for shell commands
! pip install transformers datasets evaluate scikit-learn
! pip install accelerate -U

In [39]:
import torch

device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
print(f'device: {device}')

device: cuda:0


We will be using the SST2 dataset from the Hugging Face Datasets library:

In [40]:
from datasets import load_dataset

sst2 = load_dataset('sst2')

The dataset is separated into three sections: "train," "validation," and "test". We'll use the data in the "train" section for training, and you'll use the data in the "validation" section to evaluate your model. (The "test" data is labeled differently, so we will be using it to predict and check our model performance.)

There are two fields in this dataset:

- `text`: the review text.
- `label`: a value that is either `0` for a negative review or `1` for a positive sentiment.

In [41]:
sst2['train'].num_rows

67349

Then take a look at an example:

In [42]:
sst2['train'][1]
#sst2['train'][1]

{'idx': 1, 'sentence': 'contains no wit , only labored gags ', 'label': 0}

### 2. Preprocess

The next step is to load a tokenizer to preprocess the `text` field.
A tokenizer converts text to a sequence of tokens, creating a numerical representation of the text.

In [43]:
from transformers import AutoTokenizer

distilbert_tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
bert_cased_tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
roberta_tokenizer = AutoTokenizer.from_pretrained('roberta-base')
#T5_base_tokenizer = AutoTokenizer.from_pretrained('t5-base')

text = 'Hello everyone! ! antidisestablishmentarianism'

for tokenizer in [distilbert_tokenizer, bert_cased_tokenizer, roberta_tokenizer]:
  print(f'\n\n{tokenizer.name_or_path}')
  vocab = {v: k for k, v in tokenizer.vocab.items()}
  tokenized_text = tokenizer(text)
  print([vocab[id] for id in tokenized_text['input_ids']])

tokenizer = roberta_tokenizer



distilbert-base-uncased
['[CLS]', 'hello', 'everyone', '!', '!', 'anti', '##dis', '##est', '##ab', '##lish', '##ment', '##arian', '##ism', '[SEP]']


bert-base-cased
['[CLS]', 'Hello', 'everyone', '!', '!', 'anti', '##dis', '##esta', '##b', '##lish', '##ment', '##arian', '##ism', '[SEP]']


roberta-base
['<s>', 'Hello', 'Ġeveryone', '!', 'Ġ!', 'Ġant', 'idis', 'establishment', 'arian', 'ism', '</s>']


Here we will be using Roberta tokenizer, as going forward that will be our model of choice. It is necessary to have the same tokenizer as our model would expect the input to be in a certain way, Using the same tokenizer, helps in that regard. Creating a preprocessing function to tokenize `text`. we will specify how to deal with varying input lengths here using the max_length, truncation, and/or padding arguments. (Default is to not truncate or pad. Max length is determined by model.)

https://huggingface.co/docs/transformers/pad_truncation

In [44]:
def preprocess_function(examples):
    return tokenizer(examples['sentence'], truncation=True)

To apply the preprocessing function over the entire dataset, use 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. We will speed up `map` by setting `batched=True` to process multiple elements of the dataset at once.

In [45]:
tokenized_sst2 = sst2.map(preprocess_function, batched=True)

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Now we will create a batch of examples using [DataCollatorWithPadding](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorWithPadding). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximium length in the tokenzation process.

In [46]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

### 3. Create evaluation method

Including a metric during training is often helpful for evaluating our model's performance (otherwise, it just prints the loss). Therefore, we will load a evaluation method with the Hugging Face [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, we will load the [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) metric:

In [47]:
import evaluate

# Proportion of correct predictions among the total number of cases processed
accuracy = evaluate.load('accuracy')

Then create a function that passes your predictions and labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate the accuracy:

In [48]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

Our `compute_metrics` function is set up. Now,  -- we'll need it later when you set up our training.

### 4. Train

Before we start training your model, we have to create a map of the expected ids to their labels with `id2label` and `label2id`.

In [49]:
labels = ['NEGATIVE', 'POSITIVE']
id2label = {i: label for i, label in enumerate(labels)}
label2id = {label: i for i, label in id2label.items()}

print('id2label:', id2label)
print('label2id:', label2id)

id2label: {0: 'NEGATIVE', 1: 'POSITIVE'}
label2id: {'NEGATIVE': 0, 'POSITIVE': 1}


Next, we will be using the Trainer class, which is wrapper code that abstracts away the details of training and evaluation. It is optimized for training Hugging Face Transformers and makes it easier for us to train models without writing much code.

[Here is where I learnt about Transformers](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer)<br/>
[Another Transformers tutorial that helped](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)

In [50]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

# This automodel class gives us the model with pretrained weights + a sequence classification head
# We specify how many labels we need so that the model has the correct number of outputs
# We specify id2label/label2id so that the model understands the label associated with each output
model = AutoModelForSequenceClassification.from_pretrained(
    "roberta-base", num_labels=len(labels), id2label=id2label, label2id=label2id
)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


At this point, only three steps remain:

1. Defining our training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the accuracy and save the training checkpoint.
2. Passing the training arguments to our [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call the [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune our model.

In [51]:
# https://huggingface.co/transformers/v4.4.2/main_classes/trainer.html#trainingarguments
training_args = TrainingArguments(
    output_dir='Ritwickban/Roberta_SST2',
    learning_rate=2e-5,
    per_device_train_batch_size=49,
    per_device_eval_batch_size=49,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
)


In [52]:
tokenized_sst2['validation']

Dataset({
    features: ['idx', 'sentence', 'label', 'input_ids', 'attention_mask'],
    num_rows: 872
})

In [53]:
import time as time
import math
from tqdm import tqdm

Here, we are importing the required libraries for our model to run.

In [54]:
!pip install wandb



In [55]:
import wandb
wandb.login()



True

Importing the library Weights and Biases, to log the metrics for our model training and evaluation parts of our model

In [56]:
class MyTrainer(Trainer):
  def _inner_training_loop(self, batch_size=None, args=None, resume_from_checkpoint=None, trial=None, ignore_keys_for_eval=None):
        number_of_epochs = args.num_train_epochs
        #number_of_train_batches = sst2['train'].num_rows/args.per_device_train_batch_size
        start = time.time()
        train_acc=[]
        eval_acc=[]
        criterion = torch.nn.CrossEntropyLoss().to(device)
        self.optimizer = torch.optim.Adam(model.parameters(), lr=args.learning_rate)
        self.scheduler = torch.optim.lr_scheduler.StepLR(self.optimizer , 1, gamma=0.9)
        train_dataloader = self.get_train_dataloader()
        eval_dataloader = self.get_eval_dataloader()
        # train
        wandb.init(project="sst2", name="ritwick",
        config=
        {"learning_rate":2e-5 ,
      "architecture": "roberta-base",
      "dataset": "SST2",
      "epochs": 2})
        for epoch in range(number_of_epochs):
            train_loss_per_epoch = 0
            train_acc_per_epoch = 0
            with tqdm(train_dataloader , unit="batch") as training_epoch:
                training_epoch.set_description(f"Training Epoch {epoch}")
                for step, inputs in enumerate(training_epoch):
                    inputs = inputs.to(device)
                    labels = inputs['labels']
                    # forward pass
                    self.optimizer.zero_grad()
                    output = model(**inputs) # TODO Implement by yourself
                    # get the loss
                    loss = criterion(output['logits'], labels) # TODO Implement by yourself
                    train_loss_per_epoch += loss.item()
                    #calculate gradients
                    loss.backward()
                    #update weights
                    self.optimizer.step()
                    train_acc_per_epoch += (output['logits'].argmax(1) == labels).sum().item()
                    wandb.log({"train_acc": train_acc_per_epoch, "Train_loss": train_loss_per_epoch})
                    # adjust the learning rate
            self.scheduler.step()
            train_loss_per_epoch /= len(train_dataloader)
            train_acc_per_epoch /= (len(train_dataloader)*batch_size)
            #wandb.log({"train_acc": train_acc_per_epoch, "Train_loss": train_loss_per_epoch})

            eval_loss_per_epoch = 0
            eval_acc_per_epoch = 0
            # evaluate on validation set
            with torch.no_grad():
                with tqdm(eval_dataloader , unit="batch") as eval_epoch:
                    eval_epoch.set_description(f"Evaluation Epoch {epoch}")
                    for e_step, e_inputs in enumerate(eval_epoch):
                        e_inputs = e_inputs.to(device)
                        e_labels = e_inputs['labels']
                        e_output = model(**e_inputs)
                        loss = criterion(e_output['logits'], e_labels)
                        eval_loss_per_epoch += loss.item()
                        eval_acc_per_epoch += (e_output['logits'].argmax(1) == e_labels).sum().item()
                        wandb.log({"Eval_acc": eval_acc_per_epoch, "Eval_loss": eval_loss_per_epoch})
            eval_loss_per_epoch /= len(eval_dataloader)
            eval_acc_per_epoch /= (len(eval_dataloader)*batch_size)
            wandb.log({"Eval_acc": eval_acc_per_epoch, "Eval_loss": eval_loss_per_epoch})

            print(f'\tTrain Loss: {train_loss_per_epoch:.3f} | Train Acc: {train_acc_per_epoch*100:.2f}%')
            print(f'\tEval Loss: {eval_loss_per_epoch:.3f} | Eval Acc: {eval_acc_per_epoch*100:.2f}%')

        print(f'Time: {(time.time()-start)/60:.3f} minutes')

Defining the custom trainer, to modify the inner training loop based on our requirements. and inheriting the rest of the trainer() class from Hugging face library. Then leveraging the train() from the trainer class to train our model.

In [57]:
mytrainer = MyTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_sst2['train'],
    eval_dataset=tokenized_sst2['validation'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [58]:
mytrainer.train()

VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
Eval_acc,▁▂▂▃▃▃▄▄▅▅▅▆▆▆▇▇██▁▁▂▂▃▃▃▄▄▅▅▅▆▆▇▇▇██▁
Eval_loss,▁▂▂▃▃▃▄▄▄▄▅▅▅▆▆▇▇▇▁▁▂▂▃▃▄▄▄▄▄▅▅▆▆▆▇██▁
Train_loss,▁▂▂▃▃▃▄▄▅▅▅▆▆▆▇▇▇▇██▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄
train_acc,▁▁▂▂▂▃▃▃▄▄▄▅▅▆▆▆▇▇▇█▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▆▇▇▇█

0,1
Eval_acc,0.92517
Eval_loss,0.18058
Train_loss,127.85197
train_acc,65163.0


Training Epoch 0:   0%|          | 0/1375 [00:00<?, ?batch/s]You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Training Epoch 0: 100%|██████████| 1375/1375 [08:07<00:00,  2.82batch/s]
Evaluation Epoch 0: 100%|██████████| 18/18 [00:02<00:00,  7.24batch/s]


	Train Loss: 0.197 | Train Acc: 92.39%
	Eval Loss: 0.162 | Eval Acc: 93.31%


Training Epoch 1: 100%|██████████| 1375/1375 [08:05<00:00,  2.83batch/s]
Evaluation Epoch 1: 100%|██████████| 18/18 [00:02<00:00,  7.23batch/s]

	Train Loss: 0.094 | Train Acc: 96.74%
	Eval Loss: 0.181 | Eval Acc: 92.63%
Time: 16.423 minutes





In [59]:
from transformers import AutoModelForSequenceClassification

model = model
evaluation_results_trainer = mytrainer.evaluate(tokenized_sst2["validation"])
evaluation_results_trainer

{'eval_loss': 0.1812105029821396,
 'eval_accuracy': 0.9369266055045872,
 'eval_runtime': 2.5086,
 'eval_samples_per_second': 347.609,
 'eval_steps_per_second': 7.175}

In [61]:
count=0
i=0
model_name="Roberta_SST2_Ritwick"
tokenizer = tokenizer
model = model
validation=tokenized_sst2['validation']
while count!=10:
  text=validation['sentence'][i]
  inputs = tokenizer(text,return_tensors="pt")
  with torch.no_grad():
    inputs = inputs.to(device)
    logits = model(**inputs).logits
  predicted_class_id = logits.argmax().item()
  if predicted_class_id!=validation['label'][i]:
    print(text)
    print('Confidence score:',torch.nn.functional.softmax(logits,dim=1))
    print('Predict:',model.config.id2label[predicted_class_id],"->Actual:",model.config.id2label[validation['label'][i]])
    count+=1
  i+=1

we root for ( clara and paul ) , even like them , though perhaps it 's an emotion closer to pity . 
Confidence score: tensor([[0.6872, 0.3128]], device='cuda:0')
Predict: NEGATIVE ->Actual: POSITIVE
holden caulfield did it better . 
Confidence score: tensor([[0.4957, 0.5043]], device='cuda:0')
Predict: POSITIVE ->Actual: NEGATIVE
the script kicks in , and mr. hartley 's distended pace and foot-dragging rhythms follow . 
Confidence score: tensor([[0.3453, 0.6547]], device='cuda:0')
Predict: POSITIVE ->Actual: NEGATIVE
fresnadillo 's dark and jolting images have a way of plying into your subconscious like the nightmare you had a week ago that wo n't go away . 
Confidence score: tensor([[0.6718, 0.3282]], device='cuda:0')
Predict: NEGATIVE ->Actual: POSITIVE
you wo n't like roger , but you will quickly recognize him . 
Confidence score: tensor([[0.0042, 0.9958]], device='cuda:0')
Predict: POSITIVE ->Actual: NEGATIVE
if steven soderbergh 's ` solaris ' is a failure it is a glorious failure

For a more in-depth example of how to finetune a model for text classification, I refererred the following sources:</br>
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb)
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb).

## Inference

Great, now that we've finetuned a model, we can use it for inference!

We can use the test dataset to run inference on our model and training:

In [62]:
text=sst2['test']['sentence']

In [63]:
import pandas as pd

In [64]:
df=pd.DataFrame(sst2['test'],columns=['sentence','label'])
df.head()

Unnamed: 0,sentence,label
0,uneasy mishmash of styles and genres .,-1
1,this film 's relationship to actual tension is...,-1
2,"by the end of no such thing the audience , lik...",-1
3,director rob marshall went out gunning to make...,-1
4,lathan and diggs have considerable personal ch...,-1


To try out your finetuned model for inference lets use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiating a `pipeline` for sentiment analysis with your model, and pass our text to it:

In [65]:
from transformers import pipeline

classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer, device=device)
#classifier(text)

In [66]:
df['prediction']=classifier(text)
df.head()

Unnamed: 0,sentence,label,prediction
0,uneasy mishmash of styles and genres .,-1,"{'label': 'NEGATIVE', 'score': 0.9868481755256..."
1,this film 's relationship to actual tension is...,-1,"{'label': 'NEGATIVE', 'score': 0.9963800311088..."
2,"by the end of no such thing the audience , lik...",-1,"{'label': 'POSITIVE', 'score': 0.8997088074684..."
3,director rob marshall went out gunning to make...,-1,"{'label': 'POSITIVE', 'score': 0.9586588144302..."
4,lathan and diggs have considerable personal ch...,-1,"{'label': 'POSITIVE', 'score': 0.9986604452133..."


In [67]:
def parse_prediction(prediction):
    sentiment = prediction['label']
    score = prediction['score']
    return sentiment, score

# Apply the function to the prediction column and assign results to new columns
df[['Sentiment', 'Score']] = df['prediction'].apply(lambda x: pd.Series(parse_prediction(x)))


In [None]:
df.head()

In [68]:
df['label'].unique()

array([-1])

In [69]:
from google.colab import files
df.to_csv("Annotated.csv")
files.download("Annotated.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Here, lets have the predictions on the basis of the sentences in our test dataset, and then append it to our test dataset, and have the predictions along with it, hence we can annotate and figure out errors our model is making

## Here are few resources that helped in completing this project!
### Model/Dateset Cards in Huggingface (Documentation)

Markdown files with information on how to use the model/dataset and other relevant data (metadata, potential limitations, etc.)

Looking for models/datasets to use:<br/>
https://huggingface.co/models<br/>
https://huggingface.co/datasets

More information:<br/>
https://huggingface.co/docs/hub/model-cards<br/>
https://huggingface.co/docs/hub/datasets-cards

Templates:<br/>
https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md<br/>
https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md
