<a href="https://colab.research.google.com/github/ritwickban/Sentiment-Classification/blob/main/Sentiment_Classification_on_SST2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Classification on SST2 data set
This notebook is taken from the Huggingface website (https://huggingface.co/docs/transformers/tasks/sequence_classification), modified for CSCI5541 and used for as a template for the homework assignment. Today we will be:
1. Finetuning [Roberta](https://huggingface.co/distilbert-base-uncased) on the [SST2](https://huggingface.co/datasets/imdb) dataset to determine whether a given text is positive or negative.
2. Using the finetuned model for inference.

## Pre-Training and Fine-Tuning
Pre-training:
- Pre-training is the process of training a neural network on a large dataset **before** fine-tuning it for a specific task, i.e. training "from scratch."
- Pre-training allows the network to learn general linguistic features and representations that can be useful for many different tasks

Fine-Tuning:
- Fine-tuning is the process of training a pre-trained model on a new (and almost always smaller) dataset/task
- Improves task-specific performance because the model learns to specialize
- Saves time and computing resources since the model doesn't need to learn everything from scratch

<a href="https://aclanthology.org/N19-1423.pdf">Image source</a>

## Let's start fine-tuning!
### 1. Load dataset

In [21]:
%%capture
# Use this only after you check everything is being loaded properly

# First install necessary libraries
# Exclamation marks for shell commands
! pip install transformers datasets evaluate scikit-learn
! pip install accelerate -U

In [22]:
import torch

device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
print(f'device: {device}')

device: cuda:0


We will be using the IMDb (Internet Movie Database) dataset from the 🤗 Datasets library:

In [5]:
from datasets import load_dataset

sst2 = load_dataset('sst2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

The dataset is separated into three sections: "train," "test," and "unsupervised." You'll use the data in the "train" section for training, and you'll use the data in the "test" section to evaluate your model. (The "unsupervised" data is unlabeled, so we will not be using it.)

There are two fields in this dataset:

- `text`: the movie review text.
- `label`: a value that is either `0` for a negative review or `1` for a positive review.

In [6]:
sst2

DatasetDict({
    train: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 872
    })
    test: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 1821
    })
})

Then take a look at an example:

In [7]:
sst2['train'][1]
#sst2['train'][1]

{'idx': 1, 'sentence': 'contains no wit , only labored gags ', 'label': 0}

### 2. Preprocess

The next step is to load a tokenizer to preprocess the `text` field.
A tokenizer converts text to a sequence of tokens, creating a numerical representation of the text.
Notice how there are multiple ways to tokenize text. Make sure to use the right tokenizer for your model.

In [8]:
from transformers import AutoTokenizer

distilbert_tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
bert_cased_tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
roberta_tokenizer = AutoTokenizer.from_pretrained('roberta-base')
#T5_base_tokenizer = AutoTokenizer.from_pretrained('t5-base')

text = 'Hello everyone! ! antidisestablishmentarianism'

for tokenizer in [distilbert_tokenizer, bert_cased_tokenizer, roberta_tokenizer]:
  print(f'\n\n{tokenizer.name_or_path}')
  vocab = {v: k for k, v in tokenizer.vocab.items()}
  tokenized_text = tokenizer(text)
  print([vocab[id] for id in tokenized_text['input_ids']])

tokenizer = roberta_tokenizer

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



distilbert-base-uncased
['[CLS]', 'hello', 'everyone', '!', '!', 'anti', '##dis', '##est', '##ab', '##lish', '##ment', '##arian', '##ism', '[SEP]']


bert-base-cased
['[CLS]', 'Hello', 'everyone', '!', '!', 'anti', '##dis', '##esta', '##b', '##lish', '##ment', '##arian', '##ism', '[SEP]']


roberta-base
['<s>', 'Hello', 'Ġeveryone', '!', 'Ġ!', 'Ġant', 'idis', 'establishment', 'arian', 'ism', '</s>']


Create a preprocessing function to tokenize `text`. You can specify how to deal with varying input lengths here using the max_length, truncation, and/or padding arguments. (Default is to not truncate or pad. Max length is determined by model.)

https://huggingface.co/docs/transformers/pad_truncation

In [9]:
def preprocess_function(examples):
    return tokenizer(examples['sentence'], truncation=True)

To apply the preprocessing function over the entire dataset, use 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. You can speed up `map` by setting `batched=True` to process multiple elements of the dataset at once.

In [10]:
tokenized_sst2 = sst2.map(preprocess_function, batched=True)

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

Now create a batch of examples using [DataCollatorWithPadding](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorWithPadding). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximium length in the tokenzation process.

In [11]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

### 3. Create evaluation method

Including a metric during training is often helpful for evaluating your model's performance (otherwise, it just prints the loss). You can quickly load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):

In [12]:
import evaluate

# Proportion of correct predictions among the total number of cases processed
accuracy = evaluate.load('accuracy')

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Then create a function that passes your predictions and labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate the accuracy:

In [13]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

Your `compute_metrics` function is ready to go now -- you'll need it later when you set up your training.

### 4. Train

Before you start training your model, create a map of the expected ids to their labels with `id2label` and `label2id`.

In [14]:
labels = ['NEGATIVE', 'POSITIVE']
id2label = {i: label for i, label in enumerate(labels)}
label2id = {label: i for i, label in id2label.items()}

print('id2label:', id2label)
print('label2id:', label2id)

id2label: {0: 'NEGATIVE', 1: 'POSITIVE'}
label2id: {'NEGATIVE': 0, 'POSITIVE': 1}


Next, we will be using the Trainer class, which is wrapper code that abstracts away the details of training and evaluation. It is optimized for training 🤗 Transformers and makes it easier for us to train models without writing much code.

[More info on Trainers](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer)<br/>
[Trainer Tutorial](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)

In [15]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

# This automodel class gives us the model with pretrained weights + a sequence classification head
# We specify how many labels we need so that the model has the correct number of outputs
# We specify id2label/label2id so that the model understands the label associated with each output
model = AutoModelForSequenceClassification.from_pretrained(
    "roberta-base", num_labels=len(labels), id2label=id2label, label2id=label2id
)

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the accuracy and save the training checkpoint.
2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model (may take some time).

In [16]:
# https://huggingface.co/transformers/v4.4.2/main_classes/trainer.html#trainingarguments
training_args = TrainingArguments(
    output_dir='Ritwick_sst2_roberta_base',
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
)


In [17]:
tokenized_sst2['validation']

Dataset({
    features: ['idx', 'sentence', 'label', 'input_ids', 'attention_mask'],
    num_rows: 872
})

In [18]:
import time as time
import math
from tqdm import tqdm

In [25]:
class MyTrainer(Trainer):
  def _inner_training_loop(self, batch_size=None, args=None, resume_from_checkpoint=None, trial=None, ignore_keys_for_eval=None):
        number_of_epochs = args.num_train_epochs
        start = time.time()
        train_acc=[]
        eval_acc=[]
        criterion = torch.nn.CrossEntropyLoss().to(device)
        self.optimizer = torch.optim.Adam(model.parameters(), lr=args.learning_rate)
        self.scheduler = torch.optim.lr_scheduler.StepLR(self.optimizer , 1, gamma=0.9)
        train_dataloader = self.get_train_dataloader()
        eval_dataloader = self.get_eval_dataloader()
        max_steps = math.ceil(args.num_train_epochs * len(train_dataloader))
        for epoch in range(number_of_epochs):
            train_loss_per_epoch = 0
            train_acc_per_epoch = 0
            with tqdm(train_dataloader , unit="batch") as training_epoch:
                training_epoch.set_description(f"Training Epoch {epoch}")
                for step, inputs in enumerate(training_epoch):
                    inputs = inputs.to(device)
                    labels = inputs['labels']
                    # forward pass
                    self.optimizer.zero_grad()
                    output = model(**inputs) # TODO Implement by yourself
                    # get the loss
                    loss = criterion(output['logits'], labels) # TODO Implement by yourself
                    train_loss_per_epoch += loss.item()
                    #calculate gradients
                    loss.backward()
                    #update weights
                    self.optimizer.step()
                    train_acc_per_epoch += (output['logits'].argmax(1) == labels).sum().item()
                    # adjust the learning rate
            self.scheduler.step()
            train_loss_per_epoch /= len(train_dataloader)
            train_acc_per_epoch /= (len(train_dataloader)*batch_size)

            eval_loss_per_epoch = 0
            eval_acc_per_epoch = 0
            # evaluate on validation set
            with torch.no_grad():
                with tqdm(eval_dataloader , unit="batch") as eval_epoch:
                    eval_epoch.set_description(f"Evaluation Epoch {epoch}")
                    for e_step, e_inputs in enumerate(eval_epoch):
                        e_inputs = e_inputs.to(device)
                        e_labels = e_inputs['labels']
                        e_output = model(**e_inputs)
                        loss = criterion(e_output['logits'], e_labels)
                        eval_loss_per_epoch += loss.item()
                        eval_acc_per_epoch += (e_output['logits'].argmax(1) == e_labels).sum().item()

            eval_loss_per_epoch /= len(eval_dataloader)
            eval_acc_per_epoch /= (len(eval_dataloader)*batch_size)

            print(f'\tTrain Loss: {train_loss_per_epoch:.3f} | Train Acc: {train_acc_per_epoch*100:.2f}%')
            print(f'\tEval Loss: {eval_loss_per_epoch:.3f} | Eval Acc: {eval_acc_per_epoch*100:.2f}%')

        print(f'Time: {(time.time()-start)/60:.3f} minutes')

In [26]:
mytrainer = MyTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_sst2['train'],
    eval_dataset=tokenized_sst2['validation'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [27]:
mytrainer.train()

Training Epoch 0: 100%|██████████| 2105/2105 [08:01<00:00,  4.38batch/s]
Evaluation Epoch 0: 100%|██████████| 28/28 [00:02<00:00, 12.10batch/s]


	Train Loss: 0.043 | Train Acc: 98.54%
	Eval Loss: 0.210 | Eval Acc: 90.62%


Training Epoch 1: 100%|██████████| 2105/2105 [08:01<00:00,  4.37batch/s]
Evaluation Epoch 1: 100%|██████████| 28/28 [00:02<00:00, 12.11batch/s]

	Train Loss: 0.031 | Train Acc: 98.94%
	Eval Loss: 0.245 | Eval Acc: 90.85%
Time: 16.115 minutes





<Tip>

[Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) applies dynamic padding by default when you pass `tokenizer` to it. In this case, you actually don't need to specify a data collator explicitly because we're not using any special data collation logic.


<Tip>

For a more in-depth example of how to finetune a model for text classification, take a look at the corresponding
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb)
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb).

</Tip>

## Inference

Great, now that you've finetuned a model, you can use it for inference!

Grab some text you'd like to run inference on:

In [28]:
text=sst2['test']['sentence']

In [29]:
import pandas as pd

In [30]:
df=pd.DataFrame(sst2['test'],columns=['sentence','label'])
df.head()

Unnamed: 0,sentence,label
0,uneasy mishmash of styles and genres .,-1
1,this film 's relationship to actual tension is...,-1
2,"by the end of no such thing the audience , lik...",-1
3,director rob marshall went out gunning to make...,-1
4,lathan and diggs have considerable personal ch...,-1


The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for sentiment analysis with your model, and pass your text to it:

In [31]:
from transformers import pipeline

classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer, device=device)
#classifier(text)

In [32]:
df['prediction']=classifier(text)
df.head()

Unnamed: 0,sentence,label,prediction
0,uneasy mishmash of styles and genres .,-1,"{'label': 'NEGATIVE', 'score': 0.9983795881271..."
1,this film 's relationship to actual tension is...,-1,"{'label': 'NEGATIVE', 'score': 0.9974343180656..."
2,"by the end of no such thing the audience , lik...",-1,"{'label': 'POSITIVE', 'score': 0.9701967239379..."
3,director rob marshall went out gunning to make...,-1,"{'label': 'NEGATIVE', 'score': 0.6235690116882..."
4,lathan and diggs have considerable personal ch...,-1,"{'label': 'POSITIVE', 'score': 0.9991191029548..."


In [33]:
def parse_prediction(prediction):
    sentiment = prediction['label']
    score = prediction['score']
    return sentiment, score

# Apply the function to the prediction column and assign results to new columns
df[['Sentiment', 'Score']] = df['prediction'].apply(lambda x: pd.Series(parse_prediction(x)))


You can also manually replicate the results of the `pipeline` if you'd like:

Tokenize the text and return PyTorch tensors:

In [34]:
df.head()

Unnamed: 0,sentence,label,prediction,Sentiment,Score
0,uneasy mishmash of styles and genres .,-1,"{'label': 'NEGATIVE', 'score': 0.9983795881271...",NEGATIVE,0.99838
1,this film 's relationship to actual tension is...,-1,"{'label': 'NEGATIVE', 'score': 0.9974343180656...",NEGATIVE,0.997434
2,"by the end of no such thing the audience , lik...",-1,"{'label': 'POSITIVE', 'score': 0.9701967239379...",POSITIVE,0.970197
3,director rob marshall went out gunning to make...,-1,"{'label': 'NEGATIVE', 'score': 0.6235690116882...",NEGATIVE,0.623569
4,lathan and diggs have considerable personal ch...,-1,"{'label': 'POSITIVE', 'score': 0.9991191029548...",POSITIVE,0.999119


In [75]:
inputs = tokenizer(text, return_tensors='pt',padding=True).to(device)  # 'pt' means your tokenizer will return a pytorch tensor

Pass your inputs to the model and return the `logits`:

In [76]:
from transformers import AutoModelForSequenceClassification

with torch.no_grad():
    logits = model(**inputs).logits

Get the class with the highest probability, and use the model's `id2label` mapping to convert it to a text label:

In [79]:
predicted_class_id = logits.argmax().item() # is the first output or second output bigger? get ID of bigger output
try:
    predicted_label = model.config.id2label[predicted_class_id]
except KeyError:
    print("The predicted class ID is not in the id2label dictionary.")

The predicted class ID is not in the id2label dictionary.


## Model/Dateset Cards in Huggingface (Documentation)

Markdown files with information on how to use the model/dataset and other relevant data (metadata, potential limitations, etc.)

Looking for models/datasets to use:<br/>
https://huggingface.co/models<br/>
https://huggingface.co/datasets

More information:<br/>
https://huggingface.co/docs/hub/model-cards<br/>
https://huggingface.co/docs/hub/datasets-cards

Templates:<br/>
https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md<br/>
https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md
