#  Assignment 2 - Transfer Learning and Data Augmentation 💬

Welcome to the **second assignment** for the **CS-552: Modern NLP course**!

> - 😀 Name: **Tianhao Dai**
> - ✉️ Email: **tianhao.dai@epfl.ch**
> - 🪪 SCIPER: **369945**

<div style="padding:15px 20px 20px 20px;border-left:3px solid green;background-color:#e4fae4;border-radius: 20px;color:#424242;">

## **Assignment Description**
- In the first part of this assignment, you will need to implement training (finetuning) and evaluation of a pre-trained language model ([RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)) on a **Sentiment Analysis (SA)** task, which aims to determine whether a product review's emotional tone is positive or negative.

- For part-2, following the first finetuning task, you will need to identify the shortcuts (i.e. some salient or toxic features) that the model learnt for the specific task.

- For part-3, you are supposed to annotate 80 randomly assigned new datapoints as ground-truth labels. Additionally, the cross annotation should be conducted by another one or two annotators, and you will learn about how to calculate the agreement statistics as a significant characteristic reflecting the quality of a collected dataset.

- For part-4, since the human annotation is quite time- and effort-consuming, there are plenty of ways to get silver-labels from automatic labeling to augment the dataset scale, e.g., paraphrasing each text input in different words without changing its meaning. You will use a [T5](https://huggingface.co/docs/transformers/en/model_doc/t5) paraphrase model to expand the training data of sentiment analysis, and evaluate the improvement of data augmentation.

For Parts 1 and Part 2, you will need to complete the code in the corresponding `.py` files (`sa.py` for Part 1, `shortcut.py` for Part 2). You will be provided with the function descriptions and detailed instructions about the code snippet you need to write.


### Table of Contents
- **PART 1: Sentiment Analysis (33 pts)**
    - 1.1 Dataset Processing (10 pts)
    - 1.2 Model Training and Evaluation (18 pts)
    - 1.3 Fine-Grained Validation (5 pts)
- **PART 2: Identify Model Shortcuts (22 pts)**
    - 2.1 N-gram Pattern Extraction (6 pts)
    - 2.2 Distill Potentially Useful Patterns (8 pts)
    - 2.3 Case Study (8 pts)
- **PART 3: Annotate New Data (25 pts)**
    - 3.1 Write an Annotation Guideline (5 pts)
    - 3.2 Annotate Your Datapoints with Partner(s) (8 pts)
    - 3.3 Agreement Measure (12 pts)
- **PART 4: Data Augmentation (20 pts)**
    - 4.1 Data Augmentation with Paraphrasing (15 pts)
    - 4.2 Retrain RoBERTa Model with Data Augmentation (5 pts)
    
### Deliverables

- ✅ This jupyter notebook: `assignment2.ipynb`
- ✅ `sa.py` and `shortcut.py` file
- ✅ Checkpoints for RoBERTa models finetuned on original and augmented SA training data (Part 1 and Part 4), including:
    - `models/lr1e-05-warmup0.3/`
    - `models/lr2e-05-warmup0.3/`
    - `models/augmented/lr1e-05-warmup0.3/`
- ✅ Model prediction results on each domain data (Part 1.3 Fine-Grained Validation): `predictions/`
- ✅ Cross-annotated new SA data (Part 3), including:
    - `data/<your_assigned_dataset_id>-<your_sciper_number>.jsonl`
    - `data/<your_assigned_dataset_id>-<your_partner_sciper_number>.jsonl`
    - (for group of 3) `data/<your_assigned_dataset_id>-<your_second_partner_sciper_number>.jsonl`
- ✅ Paraphrase-augmented SA training data (Part 4), including:
    - `data/augmented_train_sa.jsonl`
- ✅ `./tensorboard` directory with logs for all trained/finetuned models, including:
    - `tensorboard/part1_lr1e-05/`
    - `tensorboard/part1_lr2e-05/`
    - `tensorboard/part4_lr1e-05/`

### How to implement this assignment

Please read carefully the following points. All the information on how to read, implement and submit your assignment is explained in details below:

1. For this assignment, you will need to implement and fill in the missing code snippets for both the **Jupyter Notebook `assignment2.ipynb`** and the **`sa.py`**, **`shortcut.py`** python files.

2. Along with above files, you need to additionally upload model files under the **`models/`** dir, regarding the following models:
    - finetuned RoBERTa models on original SA training data (PART 1)  
    - finetuned RoBERTa model on augmented SA training data (PART 4)

3. You also need to upload model prediction results in Part 1.3 Fine-Grained Validation, saved in **`predictions/`**.

4. You also need to upload new data files under the **`data/`** dir (along with our already provided data), including:
    - new SA data with your and your partner's annotations (Part 3)
    - paraphrase-augmented SA training data (Part 4)

5. Finally, you will need to log your training using Tensorboard. Please follow the instructions in the `README.md` of the **``tensorboard/``** directory.

**Note**: Large files such as model checkpoints and logs should be pushed to the repository with Git LFS. You may also find that training the models on a GPU can speed up the process, we recommend using Colab's free GPU service for this. A tutorial on how to use Git LFS and Colab can be found [here](https://github.com/epfl-nlp/cs-552-modern-nlp/blob/main/Exercises/tutorials.md).
    
</div>

<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

## **Environment Setup**

### **Option 1: creating your own environment**

```
conda create --name mnlp-a2 python=3.10
conda activate mnlp-a2
pip install -r requirements.txt
```

**Note**: If some package versions in our suggested environment do not work, feel free to try other package versions suitable for your computer, but remember to update ``requirements.txt`` and explain the environment changes in your notebook (no penalty for this if necessary).

### **Option 2: using Google Colab**
If you are using Google Colab notebook for this assignment, you will need to run a few commands to set up our environment on Google Colab, as shown below:
    
</div>

In [1]:
# This cell makes sure modules are auto-loaded when you change external python files
%load_ext autoreload
%autoreload 2

In [2]:
# If you are working in Colab, then consider mounting your assignment folder to your drive
from google.colab import drive
drive.mount('/content/drive')

# Direct to your assignment folder.
%cd /content/drive/MyDrive/a2-2024-thdai2000

Mounted at /content/drive
/content/drive/MyDrive/a2-2024-thdai2000


Install packages that are not included in the Colab base envrionemnt:

In [3]:
import os
os.environ["WANDB_DISABLED"] = "true"
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0" # limiting to one GPU

# Install dependencies
!pip install -r requirements.txt

Collecting torch==2.1.0 (from -r requirements.txt (line 1))
  Downloading torch-2.1.0-cp310-cp310-manylinux1_x86_64.whl (670.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m670.2/670.2 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers==4.38.1 (from -r requirements.txt (line 2))
  Downloading transformers-4.38.1-py3-none-any.whl (8.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.5/8.5 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Collecting jsonlines==4.0.0 (from -r requirements.txt (line 12))
  Downloading jsonlines-4.0.0-py3-none-any.whl (8.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.1.0->-r requirements.txt (line 1))
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.1.0->-

In [4]:
import numpy as np
import jsonlines
import random
import torch
from transformers import RobertaTokenizer, RobertaForSequenceClassification

# TODO: Enter your Sciper number
SCIPER = '369945'
seed = int(SCIPER)
torch.backends.cudnn.deterministic = True

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

<torch._C.Generator at 0x7bc38048bf30>

In [5]:
# Check the availability of GPU (proceed only it returns True!)
if torch.cuda.is_available():
  print('Good to go!')
else:
  print('Please set GPU via Edit -> Notebook Settings.')

Good to go!


<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">
    
# PART 1: Sentiment Analysis (33 pts)

In this part, we will finetune a pretrained language model (Roberta) on sentiment analysis(SA) task.

> Specifically, we will focus on a binary sentiment classification task for multi-domain product reviews. It requires the model to **classify a given paragraph of review by its sentiment polarity (positive or negative)**.

</div>

### Load Training Dataset (`train_sa.jsonl`)

**You can run the following cell to have the first glance at your data**. Each data sample is a python dictionary, which consists of following components:
- input review (*'review'*): a natural language sentence or a paragraph commenting about a product.
- domain (*'domain'*): describing the type of product being reviewed.
- label of sentiment (*'label'*): indicating whether the review states positive or negative views about the product.

In [6]:
data_dir = 'data'
data_train_path = os.path.join(data_dir, 'train_sa.jsonl')
with jsonlines.open(data_train_path, "r") as reader:
    for sid, sample in enumerate(reader.iter()):
        if sid % 200 == 0:
            print(sample)

{'review': "THis book was horrible.  If it was possible to rate it lower than one star i would have.  I am an avid reader and picked this book up after my mom had gotten it from a friend.  I read half of it, suffering from a headache the entire time, and then got to the part about the relationship the 13 year old boy had with a 33 year old man and i lit this book on fire.  One less copy in the world...don't waste your money. I wish i had the time spent reading this book back so i could use it for better purposes.  THis book wasted my life", 'domain': 'books', 'label': 'negative'}
{'review': 'Sphere by Michael Crichton is an excellant novel. This was certainly the hardest to put down of all of the Crichton novels that I have read. The story revolves around a man named Norman Johnson. Johnson is a phycologist. He travels with 4 other civilans to a remote location in the Pacific Ocean to help the Navy in a top secret misssion. They quickly learn that under the ocean is a half mile long sp

In [7]:
# We use the following pretrained tokenizer and model
model_name = "FacebookAI/roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaForSequenceClassification.from_pretrained(model_name, num_labels=2)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 🎯 Q1.1: **Dataset Processing (10 pts)**

Our first step is to constructing a Pytorch Dataset for SA task. Specifically, we will need to implement **tokenization** and **padding** using a HuggingFace pre-trained tokenizer.

**TODO🔻: Complete `SADataset` class following the instructions in `sa.py`, and test by running the following cell.**

In [8]:
from sa import SADataset
model_name = "FacebookAI/roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
dataset = SADataset("data/train_sa.jsonl", tokenizer)

Building SA Dataset...


0it [00:00, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (725 > 512). Running this sequence through the model will result in indexing errors
1600it [00:06, 259.26it/s]


In [9]:
from testA2 import test_SADataset
test_SADataset(dataset)

SADataset test correct ✅


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## 🎯 Q1.2: **Model Training and Evaluation (18 pts)**

Next, we will implement the training and evaluation process to finetune the model.

- For training: you will need to calculate the **loss** and update the model weights by using **Adam optimizer**. Additionally, we add a **learning rate schedular** to adopt an adaptive learning rate during the whole training process.

- For evaluation: you will need to compute the **confusion matrix** and **F1 scores** to assess the model performance.

**TODO🔻: Complete the `compute_metrics()`, `train()` and `evaluate()` functions following the instructions in the `sa.py` file, you can test compute_metrics() by running the following cell.**

In [10]:
from sa import compute_metrics, train, evaluate

from testA2 import test_compute_metrics
test_compute_metrics(compute_metrics)

compute_metric test correct ✅


#### **Start Training and Validation!**

TODO🔻: (1) [coding question] Train the model with the following two different learning rates (other hyperparameters should be kept consistent).

> A. learning_rate = 1e-5

> B. learning_rate = 2e-5

**Note:** *Each training will take ~7-10 minutes using a T4 Colab GPU.*

In [11]:
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model_name = "FacebookAI/roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaForSequenceClassification.from_pretrained(model_name, num_labels=2)
model.to(device)

batch_size = 8
epochs = 4
max_grad_norm = 1.0
warmup_percent = 0.3

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
test_dataset = SADataset("data/test_sa.jsonl", tokenizer)  # added: to load the test dataset

Building SA Dataset...


1it [00:00,  4.32it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1055 > 512). Running this sequence through the model will result in indexing errors
6400it [00:17, 367.57it/s] 


In [13]:
learning_rate = 1e-5  # play around with this hyperparameter

train(dataset,
      test_dataset,
      model,
      device,
      batch_size,
      epochs,
      learning_rate,
      warmup_percent,
      max_grad_norm,
      model_save_root='models/',
      tensorboard_path="./tensorboard/part1_lr{}".format(learning_rate))

Training:   0%|          | 0/200 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
Training: 100%|██████████| 200/200 [00:19<00:00, 10.10it/s]
Evaluation: 100%|██████████| 800/800 [00:17<00:00, 46.47it/s]


Epoch: 0 | Training Loss: 0.694 | Validation Loss: 0.687
Epoch 0 SA Validation:
Confusion Matrix:
[[1536, 1664], [1102, 2098]]
F1: (52.62%, 60.27%) | Macro-F1: 56.45%
Model Saved!


Training: 100%|██████████| 200/200 [00:16<00:00, 12.11it/s]
Evaluation: 100%|██████████| 800/800 [00:17<00:00, 45.95it/s]


Epoch: 1 | Training Loss: 0.475 | Validation Loss: 0.295
Epoch 1 SA Validation:
Confusion Matrix:
[[2996, 204], [458, 2742]]
F1: (90.05%, 89.23%) | Macro-F1: 89.64%
Model Saved!


Training: 100%|██████████| 200/200 [00:19<00:00, 10.37it/s]
Evaluation: 100%|██████████| 800/800 [00:17<00:00, 46.32it/s]


Epoch: 2 | Training Loss: 0.256 | Validation Loss: 0.395
Epoch 2 SA Validation:
Confusion Matrix:
[[2946, 254], [371, 2829]]
F1: (90.41%, 90.05%) | Macro-F1: 90.23%
Model Saved!


Training: 100%|██████████| 200/200 [00:17<00:00, 11.21it/s]
Evaluation: 100%|██████████| 800/800 [00:17<00:00, 46.67it/s]

Epoch: 3 | Training Loss: 0.167 | Validation Loss: 0.450
Epoch 3 SA Validation:
Confusion Matrix:
[[3035, 165], [510, 2690]]
F1: (89.99%, 88.85%) | Macro-F1: 89.42%





In [14]:
# train with 2e-5
model = RobertaForSequenceClassification.from_pretrained(model_name, num_labels=2)  # re-initialize the model
model.to(device)

learning_rate = 2e-5  # play around with this hyperparameter

train(dataset,
      test_dataset,
      model,
      device,
      batch_size,
      epochs,
      learning_rate,
      warmup_percent,
      max_grad_norm,
      model_save_root='models/',
      tensorboard_path="./tensorboard/part1_lr{}".format(learning_rate))

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Training: 100%|██████████| 200/200 [00:15<00:00, 12.72it/s]
Evaluation: 100%|██████████| 800/800 [00:17<00:00, 46.63it/s]


Epoch: 0 | Training Loss: 0.677 | Validation Loss: 0.377
Epoch 0 SA Validation:
Confusion Matrix:
[[3014, 186], [682, 2518]]
F1: (87.41%, 85.30%) | Macro-F1: 86.36%
Model Saved!


Training: 100%|██████████| 200/200 [00:16<00:00, 12.34it/s]
Evaluation: 100%|██████████| 800/800 [00:17<00:00, 46.33it/s]


Epoch: 1 | Training Loss: 0.400 | Validation Loss: 0.342
Epoch 1 SA Validation:
Confusion Matrix:
[[2767, 433], [217, 2983]]
F1: (89.49%, 90.18%) | Macro-F1: 89.83%
Model Saved!


Training: 100%|██████████| 200/200 [00:16<00:00, 11.91it/s]
Evaluation: 100%|██████████| 800/800 [00:17<00:00, 46.41it/s]


Epoch: 2 | Training Loss: 0.224 | Validation Loss: 0.508
Epoch 2 SA Validation:
Confusion Matrix:
[[3058, 142], [629, 2571]]
F1: (88.80%, 86.96%) | Macro-F1: 87.88%


Training: 100%|██████████| 200/200 [00:15<00:00, 13.02it/s]
Evaluation: 100%|██████████| 800/800 [00:17<00:00, 45.88it/s]


Epoch: 3 | Training Loss: 0.117 | Validation Loss: 0.524
Epoch 3 SA Validation:
Confusion Matrix:
[[2761, 439], [226, 2974]]
F1: (89.25%, 89.94%) | Macro-F1: 89.60%


TODO🔻: (2) [textual question] compare and discuss the results.

- Which learning rate is better? Explain your answers.

The learning rate of 1e-5 is better than 2e-5 because 1e-5 is less overfitted to the training set and potentially has better generalization ability.

From the loss plot, we can see that the evaluation loss of 2e-5 is lower than 1e-5 at the end of the first epoch, but after the first epoch, the evaluation loss of 1e-5 is consistently lower than 2e-5. Though they both over-fit after two epochs, the trend of 1e-05 is less obvious, which indicates 1e-05 has better generalizability on unseen data than 2e-05.

However, the F1 scores and Macro-F1 of 1e-5 and 2e-5 do not demonstrate significant difference. The model with learning rate 1e-05 has higher Macro-F1 at the end of the 3rd epoch, but both models reach high Macro-F1 (around 90\%).

To tell the actual difference in performances, further analysis, such as different seeding and hypothesis testing, are needed.

## 🎯 Q1.3: **Fine-Grained Validation (5 pts)**

TODO🔻: (1) [coding question] Use the model checkpoint trained from the first learning_rate setting (lr=1e-5), check the model performance on each domain subsets of the validation set. You should report **the validation loss**, **confusion matrix**, **F1 scores** and **Macro-F1 on each domain**.

In [15]:
# Split the test sets into subsets with different domains
# Save the subsets under 'data/'
# Replace "..." with your code
domain_data = {}
with jsonlines.open("data/test_sa.jsonl", mode="r") as reader:
    for sample in reader.iter():
        if sample['domain'] not in domain_data.keys():
            domain_data[sample['domain']] = []
        domain_data[sample['domain']].append(sample)

for domain, samples in domain_data.items():
    with jsonlines.open("data/test_sa_"+domain+".jsonl", mode="w") as writer:
        for sd in samples:
            writer.write(sd)

In [16]:
learning_rate = 1e-5
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
tokenizer = RobertaTokenizer.from_pretrained('./models/lr1e-05-warmup0.3')
model = RobertaForSequenceClassification.from_pretrained('./models/lr1e-05-warmup0.3')
model.to(device)

results_save_dir = 'predictions/'

# Evaluate and save prediction results in each domain
# Replace "..." with your code
for domain in domain_data.keys():

    domain_dataset = SADataset("data/test_sa_"+domain+".jsonl", tokenizer)
    dev_loss, confusion, f1_pos, f1_neg = evaluate(domain_dataset,
                                                   model,
                                                   device,
                                                   batch_size=8,
                                                   use_labels=True,
                                                   result_save_file='predictions/test_'+domain+'.jsonl')
    macro_f1 = (f1_pos + f1_neg) / 2

    print(f'Domain: {domain}')
    print(f'Validation Loss: {dev_loss:.3f}')
    print(f'Confusion Matrix:')
    print(confusion)
    print(f'F1: ({f1_pos*100:.2f}%, {f1_neg*100:.2f}%) | Macro-F1: {macro_f1*100:.2f}%')

Building SA Dataset...


0it [00:00, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1055 > 512). Running this sequence through the model will result in indexing errors
1600it [00:03, 515.35it/s]
Evaluation: 100%|██████████| 200/200 [00:04<00:00, 43.86it/s]


Domain: books
Validation Loss: 0.376
Confusion Matrix:
[[739, 61], [90, 710]]
F1: (90.73%, 90.39%) | Macro-F1: 90.56%
Building SA Dataset...


1600it [00:02, 643.15it/s]
Evaluation: 100%|██████████| 200/200 [00:04<00:00, 46.90it/s]


Domain: dvd
Validation Loss: 0.502
Confusion Matrix:
[[721, 79], [118, 682]]
F1: (87.98%, 87.38%) | Macro-F1: 87.68%
Building SA Dataset...


1600it [00:01, 1009.74it/s]
Evaluation: 100%|██████████| 200/200 [00:04<00:00, 45.42it/s]


Domain: electronics
Validation Loss: 0.359
Confusion Matrix:
[[736, 64], [82, 718]]
F1: (90.98%, 90.77%) | Macro-F1: 90.87%
Building SA Dataset...


1600it [00:01, 1258.40it/s]
Evaluation: 100%|██████████| 200/200 [00:04<00:00, 47.71it/s]


Domain: housewares
Validation Loss: 0.342
Confusion Matrix:
[[750, 50], [81, 719]]
F1: (91.97%, 91.65%) | Macro-F1: 91.81%


TODO🔻: (2) [textual question] compare and discuss the results.

**Questions:**
- On which domain does the model perform the best? the worst?
- Give some possible explanations of why the model's best-performed domain is easier, and why the model's worst-performed domain is more challenging. Use some examples to support your explanations.

**Note:** To find examples for supporting your discussion, save the model prediction results on each domain under the `predictions/` folder, by specifying the `result_save_file` parameter in the *evaluate* function.

On housewares, the model performs the best, with Macro-F1 being 91.81\%. On dvd, the model performs the worst, and the Macro-F1 is 87.68\%.

One possible factor that causes the performance difference is the length of reviews. The mean length of housewares reviews (90, split by spaces) is significantly lower than that of dvd reviews (165, split by spaces), with the p-value of t-test being 1.43e-52. The truncation thus causes much more information loss in dvd reviews, making it more difficult for the model to understand the context and make predictions in the dvd domain. Moreover, the subword-level tokenizer in our implementation makes the review even shorter.

Another potential factor is the different ways of expressing sentiment. In housewares domain, users directly comment on the product in an explicit way. For example,

*"Great thing to own and very useful for cooking small quantity of foods. I use it a lot"*

explicitly shows positive sentiment, while

*"FYI, those of you with square burners will not like this one. It is not stable because of the design, which is clearly meant for round burners."*

the other way around. In dvd domain, the sentiments are much more complex and implicit, usually not pointing to the product itself but the movie plots, subjective feelings, personal memories, etc. For instance, the negative sentiment in

*"I had to watch it in my 8th grade PE class and it is very informative and somewhat graphic. It shows the scariness of it and makes you eat for one thing! You wont ever forget it"*

is not so obvious so the model mistakenly predict it as positive. The model might also have been confused by the shortcut feature "I hate ..." in this implicitly positive review:

*"Look at my review for the movie soundtrack.  I had a full vinyl copy of the voiced movie, but my mom sold it.  I hate that fact, but there isn't a thing I can do about it.  I wish I could get one back"*.


I also find that some samples are WRONGLY ANNOTATED in the dvd domain, which also hurt the model performance. For example, the sample

*"This is one of my favorite romantic comedies.  It is funny, touching, and has some of my favorite actors in it.  And it doesn't only center around the main characters but has a great ensemble cast of very interesting people!"* has the label negative, which is apparently not true.

<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

# PART 2: Identify Model Shortcuts (22 pts)

In this part, We aim to find out the shortcut features learnt by the sentiment analysis model we have trained in Part1. We will be using the model checkpoint trained with `learning rate=1e-5`.

</div>

## 🎯 Q2.1: **N-gram Pattern Extraction (6 pts)**
We hypothesize that `n-gram`s could be the potential shortcut features learnt by the SA model. An `n-gram` is defined as a sequence of n consecutive words appeared in a natural language sentence or paragraph.

Thus, we aim to extract that an n-gram that appears in a review may serve as a key indicator of the polarity of the review's sentiment, for example:

>- **Review 1**: This book was **horrible**. If it was possible to rate it **lower than one star** I would have.
>- **Review 2**: **Excellent** book, **highly recommended**. Helps to put a realistic perspective on millionaires.

For Review 1, the `1-gram "horrible"` and the `4-gram "lower than one star"` serve as two key indicators of negative sentiment. While for Review 2, the `1-gram "excellent"` and the `2-gram "highly recommended"` obviously indicate positive sentiment.

TODO🔻: (1) [coding question] Complete `ngram_extraction()` function in `shortcut.py` file.

The returned *ngrams* contains a **list** of dictionaries. The `n-th` **dictionary** corresponds the `n-grams` (n=1,2, 3, 4).

The keys of each dictionary should be a **unique n-gram string** appeared in reviews, and the value of each n-gram key records the frequency of positive/negative predictions **made by the model** when the n-gram appears in the review, i.e., `\[#positive_predictions, #negative_predictions\]`.

> Example: **`ngrams`[0]['horrible'][0]** should return the number of the positive predictions made by the model when the 1-gram token 'horrible' appear in the given review. i.e., \[#positive_predictions, #negative_predictions\].

**Note:** (1) All the sequences contain punctuations should NOT be counted as a n-gram (e.g. `it is great .` is NOT a 4-gram, but `it is great` is a 3-gram); (2) All stop-words should NOT be counted as 1-grams, but can appear in other n-gram sequences (e.g. `is` is NOT a 1-gram token, but `it is great` can be a 3-gram token.)

## 🎯 Q2.2: **Distill Potentially Useful Patterns (8 pts)**

TODO🔻: (2) [coding question] For each group of n-grams (n=1,2,3,4), find and **print** the **top-100 n-gram sequences** with the **greatest frequency of appearance**, which could contain frequent semantic features and would be used as our feature list.

In [17]:
from shortcut import ngram_extraction

In [18]:
# all your saved model prediction results from 1.3 Fine-Grained Validation
prediction_files = ['./predictions/test_'+domain+'.jsonl' for domain in domain_data.keys()]

# TODO: Define your tokenizer
tokenizer = RobertaTokenizer.from_pretrained('FacebookAI/roberta-base')
ngrams = ngram_extraction(prediction_files, tokenizer)

top_100 = {}
for n, counts in enumerate(ngrams):
    # TODO: find top-100 n-grams (n=1,2,3 or 4) associated with the greatest frequency of appearance
    top_100_freq = [elem[0] for elem in sorted(counts.items(), key=lambda item: sum(item[1]), reverse=True)][:100]

    print(f'Top-100 most frequent {n+1}-grams:')
    print(top_100_freq)

    top_100[n] = top_100_freq

100%|██████████| 1600/1600 [00:10<00:00, 158.96it/s]
100%|██████████| 1600/1600 [00:07<00:00, 203.77it/s]
100%|██████████| 1600/1600 [00:05<00:00, 279.33it/s]
100%|██████████| 1600/1600 [00:03<00:00, 452.68it/s]


Top-100 most frequent 1-grams:
["'s", "'t", 'one', 'book', 'like', 'would', 'good', 'movie', 'get', 'time', 'great', 'well', 'even', 'use', 'much', 'film', 'really', 'first', 'also', 'read', 'j', '...', 'way', 'work', 'many', "'ve", 'better', 'could', ').', 'k', 'new', 'two', 'b', 'people', '2', 'make', "'m", 'l', 'little', 'back', 'story', 'see', 'r', 'vd', 'love', 'g', 'man', 'think', 'buy', 'h', 'never', 'us', 'w', 'know', 'best', 'still', 'c', 'years', 'p', 'used', 'bought', '3', '--', 'want', 'life', 'made', 'go', 'product', 'set', 'another', 'quality', 'got', 'find', '1', 'ing', 'e', 'thing', 'n', 'found', 'say', 'ever', 'end', 'bad', 'every', 'able', 'old', 'right', 'something', 'going', 'since', 'v', 'need', 'sound', 'money', '5', 'er', 'easy', 'f', 'long', 'books']
Top-100 most frequent 2-grams:
['of the', 'in the', "it 's", 'is a', 'and the', 'on the', 'to the', 'it is', 'i have', 'if you', 'this is', "don 't", 'this book', 'to be', 'for the', 'with the', 'and i', 'i was', 'i

**Among each type of top-100 frequent n-grams above**, we aim to further find out the n-grams which **most likely** lead to *positive*/*negative* predictions (positive/negative shortcut features).

TODO🔻: (3) [coding&text question] Design **two different methods to re-rank** the top-100 n-grams to extract shortcut features. For each method, you should extract **1** feature in each of n-grams group (n=1, 2, 3, 4) for positve and negative prediction (1\*4\*2=8 features in total for 1 method).

Explain each of your design choices in natural language, and compare which method finds more reasonable patterns.


In [19]:
# TODO: [Method 1] find top-1 positive and negative patterns

for n in range(1,5):
    print(str(n)+"-gram:")
    grams_top_100 = [item for item in ngrams[n-1].items() if item[0] in top_100[n-1]]
    sorted_by_percentage = sorted(grams_top_100, key=lambda item: item[1][0]/sum(item[1]), reverse=True)
    # print('top-10:', sorted_by_percentage[:10])
    # print('last-10', sorted_by_percentage[-10:])
    pos = sorted_by_percentage[0][0]
    neg = sorted_by_percentage[-1][0]
    print('positive pattern:', pos)
    print('negative pattern:', neg)

# TODO: [Explanation of Method 1]
# This method calculates the percentage of occurrences of each feature in positive predictions out of all occurrences and sorts them from high to low. The highest is the feature most indicative of positivity, and the lowest is the feature most indicative of negativity.

1-gram:
positive pattern: easy
negative pattern: bad
2-gram:
positive pattern: a great
negative pattern: didn 't
3-gram:
positive pattern: easy to use
negative pattern: cu isin art
4-gram:
positive pattern: i highly recommend this
negative pattern: do not buy this


In [20]:
# TODO: [Method 2] find top-1 positive and negative patterns
for n in range(1,5):
    print(str(n)+"-gram:")
    grams_top_100 = [item for item in ngrams[n-1].items() if item[0] in top_100[n-1]]
    sorted_by_prob = sorted(grams_top_100, key=lambda item: item[1][0], reverse=True)
    # print('top-10:', sorted_by_prob[:10])
    # print('last-10', sorted_by_prob[-10:])
    pos = sorted_by_prob[0][0]
    neg = sorted_by_prob[-1][0]
    print('positive pattern:', pos)
    print('negative pattern:', neg)

# TODO: [Explanation of Method 2]
# This method directly sorts the features based on the occurrences of each feature in positive predictions, from most to least. The ones that appear the most are considered to be most relevant to positive predictions, and conversely, the ones that appear the least are related to negative predictions.

1-gram:
positive pattern: 's
negative pattern: bad
2-gram:
positive pattern: of the
negative pattern: am azon
3-gram:
positive pattern: this is a
negative pattern: cu isin art
4-gram:
positive pattern: this is a great
negative pattern: do not buy this


TODO🔻: Compare and discuss the results from two methods above.

The first method finds more reasonable patterns in 1-gram, 2-gram and 3-gram than the second.

In the first method, the 1-grams "easy" and "bad" are good indicators of positive and negative sentiments, respectively. For 2-gram and 3-gram, "a great" and "easy to use" are also in line with positive sentiments, though the associations between "didn 't" as well as "cu isin art" and negative sentiments are a bit vague.

In the second method, the top-1 features of 1-gram, 2-gram, 3-gram are not as indicative as first method. However, both methods return the same 4-grams, "this is a great" and "do not buy this", which are good features.

## 🎯 Q2.3: **Case Study (8 pts)**

TODO🔻: Among the shortcut features you found in 2.1, find out **4 representative** cases (pair of `\[review, n-gram feature\]`) where the shortcut feature **will lead to a wrong prediction**.

For example, the 1-gram feature "excellent" has been considered as a shortcut for *positive* sentiment, while the ground-truth label of the given review containing "excellent" is *negative*.

**Questions:**
- Based on your case study, do you detect any limitations of the n-gram patterns?
- Which type of n-gram (1/2/3/4-gram) pattern is more robust to be used for sentiment prediction shortcut and why?

In [21]:
# TODO: you can fill your code for finding cases here

# uncomment the code below to print all the samples from which I select 4 representative cases in the next cell

# domains = ['books', 'dvd', 'electronics', 'housewares']
# for domain in domains:
#     with jsonlines.open('./predictions/test_{}.jsonl'.format(domain)) as reader:
#         for sample in reader.iter():
#             # select a few shortcut features to find wrong predictions
#             if 'easy' in sample['review'].lower() and sample['label']=='negative' and sample['prediction']=='positive':
#                 print([sample['review'], 'easy'])
#                 continue
#             if 'bad' in sample['review'].lower() and sample['label']=='positive' and sample['prediction']=='negative':
#                 print([sample['review'], 'bad'])
#                 continue
#             if 'a great' in sample['review'].lower() and sample['label']=='negative' and sample['prediction']=='positive':
#                 print([sample['review'], 'a great'])
#                 continue
#             if 'easy to use' in sample['review'].lower() and sample['label']=='negative' and sample['prediction']=='positive':
#                 print([sample['review'], 'easy to use'])
#                 continue
#             if 'i highly recommend this' in sample['review'].lower() and sample['label']=='negative' and sample['prediction']=='positive':
#                 print([sample['review'], 'i highly recommend this'])
#                 continue

In [22]:
# print most representive cases
print('\nCase1: negative label, positive prediction')
print(["Well, I guess I should've read more reviews of this DVD. It is STRICTLY for beginners. I don't sweat and my heart rate barely goes up (if I do breath of fire it helps). When I want an easy day I'll pop this in and do both sessions. Very simple. But I'm sure it's good for a beginner--a BEGINNING BEGINNER. Sara Ivanhoe is soothing though, and encouraging. But seriously, this is a very easy workout", 'easy'])
print('\nCase2: positive label, negative prediction')
print(['No, he doesn\'t. Angel Heart centers around a down in the dumps private detective, Harry Angel and Lou Cyphere, obviously the devil. Well, not obvious to me at first...ha! The only bad part about this movie is the fuss people made over it. The voodoo dance scenes were apparently fairly in keeping with tradition. Voodoo practioners had a problem with this movie centering around a "devil" character since there is no devil figure in voodoo. However...the devil character has very little attachment to the voodoo practices, it just happens to be how the devil got involved with Johnney (the missing crooner). Mickey Rourke is stellar in this movie..Bonet is enchanting...Deniro is flawless. Worth owning!', 'bad'])
print('\nCase3: negative label, positive prediction')
print(['I am on my third machine now, and it too is leaking.  Dont get me wrong, the coffee tastes really good, always fresh.  But after a while the machine starts to leak when you are making a cup of coffee.  I guess the machine only cost 25 bucks now, so if you dont mind only using it for a month it is a great buy', 'a great'])
print('\nCase4: negative label, positive prediction')
print(['I really love my Garim GPS 12. It is easy to use and move between screens and features. The Summit is not as easy to use. It is not as convenient to identify way points, does not work with mapping software and its display is to simple. It was not a step up but sideways. I use it in conjunction with my GPS 12 mostly just to keep the track log', 'easy to use'])


Case1: negative label, positive prediction
["Well, I guess I should've read more reviews of this DVD. It is STRICTLY for beginners. I don't sweat and my heart rate barely goes up (if I do breath of fire it helps). When I want an easy day I'll pop this in and do both sessions. Very simple. But I'm sure it's good for a beginner--a BEGINNING BEGINNER. Sara Ivanhoe is soothing though, and encouraging. But seriously, this is a very easy workout", 'easy']

Case2: positive label, negative prediction
['No, he doesn\'t. Angel Heart centers around a down in the dumps private detective, Harry Angel and Lou Cyphere, obviously the devil. Well, not obvious to me at first...ha! The only bad part about this movie is the fuss people made over it. The voodoo dance scenes were apparently fairly in keeping with tradition. Voodoo practioners had a problem with this movie centering around a "devil" character since there is no devil figure in voodoo. However...the devil character has very little attachment 

TODO🔻: (Write your case study discussions and answers to the questions here.)

The n-gram patterns as shortcut features are limited because they do not represent the complete meaning of the review and they may misguide the model. Some reviews are comparing two products, speaking good of other products to disgrade the product that she buys, and hence contain both positive and negative n-gram features. There is also review that mentions some good aspects of a product but is negatively toned in general. In this case, both sentiment features can be captured too.

4-gram pattern is the most robust shortcut feature. Empirically, as the output of the second last cell shows, there is no 4-gram pattern leading to wrong predictions, and most of the wrong predictions are associated with 1-gram and 2-gram. This can again be attributed to the locality of n-gram. 1-grams such as "easy", "bad", and 2-grams such as "a great" carry limited semantic information and can often lead to ill prediction. For example, the negation "not" is not incorporated in 1-grams, which otherwise might cause the opposite meaning.

<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

## **Part 3: Annotate New Data (25 pts)**

In this part, you will **annotate** the gold labels of some **new** SA data samples, and measure the degree of **agreement** between your and **one or two partners'** annotations.
    
</div>

## 🎯 Q3.1: **Write an Annotation Guideline (5 pts)**

TODO🔻: Imagine that you are going to assign this annotation task to a crowdsourcing worker, who is completely not familiar with computer science and NLP. Think about how you are going to explain this annotation task to him in order to guide him do a decent job. Write an annotation guideline for such a worker who are going to do this task for you.

**Note:** You should come up with your own guideline without the help of your partner(s) in later Part 3.2

Your job is to first read the **whole** review, and then use your own judgement to label each review as 'positive' or 'negative'. Here are some general principles:
1. For reviews in electronics and housewares domains, always focus on the sentiment towards **the product itself**. In other words, you should label the review according to how the user comments about the quality, functionality, cost-effectiveness. You **do not need to** consider attributes other than that.
2. For reviews in books and dvd domains, besides the aspects mentioned above, you **can** also consider **how the user thinks of the content** of the book or the dvd. However, your evaluation should not depend on the attributes irrelevant to the product, such as the tone of the story, the sentiment of the roles in the novel, etc.
3. Where positive sentiment and negative sentiment both appear, pick the one that is more salient.
4. Where sentiment is not explicitly expressed, annotate negative.
5. Be consistent in the principles.

Here is an examples as your reference.

*"This is a great set if you are interested in knowing Bret \"Hitman\" Hart the man.  However, the matches leave much to be desired.  It seems that Bret focused more on choosing matches that were sentimental to him as opposed to what would be most memorable to fans.  Overall, I would still recommend this set because it is a first rate documentary.  However, if you want more wrestling and less documentary, you might want to spring for the Undertaker DVD"*

This a review in dvd domain. It has mixed sentiments, but negative sentiment prevails. Therefore, you should annotate "negative".

## 🎯 Q3.2: **Annotate Your Datapoints with Partner(s) (8 pts)**

TODO🔻: Annotate 80 datapoints (20 in each domain of "books", "dvd", "electronics" and "housewares") assigned to you and your partner(s), by editing the value of the key **"label"** in each datapoint. You and your partner(s) should annotate **independently of each other**, i.e., each of you provide your own 80 annotations.

Please find your assigned annotation dataset **ID** and **your partner(s)** according to this [list](https://docs.google.com/spreadsheets/d/1hOwBUb8XE8fitYa4hlAwq8mARZe3ZsL4/edit?usp=sharing&ouid=108194779329215429936&rtpof=true&sd=true). Your annotation dataset can be found [here](https://drive.google.com/drive/folders/1IHXU_v3PDGbZG6r9T5LdjKJkHQ351Mb4?usp=sharing).

**Name your annotated file as `<your_assigned_dataset_id>-<your_sciper_number>.jsonl`.**

**You should also submit your partner's annotated file `<assigned_dataset_id>-<your_partner_sciper_number>.jsonl`.**

## 🎯 Q3.3: **Agreement Measure (12 pts)**

TODO🔻: Based on your and your partner's annotations in 3.2, calculate the [Cohen's Kappa](https://scikit-learn.org/stable/modules/model_evaluation.html#cohen-kappa) or [Krippendorff's Alpha](https://github.com/pln-fing-udelar/fast-krippendorff) (if you are in a group of three students) between the annotators on **each domain** and **across all domains**.

**Note:** Cohen's Kappa or Krippendorff's Alpha interpretation

0: No Agreement

0 ~ 0.2: Slight Agreement

0.2 ~ 0.4: Fair Agreement

0.4 ~ 0.6: Moderate Agreement

0.6 ~ 0.8: Substantial Agreement

0.8 ~ 1.0: Near Perfect Agreement

1.0: Perfect Agreement

**Questions:**
- What is the overall degree of agreement between you and your partner(s) according to the above interpretation of score ranges?
- In which domain are disagreements most and least frequently happen between you and your partner(s)? Give some examples to explain why that is the case.
- Are there possible ways to address the disagreements between annotators?

In [23]:
# Fill your code for calculating agreement scores here.
from sklearn.metrics import cohen_kappa_score

label_to_id = {'positive': 1, 'negative': 0}

with jsonlines.open("./data/32-369945.jsonl") as reader:
    my_labels = []
    my_books_labels = []
    my_dvd_labels = []
    my_electronics_labels = []
    my_housewares_labels = []
    for sample in reader.iter():
        label = label_to_id[sample['label']]
        my_labels.append(label)
        if sample['domain'] == 'books':
            my_books_labels.append(label)
        if sample['domain'] == 'dvd':
            my_dvd_labels.append(label)
        if sample['domain'] == 'electronics':
            my_electronics_labels.append(label)
        if sample['domain'] == 'housewares':
            my_housewares_labels.append(label)
    assert len(my_labels) == len(my_books_labels) + len(my_dvd_labels) + len(my_electronics_labels) + len(my_housewares_labels)

with jsonlines.open("./data/32-316842.jsonl") as reader:
    other_labels = []
    other_books_labels = []
    other_dvd_labels = []
    other_electronics_labels = []
    other_housewares_labels = []
    for sample in reader.iter():
        label = label_to_id[sample['label']]
        other_labels.append(label)
        if sample['domain'] == 'books':
            other_books_labels.append(label)
        if sample['domain'] == 'dvd':
            other_dvd_labels.append(label)
        if sample['domain'] == 'electronics':
            other_electronics_labels.append(label)
        if sample['domain'] == 'housewares':
            other_housewares_labels.append(label)
    assert len(other_labels) == len(other_books_labels) + len(other_dvd_labels) + len(other_electronics_labels) + len(other_housewares_labels)

print("The overall Cohen's Kappa is:", cohen_kappa_score(my_labels, other_labels))
print("The Cohen's Kappa on books domain is:", cohen_kappa_score(my_books_labels, other_books_labels))
print("The Cohen's Kappa on dvd domain is:", cohen_kappa_score(my_dvd_labels, other_dvd_labels))
print("The Cohen's Kappa on electronics domain is:", cohen_kappa_score(my_electronics_labels, other_electronics_labels))
print("The Cohen's Kappa on housewares domain is:", cohen_kappa_score(my_housewares_labels, other_housewares_labels))

The overall Cohen's Kappa is: 0.8903107861060329
The Cohen's Kappa on books domain is: 0.0
The Cohen's Kappa on dvd domain is: 0.8863636363636364
The Cohen's Kappa on electronics domain is: 1.0
The Cohen's Kappa on housewares domain is: 1.0


(Write your answers to the questions here.)

**Statement**: I contacted Arun Navaranjan in my adjacent group and used her annotation (SCIPER: 316842) for my analysis.

Here is an example of annotation disagreement.

*"I love the Abs Diet itself so I was expecting great things from the workout. It was pretty good, but the moves are nothing new. I was disappointed that the \"Advanced Total Body Circuit Training\" was actually just 12 squats, 12 lunges, and some push-ups and that was it. Don't get me wrong I felt it the next day, but only after I did the routine twice. The exercises are definitely effective, but they're nothing spectacular or new. Overall it's a great workout, but as you get a little stronger you can expect to have to do the routine more than once or add weight in order for it to be effective."*

This is a mixed sentiment case in dvd domain. I annotated it as positive because I thought the user expressed a positive sentiment in general (according to my principle 3), while my partner labeled it as negative potentially due to the user's complaint about "the intensity of the workout". This could be attributed to our different principles of annotation when dealing with difficult examples. Such cases of mixed sentiment or vague sentiment are common in dvd and book domains, leading to less annotation disagreement (and also the poorer model performance as analyzed in part 1.3).

Possible ways of address disagreement are:
1. Design a better annotation guideline, with an emphasis on how to address difficult cases.
2. Qualify annotators according to their language competency, domain knowledge, etc.
3. Define the task more precisely, such as introducing aspects in sentiment analysis (it should largely address the annotation disagreement in mixed sentiment cases), thus making the analysis more fine-grained.

<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

## **Part 4: Data Augmentation (20 pts)**

Since we only used 20% of the whole dataset for training, which might limit the model performance. In the final part, we will try to enlarge the training set by **data augmentation**.  

Specifically, we will **`Rephrase`** some current training samples using pretrained paraphraser. So that the paraphrased synthetic samples would preserve the semantic similarity while change the surface format.

You can use the pretrained T5 paraphraser [here](https://huggingface.co/humarin/chatgpt_paraphraser_on_T5_base).

</div>

## 🎯 Q4.1: **Data Augmentation with Paraphrasing (15 pts)**
TODO🔻: Implement functions named `get_paraphrase_batch` and `get_paraphrase_dataset` with the details in the below two blocks.

In [24]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# get the given pretrained paraphrase model and the corresponding tokenizer (https://huggingface.co/humarin/chatgpt_paraphraser_on_T5_base)
paraphrase_tokenizer = AutoTokenizer.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base")
paraphrase_model = AutoModelForSeq2SeqLM.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base").to(device)

def get_paraphrase_batch(
    model,
    tokenizer,
    input_samples,
    n,
    repetition_penalty=10.0,
    diversity_penalty=3.0,
    no_repeat_ngram_size=2,
    temperature=0.7,
    max_length=256,
    device='cuda:0'):
    '''
    Input
      model: paraphraser
      tokenizer: paraphrase tokenizer
      input_samples: a batch (list) of real samples to be paraphrased
      n: number of paraphrases to get for each input sample
      for other parameters, please refer to:
          https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationConfig
    Output: Tuple.
      synthetic_samples: a list of paraphrased samples
    '''

    # TODO: implement paraphrasing on a batch of input samples
    input_samples_add_prefix = [f'paraphrase: {sample}' for sample in input_samples]
    input_ids = tokenizer(
        input_samples_add_prefix,
        return_tensors="pt",
        padding="longest",
        max_length=max_length,
        truncation=True,
    ).input_ids.to(device)

    outputs = model.generate(
        input_ids, temperature=temperature, repetition_penalty=repetition_penalty,
        num_return_sequences=n, no_repeat_ngram_size=no_repeat_ngram_size,
        num_beams=n, num_beam_groups=n,
        max_length=max_length, diversity_penalty=diversity_penalty
    )

    synthetic_samples = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    return synthetic_samples

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [25]:
from tqdm import tqdm  # added

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

data_dir = 'data'
data_train_path = os.path.join(data_dir, 'train_sa.jsonl')
BATCH_SIZE = 8
N_PARAPHRASE = 2

def get_paraphrase_dataset(model, tokenizer, data_path, batch_size, n_paraphrase):
    '''
    Input
      model: paraphrase model
      tokenizer: paraphrase tokenizer
      data_path: path to the `jsonl` file of training data
      batch_size: number of input samples to be paraphrases in one batch
      n_paraphrase: number of paraphrased sequences for each sample
    Output:
      paraphrase_dataset: a list of all paraphrase samples. Do not include the original training data.
    '''
    with open(data_path, 'r') as reader:
      line_count = sum(1 for _ in reader)

    paraphrase_dataset = []
    with jsonlines.open(data_path, "r") as reader:

        # TODO: get paraphrases for the whole training dataset using get_paraphrase_batch
        batch_domain = []
        batch_label = []
        batch_text = []
        for sample_id, sample in enumerate(tqdm(reader.iter())):
          batch_domain += [sample['domain']] * N_PARAPHRASE
          batch_label += [sample['label']] * N_PARAPHRASE
          batch_text.append(sample['review'])
          if len(batch_text) == BATCH_SIZE or sample_id == line_count-1:
            paraphrase_batch_text = get_paraphrase_batch(paraphrase_model, paraphrase_tokenizer, batch_text, N_PARAPHRASE)
            assert len(paraphrase_batch_text) == len(batch_domain) == len(batch_label)
            for i, text in enumerate(paraphrase_batch_text):
              paraphrase_dataset.append({'review': paraphrase_batch_text[i], 'domain': batch_domain[i], 'label': batch_label[i]})
            batch_domain = []
            batch_label = []
            batch_text = []

    return paraphrase_dataset

**Note:** run paraphrasing, which will take ~20-30 minutes using a T4 Colab GPU. But the running time could depend on various implementations.

In [26]:
paraphrase_dataset = get_paraphrase_dataset(paraphrase_model, paraphrase_tokenizer, data_train_path, BATCH_SIZE, N_PARAPHRASE)

1600it [10:08,  2.63it/s]


In [27]:
# Original training dataset
with jsonlines.open(data_train_path, "r") as reader:
    origin_data = [dt for dt in reader.iter()]

all_data = origin_data + paraphrase_dataset

# Write all the original and paraphrased data samples into training dataset
augmented_data_train_path = os.path.join(data_dir, 'augmented_train_sa.jsonl')
with jsonlines.open(augmented_data_train_path, "w") as writer:
    writer.write_all(all_data)

assert len(all_data) == 3 * len(origin_data)

## 🎯 Q4.2: **Retrain RoBERTa Model with Data Augmentation (5 pts)**
TODO🔻: Retrain the sentiment analysis model with the augmented (original+paraphrased), larger dataset :)

**Note:** *Training on the augmented data will take about 15 minutes using a T4 Colab GPU.*

In [28]:
# Re-train a RoBERTa SA model on the augmented training dataset
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model_name = "FacebookAI/roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaForSequenceClassification.from_pretrained(model_name, num_labels=2)
model.to(device)

learning_rate = 1e-5
batch_size = 8
epochs = 4
max_grad_norm = 1.0
warmup_percent = 0.3
augmented_dataset = SADataset('./data/augmented_train_sa.jsonl', tokenizer)

train(augmented_dataset,
      test_dataset,
      model,
      device,
      batch_size,
      epochs,
      learning_rate,
      warmup_percent,
      max_grad_norm,
      model_save_root='models/augmented/',
      tensorboard_path="./tensorboard/part4_lr{}".format(learning_rate))

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Building SA Dataset...


0it [00:00, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (725 > 512). Running this sequence through the model will result in indexing errors
4800it [00:03, 1278.64it/s]
Training: 100%|██████████| 600/600 [00:45<00:00, 13.25it/s]
Evaluation: 100%|██████████| 800/800 [00:17<00:00, 46.55it/s]


Epoch: 0 | Training Loss: 0.642 | Validation Loss: 0.349
Epoch 0 SA Validation:
Confusion Matrix:
[[2901, 299], [474, 2726]]
F1: (88.24%, 87.58%) | Macro-F1: 87.91%
Model Saved!


Training: 100%|██████████| 600/600 [00:46<00:00, 12.87it/s]
Evaluation: 100%|██████████| 800/800 [00:17<00:00, 46.55it/s]


Epoch: 1 | Training Loss: 0.368 | Validation Loss: 0.264
Epoch 1 SA Validation:
Confusion Matrix:
[[2863, 337], [230, 2970]]
F1: (90.99%, 91.29%) | Macro-F1: 91.14%
Model Saved!


Training: 100%|██████████| 600/600 [00:45<00:00, 13.24it/s]
Evaluation: 100%|██████████| 800/800 [00:17<00:00, 46.70it/s]


Epoch: 2 | Training Loss: 0.233 | Validation Loss: 0.361
Epoch 2 SA Validation:
Confusion Matrix:
[[2959, 241], [317, 2883]]
F1: (91.38%, 91.18%) | Macro-F1: 91.28%
Model Saved!


Training: 100%|██████████| 600/600 [00:44<00:00, 13.34it/s]
Evaluation: 100%|██████████| 800/800 [00:17<00:00, 46.87it/s]

Epoch: 3 | Training Loss: 0.123 | Validation Loss: 0.523
Epoch 3 SA Validation:
Confusion Matrix:
[[2901, 299], [289, 2911]]
F1: (90.80%, 90.83%) | Macro-F1: 90.81%





TODO🔻: Discuss your results by answering the following questions

- Compare the performances of models in Part 1 and Part 4. Does the data augmentation help with the performance and why (give possible reasons)?
- No matter whether the data augmentation helps or not, list **three** possible ways to improve our current data augmentation method.

(Write your answers to the questions here.)

The data augmentation did help the performance. As the loss plot shows, with all hyperparameters being controled, the evaluation loss on the augmented data is lower than on non-augmented data at the end of the 1st epoch (the 600th training step), and continues to decrease until the end of the 2nd epoch. It is not until the 3rd epoch that it starts to increase due to over-fitting. The possible reason is that the paraphrased sentences enrich the feature space that the model learn from. The paraphrased sentences not only enhance linguistic diversity, making the model robust to various expressions of the similar meaning, but also prevent the model from over-fitting to occasional patterns in the small dataset, such as n-grams shortcuts, leading to better generalizablity.

Three possible ways to improve the current data augmentation method:
1. Introduce human supervision. Some paraphrased sentences, though grammatically correct, may deviate from the original sentence in semantic meaning and pragmatic use, thus making the augmented data less realistic and creating a bottleneck in model performance. One could recruit some annotators to manually paraphrase some reviews or to select the best model-paraphrased sentences, or even train another model to filter out poorly paraphrased samples.
2. Increase the paraphrasing model size. By increasing the number of parameters and training on a larger dataset, the model could potentially generate higher-quality paraphrased sentences.
3. Fine-tune the paraphrasing model on a task-specific dataset. One could fine-tune the pre-trained t5 model on a task-specific dataset, such as product reviews, while taking care of data leakage issue. The fine-tuning tasks could be masked language modeling, next sentence prediction, etc.

<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

### **5 Upload Your Notebook, Data and Models**

Please upload your filled jupyter notebook in your GitHub Classroom repository, **with all cells run and output results shown**.

**Note:** We are **not** responsible for re-running the cells in your notebook.

Please also submit all your **datasets** **(anotated and augmented)**, as well as **all your trained models** in Part 1 and Part 4, in your GitHub Classroom repository.
    
</div>