#  Assignment 2 - Transfer Learning and Data Augmentation 💬

Welcome to the **second assignment** for the **CS-552: Modern NLP course**!

> - 😀 Name: **< Xavier Nal >**
> - ✉️ Email: **< xavier.nal >@epfl.ch**
> - 🪪 SCIPER: **288275**

<div style="padding:15px 20px 20px 20px;border-left:3px solid green;background-color:#e4fae4;border-radius: 20px;">

## **Assignment Description**
- In the first part of this assignment, you will need to implement training (fine-tuning) and evaluation of a pre-trained language model ([DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert) ), on natural language inference (NLI) task for recognizing textual entailment (RTE).

- Following the first finetuning task, you will need to identify the shortcut (i.e. some salient or toxic features) that the model learnt for the specific task. 

- For part-3, you are supposed to annotate 100 randomly assigned test datapoints as ground-truth labels. Additionally, the cross annotation should be conducted by another one or two annotators, and you will learn about how to calculate the agreement statistics as a significant characteristic reflecting the quality of a collected dataset.

- For part-4, since the human annotation is quite time- and effort-consuming, there are plenty of ways to get silver-labels from automatic labeling to augment the dataset scale. We provide the reference to some simple methods (EDA and Back Translation) but you are encouraged to explore other advanced mechanisms. You will evaluate the improvement of your model performance by using your data augmentation method.

For each part, you will need to complete the code in the corresponding `.py` files (`nli.py` for Part-1, `shortcut.py` for Part-2, `eda.py` for Part-4). You will be provided with the function descriptions and detailed instructions about the code snippet you need to write.


### Table of Contents
- **[PART 1: Model Finetuning for NLI](#1)**
    - [1.1 Data Processing](#11)
    - [1.2 Model Training and Evaluation](#12)
- **[PART 2: Identify Model Shortcut](#2)**
    - [2.1 Word-Pair Pattern Extraction](#21)
    - [2.2 Distill Potentially Useful Patterns](#22)
    - [2.3 Case Study](#23)
- **[PART 3: Annotate New Data](#3)**
    - [3.1 Write an Annotation Guideline](#31)
    - [3.2 Annotate Your 100 Datapoints with Partner(s)](#32)
    - [3.3 Agreement Measure](#33)
    - [3.4 Robustness Check](#34)
- **[PART 4: Data Augmentation](#4)**
    
### Deliverables

- ✅ This jupyter notebook
- ✅ `nli.py` file
- ✅ `shortcut.py` file
- ✅ Finetuned DistilBERT models for NLI task (Part 1 and Part 4)
- ✅ Annotated and cross-annotated data files (Part 3)
- ✅ New dataset from data augmentation (Part 4)

</div>

### Google Colab Setup
If you are using Google Colab notebook for this assignment, you will need to run a few commands to set up our environment on Google Colab. If you are running this notebook on a local machine you can skip this section.

Run the following cell to mount your Google Drive. Follow the popped window, sign in to your Google account. (The same account you used to store this notebook!)

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Now first click the 4th left-side bar (named Files), then click the 2nd bar popped under Files column (named Refresh), under "/drive/MyDrive/" find the Assignment 2 folder that you uploaded to your Google Drive, copy its path and fill it in below. If everything is working correctly, then running the folowing cell should print the filenames from the assignment:

```
['Assignment2.ipynb', 'requirements.txt', 'runs', 'predictions', 'nli_data', 'testA2.py', 'nli.py', 'shortcut.py']
```

In [2]:
import os
# TODO: Fill in the path where you download the Assignment folder into
ROOT_PATH = "/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2" # Replace with your directory to A2 folder
print(os.listdir(ROOT_PATH))
"choices"
print(ROOT_PATH)

['nli_data', 'testA2.py', 'runs', '__pycache__', 'eda.py', 'nli.py', 'shortcut.py', 'requirements.txt', 'runs_augmented_model', 'Assignment2.ipynb']
/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2


Before we start, we also need to run some boilerplate code to set up our environment, same as previous assignments. You'll need to rerun this setup code each time you start the notebook.

In [3]:
ROOT_PATH_pip = "/content/drive/MyDrive/'Colab Notebooks'/mnlp/a2-xav-nal/A2"
!pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
requirements = ROOT_PATH_pip + "/requirements.txt"
print(requirements)
!pip install -r {requirements}

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/, https://download.pytorch.org/whl/cu116
Collecting torch==1.13.1+cu116
  Downloading https://download.pytorch.org/whl/cu116/torch-1.13.1%2Bcu116-cp39-cp39-linux_x86_64.whl (1977.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 GB[0m [31m820.1 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchvision==0.14.1+cu116
  Downloading https://download.pytorch.org/whl/cu116/torchvision-0.14.1%2Bcu116-cp39-cp39-linux_x86_64.whl (24.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.2/24.2 MB[0m [31m54.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchaudio==0.13.1
  Downloading https://download.pytorch.org/whl/cu116/torchaudio-0.13.1%2Bcu116-cp39-cp39-linux_x86_64.whl (4.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m100.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch, torchvi


Run this cell to load the autoreload extension. This allows us to edit .py source files, and re-import them into the notebook for a seamless editing and debugging experience.

In [4]:
%load_ext autoreload
%autoreload 2

In [5]:
from copy import deepcopy
import numpy as np 
from tqdm import tqdm
import jsonlines
import sys
import time
import random

import torch
import torch.utils.data
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import AdamW, get_constant_schedule_with_warmup
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

Once you have successfully mounted your Google Drive and located the path to this assignment, run the following cell to allow us to import from the `.py` files of this assignment. If it works correctly, it should print the message:

```
Hello A2!
```

In [6]:
sys.path.append(ROOT_PATH)

from testA2 import hello_A2
hello_A2()

Hello A2!


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Note that if CUDA is not enabled, `torch.cuda.is_available()` will return False and this notebook will fallback to CPU mode.

In [7]:
if torch.cuda.is_available():
  print('Good to go!')
else:
  print('Please set GPU via Edit -> Notebook Settings.')

Good to go!


### Local Setup
If you skip Google Colab setup, you still need to fill in the path where you download the Assignment folder, and install required packages.

In [None]:
ROOT_PATH = "../A2" # Replace with your directory to A2 folder

In [None]:
requirements = ROOT_PATH + "/requirements.txt"
print(requirements)
!pip install -r {requirements}

../A2/requirements.txt
[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: '../A2/requirements.txt'[0m[31m
[0m

In [None]:
with open('../A2/requirements.txt', 'r') as file:
    data = file.read()
print(data)

FileNotFoundError: ignored

In [None]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
from copy import deepcopy
import numpy as np 
from tqdm import tqdm
import jsonlines
import sys
import time, os
import random

import torch
import torch.utils.data
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import AdamW, get_constant_schedule_with_warmup
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

<a name="1"></a>
## **PART 1: Finetuning DistilBERT for NLI**
---

### **What is the NLI task?🧐**
> Given a pair of sentences, denoted as a "premise" sentence and a "hypothesis" sentence, NLI (or RTE) aims to determine their logical relationship, i.e. whether they are logically follow (entailment), unfollow (contradiction) or are undetermined (neutral) to each other.

> Defined as a machine learning task, NLI can be considered as a 3-classes (entailment, contradiction, or neutral) classification task, with a sentence-pair input ("hypothesis" and “premise”).

> **You can run the following cell to have the first glance at your data**. Each data sample is a python dictionary, which consists of following components:
- premise sentence (*'premise'*), 
- hypothesis sentence (*'hypothesis'*) 
- domain (*'domain'*): describing the topic of premise and hypothesis sentences (e.g., government regulations, telephone talks, etc.)
- label (*'label'*): indicating the logical relation between premise and hypothesis (i.e., entailment, contradiction, or neutral).

In [8]:
# If you use Google Colab, then data_dir = 'GOOGLE_DRIVE_PATH/nli_data'
data_dir = ROOT_PATH+'/nli_data'
data_dev_path = os.path.join(data_dir, 'dev_in_domain.jsonl')
with jsonlines.open(data_dev_path, "r") as reader:
    for sid, sample in enumerate(reader.iter()):
        print('premise', sample['premise'])
        print('hypothesis', sample['hypothesis'])
        print('label', sample['label'])
        print('\n')
        if sid == 10:
            break

premise The new rights are nice enough
hypothesis Everyone really likes the newest benefits 
label neutral


premise This site includes a list of all award winners and a searchable database of Government Executive articles.
hypothesis The Government Executive articles housed on the website are not able to be searched.
label contradiction


premise uh i don't know i i have mixed emotions about him uh sometimes i like him but at the same times i love to see somebody beat him
hypothesis I like him for the most part, but would still enjoy seeing someone beat him.
label entailment


premise yeah i i think my favorite restaurant is always been the one closest  you know the closest as long as it's it meets the minimum criteria you know of good food
hypothesis My favorite restaurants are always at least a hundred miles away from my house. 
label contradiction


premise i don't know um do you do a lot of camping
hypothesis I know exactly.
label contradiction


premise well that would be a help 

In [9]:
# Enter enter your Sciper number
SCIPER = '288275'
seed = int(SCIPER)

In [10]:
print('Your random seed is: ', seed)

Your random seed is:  288275


In [11]:
# We use the following pretrained tokenizer and model
model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=3)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifi

### **1.1 Dataset Processing**
Our first step is to load datasets for NLI task by constructing a Pytorch Dataset. Specifically, we will need to implement tokenization and padding with a HuggingFace pre-trained tokenizer.

**Complete `NLIDataset` class following the instructions in `nli.py`, and test by running the following cell.**

In [None]:

from nli import NLIDataset
model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
dataset = NLIDataset(ROOT_PATH+"/nli_data/dev_in_domain.jsonl", tokenizer)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Building NLI Dataset...


9815it [00:08, 1121.64it/s]


In [None]:
from testA2 import test_NLIDataset
test_NLIDataset(dataset)

NLIDataset test correct ✅


### **1.2 Model Training and Evaluation**
Next, we will implement the training and evaluation process to finetune the model. For model training, you will need to calculate the loss and update the model weights by update the optimizer. Additionally, we add a learning rate schedular to adopt an adaptive learning rate during the whole training process. 

For evaluation, you will need to compute accuracy and F1 scores to assess the model performance. 

**Complete the `compute_metric()`, `train()` and `evaluate()` functions following the instructions in the `nli.py` file, you can test compute_metric() by running the following cell.**

In [16]:
from nli import evaluate, train, compute_metrics
from testA2 import test_compute_metrics
test_compute_metrics(compute_metrics)

compute_metric test correct ✅


#### **Start Training and Validation!**

Try the following different hyperparameter settings, compare and discuss the results. (Other hyperparameters should not be changed.)

> A. learning_rate 2e-5

> B. learning_rate 5e-5

**Note:** *Each training will take about 1 hour using a GPU, please keep your computer and notebook active during the training.*

**Questions: Which learning rate is better? Explain your answers.**

In [None]:
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")



model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=3)
model.to(device)

train_dataset = NLIDataset(ROOT_PATH+"/nli_data/train.jsonl", tokenizer)
dev_dataset = NLIDataset(ROOT_PATH+"/nli_data/dev_in_domain.jsonl", tokenizer)

batch_size = 16
epochs = 4
max_grad_norm = 1.0
warmup_percent = 0.3
model_save_root = ROOT_PATH+'/runs/'

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

Building NLI Dataset...


98176it [01:12, 1347.87it/s]


Building NLI Dataset...


9815it [00:07, 1390.13it/s]


In [None]:
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model_lr5 = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=3)
model_lr5.to(device)

learning_rate = 5e-5 # play around with this hyperparameter

train(train_dataset, dev_dataset, model_lr5, device, batch_size, epochs,
      learning_rate, warmup_percent, max_grad_norm, model_save_root)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

 path save repo /content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/runs/lr5e-05-warmup0.3


Training: 100%|██████████| 6136/6136 [04:06<00:00, 24.88it/s]
Evaluation: 100%|██████████| 614/614 [00:06<00:00, 98.77it/s] 


Epoch: 0 | Training Loss: 0.949 | Validation Loss: 0.737
Epoch 0 NLI Validation:
Accuracy: 68.29% | F1: (73.30%, 62.55%, 68.02%) | Macro-F1: 67.95%
Model Saved at epoch 0


Training: 100%|██████████| 6136/6136 [04:01<00:00, 25.46it/s]
Evaluation: 100%|██████████| 614/614 [00:06<00:00, 98.58it/s] 


Epoch: 1 | Training Loss: 0.647 | Validation Loss: 0.598
Epoch 1 NLI Validation:
Accuracy: 75.33% | F1: (78.91%, 70.85%, 75.94%) | Macro-F1: 75.23%
Model Saved at epoch 1


Training: 100%|██████████| 6136/6136 [04:00<00:00, 25.46it/s]
Evaluation: 100%|██████████| 614/614 [00:06<00:00, 98.41it/s] 


Epoch: 2 | Training Loss: 0.486 | Validation Loss: 0.588
Epoch 2 NLI Validation:
Accuracy: 76.58% | F1: (80.63%, 71.63%, 76.55%) | Macro-F1: 76.27%
Model Saved at epoch 2


Training: 100%|██████████| 6136/6136 [04:00<00:00, 25.47it/s]
Evaluation: 100%|██████████| 614/614 [00:06<00:00, 98.50it/s] 

Epoch: 3 | Training Loss: 0.316 | Validation Loss: 0.695
Epoch 3 NLI Validation:
Accuracy: 75.96% | F1: (79.44%, 70.44%, 77.24%) | Macro-F1: 75.71%





In [None]:
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

learning_rate = 2e-5 # play around with this hyperparameter

model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model_lr2 = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=3)
model_lr2.to(device)

train(train_dataset, dev_dataset, model_lr2, device, batch_size, epochs,
      learning_rate, warmup_percent, max_grad_norm, model_save_root)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

 path save repo /content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/runs/lr2e-05-warmup0.3


Training: 100%|██████████| 6136/6136 [04:03<00:00, 25.20it/s]
Evaluation: 100%|██████████| 614/614 [00:06<00:00, 98.40it/s] 


Epoch: 0 | Training Loss: 0.949 | Validation Loss: 0.737
Epoch 0 NLI Validation:
Accuracy: 68.29% | F1: (73.30%, 62.55%, 68.02%) | Macro-F1: 67.95%
Model Saved at epoch 0


Training: 100%|██████████| 6136/6136 [04:01<00:00, 25.46it/s]
Evaluation: 100%|██████████| 614/614 [00:06<00:00, 98.29it/s] 


Epoch: 1 | Training Loss: 0.647 | Validation Loss: 0.598
Epoch 1 NLI Validation:
Accuracy: 75.33% | F1: (78.91%, 70.85%, 75.94%) | Macro-F1: 75.23%
Model Saved at epoch 1


Training: 100%|██████████| 6136/6136 [04:01<00:00, 25.40it/s]
Evaluation: 100%|██████████| 614/614 [00:06<00:00, 98.33it/s] 


Epoch: 2 | Training Loss: 0.486 | Validation Loss: 0.588
Epoch 2 NLI Validation:
Accuracy: 76.58% | F1: (80.63%, 71.63%, 76.55%) | Macro-F1: 76.27%
Model Saved at epoch 2


Training: 100%|██████████| 6136/6136 [04:01<00:00, 25.43it/s]
Evaluation: 100%|██████████| 614/614 [00:06<00:00, 98.57it/s] 

Epoch: 3 | Training Loss: 0.316 | Validation Loss: 0.695
Epoch 3 NLI Validation:
Accuracy: 75.96% | F1: (79.44%, 70.44%, 77.24%) | Macro-F1: 75.71%





### **Fine-Grained Validation**

Use the model checkpoint saved under the first hyperparameter setting (learning_rate 2e-5) in 1.4, check the model performance on each domain subsets of the validation set, report the validation loss, accuracy, F1 scores and Macro-F1 on each domain, compare and discuss the results.

**Questions: On which domain does the model perform the best? the worst? Give some possible explanations of why the model's best-performed domain is easier, and why the model's worst-performed domain is more challenging. Use some examples to support your explanations.**

**Note:** To find examples for supporting your discussion, save the model prediction results on each domain under the './predictions/' folder, by specifying the *result_save_file* of the *evaluate* function.

In [None]:
batch_size = 16
learning_rate = 2e-5
warmup_percent = 0.3
# checkpoint = ROOT_PATH+'/runs/lr{}-warmup{}/model_epoch2.pt'.format(learning_rate, warmup_percent)

# Split the validation sets into subsets with different domains
# Save the subsets under './nli_data/'
# Replace "..." with your code
...

# Define the set of domains
domains = set(["fiction", "government", "slate", "telephone", "travel"])

# Define the path to the output directory
output_dir = os.path.join(ROOT_PATH, "nli_data")

# Initialize the output files for each domain
output_files = {domain: os.path.join(output_dir, f"dev_{domain}.jsonl") for domain in domains}
for file in output_files.values():
    with open(file, "w"):
        pass

# Iterate over the input samples and write them to the appropriate output file
with jsonlines.open(ROOT_PATH+"/nli_data/dev_in_domain.jsonl", "r") as reader:
    for sample in tqdm(reader.iter()):
        # print(sample)
        domain = sample['domain']
        output_file = output_files.get(domain)
        with jsonlines.open(output_file, "a") as writer:
          writer.write(sample)




9815it [00:25, 377.69it/s]


In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
epoch = 2
learning_rate = 2e-5
warmup_percent = 0.3
repo =  ROOT_PATH+f'/runs/lr{learning_rate}-warmup{warmup_percent}'

# load tokenizer
tokenizer_checkpoint = os.path.join(repo, 'tokenizer_epoch{}.pt'.format(epoch))
tokenizer = DistilBertTokenizer.from_pretrained(tokenizer_checkpoint)

# load params model
model_checkpoint = os.path.join(repo, 'model_epoch{}.pt'.format(epoch))
model_name = "distilbert-base-uncased"
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=3)
state_dict = torch.load(model_checkpoint)
model.load_state_dict(state_dict)
model.to(device)

In [None]:


# Initialize the output files prediction for each domain
domains = set(["fiction", "government", "slate", "telephone", "travel"])
output_dir = os.path.join(ROOT_PATH, "nli_data/predictions")
output_files = {domain: os.path.join(output_dir, f"dev_pred_{domain}.jsonl") for domain in domains}
for file in output_files.values():
    with open(file, "w"):
        pass

for domain in ["fiction", "government", "slate", "telephone", "travel"]:
    
    # Evaluate and save prediction results in each domain
    # Replace "..." with your code
    data_files= f"/nli_data/dev_{domain}.jsonl"
    dev_domain_dataset = NLIDataset(ROOT_PATH + data_files, tokenizer)

    result_save_file = f'/nli_data/predictions/dev_pred_{domain}.json'
    dev_loss, acc, f1_ent, f1_neu, f1_con = evaluate(dev_domain_dataset, model, device,  batch_size, result_save_file= ROOT_PATH + result_save_file)
    
    macro_f1 = (f1_ent + f1_neu + f1_con) / 3
    
    print(f'Domain: {domain}')
    print(f'Validation Loss: {dev_loss:.3f} | Accuracy: {acc*100:.2f}%')
    print(f'F1: ({f1_ent*100:.2f}%, {f1_neu*100:.2f}%, {f1_con*100:.2f}%) | Macro-F1: {macro_f1*100:.2f}%')

Building NLI Dataset...


1973it [00:01, 1935.53it/s]
Evaluation: 100%|██████████| 124/124 [00:01<00:00, 116.84it/s]


Domain: fiction
Validation Loss: 0.596 | Accuracy: 76.43%
F1: (79.45%, 71.58%, 77.38%) | Macro-F1: 76.14%
Building NLI Dataset...


1945it [00:01, 1171.79it/s]
Evaluation: 100%|██████████| 122/122 [00:01<00:00, 97.64it/s]


Domain: government
Validation Loss: 0.472 | Accuracy: 81.54%
F1: (85.26%, 77.73%, 80.86%) | Macro-F1: 81.28%
Building NLI Dataset...


1955it [00:01, 1367.75it/s]
Evaluation: 100%|██████████| 123/123 [00:01<00:00, 100.46it/s]


Domain: slate
Validation Loss: 0.696 | Accuracy: 71.30%
F1: (75.71%, 64.80%, 72.19%) | Macro-F1: 70.90%
Building NLI Dataset...


1966it [00:01, 1407.34it/s]
Evaluation: 100%|██████████| 123/123 [00:01<00:00, 86.78it/s]


Domain: telephone
Validation Loss: 0.614 | Accuracy: 75.18%
F1: (79.50%, 68.97%, 76.05%) | Macro-F1: 74.84%
Building NLI Dataset...


1976it [00:02, 913.36it/s] 
Evaluation: 100%|██████████| 124/124 [00:01<00:00, 98.84it/s] 

Domain: travel
Validation Loss: 0.551 | Accuracy: 78.80%
F1: (83.76%, 75.13%, 76.79%) | Macro-F1: 78.56%





The model performs better on governement domain ( 81.54% of accuracy)

## **Task2: Identify Shortcuts**

We aim to find some shortcuts that the model in 1.4 (under the first hyperparameter setting) has learned.

### **2.1 Word-Pair Pattern Extraction**

We consider to exatrct simple word-pair patterns that the model may have learned from the NLI data. 

For this, we assume that a pair of words that occur in a premise-hypothesis sentence pair (one occurs in premise and the other occurs in hypothesis) may serve as a key indicator of the logical relationship between the premise and hypothesis sentences. For example:

>- Premise: Consider the United States Postal Service.
>- Hypothesis: Forget the United States Postal Service.

Here the word-pair "consider" and "forget" determine that the premise and hypothesis have a *contradiction* relationship, so (consider, forget) --> *contradiction* might be a good pattern to learn.

**Note:** 
- We do not consider the naive word pair patterns where the word from premise and the word from hypothesis are identical, e.g., (service, service) got from the above premise-hypothesis sentence pair.
- We do not consider stop words neither, punctuations and words that contain special prefix '##', e.g., '##s' in the pattern extraction.

**Complete `word_pair_extraction()` function in `shortcut.py` file.**

The keys of the returned dictionary *word_pairs* should be **different word-pairs** appered in premise-hypothesis sentence pairs, i.e., (a word from the premise, a word from the hypothesis).

The value of a word-pair key records the counts of entailment, neutral and contradiction predictions **made by the model** when the word-pair occurs, i.e., \[#entailment_predictions, #neutral_predictions,  #contradiction_predictions\].

**Note:** Remember to remove naive word pairs (i.e., premise word identical to hypothesis word), stop_words, puntuations and words with special prefix '##' out of consideration.

In [13]:
import itertools
import jsonlines
import nltk
nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')
stop_words.append('uh')

import string
puncs = string.punctuation
import nltk
nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')
stop_words.append('uh')

import string
puncs = string.punctuation

epoch = 2
learning_rate = 2e-5
warmup_percent = 0.3
repo =  ROOT_PATH+f'/runs/lr{learning_rate}-warmup{warmup_percent}'
# load tokenizer
tokenizer_checkpoint = os.path.join(repo, 'tokenizer_epoch{}.pt'.format(epoch))
tokenizer = DistilBertTokenizer.from_pretrained(tokenizer_checkpoint)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### **2.2 Distill Potentially Useful Patterns**

Find and print the **top-100** word-pairs that are associated with the **largest total number** of model predictions, which might contain frequently used patterns.

In [None]:
from shortcut import word_pair_extraction

In [14]:
# all your saved model prediction results in 1.2 Fine-Grained Validation
folder_path = ROOT_PATH + "/nli_data/predictions"
prediction_files = []

# Iterate over all files in the directory
for filename in os.listdir(folder_path):
    prediction_files.append(os.path.join(folder_path, filename))

print(prediction_files)

['/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_fiction.json', '/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_government.json', '/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_government.jsonl', '/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_slate.jsonl', '/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_fiction.jsonl', '/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_travel.jsonl', '/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_telephone.jsonl', '/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_slate.json', '/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_telephone.json', '/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-n

In [15]:
word_pairs = word_pair_extraction(prediction_files, tokenizer)

print('word_pairs', len(word_pairs))

files /content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_fiction.json
files /content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_government.json
files /content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_government.jsonl
files /content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_slate.jsonl
files /content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_fiction.jsonl
files /content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_travel.jsonl
files /content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_telephone.jsonl
files /content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_slate.json
files /content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_telephone.json
files /content/drive/MyDriv

In [52]:
word_pairs[('like', 'never')]

[3, 1, 17]

In [16]:
import collections
# {"entailment": 0, "neutral": 1, "contradiction": 2}

sum_word_pairs = {}

# Compute occurences
for pair, freq in word_pairs.items():
    total_freq = sum(freq)
    sum_word_pairs[pair] = total_freq


# find top-100 word-pairs associated with the largest total number of model predictions
top_100_pairs = collections.Counter(sum_word_pairs).most_common(100)
# print(top_100_pairs)

# to add again the value information
top_100_pairs_labels = {}
for key, freq in top_100_pairs:
  label = word_pairs[key]
  top_100_pairs_labels[key] = label


In [17]:
# find top-100 word-pairs associated with the largest total number of model predictions
top_100_freq_pairs = top_100_pairs_labels

print(top_100_freq_pairs)

{('services', 'legal'): [24, 18, 19], ('legal', 'services'): [21, 11, 18], ('know', 'time'): [8, 24, 14], ('postal', 'service'): [16, 13, 13], ('service', 'postal'): [14, 11, 12], ('know', 'people'): [16, 16, 4], ('know', 'money'): [13, 16, 7], ('know', 'like'): [18, 15, 3], ('know', 'lot'): [10, 21, 4], ('know', 'watch'): [11, 12, 12], ('yeah', 'like'): [9, 11, 13], ('know', 'make'): [5, 15, 11], ('know', 'get'): [6, 13, 10], ('would', 'could'): [7, 15, 6], ('like', 'think'): [16, 10, 1], ('last', 'years'): [13, 6, 8], ('know', 'never'): [1, 0, 26], ('yeah', 'one'): [7, 6, 14], ('year', 'last'): [10, 6, 10], ('yeah', 'time'): [5, 8, 13], ('like', 'lot'): [13, 10, 2], ('know', 'would'): [2, 17, 6], ('know', 'think'): [9, 2, 14], ('know', 'go'): [6, 5, 14], ('one', 'people'): [11, 9, 4], ('test', 'toxicity'): [8, 8, 8], ('yeah', 'never'): [5, 1, 18], ('yeah', 'get'): [6, 13, 5], ('first', 'mail'): [6, 6, 11], ('last', 'year'): [11, 5, 7], ('know', 'children'): [5, 11, 7], ('ca', 'da'): 

**Among the top-100 frequent word-pairs above**, find out the **top-5** word-pairs whose occurances **most likely** lead to *entailment* predictions (entailment patterns), and the **top-5** word-pairs whose occurances **most likely** lead to *contradiction* predictions (contradiction patterns).

**Explain your rules for finding these word pairs.**

In [18]:
# find top-5 entailment and contradiction patterns
top_5_entailment = sorted(top_100_pairs_labels.items(), key=lambda x: x[1][0], reverse=True)[:5]
top_5_contradict = sorted(top_100_pairs_labels.items(), key=lambda x: x[1][2], reverse=True)[:5]

print("Entailment Patterns:")
print(top_5_entailment)
print("Contradiction Patterns:")
print(top_5_contradict)

Entailment Patterns:
[(('services', 'legal'), [24, 18, 19]), (('legal', 'services'), [21, 11, 18]), (('know', 'like'), [18, 15, 3]), (('information', 'state'), [17, 0, 0]), (('postal', 'service'), [16, 13, 13])]
Contradiction Patterns:
[(('know', 'never'), [1, 0, 26]), (('services', 'legal'), [24, 18, 19]), (('legal', 'services'), [21, 11, 18]), (('yeah', 'never'), [5, 1, 18]), (('like', 'never'), [3, 1, 17])]


### **2.3 Case Study**

Find out and study **4 representative** cases where the pattern that you have found in 2.2 **fails**, e.g., the premise-hypothesis sentence pair contains ('good', 'bad'), but has an *entailment* gold label.

**Based on your case study, explain the limitations of the word-pair patterns.**

In [12]:
# all your saved model prediction results in 1.2 Fine-Grained Validation

# ROOT_PATH = '../A2'
ROOT_PATH = "/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2"
folder_path = ROOT_PATH + "/nli_data/predictions"
prediction_files = []

# Iterate over all files in the directory
for filename in os.listdir(folder_path):
    prediction_files.append(os.path.join(folder_path, filename))

print(prediction_files)

['/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_fiction.json', '/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_government.json', '/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_government.jsonl', '/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_slate.jsonl', '/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_fiction.jsonl', '/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_travel.jsonl', '/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_telephone.jsonl', '/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_slate.json', '/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_telephone.json', '/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-n

In [19]:
# looking for contradiction label on the pair ('know', 'never'), [1, 0, 26])
"""
Example of a the patern (('know', 'never'), [1, 0, 26]) is most of the time contradiction but  have one entailment
"""
for file in prediction_files:
    with jsonlines.open(file) as reader:
        # Loop through all predictions in the file
        for idx, sample in enumerate(reader.iter()):
            # Check if premise contains the pair words
            if 'know' in sample['premise'] and 'know' not in sample['hypothesis'] \
                and 'never' in sample['hypothesis'] and 'never' not in sample['premise'] \
                and sample['label'] == 'entailment':

                print('file', file)
                print('index', idx)
                print('Premise:', sample['premise'])
                print('Hypothesis:', sample['hypothesis'])
                print('Label:', sample['label'])
                print('prediction:', sample['prediction'])
                print('\n')

file /content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_telephone.json
index 546
Premise: and uh my daughter gets irate when i when i do that because you know she's a teenager
Hypothesis: My daughter's a teenager and so she gets mad whenever I do that.
Label: entailment
prediction: entailment




In [20]:
# looking for contradiction label on the pair (('know', 'like'), [18, 15, 3])
"""
Example of a the patern (('know', 'like'), [18, 15, 3]) is most of the time entailment but  have one entailment
"""
for file in prediction_files:
    with jsonlines.open(file) as reader:
        # Loop through all predictions in the file
        for idx, sample in enumerate(reader.iter()):
            # Check if premise contains "may" and hypothesis contains "might"
            if 'know' in sample['premise'] and 'like' in sample['hypothesis'] and sample['label'] == 'contradiction' :
                print('file', file)
                print('index', idx)
                print('Premise:', sample['premise'])
                print('Hypothesis:', sample['hypothesis'])
                print('Label:', sample['label'])
                print('prediction:', sample['prediction'])
                print('\n')

file /content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_telephone.json
index 766
Premise: and the same is true of the drug hangover you know if you
Hypothesis: It's nothing like a drug hangover.
Label: contradiction
prediction: contradiction


file /content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_telephone.json
index 1101
Premise: um-hum yeah i know what that's like uh-huh
Hypothesis: I have no idea what that is like.
Label: contradiction
prediction: contradiction


file /content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/nli_data/predictions/dev_pred_telephone.json
index 1209
Premise: yeah right right yeah i know i uh i remember my college days  and having to do that too
Hypothesis: I remember that when I went to college we didn't have anything like that.
Label: contradiction
prediction: contradiction




The word-pair pattern approach can be effective in detecting entailment and contradiction relationships in certain cases. However, there are limitations to this approach. For example, in the case of the word-pair pattern ('services', 'legal'), it may suggest a strong entailment relationship (top5 frequency for entailment), but it is also pattern word pair in the top5 for contradiction.

Similarly, the word-pair pattern ('know', 'like') may suggest a entailment relationship, but in some cases, it may not hold true. In our case there is 3 case with contradiction relationship. 

These examples demonstrate the limitations of  word-pair patterns for detecting entailment and contradiction relationships. To improve the accuracy of these predictions, it is important to take into account the context.

## **Task3: Annotate New Data**

To check the robustness of developed model, **some additional sets of test data** are collected, which contain NLI samples that are out of the domains of the training and validation data.

However, the test data does not have gold labels of the relationships between premise and hypothesis sentences, i.e., all the labels are marked as *hidden*. **We consider to annotate the data by ourselves.**

### **3.1 Write an Annotation Guideline**

Imagine that you are going to assign this annotation task to a crowdsourcing worker, who is completely not familiar with computer science and NLP. Think about how you are going to explain this annotation task to him in order to guide him do a decent job. Write an annotation guideline for such a worker who are going to do this task for you.

**Note:** You should come up with your own guideline without the help of your partner(s) in later Task 3.2

(Write your annotation guideline here.)

#### Guideline

You will have the data to annotate under the form on one premise sentence follow by an hypothesis.
Your role is to labelised to say the link between the premise and the hypothesis. In detail is to determine whether the hypothesis is supported, contradicted or can't be inderred from the premise. 


1. Read and understand the premise and the hypothesis carefully

2. Now determine the link between the hypothesis and the premise:
- does the hypothesis contradicted the premise?
- does they say incompatible information?
- Note: seek for the presence of negation in one sentence, seek for the presence of words that have contradictoire meaning (start/finish, at time/late, old/young, ...
-> then the label is "contradiction"
2.1 if the sentence is not contradictoire, does the hypothesis support the premise?
- if the premise is the only knowledge you have, can you determine the hypothesis? 
- does all the information provide by the hypothesis come from the premise?
- does the hypothesis repeat what say the premise without adding information?
- does the hypothesis clearly implied by the premise?
write label "entailment"
2.2 if the sentence is not contradictoire and not support by the premise,it may be an example of 'neutral' 
- does the hypothesis speak about another thing than the premise?
- is the hypothesis not directly support by the premise?
write 'neutral' 



**studen note:** the choice has been made in the guideline to pay attention at contradiction before entailment is because he found easier to repart contradiction than clear support between the premise and the hypothesis

### **3.2 Annotate Your 100 Datapoints with Partner(s)**

Annotate your 100 test datapoints with your partner(s), by editing the value of the key "label_student1", "label_student2" and "label_student3" (if you are in a group of three students) in each datapoint.

**Note:** 
- You can download the assigned annotation file (`<your-testset-id>.jsonl`) by [this link](https://drive.google.com/drive/folders/146ExExmpnSUayu6ArGiN5gQzCPJp0myB?usp=share_link)
- Please find your annotation partner according to the "Student Pairing List for A2 Task3" shared on Ed.

**Name your annotated file as `<index>-<sciper_number>.jsonl`.** 

For example, if you get `01.jsonl` to annotate, you should name your deliverable as `01-<your_sciper_number>.jsonl`.

In [None]:
# code to see the data
number = 90
data_dir = ROOT_PATH+'/nli_data'
data_dev_path = os.path.join(data_dir, '28-288275.jsonl')
with jsonlines.open(data_dev_path, "r") as reader:
    for sid, sample in enumerate(reader.iter()):
        if sid > number:
            print('index: ',sid)
            print('premise:    ', sample['premise'])
            print('hypothesis: ',sample['hypothesis'])
            print('label:      ', sample['label_student2'])
            print('\n')
        if sid > number + 1:
            break

index:  91
premise:     As Indianapolis Center continued searching for the aircraft, two managers and the controller responsible for American 77 looked to the west and southwest along the flight's projected path, not east-where the aircraft was now heading.
hypothesis:  They did not locate the plane until it crashed.
label:       neutral


index:  92
premise:     But most important, it must have a managerial system capable of coordinating these elements on an ongoing basis.
hypothesis:  A competent managerial system that is capable of consistently coordinating these elements is a necessity.
label:       entailment


index:  93
premise:     Some within the Pentagon argued in the 1990s that the alert sites should be eliminated entirely.
hypothesis:  Alert sites are costing the government a lot of money.
label:       neutral


index:  94
premise:     Our approach to the problem was to use operations research techniques and computer simulations of demand to explore the appropriate inventor

In [None]:
# code to label the data
#ROOT_PATH = "/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2" 
data_dir = ROOT_PATH+'/nli_data'
data_dev_path = os.path.join(data_dir, '28-288275.jsonl')

my_dict = {0: "entailment", 1: "neutral", 2: "contradiction"}

with jsonlines.open(data_dev_path) as reader:
    data = list(reader)
new_label = '0'

for i, d in enumerate(data):
    if new_label == 'q':
        break
        
    if d['label_student2'] == 'unknown':
        while True:
            print(' i: ', i)
            print('premise:', d['premise'])
            print('hypothesis:',d['hypothesis'])
            print('label:',d['label_student2'])
            
            new_label = input('Enter label')

            if new_label == 'q':
                break            
            elif new_label.isdigit() and 0 <= int(new_label) < 3:
                new_label = int(new_label)
                data_1[i]['label_student3'] = my_dict[new_label]
                break
         

with jsonlines.open(data_dev_path, mode='w') as writer:
    writer.write_all(data)
    

### **3.3 Agreement Measure**

Based on your and your partner's annotations on the 100 test datapoints in 3.2, calculate the [Cohen's Kappa](https://scikit-learn.org/stable/modules/model_evaluation.html#cohen-kappa) or [Krippendorff's Alpha](https://github.com/pln-fing-udelar/fast-krippendorff) (if you are in a group of three students) between the annotators. Discuss the agreement measure results.

**Note:** Cohen's Kappa or Krippendorff's Alpha interpretation

0: No Agreement

0 ~ 0.2: Slight Agreement

0.2 ~ 0.4: Fair Agreement

0.4 ~ 0.6: Moderate Agreement

0.6 ~ 0.8: Substantial Agreement

0.8 ~ 1.0: Near Perfect Agreement

1.0: Perfect Agreement

> **Questions**: What is your interpretation of Cohen's Kappa or Krippendorff's Alpha value according to the above mapping? Which kind of disagreements are most frequently happen between you and your partner(s), i.e., *entailment* vs. *neutral*, *entailment* vs. *contradiction*, or *neutral* vs. *contradiction*? For the second question, give some examples to explain why that is the case. Are there possible ways to address the disagrrements between two annotators?

In [None]:
# fill your code here

In [None]:
def check_disagreement(label_studen_1, label_studen_2):
    """
    input: label_student_1, label_student_2: string
    output: number
    0: error
    1: for entailment vs. neutral
    2: for entailment vs. contradiction
    3: for neutral vs. contradiction
    """
    result = 0
    
    if label_studen_1 == "entailment":
        
        if label_studen_2 == "neutral":
            result = 1
        elif label_studen_2 == "contradiction":
            result = 2
        else:
            print('error the label are the same',label_studen_1, label_studen_2)
        
    elif label_studen_1 == "neutral":
        
        if label_studen_2 == "entailment":
            result = 1
        elif label_studen_2 == "contradiction":
            result = 3
        else:
            print('error the label are the same',label_studen_1, label_studen_2)
        
    elif label_studen_1 == "contradiction":
        
        if label_studen_2 == "entailment":
            result = 2
        elif label_studen_2 == "neutral":
            result = 3
        else:
            print('error the label are the same',label_studen_1, label_studen_2)

            
    return result
        

In [None]:
# code to check the number of commun labl
# ROOT_PATH = "/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2" 
ROOT_PATH = "../A2" # Replace with your directory to A2 folder
data_dir = ROOT_PATH+'/nli_data'
data_dev_path = os.path.join(data_dir, '28-351555.jsonl')
data_dev_path_2 = os.path.join(data_dir, '28-288275.jsonl')

my_dict = {0: "entailment", 1: "neutral", 2: "contradiction"}
my_dict_inv = {"entailment": 0, "neutral":1 , "contradiction":2}

# 1: for entailment vs. neutral
# 2: for entailment vs. contradiction
# 3: for neutral vs. contradiction

entvsneu = 0
entvscon = 0
neuvscon = 0


with jsonlines.open(data_dev_path) as reader:
    data_1 = list(reader)

with jsonlines.open(data_dev_path_2) as reader:
    data_2 = list(reader)

input = '0'
same = 0
not_same= 0
diff = []
ind_diff = []
for i, d in enumerate(data_1):
    if d['label_student1'] == data_2[i]['label_student2']:
        #print(i)
        same += 1
    else:
        not_same += 1
        disagreement = check_disagreement(d['label_student1'], data_2[i]['label_student2'])
        
        if disagreement == 0:
            print('error disagrement')
        # entailment vs. neutral
        elif disagreement == 1:
            entvsneu += 1
        # entailment vs. contradiction
        elif disagreement == 2:
            entvscon += 1
        # 3: for neutral vs. contradiction
        elif disagreement == 3:
            neuvscon += 1
        diff.append(d)
        ind_diff.append(i)


print('same:', same, 'not_same', not_same)
print('entailment vs. neutral', entvsneu)
print('entailment vs. contradiction', entvscon)
print('neutral vs. contradiction', neuvscon)
print('indices of label differents:', ind_diff)

There are 35 disagreements between my partner and me. The most frequent disagreement is about entailment versus neutrality. One possible reason for this is the multiple meanings of certain words, as well as the fact that neither my partner nor I have English as our native language. Therefore, sometimes we do not completely understand the meaning of a sentence. The disagreement between entailment and contradiction occurs the least, and when it does happen, it is usually due to an unintentional mistake or lack of attention. We are able to reach an agreement quickly in such cases.

Disagrement entailment vs neutral:
- index:  48
- premise:     That was the favorite part of the story!
- hypothesis:  That was the best part of the story!
- label student 1:  neutral   
- label student 2:  entailment

In the given example, there is a disagreement between two students on whether the hypothesis entails the premise. Although the two sentences convey similar meanings, there is a subtle difference in the words "favorite" and "best". "Favorite" refers to something that is preferred or liked the most, while "best" refers to something that is of the highest quality or excellence, which conduct to put the label neutral for the first student.

Disagrement contradiction vs neutral:
- premise:     Into adulthood, what books do you like now?
- hypothesis:  Why don't you read books, now that you're an adult?
- label:       contradiction
- label2:      neutral

In the example above, at first, I labeled it as neutral, reasoning that the premise and the hypothesis were asking about different things: the type of books or if the person reads books at all. However, my partner pointed out that the premise implies that the person has already started reading books, and thus the hypothesis contradicts this by implying that the person doesn't read books as an adult. This example shows the importance of carefully reading and analyzing the context of the given sentences to accurately identify the correct label.

How to do better?

In summary, most of our labeling disagreements were due to a lack of deep understanding of the sentences. Employing the services of two English-speaking workers to label the data could improve the Cohen's Kappa or Krippendorff's Alpha values, as they have a better understanding of the meaning of the sentences and common expressions. The remaining the disagreements were caused by inattention errors. In such cases, having multiple workers label the same data and taking results with the highest frequency can remove such errors. However, this would increase the cost of human labeling.

In [None]:
# code see in detail the disagrement
# ROOT_PATH = "/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2" 
ROOT_PATH = "../A2" # Replace with your directory to A2 folder
data_dir = ROOT_PATH+'/nli_data'
#data_dev_path_fin = os.path.join(data_dir, 'commun_agrre.jsonl')

new_label = 0

for i in ind_diff:
        
    print(i)
    print('index: ',i)
    print('premise:    ', data_1[i]['premise'])
    print('hypothesis: ', data_1[i]['hypothesis'])
    print('label:      ', data_1[i]['label_student1'])
    print('label2:     ', data_2[i]['label_student2'])

In [53]:
# agreement on disagreement values 
ground_truth = {
    0:'entailment',
    1:'neutral', 4:'entailment', 
    13:'entailment', 17:'entailment', 
    19:'entailment', 22:'neutral', 
    32:'entailment', 36:'entailment', 
    41:'entailment', 42:'neutral', 
    43:'neutral', 48:'entailment', 
    51:'contradiction', 53:'contradiction', 
    54:'contradiction', 61:'entailment', 
    62:'neutral', 63:'contradiction', 
    66:'entailment', 71:'entailment', 
    72:'neutral', 74:'contradiction',
    75:'neutral', 78:'neutral',
    80:'neutral', 81:'contradiction',
    83:'neutral', 86:'neutral',
    87:'contradiction', 90:'entailment',
    91:'neutral', 93:'neutral',
    94:'neutral', 96:'neutral'
    
}

In [55]:
# code to put all the information in the file 28-288275.jsonl
ROOT_PATH = "/content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2" 
# ROOT_PATH = "../A2" # Replace with your directory to A2 folder
data_dir = ROOT_PATH+'/nli_data'
data_dev_path = os.path.join(data_dir, '28-351555.jsonl')
data_dev_path_2 = os.path.join(data_dir, '28-288275.jsonl')

my_dict = {0: "entailment", 1: "neutral", 2: "contradiction"}
my_dict_inv = {"entailment": 0, "neutral":1 , "contradiction":2}


with jsonlines.open(data_dev_path) as reader_1:
    data_1 = list(reader_1)

with jsonlines.open(data_dev_path_2) as reader_2:
    data_2 = list(reader_2)


for i, d in enumerate(data_2):
    # print('index', i)
    d['label_student1'] = data_1[i]['label_student1']
    
    if d['label_student2'] == data_1[i]['label_student1']:
        d['label'] = d['label_student2']
 
    else:
        # print('disagrement index',i)
        d['label'] = ground_truth[i]
        
    # print(i,d['label'])
# write the label student 1 and label values in the document 28-288275.jsonl
with jsonlines.open(data_dev_path_2, mode='w') as writer:
    writer.write_all(data_2)

0 entailment
1 neutral
2 entailment
3 contradiction
4 entailment
5 entailment
6 contradiction
7 entailment
8 contradiction
9 neutral
10 contradiction
11 contradiction
12 contradiction
13 entailment
14 contradiction
15 entailment
16 entailment
17 entailment
18 neutral
19 entailment
20 entailment
21 entailment
22 neutral
23 neutral
24 neutral
25 contradiction
26 entailment
27 contradiction
28 neutral
29 neutral
30 contradiction
31 neutral
32 entailment
33 contradiction
34 neutral
35 contradiction
36 entailment
37 contradiction
38 contradiction
39 entailment
40 contradiction
41 entailment
42 neutral
43 neutral
44 entailment
45 contradiction
46 entailment
47 entailment
48 entailment
49 contradiction
50 entailment
51 contradiction
52 entailment
53 contradiction
54 contradiction
55 entailment
56 neutral
57 contradiction
58 neutral
59 entailment
60 entailment
61 entailment
62 neutral
63 contradiction
64 entailment
65 entailment
66 entailment
67 entailment
68 entailment
69 contradiction
70 ent

### **3.4 Robustness Check**

Take into account both your and your partner's annotations, determine the final labels of the 100 test datapoints, by editing the value of the key "label" in each of your datapoint.

Evaluate the performance of your developed model in 1.4 (still under the first hyperparameter setting) on your annotated 100 test datapoints, and compare with the model performance on the validation set.

> **Question**: Do you think that your developed model has a good robuestness of handling out-of-domain NLI predictions?

In [None]:
# fill your code here

In [50]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
epoch = 2
learning_rate = 2e-5
warmup_percent = 0.3
repo =  ROOT_PATH+f'/runs/lr{learning_rate}-warmup{warmup_percent}'

# load tokenizer
tokenizer_checkpoint = os.path.join(repo, 'tokenizer_epoch{}.pt'.format(epoch))
tokenizer = DistilBertTokenizer.from_pretrained(tokenizer_checkpoint)

# load params model
model_checkpoint = os.path.join(repo, 'model_epoch{}.pt'.format(epoch))
model_name = "distilbert-base-uncased"
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=3)
state_dict = torch.load(model_checkpoint)
model.load_state_dict(state_dict)
_ = model.to(device)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifi

In [59]:
################### Results on own data annotation #########################333
# Initialize the output files prediction for each domain
output_dir = os.path.join(ROOT_PATH, "nli_data/predictions")
output_file = os.path.join(output_dir, f"pred_own_annotation.jsonl")
batch_size = 50

# Evaluate and save prediction results
data_annotation = f"/nli_data/28-288275.jsonl"
data_annotation_dataset = NLIDataset(ROOT_PATH + data_annotation, tokenizer)

result_save_file = f'/nli_data/predictions/dev_pred_data_annotation.json'
dev_loss, acc, f1_ent, f1_neu, f1_con = evaluate(data_annotation_dataset, model, device,  batch_size, result_save_file= ROOT_PATH + result_save_file)

macro_f1 = (f1_ent + f1_neu + f1_con) / 3

print(f'Own annotation data')
print(f'Validation Loss: {dev_loss:.3f} | Accuracy: {acc*100:.2f}%')
print(f'F1: ({f1_ent*100:.2f}%, {f1_neu*100:.2f}%, {f1_con*100:.2f}%) | Macro-F1: {macro_f1*100:.2f}%')

Building NLI Dataset...


100it [00:00, 1317.66it/s]
Evaluation: 100%|██████████| 2/2 [00:00<00:00, 27.51it/s]

Own annotation data
Validation Loss: 0.770 | Accuracy: 76.00%
F1: (81.01%, 59.26%, 83.58%) | Macro-F1: 74.62%





The model achieve hight F1 and accuracy score, so the model seems to be good to predict data outside its dataset.However the test need to be done with more than 100 new data to have a good conclusion.

## **Task4: Data Augmentation**

Finally, we consider to use a data augmentation method to create more training data, and use the augmented data to improve the model performance. The data augmentation method we are going to use is [EDA](https://aclanthology.org/D19-1670/).

### **4.1 EDA: Easy Data Augmentation algorithm for Text**

For this section, we will need to implement the most simple data augmentation techniques on textual sentences, including **SR** (Synonym Replacement), **RD** (Random Deletion), **RS** (Random Swap), **RI** (Random Insertion). 

You should complete all the functions in `eda.py` script, and you can test them with a simple testcase by running the following cell.

- **Synonym Replacement (SR)**
> In Synonym Replacement, we randomly replace some words in the sentence with their synonyms.

You can test whether you get the synonyms right and see an example with synonym replacement.

In [56]:
from eda import get_synonyms
from testA2 import test_get_synonyms

test_get_synonyms(get_synonyms)

The synonyms for the word "task" are:  ['chore', 'tax', 'undertaking', 'job', 'project', 'labor']


In [57]:
from eda import synonym_replacement

print(f" Example of Synonym Replacement: {synonym_replacement('hey man how are you doing',3)}")

 Example of Synonym Replacement: hey humankind how are you doing


- **Random Deletion (RD)**

> In Random Deletion, we randomly delete a word if a uniformly generated number between 0 and 1 is smaller than a pre-defined threshold. This allows for a random deletion of some words of the sentence.

In [59]:
from eda import random_deletion

print(f" Example of Random Deletion: {random_deletion('hey man how are you doing', p=0.3, max_deletion_n=3)}")

 Example of Random Deletion: hey how are you doing


- **Random Swap (RS)**
> In Random Swap, we randomly swap the order of two words in a sentence.

In [60]:
from eda import swap_word

print(f" Example of Random Swap: {swap_word('hey man how are you doing')}")

 Example of Random Swap: hey man how are doing you


- **Random Insertion (RI)**
> Finally, in Random Insertion, we randomly insert synonyms of a word at a random position.
> Data augmentation operations should not change the true label of a sentence, as that would introduce unnecessary noise into the data. Inserting a synonym of a word in a sentence, opposed to a random word, is more likely to be relevant to the context and retain the original label of the sentence.

In [61]:
from eda import random_insertion

print(f" Example of Random Insertion: {random_insertion('hey man how are you doing', n=2)}")

 Example of Random Insertion: hey man how are you doing perform


### **4.2 Augment Your Model**

Combine all the functions you have implemented in 4.1, you can come up with your own data augmentation pipeline with various p and n ;)

Next step is to expand the training data you used in Task1, re-train your model in 1.4 on your augmented data, and re-evaluate its performance on both the given validation set as well as on your manually annotated 100 test datapoints. 

Discuss the improvements that your data augmentation brings to your model. ***Include some examples of old vs. new model predictions to demonstrate the improvements.***

**Warning: In terms of data size and training time control, we stipulate that your augmented training data should not be larger than 100M.** (Currently the training data train.jsonl is about 25M.)

In [62]:
def aug(sent,n,p):
    print(f" Original Sentence : {sent}")
    print(f" SR Augmented Sentence : {synonym_replacement(sent, n)}")
    print(f" RD Augmented Sentence : {random_deletion(sent, p, n)}")
    print(f" RS Augmented Sentence : {swap_word(sent)}")
    print(f" RI Augmented Sentence : {random_insertion(sent,n)}")
    
aug('hey man how are you doing', p=0.2, n=2)

 Original Sentence : hey man how are you doing
 SR Augmented Sentence : hey gentlemans gentleman how are you doing
 RD Augmented Sentence : hey man how are you doing
 RS Augmented Sentence : hey doing how are you man
 RI Augmented Sentence : hey man how are be you doing


- Augment training dataset and Re-train your model
> Notes: you can decide on your own how much data you want to augment. But there are two pitfalls: i) by EDA, more augmentation means more noises, which not necessarily increases the performance; ii) more data means longer training time. Please balance your data scale and GPU time ;) 

In [65]:
# Iterate over the input samples and write them to the appropriate output file
with jsonlines.open(ROOT_PATH+"/nli_data/train.jsonl", "r") as reader:
    for sample in tqdm(reader.iter()):
        augment_sample_data(sample, 'train_augmented.jsonl' )

98176it [04:43, 346.83it/s]


In [64]:
# function to get 4 augmented sentence from a same premise, we augmented just 50% of the training dataset
def augment_sample_data(example, output_file):
    augmented_data = []
    augmentation = random.choice([0, 1])
        
    premise = example['premise']
    hypothesis = example['hypothesis']
    domain = example['domain']
    label = example['label']
    
    # init the augmented data
    original_data = {'premise': premise, 'hypothesis':hypothesis, 'domain':domain ,'label': label}
    augmented_data.append(original_data)

    # just augment the half of the data
    if (augmentation % 2) == 0:
      augmented_example_sr = {'premise': premise, 'hypothesis':hypothesis, 'domain':domain ,'label': label}
      augmented_example_rd = {'premise': premise, 'hypothesis':hypothesis, 'domain':domain ,'label': label}
      augmented_example_rs = {'premise': premise, 'hypothesis':hypothesis, 'domain':domain ,'label': label}
      augmented_example_ri = {'premise': premise, 'hypothesis':hypothesis, 'domain':domain ,'label': label}
      

      # get the augmented data
      augmented_example_sr['premise'] = synonym_replacement(premise, 2)
      augmented_example_rd['premise'] = random_deletion(premise, 0.2, 2)
      augmented_example_rs['premise'] = swap_word(premise)
      augmented_example_ri['premise'] = random_insertion(premise, 2)

      
      augmented_data.append(augmented_example_sr)
      augmented_data.append(augmented_example_rd)
      augmented_data.append(augmented_example_rs)
      augmented_data.append(augmented_example_ri)


    with jsonlines.open(ROOT_PATH+"/nli_data/"+ output_file, "a") as writer:
      writer.write(sample)
    

In [61]:
SCIPER = '288275'
seed = int(SCIPER)
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")


In [18]:
train_dataset = NLIDataset(ROOT_PATH+"/nli_data/train_augmented.jsonl", tokenizer)
dev_dataset = NLIDataset(ROOT_PATH+"/nli_data/dev_in_domain.jsonl", tokenizer)

Building NLI Dataset...


98176it [01:13, 1340.14it/s]


Building NLI Dataset...


9815it [00:07, 1387.82it/s]


In [21]:
batch_size = 16
epochs = 4
max_grad_norm = 1.0
warmup_percent = 0.3
model_save_root = ROOT_PATH+'/runs_augmented_model/'

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model_lr2 = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=3)
model_lr2.to(device)

learning_rate = 2e-5

train(train_dataset, dev_dataset, model_lr2, device, batch_size, epochs,
      learning_rate, warmup_percent, max_grad_norm, model_save_root)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifi

 path save repo /content/drive/MyDrive/Colab Notebooks/mnlp/a2-xav-nal/A2/runs_augmented_model/lr2e-05-warmup0.3


Training: 100%|██████████| 6136/6136 [04:03<00:00, 25.25it/s]
Evaluation: 100%|██████████| 614/614 [00:06<00:00, 98.29it/s] 


Epoch: 0 | Training Loss: 0.949 | Validation Loss: 0.737
Epoch 0 NLI Validation:
Accuracy: 68.29% | F1: (73.30%, 62.55%, 68.02%) | Macro-F1: 67.95%
Model Saved at epoch 0


Training: 100%|██████████| 6136/6136 [04:01<00:00, 25.37it/s]
Evaluation: 100%|██████████| 614/614 [00:06<00:00, 98.67it/s] 


Epoch: 1 | Training Loss: 0.647 | Validation Loss: 0.598
Epoch 1 NLI Validation:
Accuracy: 75.33% | F1: (78.91%, 70.85%, 75.94%) | Macro-F1: 75.23%
Model Saved at epoch 1


Training: 100%|██████████| 6136/6136 [04:03<00:00, 25.25it/s]
Evaluation: 100%|██████████| 614/614 [00:06<00:00, 98.59it/s] 


Epoch: 2 | Training Loss: 0.486 | Validation Loss: 0.588
Epoch 2 NLI Validation:
Accuracy: 76.58% | F1: (80.63%, 71.63%, 76.55%) | Macro-F1: 76.27%
Model Saved at epoch 2


Training: 100%|██████████| 6136/6136 [04:01<00:00, 25.38it/s]
Evaluation: 100%|██████████| 614/614 [00:06<00:00, 98.86it/s] 

Epoch: 3 | Training Loss: 0.316 | Validation Loss: 0.695
Epoch 3 NLI Validation:
Accuracy: 75.96% | F1: (79.44%, 70.44%, 77.24%) | Macro-F1: 75.71%





In [60]:
################### Results on own data annotation #########################333
# Initialize the output files prediction for each domain
output_dir = os.path.join(ROOT_PATH, "nli_data/predictions")
output_file = os.path.join(output_dir, f"pred_own_annotation.jsonl")
batch_size = 50

# Evaluate and save prediction results
data_annotation = f"/nli_data/28-288275.jsonl"
data_annotation_dataset = NLIDataset(ROOT_PATH + data_annotation, tokenizer)

result_save_file = f'/nli_data/predictions/dev_pred_data_annotation.json'
dev_loss, acc, f1_ent, f1_neu, f1_con = evaluate(data_annotation_dataset, model_lr2, device,  batch_size, result_save_file= ROOT_PATH + result_save_file)

macro_f1 = (f1_ent + f1_neu + f1_con) / 3

print(f'Own annotation data')
print(f'Validation Loss: {dev_loss:.3f} | Accuracy: {acc*100:.2f}%')
print(f'F1: ({f1_ent*100:.2f}%, {f1_neu*100:.2f}%, {f1_con*100:.2f}%) | Macro-F1: {macro_f1*100:.2f}%')

Building NLI Dataset...


100it [00:00, 1294.85it/s]
Evaluation: 100%|██████████| 2/2 [00:00<00:00, 27.62it/s]

Own annotation data
Validation Loss: 0.803 | Accuracy: 76.00%
F1: (80.49%, 57.69%, 84.85%) | Macro-F1: 74.34%





Unfortunately, all three models trained here are producing the same errors and metrics without any improvement despite changing learning rate and data augmentation. It may be an error in the training function. However, most of the time cases, data augmentation can significantly enhance the learning of the data, and the model is still able to produce consistent and accurate predictions.


Student Note: Due to problem with git lfs I put all the three model in the folder models

### **5 Upload Your Notebook, Data and Models**

Please **rename** your filled jupyter notebook as **your Sciper number** and upload it to your GitHub Classroom repository, **with all cells run and output results shown**.

**Note:** We are **not** responsible for re-running the cells in your notebook.

Please also submit all your processed (e.g., anotated and augmented) datasets, as well as all your trained models in Task 1 and Task 4, in your GitHub Classroom repository.

The datasets and models that you need to submit include:

**1. The best model checkpoint you trained in the Section 1.2 "Start Training and Validation!"**

**2. The best model prediction results in the Section 1.2 "Fine-Grained Validation"**

**3. Your annotated test dataset in the Section 3.2 "Annotate Your 100 Datapoints with Partner(s)"**

**4. Your augmented training data and best model checkpoint in the Section 4.2 "Augment Your Model"**

**Note:** You may need to use [GitHub LFS](https://edstem.org/eu/courses/379/discussion/27240) for submitting large files.