# Assignment 3: Transformers

In this homework, we will practise using the **🤗 Hugging Face** libraries for transformers and datasets for natural language processing. Follow the directions in each section to complete the expected parts. Take into account the following guidelines:

*   **Assignment Due Date:** <b><font color='red'>1401.03.09</font></b> 23:59:00
*   We always recommend co-operation and discussion in groups for assignments. However, each student has to finish all the questions by him/herself. If our matching system identifies any sort of copying, you'll be responsible for consequences.

*   The items you need to answer are highlighted in bold SeaGreen and the coding parts you need to implement are denoted by:
```
#######################
# Your implementation #
#######################
```
*   If you have any issues about this assignment, please do not hesitate to contact us. But before, check the provided documentation links for Hugging Face and PyTorch first.
*   This code requires some of the dependencies of the COLAB platform, you are free to use any other platform that supports jupyter notebooks, but there is no guarantee that the code snippets will work in your setting without some modifications. 
*   You can double click on collapsed code cells to expand them.
*   <b><font color='red'>When you are ready to submit, please follow the instructions at the end of this notebook.</font></b>



In [None]:
#@title Import and Install Essential Packages

%pip install transformers tokenizers datasets

# Enable the following line if you want to see error stack in GPU mode
# %env CUDA_LAUNCH_BLOCKING=1

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import numpy as np
import pandas as pd

from IPython.display import clear_output

clear_output()
print("Done!")


## Part 1: Take advantage of Transformers power

Previously, you solved the SNLI dataset with recurrent neural networks, which is a popular dataset for natural language inference. In this part, you will fine-tune a transformer encoder only model to utilize in the same dataset and task. 

In order to complete this section:
*   You should follow each steps below and compelete each block.
*   Run the model for **at least** one epoch.
*   Evaluate your model on the **test set** using accuracy metrics **at least after each 5000 optimization steps**.

Hints: 
*   You can import the dataset using any technique that works best for you, although the SNLI dataset is available in the Hugging Face hub, and you can find instructions for loading the dataset using the `datasets` library [here](https://huggingface.co/datasets/snli).
*   If your code's runtime exceeds one hour, you can downscale the training corpus using `datasets.train_test_split` method.
*   You can use any strategy to fine-tune the model in this case, however we recommend using Trainer API because it will make your life easier.
*   You may also find the [Practical NLP Tutorial Notebooks](https://teias-courses.github.io/nlp00/schedule), which are available on the course schedule page, particularly useful in completing this section.

In [None]:
#@title Arguments
model_name = "roberta-base" #@param {type:"string"}
max_sequence_length = 256 #@param {type:"integer"}
output_directory = "roberta-snli-finetuned" #@param {type:"string"}
per_device_train_batch_size =  64#@param {type:"integer"}
per_device_eval_batch_size =  64#@param {type:"integer"}
learning_rate = 5e-5 #@param {type:"number"}
num_train_epochs = 2 #@param {type:"integer"}
evaluation_strategy = "steps" #@param {type:"string"}
logging_steps =  5000#@param {type:"integer"}


In [None]:
#@title Import Packages

#######################
# Your implementation #
#######################


clear_output()
print("Done!")

In [None]:
#@title Load the Dataset

#######################
# Your implementation #
#######################


clear_output()
dataset

In [None]:
#@title Load the Model

#######################
# Your implementation #
#######################


clear_output()
print("Done!")

In [None]:
#@title Tokenization

#######################
# Your implementation #
#######################


clear_output()
tokenized_dataset

In [None]:
#@title Compute Metrics
def compute_metrics(eval_preds):
    #######################
    # Your implementation #
    #######################


In [None]:
#@title Training Arguments

#######################
# Your implementation #
#######################


In [None]:
#@title Train the Model

#######################
# Your implementation #
#######################


In [None]:
#@title Plot the Confusion Matrix on Test Set


Answer the following questions:


1. Plot the confusion matrix. Compare the results to those of your prior RNN-based models.

2. On the plot, analyze the model output. Where does the model perform badly? 

2. Explain the reasons you think why the model performs poorly in this area.



**Answer The Question Here**

## Part 2: Custom Loss Function

When employing transformers (especially in research), it is usual to desire to tweak some aspects of the model's functionality. Suppose that the default Cross Entropy Loss does not work good in your settings. You may consider using the Focal Loss. Focal loss applies a modulating term to the cross entropy loss in order to focus learning on hard misclassified examples. 

Formally, the Focal Loss adds a factor $(1-p)^{\gamma}$ to the standard cross entropy criterion. Setting $\gamma > 0$ reduces the relative loss for well-classified examples ($p_t > 0.5$), putting more focus on hard, misclassified examples.

\begin{equation}
F L(p) = -\alpha (1-p)^{\gamma} \log (p)
\end{equation}

In order to complete this section:

- Complete the following loss function. You can use the sanity check block to compare your answer to ours.
- Change the training pipeline to use your custom loss function instead of the cross entropy.

Hints:
- Understanding the `tensor.gather` operand will probably reduce the amount of code you write.

In [None]:
#@title Custom Loss Function Implementation
def focal_loss(logits, labels, alpha=1.0, gamma=2):
  #######################
  # Your implementation #
  #######################


In [None]:
#@title Loss Function Sanity Check
labels = torch.tensor([0, 1, 2, 1, 2])
logits = torch.tensor([[1., 0, 0],
                       [1., 0, 0],
                       [0, 0, 1.],
                       [1., 0, 0],
                       [0, 0, 1.]])
loss = focal_loss(logits, labels)
assert (torch.abs(loss - 0.449) < 0.01), "Wrong implementation. Correct: {}. Yours: {}".format(0.449, loss) 

In [None]:
#@title Using the Custom Function in Your Training Pipeline

#######################
# Your implementation #
#######################



In [None]:
#@title Traing the Model

#######################
# Your implementation #
#######################



In [None]:
#@title Plot the Confusion Matrix on Test Set


Answer the following questions:

1. Plot the confusion matrix and compare the results with previous section.

2. What areas does the new model perform better (or worse) in, and why?

**Answer The Question Here**

## Part 3: Model Analysis and Interpretation

Now that you are familiar with fine-tuning transformer models and have gotten acceptable results using a transformer model. It is time to interpret our model. We can look at a transformer as a black box function, which takes an input in the form of string(s), and outputs a probability based on the application of the function on the input. As models increase in complexity, it gets more difficult to correctly pinpoint the reason behind their behavior. However, it is paramount to strive to understand this, as models often perform in unexpected ways, which might jeopardize our intent for using them. 

As a starting point in model analysis, we will look into [Erasure](https://arxiv.org/pdf/1612.08220.pdf). Erasure is an early, yet effective method to find the sections of input that contribute to the model output, enabling us to pinpoint the tokens that effective determine model behavior. 

For this task, we will implement a simplified version of Erasure using our previous transformer model fine-tuned on the SNLI task. 

To ease your work, we are going to give you a step by step instruction in implementing Erasure. Read the steps below carefully as you are required to implement them. 



1.   Prepare two inputs for the SNLI task. (Example: "He is a firefighter" and "He has worked in the same department for 20 years"). 
2.   Feed your inputs to your fine-tuned transformer model and document the probability of one of the outputs. (Example: Label: 1, Probability: 0.76)
3.   Perturb your input by looping over tokens in each sentence, in each step, remove one of the tokens, feed the new perturbed inputs to the model, and document the probability change in the most probable output. 
4.   The change in probability is the contribution of that token on the model output. 

Note that contribution can either be negative or positive. With negative contribution indicating that the word actually lowers the probability of an output.

With these information in mind, write your code below. 



In [None]:
# compute token importances towards predicting a target word
def compute_importances(text1,text2, target_label_index=0):
  """
  Computes importance scores for tokens in the input texts (premise and hypothesis). The target model is assumed to
  be gloabl. If you want to change the function signature and pass the model as argument,
  remember to fix the function calls in the visulization cells bellow.

  Args:
    text1: raw text, premise of the SNLI task
    text2: raw text, hypothesis of the SNLI task
    target_label_index: the rank of target label in the top predictions (i-th most probable label)

  Returns:
    tokens: tokenized input texts, concatenated. (list of str)
    importances: list of importance values for each token. (length = length of tokens)

  """
  #######################
  # Your implementation #
  #######################

  return tokens, importances

Great job, now that you have written a function that computes importances, it is time to do an initial testing on it. To ease your work, we provide a function for visualizing the importances.

In [None]:
# @title Visualize importance
from matplotlib import pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as clr
from pylab import rcParams

from IPython.display import HTML

def visualize_attention(sentences, score_lists, color_maps='RdYlGn', rtl=False,
                        alpha=0.5, font_size=14, token_sep=' ', sentence_sep='<br/><br/>'):

  if type(color_maps) is str:
    color_maps = [color_maps] * len(sentences)

  span_sentences, style_sentences = [], []

  for s, tokens in enumerate(sentences):

    scores = score_lists[s]
    cmap = cm.get_cmap(color_maps[s])
    
    max_value = max(abs(min(scores)), abs(max(scores)))
    normer = clr.Normalize(vmin=-max_value/alpha, vmax=max_value/alpha)
    colors = [clr.to_hex(cmap(normer(x))) for x in scores]

    if len(tokens) != len(colors):
        raise ValueError("number of tokens and colors don't match")

    style_elems, span_elems = [], []
    for i in range(len(tokens)):
        style_elems.append(f'.c{s}-{i} {{ background-color: {colors[i]}; }}')
        span_elems.append(f'<span class="c{s}-{i}">{tokens[i]} </span>')

    span_sentences.append(token_sep.join(span_elems))
    style_sentences.append(' '.join(style_elems))
    text_dir = 'rtl' if rtl else 'ltr'

  return HTML(f"""<html><head><link href="https://fonts.googleapis.com/css?family=Roboto+Mono&display=swap" rel="stylesheet">
               <style>span {{ font-family: "Roboto Mono", monospace; font-size: {font_size}px; padding: 2px}} {' '.join(style_sentences)}</style>
               </head><body><p dir="{text_dir}">{sentence_sep.join(span_sentences)}</p></body></html>""")

Test your code by running the code below. 

In [None]:
#@title Implementation verification


# @title visualize importance

colormap = "RdYlGn" #@param ["RdYlGn", "bwr_r"]
alpha = 0.8 #@param {type:"number"}
right_to_left = False #@param {type:"boolean"}

your_tokens, your_importances = compute_importances('My brother plays football.','But my sister prefers basketball.', 0)
display(visualize_attention([your_tokens], [your_importances], color_maps=colormap, rtl=right_to_left, alpha=alpha))

Great job, now that we have all our tools ready, lets run some tests and answer some questions. Run the code below to visualize your answers.

In [None]:
#@title Visualize importances

premise = "Football is the most popular sport in the world." #@param {type:"string"}
hypothesis = "Basketball is the most watched sport in the world" #@param (type:"string")
target_label_index = 0 #@param {type:"integer"}

colormap = "RdYlGn" #@param ["RdYlGn", "bwr_r"]
alpha = 0.8 #@param {type:"number"}
right_to_left = False #@param {type:"boolean"}

tokens, importances = compute_importances(premise, hypothesis, target_label_index)
display(visualize_attention([tokens], [importances], color_maps=colormap, rtl=right_to_left, alpha=alpha))

Having visualized the example (Or any other example that you wish), answer the questions below. 



1.   What words have the highest impact on the output of the model? (Either positive or negative). For the top 3 words with the highest impact, explain why you think this is the case.
2.   Are there any words that you think should have had higher impacts? If so, explain why you think this is the case. If you can't find any, feel free to change the example and find the said words in another example, but make sure to mention the example too. 



**Type Your Answer Here**

Having seen a few examples and their analysis, lets think about this method more broadly. Answer the questions below. 



1.   Using the toolset above, find an example that yields counter intuitive results. For example, it pays attention to words that you would normally not to, or does not pay attention to words that you think are important. This might take a bit of testing to find. Provide your example here, explain why you think this is the case, are there any other words in the example that you think might be taking the attention away from the important words?
2.   Having thought about the previous question, what do you think are the shortcomings of this method? Hint: Think about independence and Hindsight Bias.

3.   How do you think the aforementioned shortcomings can be addressed? Hint: Think about methods that don't require erasing a token. Explain your answer in detail. 



**Type Your Answer Here**

## Part 4: Few-Shot Learning

In a real life setting, it is often the case that we don't have access to a large enough data that correctly corresponds with the classes that we might see during inference time. This is when Few-Shot-Learning (FSL) comes in. FSL basically consists of training the model on a small sample of classes during training using various techniques such that the model can generalize them to inference. 

In this part of your homework, we are going to implement two simple few-shot-learner model to get you acquainted with how it works. 

### Few-Shot Learning With the Custom Loss

A common approach to do Few-Shot Learning is to define a new loss function such that the given loss function has an excellent generalization capabilities. Focal Loss, as explained above, is a great choice when we either have class imbalance or a low number of samples for each class. Thus, we are going to use a model that utilizes focal loss in this section. 

First, we need change our data such that there are only a handful of samples for each class. Additionally, we are going to make our classes imbalanced and see how the focal loss performs. 

To start, complete the function below such that given a dataset (or Pandas, this is your choice) instance of SNLI, the dataset is reduced such that 100 examples of class 0, 100 examples of class 1, and 32 examples of class 2 remains. This will make sure that your dataset is both in a few-shot setting and is also imbalanced. 

In [None]:
def reduce_dataset(dataset, class_0 = 100 , class_1 = 100, class_2 = 16) :
  ######################
  #Implement your Code Here
  ######################
  return reduced_dataset

######################
#Test your function Here 
######################

Great, now that you have redcued our dataset, lets see if a model can learn from this.
Write a code below such that an instance of a normal Roberta or Bert is initialized, trained using focal loss, and tested on the test set of SNLI. 

Note that this is not much different from Part 3, thus feel free to copy your code from there if you find that easier. 

In [None]:
######################
#Implement your Code Here
######################

Good job, now that you have trained and tested your model (only on a handful of data), we can observe how our model performs in a low-shot setting with imbalanced data.

First, draw a confusion matrix for the performance of our current model for better analysis of the two models.

In [None]:
######################
#Implement your Code Here
######################

Now answer the questions below.



1.   Which class performs the worst? Which class performs the best? What do you think is the reason behind this?
2.   Compare the performances of the model trained using the few-shot setting and the model trained earlier. Which model performs better? Does the other model perform reasonably well too? 
3.   How do you think the focal loss helps with imbalanced data and few-shot learning? 



**Answer The Questions Here**

### Transfer Learning

Another approach for few shot (and zero shot) learning is to make use of another model, previously trained on a similar task (such as a clssification task that is close to our current task), and briefly fine-tuning it to cater to our needs. 

In this section, we are going to make use of this. The tasks you have to conduct are as following. 



1.   Go to [Hugging Face Hub](https://huggingface.co/models) and identify a model pre-trained/fine-tuned on a task that is similar to our current task. Needless to say, this task can't be SNLI itself as we are trying to do few-shot learning.
2.   Load the model that you have identified.
3.   Fine tune the model with the proportion Class 0 = 100, Class 1 = 100, Class 2 = 100. 
4.   Test the model

Hint: Make sure that the classifier header either fits, or you change it. Look at how pipeline classification works so you can get a better understanding on how to formulate this problem.



In [None]:
######################
#Implement your Code Here
######################

Good job. Now that you have trained, and tested this new approach, answer the questions below. 



1.   Explain how do you think your chosen model and the task that it is trained on is similar to our current task?
2.   How do you think the fact above helps the model geranlize? 
3.   Compare the transfer learning model and direct fine tuning model in terms of performance, which model performs better? What do you think is the reason behind this? 
4.   Note that the two approaches stated here are very simple methods to do few-shot learning and may not work as well. What do you think are some other methods to apply few shot learning to NLP tasks? Name and explain at least 2 other models. Google is your friend on this one. 




**Answer The Question Here**

# Fill Your Information

In [None]:
#@title Enter your information & "RUN the cell!!" { run: "auto" }
student_id = "" #@param {type:"string"}
student_name = "" #@param {type:"string"}

print("your student id:", student_id)
print("your name:", student_name)


from pathlib import Path

ASSIGNMENT_PATH = Path('asg04')
ASSIGNMENT_PATH.mkdir(parents=True, exist_ok=True)

# Make Submission (Run the cells)

In [None]:
#@title Make submission
! pip install -U --quiet PyDrive > /dev/null
! pip install -U --quiet jdatetime > /dev/null

# ! wget -q https://github.com/github/hub/releases/download/v2.10.0/hub-linux-amd64-2.10.0.tgz 


import os
import time
import yaml
import json
import jdatetime

from google.colab import files
from IPython.display import Javascript
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

asg_name = 'NLP_Assignment_3_Transformers'
script_save = '''
require(["base/js/namespace"],function(Jupyter) {
    Jupyter.notebook.save_checkpoint();
});
'''
# repo_name = 'iust-deep-learning-assignments'
submission_file_name = 'nlp_asg03__%s__%s.zip'%(student_id, student_name.lower().replace(' ',  '_'))

sub_info = {
    'student_id': student_id,
    'student_name': student_name, 
    'dateime': str(jdatetime.date.today()),
    'asg_name': asg_name
}
json.dump(sub_info, open('info.json', 'w'))

Javascript(script_save)

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
file_id = drive.ListFile({'q':"title='%s.ipynb'"%asg_name}).GetList()[0]['id']
downloaded = drive.CreateFile({'id': file_id})
downloaded.GetContentFile('%s.ipynb'%asg_name) 

! jupyter nbconvert --to script "$asg_name".ipynb > /dev/null
! jupyter nbconvert --to html "$asg_name".ipynb > /dev/null
! zip "$submission_file_name" "$asg_name".ipynb "$asg_name".html "$asg_name".txt info.json > /dev/null

print("##########################################")
print("Done! Submisson created, Please download using the bellow cell!")

In [None]:
drive.ListFile({'q':"title='%s.ipynb'"%asg_name}).GetList()[0]['id']

In [None]:
files.download(submission_file_name)