# Homework 4

In this homework, we will practise using the **🤗 Hugging Face** libraries for transformers and datasets for natural language processing. Follow the directions in each section to complete the expected parts. Take into account the following guidelines:

*   **Assignment Due Date:** <b><font color='red'>1401.06.xx</font></b> 23:59:00
*   We always recommend co-operation and discussion in groups for assignments. However, each student has to finish all the questions by him/herself. If our matching system identifies any sort of copying, you'll be responsible for consequences.

*   The items you need to answer are highlighted in bold SeaGreen and the coding parts you need to implement are denoted by:
```
#######################
# Your implementation #
#######################
```
*   If you have any issues about this assignment, please do not hesitate to contact us. But before, check the provided documentation links for Hugging Face and PyTorch first.
*   This code requires some of the dependencies of the COLAB platform, you are free to use any other platform that supports jupyter notebooks, but there is no guarantee that the code snippets will work in your setting without some modifications. 
*   You can double click on collapsed code cells to expand them.
*   <b><font color='red'>When you are ready to submit, please follow the instructions at the end of this notebook.</font></b>



In [None]:
#@title Import and Install Essential Packages

%pip install transformers tokenizers datasets

# Enable the following line if you want to see error stack in GPU mode
# %env CUDA_LAUNCH_BLOCKING=1

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import numpy as np
import pandas as pd

from IPython.display import clear_output

clear_output()
print("Done!")

Done!


## Part 1: Advanced Interpretability

In the previous homework, we had the opportunity to work with Erasure, a simple and effective interpretabation method that works by erasing the tokens one by one and observing the change in probability of a selected label. We also had an early peek to more advanced forms of interpretation. In this section, we are going to work with some of the newer forms of saliency and discuss their advantages and disadvantages. 

The good news is that you are not required to train any models in this section as we will provide the required model to you. 

Lets start by implementing and analyzing a form of interpretation called attention based saliency or attention based interpretation. 

## Part 1.1 Attention Based Interpretation

As you already know, Transformer models rely heavily on attention mechanism to make their inferences. More specifically, they make use of a mechanism called self-attention, in which the final representation is created by looking at the "importance" of other tokens in the same sentence. This final representation is then used for classification. 

One way to interpret the behavior of our model is to make use of these attention scores, with higher attention score meaning that the model priorities the token attributed to that attention score. 

To start things off, run the cell below to load a multilingual sentiment analysis bert model. We already configure this model to ease your work. 

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification, BertConfig
Config = BertConfig.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
tokenizer = BertTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
model = BertForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment", config = Config)

To give an explanation of the loaded model, we first briefly describe what it does. 

This is a classic Sentiment Analysis model in which an input is classified to a class labeled 0 to 5. With 0 meaning extremely negative, and 5 meaning extremely positive. 

Complete the function below such that given a model, a tokenizer, and a label, the output is returned containing the loss, logits, and attentions. Note that you have to perform an extra step for the model to return the attentions.

In [None]:
def return_output(model, tokenizer, raw_text, label) : 
  outputs = None
  ####################
  #Implement Your Code Here
  ####################
  return outputs
print(return_output(model = model,tokenizer = tokenizer, raw_text = 'i love books', label = 4))

Great, now that you have your outputs and attentions, answer the questions below. 

1 - Note that the attentions returned by the model have a length of $X$, first, specify what number is $X$, then explain what it shows. 

2 - Note that each element of attentions itself has a length of $Y$, first, specify what number is $Y$, then explain what it shows. 

3 - Note that the length of the final attention scores is higher than the length of our sentence. Why do you think this happens? Is this length always higher than the length of the provided sentence? 

*** Type Your Answer Here ***

Great, now that you have a understanding of the attentions returned by the model, it is time to calculate the saliency for each input. To ease your work, we are going to provide a simple algorithm which can you can follow. 

1 - Get the output of a sentence from the function you have written above. 

2 - Retrieve the attentions from the final embedding of the model. 

3 - Retrieve the attentions from every attention head of the final embedding.

4 - Look at the attention scores with respect to the CLS token, we will use the attentions of this token to calculate our final scores. 

5 - For every attention head, find the scores of each token with respect to the CLS token. 

6 - Average across attention heads to get the attention score for each token.

Having this information, complete the function below such that given an output instance and an input text, attention scores are calculated for each token. 

In [None]:
def calculate_attention_scores(outputs, raw_text) : 
  '''
  outputs: an output instance received from the model
  raw_text: raw text that is fed into the model 

  returns : 
  attention_scores: A list of scores, the same length as the list of tokens 
  list_of_tokens = A list of tokens, tokenized
  '''
  ##############
  #Implement Your Code Here
  ##############
  return attention_scores, list_of_tokens

Great job, if you have managed to get the attention scores, it means that you have a deep understanding of transformer models. 

Now, as is our tradition, it is time to visualize the output. Run the code below to prepare visualization.

In [None]:
# @title Visualize importance
from matplotlib import pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as clr
from pylab import rcParams

from IPython.display import HTML

def visualize_attention(sentences, score_lists, color_maps='RdYlGn', rtl=False,
                        alpha=0.5, font_size=14, token_sep=' ', sentence_sep='<br/><br/>'):

  if type(color_maps) is str:
    color_maps = [color_maps] * len(sentences)

  span_sentences, style_sentences = [], []

  for s, tokens in enumerate(sentences):

    scores = score_lists[s]
    cmap = cm.get_cmap(color_maps[s])
    
    max_value = max(abs(min(scores)), abs(max(scores)))
    normer = clr.Normalize(vmin=-max_value/alpha, vmax=max_value/alpha)
    colors = [clr.to_hex(cmap(normer(x))) for x in scores]

    if len(tokens) != len(colors):
        raise ValueError("number of tokens and colors don't match")

    style_elems, span_elems = [], []
    for i in range(len(tokens)):
        style_elems.append(f'.c{s}-{i} {{ background-color: {colors[i]}; }}')
        span_elems.append(f'<span class="c{s}-{i}">{tokens[i]} </span>')

    span_sentences.append(token_sep.join(span_elems))
    style_sentences.append(' '.join(style_elems))
    text_dir = 'rtl' if rtl else 'ltr'

  return HTML(f"""<html><head><link href="https://fonts.googleapis.com/css?family=Roboto+Mono&display=swap" rel="stylesheet">
               <style>span {{ font-family: "Roboto Mono", monospace; font-size: {font_size}px; padding: 2px}} {' '.join(style_sentences)}</style>
               </head><body><p dir="{text_dir}">{sentence_sep.join(span_sentences)}</p></body></html>""")

Now that have our visualization function ready, lets test our method. Run the cell below to visualize a pre-processed example that we have made for you.  

In [None]:
#@title Visualize importances

raw_text = "i love books" #@param {type:"string"}

colormap = "bwr_r" #@param ["RdYlGn", "bwr_r"]
alpha = 0.8 #@param {type:"number"}
right_to_left = False #@param {type:"boolean"}

display(visualize_attention([['[CLS]', 'i' ,'love', 'books', '[SEP]']], [[0.06659328, 0.15865228, 0.31966418, 0.04583241, 0.40925785]], color_maps=colormap, rtl=right_to_left, alpha=alpha))

Great. Now, in order to test your code, run the cell below. The output must be very close to the output of the above code (As we are running the same model). 

Note that if you get a vastly different result, there might be a mistake in your code (or in ours). 

In [None]:
attention_scores, tokens = calculate_attention_scores(outputs, 'i love books')
colormap = "bwr_r" #@param ["RdYlGn", "bwr_r"]
alpha = 0.8 #@param {type:"number"}
right_to_left = False #@param {type:"boolean"}

display(visualize_attention([tokens], [attention_scores], color_maps=colormap, rtl=right_to_left, alpha=alpha))

Congratulations, you have successfully deployed a simple attention based interpretation method.

You should have a fairly deep understanding of transformer models by now. To test your knowledge, answer the questions below. 

1 - Take into consideration the task that our model performs, what kind of words does the model usually attend to when making decisions? Try a few examples and explain why this is the case. 

2 - Note that the [SEP] token is heavily attended to despite not being in the sentence in many cases. Hypothesize why this is the case. Explain your hypothesis. 

3 - Compare our new method to the previous method of erasure, what are the differences? What are the advantages of this new approach? Can you think of any disadvantages too? 

*** Type Your Answer Here *** 

## Part 2: Does Pre-trained Language Models Understand Syntatic Information?

In this section, we want to see if language models trained on a big corpus can tell us anything about a language's syntax. To do this, we used a POS tagging dataset to determine if our model can understand the POS tags of distinct words in phrases.

First, run the following cells to download the dataset and corresponding labels, and import essential libraries. 


In [None]:
#@title Import the Libraries

%pip install transformers

from IPython.display import clear_output
from tqdm.notebook import tqdm

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from transformers import AutoModel, AutoTokenizer

import numpy as np
import pandas as pd

torch.manual_seed(1)
clear_output()
print("Done!")

Done!


In [None]:
#@title Download and Read the Dataset

!gdown 1-SF1kN-Ojv2XyLL8OwaO8CGmZVRT9NPK
!unzip dataset.zip

import os

PREFIX_PATH = '/content/dataset/en-ud-'

with open(PREFIX_PATH + "train.txt") as f:
    train_sentences = [line.strip().split() for line in f.readlines()]
with open(PREFIX_PATH + "test.txt") as f:
    test_sentences = [line.strip().split() for line in f.readlines()]

with open(PREFIX_PATH + "train.pos") as f:
    train_labels = [line.strip().split() for line in f.readlines()]
with open(PREFIX_PATH + "test.pos") as f:
    test_labels = [line.strip().split() for line in f.readlines()]
with open(PREFIX_PATH + "dev.pos") as f:
    dev_labels = [line.strip().split() for line in f.readlines()]

# take a fraction of the data
train_sentences = train_sentences[:round(len(train_sentences)*0.1)]
test_sentences = train_sentences[:round(len(test_sentences)*0.1)]
train_labels = train_labels[:round(len(train_labels)*0.1)]
test_labels = train_labels[:round(len(test_labels)*0.1)]

unique_labels = list(set.union(*[set(l) for l in train_labels + test_labels + dev_labels]))
label2index = dict()
for label in unique_labels:
    label2index[label] = label2index.get(label, len(label2index))

train_labels = [[label2index[l] for l in labels] for labels in train_labels]
test_labels = [[label2index[l] for l in labels] for labels in test_labels]

clear_output()
print(f"unique_labels: {unique_labels}")
print(f"train_sentences[0]: {train_sentences[0]}")
print(f"train_labels[0]: {train_labels[0]}")


unique_labels: ['SYM', 'PROPN', 'ADJ', 'VERB', 'NUM', 'DET', 'PART', 'ADV', 'PRON', 'INTJ', 'PUNCT', 'CCONJ', 'AUX', 'ADP', 'SCONJ', 'NOUN', 'X']
train_sentences[0]: ['Al', '-', 'Zaman', ':', 'American', 'forces', 'killed', 'Shaikh', 'Abdullah', 'al', '-', 'Ani', ',', 'the', 'preacher', 'at', 'the', 'mosque', 'in', 'the', 'town', 'of', 'Qaim', ',', 'near', 'the', 'Syrian', 'border', '.']
train_labels[0]: [1, 10, 1, 10, 2, 15, 3, 1, 1, 1, 10, 1, 10, 5, 15, 13, 5, 15, 13, 5, 15, 13, 1, 10, 13, 5, 2, 15, 10]


## Part 2.1: Evaluate Quality of Representations for POS Tagging

In order to accomplish this, we choose a dataset and a linguistic property to investigate in this experiment. A probing model also requires an auxiliary classifier. A simple linear classifier takes a word representation and transforms it to the label space. We will also require a pre-trained language model to study, which will be roberta-based.

**Note:** To begin, use your favourite deep learning framework to initialise the model, tokenizer, loss function, and optimizer. Keep in mind that your classifier only requires a single linear transformation layer.

In [None]:
#@title Create the Classifier

model_name = 'roberta-base'  #@param {type:"string"}

model =                     # Your implementation
tokenizer =                 # Your implementation

embedding_dim =             # Your implementation
num_labels = len(label2index)

# define classifier's architecture

#######################
# Your implementation #
#######################

def build_classifier(embedding_dim, num_labels):
    classifier =            # Your implementation
    criterion =             # Your implementation
    optimizer =             # Your implementation
    return classifier, criterion, optimizer

# build classifier
classifier, criterion, optimizer = build_classifier(embedding_dim, num_labels, device)

print(classifier)

With the prepaired language model and tokenizer in hand, your objective in this section is to tokenize and encode sentences. Remember that each word only requires one representation, even if some words contain several subtokens. Take the representation of the final subtoken in each word to solve this problem.

**Note:** To implement this section, utilise `word_ids` in the result object when calling `PretrainedTokenizerFast`.

In [None]:
#@title Tokenize Sentences

#######################
# Your implementation #
#######################

print(text_sentences[0], text_representations[0])

We will train the classifier on the linguistic annotations now that we have everything we need to run our experiment. In this section, you must finish the training loop using the representations that you prepaired in the previous phase. Remember to only train the classification head and not the weights of the language model itself.

In [None]:
#@title Training Loop

def train(classifier, criterion, optimizer, inputs, labels, num_epochs=2):
  epoch_pbar = tqdm(range(num_epochs))
  for epoch in epoch_pbar:

    #######################
    # Your implementation #
    #######################

  epoch_pbar.set_postfix(total_loss=total_loss, average_loss=total_loss/total_count)


  return avarage_loss, accuracy

train(classifier, criterion, optimizer, text_representations, )

Now that you have the classifier ready, you need to evaluate the quality of representations for POS taging. To implement this part, you need to generate representations for tokens using the pre-trained model (i.e., the last hidden state). Use the generated representatinos to find the accuracy of the classifier to classify POS tags. In this experiment, we use the accuracy on this task as a measure of how well language models understand words' POS tags.

Q: Analyze the accuracy of your classifier. Can you identify whether the pre-trained model recognises POS tags?

*** Type Your Answer Here *** 

## Part 2.2: Performance of Different Layers

One of the major questions in neural network interpretability is how information is organized in different parts of the deep model, such as its layers. Utilize your implementations in previous sections to train different classifiers on each layer and evaluate those classifiers on the test set.

Q: Which layer has the best representations for classifying POS tags? Explain why do you think this layer works better for this specific task? Do deeper layers' representatinos work best in all tasks?

*** Type Your Answer Here *** 

## Part 2.3: Dependency on Classifier's Architecture

Do you think that the probing accuracy is dependent on the probing model? Create a simple experiment to show your explanations. 

Explain your experiment and setup and justify your method.

*** Type Your Answer Here *** 

## Part 2.4: Control Task

In this experiment, we find out how much of the performance in previous experiments was due to the syntactic information learned by the model and how much was due to the probe classifier. We use a method suggested by Hewitt and Liang (https://arxiv.org/pdf/1909.03368.pdf), in which we make a control task that has nothing to do with the POS tagging task and do the same thing on it. Then, we measure the layers' selectivity, which is the difference between how well they worked on the POS task and how well they worked on the control task. If a layer has learned a lot about the POS task in particular, it should be much better at the POS task than the control task; that is, it should have high selectivity. 

What you should implement:

1. Make a new dataset that is similar to the original POS tagging dataset, except that each word is given a random POS tag and the tags are still spread out based on how often they were used in the original dataset. 
2. Run the same experiments in previous sections on the control task dataset.
3. Report selectivity for each layer of BERT on POS tagging dataset.
4. For more information you can read the Hewitt and Liang paper.



Q: Why is selectivity a better way to measure how much knowledge a language model stores? Compare each layer's selectivity to the accuracy of the original probe classifier.

*** Type Your Answer Here ***

## Fill your information (Run the cell)

In [None]:
#@title Enter your information & "RUN the cell!!" { run: "auto" }
student_id = "" #@param {type:"string"}
student_name = "" #@param {type:"string"}

print("your student id:", student_id)
print("your name:", student_name)


from pathlib import Path

ASSIGNMENT_PATH = Path('asg01')
ASSIGNMENT_PATH.mkdir(parents=True, exist_ok=True)

## Make Submission (Run the cell)

In [None]:
#@title Make submission
! pip install -U --quiet PyDrive > /dev/null
! pip install -U --quiet jdatetime > /dev/null

# ! wget -q https://github.com/github/hub/releases/download/v2.10.0/hub-linux-amd64-2.10.0.tgz 


import os
import time
import yaml
import json
import jdatetime

from google.colab import files
from IPython.display import Javascript
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

asg_name = 'NLP_Assignment_4'
script_save = '''
require(["base/js/namespace"],function(Jupyter) {
    Jupyter.notebook.save_checkpoint();
});
'''
# repo_name = 'iust-deep-learning-assignments'
submission_file_name = 'dl_asg01__%s__%s.zip'%(student_id, student_name.lower().replace(' ',  '_'))

sub_info = {
    'student_id': student_id,
    'student_name': student_name, 
    'dateime': str(jdatetime.date.today()),
    'asg_name': asg_name
}
json.dump(sub_info, open('info.json', 'w'))

Javascript(script_save)

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
file_id = drive.ListFile({'q':"title='%s.ipynb'"%asg_name}).GetList()[0]['id']
downloaded = drive.CreateFile({'id': file_id})
downloaded.GetContentFile('%s.ipynb'%asg_name) 

! jupyter nbconvert --to script "$asg_name".ipynb > /dev/null
! jupyter nbconvert --to html "$asg_name".ipynb > /dev/null
! zip "$submission_file_name" "$asg_name".ipynb "$asg_name".html "$asg_name".txt info.json > /dev/null

print("##########################################")
print("Done! Submisson created, Please download using the bellow cell!")

In [None]:
drive.ListFile({'q':"title='%s.ipynb'"%asg_name}).GetList()[0]['id']

In [None]:
files.download(submission_file_name)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>