In this problem set, we'll do a deep dive with language models.

Once again, you're free to execute the notebook on your personal environment, but I would strongly recommend using Google Colab. You can upload this notebook to Google colab by following the steps below.

1. Open [colab.research.google.com](colab.research.google.com)
2. Click on the upload tab
3. Upload the .ipynb file by choosing the right file from your local disk


**Submission instructions**

1. When you're ready to submit, you'll save the notebook as QTM340-PS3-Firstname-Lastname.ipynb; for example, if your name is Harry Potter, save the file as `QTM340-PS3-Harry-Potter.ipynb`. This can be done in Google colab by editing the filename and then following File --> Download --> .ipynb

2. Upload this file on canvas.

**Objective**: In this notebook, you'll learn the following in a classification task:

a. To use bag of words representation as predictors (1 point)

b. To use static word representations as predictors (2 points)

c. To use contextual word representations as predictors (3 points)

d. Explain what are the strengths and weaknesses of each of the model (2 points)

Our task is to classify research papers to categories. We'll use the dataset hosted by [huggingface](https://huggingface.co/datasets/gfissore/arxiv-abstracts-2021)

## 0. Setup

Install all the required packages.

In [None]:
%%bash

pip install datasets
pip install transformers
pip install sentencepiece

Let's get all the libraries imported first.

In [None]:
from datasets import load_dataset
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, classification_report, f1_score, confusion_matrix

import torch
import gensim
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Now download the dataset and clean it up.


**Note** This may take a couple of minutes when you run the first time because the data will be downloaded.

In [None]:
def convert2label (x):
  best_cat = x[0].split()[0]
  return best_cat.split ('.')[0]

required_cats = ['math', 'cs', 'astro-ph', 'physics', 'quant-ph']
dataset = load_dataset("gfissore/arxiv-abstracts-2021", split='train')
dataset = dataset.remove_columns (column_names=['submitter',
                                                'authors',
                                                'journal-ref',
                                                'doi',
                                                'report-no',
                                                'comments',
                                                'versions'])
df_dataset = pd.DataFrame(dataset)
df_dataset["cat"] = df_dataset.categories.apply (lambda x:convert2label (x))
original_df = df_dataset.copy (deep=True)
df_dataset = original_df.query ('cat in @required_cats')

# randomly pick 1500 examples
df_dataset = df_dataset.sample (n=1500, random_state=42)

You have two variables that are of interest: `original_df` which contains all the examples in the dataset and `df_dataset` which contains examples that belong only to some fixed categories (as defined in `required_cats`)

Next, we'll create a train (80%), validate (10%) and test (10%) split for our dataset.

In [None]:
# Split df_dataset into train, validate and test dataframes
train_df, test_df = train_test_split (df_dataset,
                                      train_size=0.9,
                                      random_state=42)

train_df, val_df = train_test_split (train_df,
                                     train_size=80/90,
                                     random_state=42)

## 1. Bag of Words classification

We'll turn the title into bag of words features.

In [None]:
# Initialize a vectorizer and classifier
vectorizer = CountVectorizer (input="content",
                              lowercase=True,
                              min_df=5,
                              max_df=0.75,
                              max_features=1000)
classifier = LogisticRegression (penalty="l2",
                                 C=0.1,
                                 max_iter=1000)

# Fit the entire dataset on the vectorizer;
# effectively, this line extracts all the features
vectorizer.fit (df_dataset["title"])

# Get the labels
y_train = train_df["cat"].values
y_val = val_df["cat"].values
y_test = test_df["cat"].values

# Get the bag-of-words representation for each document
X_bow_train = vectorizer.transform (train_df["title"])
X_bow_val = vectorizer.transform (val_df["title"])
X_bow_test = vectorizer.transform (test_df["title"])

# Now, let's fit the model
classifier.fit (X_bow_train, y_train)

# Use the trained classifier to do predictions
yhat_bow_val = classifier.predict (X_bow_val)

# Get the accuracy of the classifier
print (f"Accuracy in %: {100*accuracy_score (y_val, yhat_bow_val):.2f}")

# Get the classification report
print ("Classification report")
print (classification_report (y_val, yhat_bow_val))

**Sanity check** The bag-of-words features are quite predictive of the type of paper (60% accuracy); in comparison, a majority-class classifier -- one that predicts "math" for all examples -- will perform at 33% accuracy.

**Your turn!**

Q1. Adapt the code above to find the best regularization hyperparameter (highest accuracy) of the classifier. Report the optimal parameter and write 2-3 sentences to interpret the optimal regularization parameter [0.5 points]

You'll tune the following parameters

- C: The regularization penalty. Try all the values from the set {0.001, 0.01, 0.1, 1.0, 10.0, 100.0}

Note that you'll have to calculate the accuracy on the validation set (not on the test set). You can learn about regularization [here](https://en.wikipedia.org/wiki/Regularization_(mathematics)) and how it's controlled by looking over Scikit's API documentation of logistic regression [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [None]:
# Your code in this cell.



Q2. For the best classifier from Q1, report the top 10 and the bottom 10 features for each class that are most and least predictive of the label, respectively. Give a brief explanation for why you see these features at the top and bottom. [0.5 points]

You can obtain the top 10 features by sorting them based on the coefficients learned by the classifier.

In [None]:
# Your code in this cell



## 2. Classification using type embeddings

We'll now learn the embeddings of each word and then use these embeddings as features in the classification model. The embeddings are used using [doc2vec](https://cs.stanford.edu/~quocle/paragraph_vector.pdf) which is a variation of word2vec that learns embeddings sensitive to the topic or some label for every sentence.  


We'll learn the parameters of the embedding model (i.e. word embeddings) from the abstracts and then construct the document embedding for the titles.

In [None]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=100,
                                      min_count=5,
                                      epochs=15)

def read_corpus(iterable, tokens_only=False):
  for i, line in enumerate(iterable):
    tokens = gensim.utils.simple_preprocess(line)
    if tokens_only:
      yield tokens
    else:
      # For training data, add tags
      yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

# Create the corpus in each split
train_corpus_abstracts = list(read_corpus(train_df["abstract"].values))

train_corpus_titles = list(read_corpus(train_df["title"].values, tokens_only=True))
val_corpus_titles = list(read_corpus(val_df["title"].values, tokens_only=True))
test_corpus_titles = list(read_corpus(test_df["title"].values, tokens_only=True))

model.build_vocab(train_corpus_abstracts)
model.train(train_corpus_abstracts,
            total_examples=model.corpus_count,
            epochs=model.epochs)

Now use `model.infer_vector` to get the vector representation of any document

**Your turn!**

Q1. Get the document vectors for every document in the train set to form the training matrix. Similarly construct the validation matrix and test matrix from documents in the validation and test corpus, respectively. [0.5 points]

Following is an example of how to use `model.infer_vector` function, which will return a single vector for the entire sequence.

In [None]:
vector = model.infer_vector(["physics", "is", "awesome"])
print (vector)

In [None]:
import numpy as np
def corpus2staticmat (corpus:list, training=False) -> np.array:
  """ The function will take a corpus i.e. a collection of documents
      and get the embedding for each document.

  :params:
  corpus (list): The corpus is in the form of a list. Every item
                 in the list is a document. If the training flag is set,
                 then a document contains two properties: words and tags;
                 if the flag is not set, then the document is simply
                 a list of words.

  training (bool): A boolean flag that indicates whether the data
                   is training or non-traiing data

  :returns:
  embeddings (np.array): The embeddings for each document are
                         rows in a matrix.
  """

  embeddings = None
  # Write your code below

  return embeddings

In [None]:
X_static_train = corpus2staticmat (train_corpus_titles, training=False)
X_static_val = corpus2staticmat (val_corpus_titles, training=False)
X_static_test = corpus2staticmat (test_corpus_titles, training=False)

**Your turn!**

Q2. Find the best classifier using the embeddings features. Once again, you'll find the best hyperparameter (based on accuracy) for vector size. [0.5 points]

The vector size is a parameter for the following function `gensim.models.doc2vec.Doc2Vec`.

- vector_size: Try values from the following set {25, 50, 100, 200}

In [None]:
# Now, let's fit the model
classifier.fit (X_static_train, y_train)

# Use the trained classifier to do predictions
y_static_val = classifier.predict (X_static_val)

# Get the accuracy of the classifier
print (f"Accuracy in %: {100*accuracy_score (y_val, y_static_val):.2f}")

# Get the classification report
print ("Classification report")
print (classification_report (y_val, y_static_val))

**Sanity check**: I get 58% accuracy on the validation set using 100 dimensional features, which isn't bad considering I only trained the word2vec model for 15 epochs. There is also scope for improvement especially in categories that are rare.

In [None]:
# Your code below for tuning the vector size parameter



Q3. Compare the best classifier using just the bag-of-words feature and the classifier using doc2vec features. Which one is better in terms of accuracy? Briefly explain why? [1 point]

Your answer here:





## 3. Using contextual embeddings

Now we'll use the embeddings from a variation of BERT as features to the classifier.

The variation we'll use is called [SciBERT](https://arxiv.org/abs/1903.10676), which is the BERT model trained on scientific data such as research papers.

In [None]:
from transformers import *

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased', output_hidden_states=True)

model.eval()

Once you have the SciBERT model loaded, we can get the contextual embeddings for any sentence in a number of ways. One way is to take the embedding for the [CLS] token from the last layer using the `last_hidden_state` property set.

Note: In general if you want to access the embeddings at any hidden layer, we can access the `hidden_states` property which contains the token embeddings at every layer starting from bottom layer to the topmost layer.

Here's how to get the embeddings for the CLS token in any sequence.

In [None]:
with torch.no_grad():
  text = "Our paper measures the effect of eating ice-cream on happiness"
  encoded_input = tokenizer(text, return_tensors='pt')
  output = model(**encoded_input)

  # The [CLS] token is added at the start of the sentence,
  # which you can access by the token position 0
  # (the first zero is because we have only one sentence)
  print (output.last_hidden_state[0,0,:])
  print (output.last_hidden_state[0,0,:].size())

The above code should print the embedding output and the size of the embedding.

**Your turn!**

Q1. Adapt the code above to get the contextual embeddings for all the examples in train, validate and test sets [1 point]

You have to be careful with BERT-like models because it starts to break if the input text after tokenization exceeds 512 wordpieces, so you want to set the following parameters when you're calling the tokenizer on the sequence:

- max_length to 512
- truncation to True
- padding is True

In [None]:
from tqdm import tqdm

def corpus2contextualmat (corpus, batch_size=32):
  embeddings = None
  # Your code below
  return embeddings

Now let's call the method that gives us the contextual embeddings as follows.


Note: This could take some time because usually transformer models run fast on GPUs but we'll end up running everything on the CPU offered by Colab server.

It takes roughly 7-8 mins to run the cell below.

In [None]:
X_contextual_train = corpus2contextualmat (train_df["title"].values)
X_contextual_val = corpus2contextualmat (val_df["title"].values)
X_contextual_test = corpus2contextualmat (test_df["title"].values)

**Your turn!**

Q2. Report the accuracy by using the contextual embeddings of the titles. [0.5 points]

In [None]:
# Your code below


Q3. Instead of taking the contextual embeddings from the final layer, get the embeddings from the last 4 layers, avearge them and use them as features in the classifier. [1 point]

As mentioned, you can access the embeddings from individual layers using the `hidden_states` property of the output.

In [None]:
def corpus2contextualmat_averagedlayers (corpus, last_layers=4):
  """ Take the contextual embedding of any word as the average of the
      embeddings of the word from the last 4 layers.
  """
  embeddings = None
  # Your code below
  pass

In [None]:
X_contextualaverage_train = corpus2contextualmat_averagedlayers (train_df["title"].values, last_layers=4)
X_contextualaverage_val = corpus2contextualmat_averagedlayers (val_df["title"].values, last_layers=4)
X_contextualaverage_test = corpus2contextualmat_averagedlayers (test_df["title"].values, last_layers=4)

Q4. Report the accuracy of the model with the features constructed above [0.5 points]

In [None]:
# Your code below



## 4. Testing on unseen data

Now you have three competing classifiers:

(a) The most optimized classifier that uses bag-of-words features to predict the type of paper

(b) The most optimized classifier that uses static word embeddings to predict the type of paper

(c) The most optimized classifier that uses contextual word embeddings to predict the type of paper

**Your turn**

Q1. List 5 examples from the validation set that were misclassified by each of the classifiers. Explain in brief why the classifiers got the examples correct or incorrect. [0.5 points]

In answering the above question, you may want to think about the strengths and weaknesses of each of the classifiers.

Q2. Among the 3 competing classifiers, pick the one that has the highest accuracy. Use the classifiers output on the validation set to identify the true label that is misclassified the most. Report what is it misclassified as and explain in 2-3 sentences why this might be the case [1 point]

Q3. Report the accuracy and F1 score of all the competing classifiers. [0.5 points]




## 5. Extra credit [1 point]


Create the best classifier. You should use logistic regression but are free to try any other way to improve the performance of your classifier. Some suggestions include:

- Combine all features from the best models
- Add additional features
- Get pretrained embeddings from other models

You should briefly explain what you did in building the classifier. The best 3 submissions will get the extra credit which will be evaluated on data other than the one provided to you.