<a href="https://colab.research.google.com/github/teias-courses/nlp99/blob/gh-pages/assignments/NLP_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Assignment #02 - NLP with tight hands!
Deep Learning / Spring 1400, Khatam University

---

**Please pay attention to these notes:**
<br><br>


- **Assignment Due:** <b><font color='red'>1400.01.18</font></b> 23:59:00
- If you need any additional information, please review the assignment page on the course website.
- The items you need to answer are highlighted in <font color="SeaGreen">**bold SeaGreen**</font> and the coding parts you need to implement are denoted by:
```
# ------------------
# Put your implementation here     
# ------------------
```
- We always recommend co-operation and discussion in groups for assignments. However, **each student has to finish all the questions by him/herself**. If our matching system identifies any sort of copying, you'll be responsible for consequences.
- Students who audit this course should submit their assignments like other students to be qualified for attending the rest of the sessions.
- If you have any questions about this assignment, feel free to drop us a line. You may also post your questions on the course Microsoft Teams channel.
- You must run this notebook on Google Colab platform, it depends on Google Colab VM for some of the depencecies.
- You can double click on collapsed code cells to expand them.
- <b><font color='red'>When you are ready to submit, please follow the instructions at the end of this notebook.</font></b>


<br>



# Introduction
In this assignment, we are going to get a taste of what NLP was before introduction of RNNs! The rules are simple: no **RNN** or **Transformer** architectures are allowed! But you are allowed to use everything else!

We will go through all different kinds of standard NLP tasks one by one. For each task we chose a well known relevant dataset. We also provide a very basic baseline for each task, so you can compare the performance of your model.

Two of these tasks will be evaluated right here in this notebook, but for the other three, you have to submit your results to a Kaggle competition. Are you able to secure the first place in the leaderboard? Let's see :) 

<b><font color='red'>There are no questions for you in this notebook, you also have freedome in your implementations, but note that you will have to defend your proposed method and ideas in a video call with one of the TAs.</font></b>

Let's begin! Run the following cell:


In [None]:
# @title Install, Load libs

!pip install --upgrade scikit-learn

import numpy as np
import pandas as pd
import torch, torchtext
import spacy
from sklearn.metrics import top_k_accuracy_score, f1_score, accuracy_score
from scipy.stats import pearsonr

import pickle
import re, os, string, typing, gc, json, random
from collections import Counter

from tqdm.notebook import tqdm
import matplotlib.pyplot as plt

from IPython.display import clear_output

clear_output()
print('Done!')

# Yelp (Text Classification - Sentiment Analysis)

**Sentiment Analysis** is the process of detecting positive or negative sentiment in text. Sentiment analysis models focus on polarity (positive, negative, neutral) but also on feelings and emotions (angry, happy, sad, etc), urgency (urgent, not urgent) and even intentions (interested v. not interested). If polarity precision is important to you, you might consider expanding your polarity categories to include:

*   Very positive
*   Positive
*   Neutral
*   Negative
*   Very negative

In this assignment we just focus on a bi-polar sentiment analysis, therefore each sample is either a positive sample or negative sample. Here we use **Yelp**, which is one of the most well-known datasets for sentiment analysis.Note that the dataset is available in `torchtext` module, so we don't need to download it explicitly.

## Technical Notes:

* You can use any data for model training, except Yelp test set which is used for final evaluation.

* Your model should be compatible with our `predict_dataset` function which returns predicted sentiment for all dataset examples. Or you can implement your own version of `predict_dataset` which returns same output without shuffling data.


## Our Baseline

In [None]:
#@title Load GloVe embeddings

EMBEDDING_DIM = 100 #@param {type:"integer"}
glove = torchtext.vocab.GloVe(name='6B', dim=EMBEDDING_DIM)

word2id = glove.stoi
PAD_TOKEN_ID = len(word2id)
word2id['<unk>'] = len(word2id)
UNK_TOKEN_ID = len(word2id)
word2id['<pad>'] = len(word2id)

def convert_to_ids(tokens, pad=True, truncate=True, maxlen=32):
  ids = [word2id[token] if token in word2id else UNK_TOKEN_ID for token in tokens]
  if pad:
   ids = ids + [PAD_TOKEN_ID] * (maxlen-len(ids))
  if truncate:
   ids = ids[:maxlen]
  return ids

clear_output()
print('Done!')

In [None]:
#@title Load dataset

BATCH_SIZE = 128 #@param {type:"integer"}
MAXLEN = 64 #@param {type:"integer"}

class SentimentDataset(torch.utils.data.Dataset):
  def __init__(self, split_name='train', maxlen=160, label_map=None, num_examples=None):
    super().__init__()

    raw_dataset = torchtext.datasets.YelpReviewPolarity(split=split_name)
    tokenizer = spacy.load('en')

    # 'train', 'valid', 'test'
    if not num_examples:
      num_examples = len(raw_dataset)

    self.examples = []
    raw_data_iter = iter(raw_dataset)
    for i in range(num_examples):
      l, s = next(raw_data_iter)
      tokens = [t.text for t in tokenizer(s.lower(), disable=['parser','tagger','ner'])]
      self.examples.append((l, tokens))
    
    self.maxlen = maxlen
    self.label_map = label_map

  def __len__(self):
    return len(self.examples)

  def __getitem__(self, index):
    label, tokens = self.examples[index]
    token_ids = convert_to_ids(tokens, maxlen=self.maxlen)
    
    label_id = self.label_map[label]
    return (torch.tensor(token_ids),
            torch.tensor(label_id))

label_map = {1:0, 2:1}

yelp_train = SentimentDataset(split_name='train', 
                              maxlen=MAXLEN, 
                              label_map=label_map,
                              num_examples=55_000)

yelp_train, yelp_dev = torch.utils.data.random_split(yelp_train, [50_000, 5000])

yelp_test = SentimentDataset(split_name='test', 
                             maxlen=MAXLEN, 
                             label_map=label_map)

train_loader = torch.utils.data.DataLoader(yelp_train,
                                           batch_size=BATCH_SIZE,
                                           shuffle=True)

clear_output()
print('Done!')

In [None]:
#@title Define baseline model

class SentimentBaseline(torch.nn.Module):
  def __init__(self, word_embeddings, num_classes):
    super().__init__()
    
    embedding_dim = word_embeddings.dim

    self.embedding = torch.nn.Embedding(len(word2id), embedding_dim, padding_idx=PAD_TOKEN_ID)
    self.embedding.weight.data.copy_(torch.cat([
        word_embeddings.vectors,
        torch.rand_like(word_embeddings.vectors[:1]),  # unk
        torch.zeros_like(word_embeddings.vectors[:1])  # pad
      ], dim=0))

    self.classifier = torch.nn.Linear(embedding_dim, num_classes, bias=False)

  def forward(self, sent_toks):
    sent_emb = self.embedding(sent_toks)
    sent_mean = (sent_emb).mean(dim=1)
    logits = self.classifier(sent_mean)
    return logits

model = SentimentBaseline(word_embeddings=glove, num_classes=2).cuda()

for p in model.embedding.parameters():
  p.requires_grad = False

In [None]:
# @title Get predictions on dataset

@torch.no_grad()
def predict_dataset(model, dataset):
  data_loader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=False)
  all_labels, all_preds = [], []

  for sentence, labels in data_loader:
    logits = model(sentence.cuda())
    preds = torch.argmax(logits, dim=1)
    all_preds.append(preds)
    all_labels.append(labels)

  return (torch.cat(all_labels, dim=0).cpu().numpy(),
          torch.cat(all_preds, dim=0).cpu().numpy())

# test_labels, test_preds = predict_dataset(model, yelp_test)

In [None]:
#@title Training loop

EPOCHS = 5   #@param {type: "integer"}

optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.CrossEntropyLoss() 

for epoch in range(EPOCHS):

  train_losses, train_corrects, train_total = [], 0, 0
  progress = tqdm(train_loader,  desc=f'EPOCH {epoch+1}', mininterval=0.5)

  for i, (sentences, labels) in enumerate(progress):  # loop over the dataset multiple times

    optimizer.zero_grad()

    logits = model(sentences.cuda())
    loss = criterion(logits, labels.cuda())
    loss.backward()
    optimizer.step()

    preds = logits.argmax(dim=-1)
    train_corrects += (preds == labels.cuda()).sum().item()
    train_total += len(sentences)

    progress.set_postfix(loss=loss.item())
    train_losses.append(loss.item())

  dev_labels, dev_preds = predict_dataset(model, yelp_dev)
  progress.set_postfix(train_loss=sum(train_losses)/len(train_losses),
                       train_acc=train_corrects/train_total,
                       dev_acc=accuracy_score(dev_labels, dev_preds))
  progress.close()

## Your Model(s)

In [None]:
# ------------------
# Put your implementation here (Multi Cell)    
# ------------------

## Submit/Evaluate Model

In [None]:
test_labels, test_preds = predict_dataset(model, yelp_test)
accuracy = accuracy_score(test_labels, test_preds)
print(f'Accuracy on test set: {accuracy*100:.1f}')

#STS (Text Regression - Semantic Textual Similarity)

**Semantic Textual Similarity (STS)** measures the meaning similarity of sentences. STS deals with determining how similar two pieces of texts are. This can take the form of assigning a score from 1 to 5.

For this task, we use STS dataset available at Glue benchmark. You can take a look [here](http://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark) to find more relevant information, including a table with different baselines for comparison.

## Technical Notes:

* You can use any data for model training, except STS test set which is used for final evaluation.

* Your model should be compatible with our `predict_dataset` function which returns predicted score for all dataset examples. Or you can implement your own version of `predict_dataset` which returns same output without shuffling data.

* <b><font color='red'>After local evaluation, submit (upload) your results to [this competition](https://www.kaggle.com/t/40245eaca8ab4b19befe06f18af12b2f) and compete with your classmates.</font></b> 

In [None]:
# @title Download dataset
!wget https://dl.fbaipublicfiles.com/glue/data/STS-B.zip
!unzip /content/STS-B.zip
clear_output()

## Our Baseline

In [None]:
#@title Load GloVe embeddings

EMBEDDING_DIM = 100 #@param {type:"integer"}
glove = torchtext.vocab.GloVe(name='6B', dim=EMBEDDING_DIM)

word2id = glove.stoi
PAD_TOKEN_ID = len(word2id)
word2id['<unk>'] = len(word2id)
UNK_TOKEN_ID = len(word2id)
word2id['<pad>'] = len(word2id)

def convert_to_ids(tokens, pad=True, truncate=True, maxlen=32):
  ids = [word2id[token] if token in word2id else UNK_TOKEN_ID for token in tokens]
  if pad:
   ids = ids + [PAD_TOKEN_ID] * (maxlen-len(ids))
  if truncate:
   ids = ids[:maxlen]
  return ids

clear_output()
print('Done!')

In [None]:
#@title Load dataset

BATCH_SIZE = 128 #@param {type:"integer"}
MAXLEN = 64 #@param {type:"integer"}

class STSDataset(torch.utils.data.Dataset):
  def __init__(self, split_path=None, maxlen=160, num_examples=None):
    super().__init__()

    df = pd.read_csv(split_path, error_bad_lines=False, sep='\t').dropna()
    num_examples = num_examples or len(df)
    df = df.iloc[:num_examples]
    
    tokenizer = spacy.load('en')

    self.examples = []

    for i, row in df.iterrows():
      score = row['score']
      s1, s2 = row['sentence1'], row['sentence2']
      tokens1 = [t.text for t in tokenizer(s1.lower(), disable=['parser','tagger','ner'])]
      tokens2 = [t.text for t in tokenizer(s2.lower(), disable=['parser','tagger','ner'])]
      
      self.examples.append((score, tokens1, tokens2))
    
    self.maxlen = maxlen

  def __len__(self):
    return len(self.examples)

  def __getitem__(self, index):
    score, tokens1, tokens2 = self.examples[index]
    token_ids1 = convert_to_ids(tokens1, maxlen=self.maxlen)
    token_ids2 = convert_to_ids(tokens2, maxlen=self.maxlen)
    
    return (torch.tensor(token_ids1),
            torch.tensor(token_ids2),
            torch.tensor(score))


sts_train = STSDataset(split_path='/content/STS-B/train.tsv', 
                       maxlen=MAXLEN)

sts_test = STSDataset(split_path='/content/STS-B/dev.tsv',
                      maxlen=MAXLEN)

train_loader = torch.utils.data.DataLoader(sts_train,
                                           batch_size=BATCH_SIZE,
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(sts_test,
                                          batch_size=BATCH_SIZE,
                                          shuffle=True)
print("Done!")

In [None]:
#@title Define baseline model

import torch
from torch.autograd import Variable

class STSBaseline(torch.nn.Module):
  def __init__(self, word_embeddings):
    super().__init__()

    embedding_dim = word_embeddings.dim

    self.embedding = torch.nn.Embedding(len(word2id), embedding_dim, padding_idx=PAD_TOKEN_ID)
    self.embedding.weight.data.copy_(torch.cat([
      word_embeddings.vectors,
      torch.rand_like(word_embeddings.vectors[:1]),
      torch.zeros_like(word_embeddings.vectors[:1])], dim=0))
    
    self.cosine = torch.nn.CosineSimilarity(dim=-1)

  def forward(self, sent1_toks, sent2_toks):
    sent1_emb = self.embedding(sent1_toks)
    sent2_emb = self.embedding(sent2_toks)

    sent1_mean = (sent1_emb).mean(dim=1)
    sent2_mean = (sent2_emb).mean(dim=1)

    score = 2.5 * (self.cosine(sent1_mean, sent2_mean) + 1)
    
    return score

model = STSBaseline(glove).cuda()

for p in model.embedding.parameters():
  p.requires_grad = False

In [None]:
# @title Get predictions on dataset

@torch.no_grad()
def predict_dataset(model, dataset):
  data_loader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=False)
  all_scores, all_preds = [], []

  for s1, s2, scores in data_loader:
    preds = model(s1.cuda(), s2.cuda())
    all_preds.append(preds)
    all_scores.append(scores)

  return (torch.cat(all_scores, dim=0).cpu().numpy(),
          torch.cat(all_preds, dim=0).cpu().numpy())

# test_scores, test_preds = predict_dataset(model, sts_test)

## Your Model(s)

In [None]:
# ------------------
# Put your implementation here (Multi Cell)    
# ------------------

## Submit/Evaluate Model

In [None]:
#@title Pearson correlation
from scipy.stats import pearsonr

test_scores, test_preds = predict_dataset(model, sts_test)
pearon_coef = pearsonr(test_scores, test_preds)[0]
print(f'Pearson correlation coefficient: {pearon_coef:.4f}')

In [None]:
#@title Spearman correlation
from scipy.stats import spearmanr

test_scores, test_preds = predict_dataset(model, sts_test)
spearman_coef = spearmanr(test_scores, test_preds)[0]
print(f'Pearson correlation coefficient: {spearman_coef:.4f}')

In [None]:
# @title Scatter plot (real, predicted scores)
import matplotlib.pyplot as plt

plt.plot(test_scores, test_preds, '.')
plt.xlabel('Real similarity scores')
plt.ylabel('Predicted scores')

plt.show()

In [None]:
# @title Make kaggle submission file
from google.colab import files

with open('sts_submission.csv', 'w') as f:
  S = 'id,score\n'
  for i, score in enumerate(test_preds):
    S += f'{i},{score}\n'
  f.write(S[:-1])

files.download('sts_submission.csv')

# SNLI (Text Classification - Natural Language Inference)

**Natural language inference (NLI)** is a fundamental NLP task, investigating the entailment relationship between two texts.

The **SNLI corpus** is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI). Take a look at [here](https://nlp.stanford.edu/projects/snli/) to find more relevant information.

## Technical Notes:

* You can use any data for model training, except SNLI test set which is used for final evaluation.

* Your model should be compatible with our `predict_dataset` function which returns predicted class for all dataset examples (`numpy array with shape: (NUM_EXAMPLES,)`). Or you can implement your own version of `predict_dataset` which returns same output without shuffling data.

* <b><font color='red'>After local evaluation, submit (upload) your results to [this competition](https://www.kaggle.com/t/f394320514e24db3b668982d44164510) and compete with your classmates.</font></b> 

In [None]:
#@title Download dataset
!wget https://nlp.stanford.edu/projects/snli/snli_1.0.zip
!unzip snli_1.0.zip
clear_output()

## Our Baseline

In [None]:
#@title Load GloVe embeddings

EMBEDDING_DIM = 100 #@param {type:"integer"}
glove = torchtext.vocab.GloVe(name='6B', dim=EMBEDDING_DIM)

word2id = glove.stoi
PAD_TOKEN_ID = len(word2id)
word2id['<unk>'] = len(word2id)
UNK_TOKEN_ID = len(word2id)
word2id['<pad>'] = len(word2id)

def convert_to_ids(tokens, pad=True, truncate=True, maxlen=32):
  ids = [word2id[token] if token in word2id else UNK_TOKEN_ID for token in tokens]
  if pad:
   ids = ids + [PAD_TOKEN_ID] * (maxlen-len(ids))
  if truncate:
   ids = ids[:maxlen]
  return ids

clear_output()
print('Done!')

In [None]:
#@title Load dataset

BATCH_SIZE = 128 #@param {type:"integer"}
MAXLEN = 64 #@param {type:"integer"}

class SNLIDataset(torch.utils.data.Dataset):
  def __init__(self, split_path=None, maxlen=160, label_map=None, num_examples=None):
    super().__init__()
    
    df = pd.read_csv(split_path, sep="\t").dropna(subset=["gold_label", "sentence1", "sentence2"])
    df = df[(df['gold_label'] == 'contradiction')|(df['gold_label'] == 'neutral')|(df['gold_label'] == 'entailment')]
    num_examples = num_examples or len(df)
    df = df.iloc[:num_examples]
    
    tokenizer = spacy.load('en')

    self.examples = []

    for i, row in df.iterrows():
      label = row['gold_label']
      s1, s2 = row['sentence1'], row['sentence2']
      tokens1 = [t.text for t in tokenizer(s1.lower(), disable=['parser','tagger','ner'])]
      tokens2 = [t.text for t in tokenizer(s2.lower(), disable=['parser','tagger','ner'])]
      
      self.examples.append((label, tokens1, tokens2))
    
    self.maxlen = maxlen
    self.label_map = label_map

  def __len__(self):
    return len(self.examples)

  def __getitem__(self, index):
    label, tokens1, tokens2 = self.examples[index]
    token_ids1 = convert_to_ids(tokens1, maxlen=self.maxlen)
    token_ids2 = convert_to_ids(tokens2, maxlen=self.maxlen)
    
    label_id = self.label_map[label]
    return (torch.tensor(token_ids1),
            torch.tensor(token_ids2),
            torch.tensor(label_id))


label_map = {'contradiction':0,
             'neutral':1,
             'entailment':2}

snli_train = SNLIDataset(split_path='/content/snli_1.0/snli_1.0_train.txt', 
                        maxlen=MAXLEN, 
                        label_map=label_map,
                        num_examples=50000)

snli_dev = SNLIDataset(split_path='/content/snli_1.0/snli_1.0_dev.txt', 
                        maxlen=MAXLEN, 
                        label_map=label_map,
                        num_examples=5000)

snli_test = SNLIDataset(split_path='/content/snli_1.0/snli_1.0_test.txt', 
                        maxlen=MAXLEN, 
                        label_map=label_map)

train_loader = torch.utils.data.DataLoader(snli_train,
                                           batch_size=BATCH_SIZE,
                                           shuffle=True)
print("Done!")

In [None]:
#@title Define baseline model

class NLIBaseline(torch.nn.Module):
  def __init__(self, word_embeddings, num_classes=3):
    super().__init__()

    embedding_dim = word_embeddings.dim

    self.embedding = torch.nn.Embedding(len(word2id), embedding_dim, padding_idx=PAD_TOKEN_ID)
    self.embedding.weight.data.copy_(torch.cat([
      word_embeddings.vectors,
      torch.rand_like(word_embeddings.vectors[:1]),
      torch.zeros_like(word_embeddings.vectors[:1])], dim=0))

    self.predictor = torch.nn.Linear(embedding_dim*2, num_classes, bias=False)

  def forward(self, sent1_toks, sent2_toks):
    sent1_emb = self.embedding(sent1_toks)
    sent2_emb = self.embedding(sent2_toks)
    
    sent1_mean = (sent1_emb).mean(dim=1)
    sent2_mean = (sent2_emb).mean(dim=1)
    
    conc = torch.cat([sent1_mean, sent2_mean], dim=-1)
    logits = self.predictor(conc)
    return logits


model = NLIBaseline(glove, num_classes=3).cuda()

for p in model.embedding.parameters():
  p.requires_grad = False

In [None]:
# @title Get predictions on dataset

@torch.no_grad()
def predict_dataset(model, dataset):
  data_loader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=False)
  all_labels, all_preds = [], []

  for sent1, sent2, label in data_loader:
    logits = model(sent1.cuda(), sent2.cuda())
    preds = torch.argmax(logits, dim=1)
    all_preds.append(preds)
    all_labels.append(label)

  return (torch.cat(all_labels, dim=0).cpu().numpy(),
          torch.cat(all_preds, dim=0).cpu().numpy())

# test_labels, test_preds = predict_dataset(model, snli_test)

In [None]:
#@title Training loop

LR = 0.01 #@param {type: "number"}
EPOCHS = 10   #@param {type: "integer"}

optimizer = torch.optim.Adam(model.parameters(), lr=LR) 
criterion = torch.nn.CrossEntropyLoss() 

for epoch in range(EPOCHS):

  train_losses, train_corrects, train_total = [], 0, 0
  progress = tqdm(train_loader,  desc=f'EPOCH {epoch+1}', mininterval=0.5)

  for i, (sent1, sent2, label) in enumerate(progress):  # loop over the dataset multiple times

    optimizer.zero_grad()

    logits = model(sent1.cuda(), sent2.cuda())
    loss = criterion(logits, label.cuda())
    loss.backward()
    optimizer.step()

    preds = logits.argmax(dim=-1)
    train_corrects += (preds == label.cuda()).sum().item()
    train_total += len(sent1)

    progress.set_postfix(loss=loss.item())
    train_losses.append(loss.item())

  dev_labels, dev_preds = predict_dataset(model, snli_dev)
  progress.set_postfix(train_loss=sum(train_losses)/len(train_losses),
                       train_acc=train_corrects/train_total,
                       dev_acc=accuracy_score(dev_labels, dev_preds))
  progress.close()

## Your Model(s)

In [None]:
# ------------------
# Put your implementation here (Multi Cell)    
# ------------------

## Submit/Evaluate Model

In [None]:
test_labels, test_preds = predict_dataset(model, snli_test)
accuracy = accuracy_score(test_labels, test_preds)
print(f'Accuracy on test set: {accuracy*100:.1f}')

In [None]:
# @title Make kaggle submission file
from google.colab import files

with open('snli_submission.csv', 'w') as f:
  S = 'id,label\n'
  for i, label in enumerate(test_preds):
    S += f'{i},{label}\n'
  f.write(S[:-1])

files.download('snli_submission.csv')

# UDPoS (Sequence Labeling - Part of Speech Tagging)

**Part of Speech tagging (PoS tagging)** is the task of tagging a word in a text with its part of speech. A part of speech is a category of words with similar grammatical properties. Common English parts of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, etc.

**Universal Dependencies (UD)** is a framework for cross-linguistic grammatical annotation, which contains more than 100 treebanks in over 60 languages. In this assignment we are targeting the English dataset (which is also available in `torchtext`). Look at [here](https://universaldependencies.org/u/pos/index.html) for more information.

## Technical Notes:

* You can use any data for model training, except English UDPoS test set which is used for final evaluation.

* Your model should be compatible with our `predict_dataset` function which returns predicted PoS for all dataset examples. Or you can implement your own version of `predict_dataset` which returns same output without shuffling data.

## Our Baseline

In [None]:
#@title Load GloVe embeddings

EMBEDDING_DIM = 100 #@param {type:"integer"}
glove = torchtext.vocab.GloVe(name='6B', dim=EMBEDDING_DIM)

word2id = glove.stoi
PAD_TOKEN_ID = len(word2id)
word2id['<unk>'] = len(word2id)
UNK_TOKEN_ID = len(word2id)
word2id['<pad>'] = len(word2id)

def convert_to_ids(tokens, pad=True, truncate=True, maxlen=32):
  ids = [word2id[token] if token in word2id else UNK_TOKEN_ID for token in tokens]
  if pad:
   ids = ids + [PAD_TOKEN_ID] * (maxlen-len(ids))
  if truncate:
   ids = ids[:maxlen]
  return ids

clear_output()
print('Done!')

In [None]:
#@title Load dataset

from collections import Counter

BATCH_SIZE = 128 #@param {type:"integer"}
MAXLEN = 64 #@param {type:"integer"}
PAD_TARGET_LABEL = -1


class PosDataset(torch.utils.data.Dataset):
  def __init__(self, split_name='train', maxlen=160, label_map=None):
    super().__init__()
    self.label_map = label_map

    raw_dataset = torchtext.datasets.UDPOS(split=split_name)

    self.examples = list(raw_dataset)
    self.maxlen = maxlen

  def __len__(self):
    return len(self.examples)

  def __getitem__(self, index):
    tokens, udpos, _ = self.examples[index]
    token_ids = convert_to_ids(tokens, maxlen=self.maxlen)
    label_ids = [self.label_map[label] for label in udpos][:self.maxlen]
    label_ids = label_ids + [PAD_TARGET_LABEL] * (self.maxlen - len(label_ids))
    return (torch.tensor(token_ids),
            torch.tensor(label_ids))
    
def get_label_map(dataset):
  vocab = {}
  maxlen = 0
  counter, label2id = Counter(), {}
  for tokens, udpos1, udpos2 in dataset:
    counter.update(udpos1)
    maxlen = max(maxlen, len(tokens))
  print(maxlen)
  for label, _ in counter.most_common():
    label2id[label] = label2id.get(label) or len(label2id)
  return label2id

pos_train = torchtext.datasets.UDPOS(split='train')
label_map = get_label_map(pos_train)

pos_train = PosDataset('train', maxlen=MAXLEN, label_map=label_map)
pos_dev = PosDataset('valid', maxlen=MAXLEN, label_map=label_map)
pos_test = PosDataset('test', maxlen=MAXLEN, label_map=label_map)

train_loader = torch.utils.data.DataLoader(pos_train,
                                           batch_size=BATCH_SIZE,
                                           shuffle=True)
clear_output()

In [None]:
# @title Define baseline model

class PosBaseline(torch.nn.Module):
  def __init__(self, word_embeddings, num_classes):
    super().__init__()

    embedding_dim = word_embeddings.dim

    self.embedding = torch.nn.Embedding(len(word2id), embedding_dim, padding_idx=PAD_TOKEN_ID)
    self.embedding.weight.data.copy_(torch.cat([
      word_embeddings.vectors,
      torch.rand_like(word_embeddings.vectors[:1]),
      torch.zeros_like(word_embeddings.vectors[:1])], dim=0))
    
    self.classifier = torch.nn.Linear(embedding_dim, num_classes)

  def forward(self, token_ids):
    token_embs = self.embedding(token_ids)
    logits = self.classifier(token_embs)
    return logits

NUM_CLASSES = len(label_map)
model = PosBaseline(glove, num_classes=NUM_CLASSES).cuda()

for p in model.embedding.parameters():
  p.requires_grad = False

In [None]:
# @title Get predictions on dataset

@torch.no_grad()
def predict_dataset(model, dataset):

  all_preds, all_labels = [], []
  test_loader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=False)
  
  for ids, labels in test_loader:
    logits = model(ids.cuda())
    preds = logits.argmax(dim=-1)
    all_preds.append(preds[labels != PAD_TARGET_LABEL])
    all_labels.append(labels[labels != PAD_TARGET_LABEL])

  return (torch.cat(all_labels, dim=0).cpu().numpy(),
          torch.cat(all_preds, dim=0).cpu().numpy())

all_labels, all_preds = predict_dataset(model, pos_test)

In [None]:
#@title Training loop

EPOCHS = 10   #@param {type: "integer"}

criterion = torch.nn.CrossEntropyLoss(ignore_index=PAD_TARGET_LABEL)
optimizer = torch.optim.Adam(model.parameters())

for epoch in range(EPOCHS):
  progress = tqdm(train_loader, desc=f'EPOCH {epoch+1}')
  
  train_losses, train_corrects, train_total = [], 0, 0
  for token_ids, labels in progress:
    
    optimizer.zero_grad()

    labels = labels.cuda().flatten()
    logits = model(token_ids.cuda())
    logits = logits.view(-1, logits.shape[-1])
    loss = criterion(logits, labels)
    loss.backward()
    optimizer.step()

    preds = logits.argmax(dim=-1)
    train_corrects += (preds[labels != PAD_TARGET_LABEL] == labels[labels != PAD_TARGET_LABEL]).sum().item()
    train_total += len(labels[labels != PAD_TARGET_LABEL])

    progress.set_postfix(loss=loss.item())
    train_losses.append(loss.item())

  dev_labels, dev_preds = predict_dataset(model, pos_dev)

  progress.set_postfix(train_loss=sum(train_losses)/len(train_losses),
                       train_acc=train_corrects/train_total,
                       dev_acc=accuracy_score(dev_labels, dev_preds),
                       dev_macro_f1=f1_score(dev_labels, dev_preds, average='macro'))
  progress.close()

## Your Model(s)

In [None]:
# ------------------
# Put your implementation here (Multi Cell)    
# ------------------

## Submit/Evaluate Model

In [None]:
test_labels, test_preds = predict_dataset(model, pos_test)
macro_f1 = f1_score(test_labels, test_preds, average='macro')
print(f'Macro F1-score on test set: {macro_f1*100:.1f}')

Macro F1-score on test set: 81.3


# SQuAD (Question Answering)

**Question Answering (QA)** is the task of answering a question(source:nlpprogress :D  )! Most current QA datasets frame the task as reading comprehension where the question is about a paragraph or document and the answer often is a span in the document.

The **Stanford Question Answering Dataset (SQuAD)** is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles. The answer to every question is a segment of text (a span) from the corresponding reading passage. 

To make it easier, we just use the questions with single word answers in this assignment. Take a look at [here](https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/) for more information on SQuAD. You can also read [this paper](https://www.aclweb.org/anthology/K17-1028/) to get a clue on how to tackle the problem wihtout RNNs.

## Technical Notes:

* The final expected output for our simplified task is the index of answer token in the passage (or the probability of each token being the correct answer). So the task is token based and the tokenization should be done on our side to make test results comparable in kaggle. 

* Our baseline uses glove embeddings and also its vocab (mapping tokens to ids) to create input tensors. If you want to change that, pass your version of `transform` function (other than our `convert_to_ids`) to `SquadDataset` class.

* You can use any data for model training, except SQuAD 1.1 dev set which is considered as test set. And that's because the official test set of SQuAD is not publicly available.

* Your model should be compatible with our `predict_dataset` function which returns predictions of model for all tokens in a dataset with shape `(NUM_EXAMPLES, MAX_PASSAGE_LENGTH)`. Or you can implement your own version of `predict_dataset` which returns that output without shuffling data.

* <b><font color='red'>After local evaluation, submit (upload) your results to [this competition](https://www.kaggle.com/t/726281a9a5a34e6586c58790f6ebbe26) and compete with your classmates.</font></b>

## Our Baseline

In [None]:
#@title Load GloVe embeddings

EMBEDDING_DIM = 100 #@param {type:"integer"}
glove = torchtext.vocab.GloVe(name='6B', dim=EMBEDDING_DIM)

word2id = glove.stoi
PAD_TOKEN_ID = len(word2id)
word2id['<unk>'] = len(word2id)
UNK_TOKEN_ID = len(word2id)
word2id['<pad>'] = len(word2id)

def convert_to_ids(tokens, pad=True, truncate=True, maxlen=32):
  ids = [word2id[token] if token in word2id else UNK_TOKEN_ID for token in tokens]
  if pad:
   ids = ids + [PAD_TOKEN_ID] * (maxlen-len(ids))
  if truncate:
   ids = ids[:maxlen]
  return ids

clear_output()
print('Done!')

In [None]:
#@title Load dataset

BATCH_SIZE = 128 #@param {type:"integer"}
MAX_QUESTION_LEN = 24 #@param {type:"integer"}
MAX_PASSAGE_LEN = 256 #@param {type:"integer"}
tokenizer = spacy.load('en')

class SquadDataset(torch.utils.data.Dataset):
  def __init__(self, split_name='train', transform=convert_to_ids,
               max_question_len=24, max_passage_len=256):
    super().__init__()

    raw_dataset = torchtext.datasets.SQuAD1(split=split_name)

    self.max_question_len = max_question_len
    self.max_passage_len = max_passage_len
    self.transform = transform
    self.qa_pairs = []

    for passage, question, ans_texts, ans_starts in tqdm(raw_dataset, desc='TOKENIZING'):
      
      answer_start = -1

      # check if the question has a single token answer
      for ans_text, ans_start in zip(ans_texts, ans_starts):
        if len(tokenizer(ans_text, disable=['parser','tagger','ner'])) == 1:
          answer_start = ans_start
          break
      else:
        continue

      passage_tokens = tokenizer(passage.lower(), disable=['parser','tagger','ner'])
      p_tokens = [token.text for token in passage_tokens]
      ans_idx = -1

      # find index of answer token
      for i, token in enumerate(passage_tokens):
        if token.idx == answer_start and token.idx < self.max_passage_len:
          ans_idx = i
          break
      else:
        continue

      q_tokens = [token.text for token in tokenizer(question.lower(), disable=['parser','tagger','ner'])]
      self.qa_pairs.append((p_tokens, q_tokens, ans_idx))

  def __len__(self):
    return len(self.qa_pairs)

  def __getitem__(self, index):
    p_tokens, q_tokens, ans_idx = self.qa_pairs[index]
    p_ids = self.transform(p_tokens, maxlen=self.max_passage_len) 
    q_ids = self.transform(q_tokens, maxlen=self.max_question_len)

    return (torch.tensor(p_ids),
            torch.tensor(q_ids),
            torch.tensor(ans_idx))
   
squad_train = SquadDataset(split_name='train',
                           max_question_len=MAX_QUESTION_LEN,
                           max_passage_len=MAX_PASSAGE_LEN)
dev_size = len(squad_train) // 10
train_size = len(squad_train) - dev_size

squad_train, squad_dev = torch.utils.data.random_split(squad_train, [train_size, dev_size])

squad_test = SquadDataset(split_name='dev',
                          max_question_len=MAX_QUESTION_LEN,
                          max_passage_len=MAX_PASSAGE_LEN)

train_loader = torch.utils.data.DataLoader(squad_train,
                                           batch_size=BATCH_SIZE,
                                           shuffle=True,
                                           pin_memory=True)
clear_output()

In [None]:
#@title Define baseline model

class SquadBaseline(torch.nn.Module):
  def __init__(self, word_embeddings):
    super().__init__()

    num_units = word_embeddings.dim

    self.embedding = torch.nn.Embedding(len(word2id), word_embeddings.dim, padding_idx=PAD_TOKEN_ID)
    self.embedding.weight.data.copy_(torch.cat([
        word_embeddings.vectors,
        torch.rand_like(word_embeddings.vectors[:1]),  # unk
        torch.zeros_like(word_embeddings.vectors[:1])  # pad
      ], dim=0))

    self.p_linear = torch.nn.Linear(num_units, num_units)
    self.q_linear = torch.nn.Linear(num_units, num_units)

    self.classifier = torch.nn.Linear(num_units, 1)
  
  def forward(self, p_ids, q_ids):
    p_embs = self.embedding(p_ids)
    q_embs = self.embedding(q_ids)

    p_features = torch.nn.functional.relu(self.p_linear(p_embs))
    q_features = torch.nn.functional.relu(self.q_linear(q_embs))

    q_reduced = q_features.mean(dim=1).unsqueeze(1)
    merged = q_reduced.repeat(1, p_features.shape[1], 1) - p_features
    logits = self.classifier(merged)
    return logits.squeeze(-1)

model = SquadBaseline(glove).cuda()

for p in model.embedding.parameters():
  p.requires_grad = False

In [None]:
# @title Get predictions on dataset

@torch.no_grad()
def predict_dataset(model, dataset):
  data_loader = torch.utils.data.DataLoader(dataset,
                                            batch_size=BATCH_SIZE,
                                            shuffle=False)
  outputs, labels = [], []
  
  for p_ids, q_ids, indexes in data_loader:
    logits = model(p_ids.cuda(), q_ids.cuda())
    outputs.append(logits)
    labels.append(indexes)

  return (torch.cat(labels, dim=0).cpu().numpy(),
          torch.cat(outputs, dim=0).cpu().numpy())

# labels, outputs = predict_dataset(model, squad_dev)

In [None]:
#@title Training loop
EPOCHS =    10#@param {type: "integer"}

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())

for epoch in range(EPOCHS):
  progress = tqdm(train_loader, desc=f'EPOCH {epoch+1}')
  
  train_losses, train_corrects, train_total = [], 0, 0
  for p_ids, q_ids, indexes in progress:
    
    optimizer.zero_grad()

    logits = model(p_ids.cuda(), q_ids.cuda())
    loss = criterion(logits, indexes.cuda())
    loss.backward()
    optimizer.step()

    preds = logits.argmax(dim=-1)
    train_corrects += (preds == indexes.cuda()).int().sum().item()
    train_total += p_ids.shape[0]
    progress.set_postfix(loss=loss.item())
    train_losses.append(loss.item())

  progress.set_postfix(train_loss=sum(train_losses)/len(train_losses))
  
  labels, outputs = predict_dataset(model, squad_dev)

  progress.set_postfix(train_loss=sum(train_losses)/len(train_losses),
                       train_acc=train_corrects/train_total,
                       val_acc=top_k_accuracy_score(labels, outputs, labels=list(range(MAX_PASSAGE_LEN)), k=1),
                       top_5_val_acc=top_k_accuracy_score(labels, outputs, labels=list(range(MAX_PASSAGE_LEN)), k=5))
  progress.close()

## Your Model(s)

In [None]:
# ------------------
# Put your implementation here (Multi Cell)    
# ------------------

## Submit/Evaluate Model

In [None]:
#@title Evaluation
test_labels, test_preds = predict_dataset(model, squad_test)
accuracy = top_k_accuracy_score(test_labels, test_preds, labels=list(range(MAX_PASSAGE_LEN)), k=1)
accuracy_5 = top_k_accuracy_score(test_labels, test_preds, labels=list(range(MAX_PASSAGE_LEN)), k=5)

print(f'Accuracy on test set: {accuracy*100:.2f}')
print(f'Top-5 Accuracy on test set: {accuracy_5*100:.2f}')

In [None]:
# @title Make kaggle submission file
from google.colab import files

test_preds = np.argmax(test_preds, axis=-1)

with open('squad_submission.csv', 'w') as f:
  S = 'id,label\n'
  for i, label in enumerate(test_preds):
    S += f'{i},{label}\n'
  f.write(S[:-1])

files.download('squad_submission.csv')

# Submission

Congratulations! You finished the assignment & you're ready to submit your work. Please follow the instructions:

1. Check and review your answers. Make sure all of the cell outputs are what you want. 
2. Select File > Save.
3. **Fill your information** & run the cell bellow.
4. Run **Make Submission** cell, It may take several minutes and it may ask you for your credential.
5. Run **Download Submission** cell to obtain your submission as a zip file.
6. Grab the downloaded file (`nlp_asg02__xx__xx.zip`) and hand it over in microsoft teams.

## Fill your information (Run the cell)

In [None]:
#@title Enter your information & "RUN the cell!!" { run: "auto" }
student_id = "" #@param {type:"string"}
student_name = "" #@param {type:"string"}

print("your student id:", student_id)
print("your name:", student_name)


from pathlib import Path

ASSIGNMENT_PATH = Path('asg02')
ASSIGNMENT_PATH.mkdir(parents=True, exist_ok=True)

## Make Submission (Run the cell)

In [None]:
#@title Make submission
! pip install -U --quiet PyDrive > /dev/null
! pip install -U --quiet jdatetime > /dev/null

# ! wget -q https://github.com/github/hub/releases/download/v2.10.0/hub-linux-amd64-2.10.0.tgz 


import os
import time
import yaml
import json
import jdatetime

from google.colab import files
from IPython.display import Javascript
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

asg_name = 'NLP_Assignment_2'
script_save = '''
require(["base/js/namespace"],function(Jupyter) {
    Jupyter.notebook.save_checkpoint();
});
'''
# repo_name = 'iust-deep-learning-assignments'
submission_file_name = 'nlp_asg02__%s__%s.zip'%(student_id, student_name.lower().replace(' ',  '_'))

sub_info = {
    'student_id': student_id,
    'student_name': student_name, 
    'dateime': str(jdatetime.date.today()),
    'asg_name': asg_name
}
json.dump(sub_info, open('info.json', 'w'))

Javascript(script_save)

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
file_id = drive.ListFile({'q':"title='%s.ipynb'"%asg_name}).GetList()[0]['id']
downloaded = drive.CreateFile({'id': file_id})
downloaded.GetContentFile('%s.ipynb'%asg_name) 

! jupyter nbconvert --to script "$asg_name".ipynb > /dev/null
! jupyter nbconvert --to html "$asg_name".ipynb > /dev/null
! zip "$submission_file_name" "$asg_name".ipynb "$asg_name".html "$asg_name".txt info.json > /dev/null

print("##########################################")
print("Done! Submisson created, Please download using the bellow cell!")

In [None]:
drive.ListFile({'q':"title='%s.ipynb'"%asg_name}).GetList()[0]['id']

In [None]:
files.download(submission_file_name)