### Coursework coding instructions (please also see full coursework spec)

Please choose if you want to do either Task 1 or Task 2. You should write your report about one task only.

For the task you choose you will need to do two approaches:
  - Approach 1, which can use use pre-trained embeddings / models
  - Approach 2, which should not use any pre-trained embeddings or models
We should be able to run both approaches from the same colab file

#### Running your code:
  - Your models should run automatically when running your colab file without further intervention
  - For each task you should automatically output the performance of both models
  - Your code should automatically download any libraries required

#### Structure of your code:
  - You are expected to use the 'train', 'eval' and 'model_performance' functions, although you may edit these as required
  - Otherwise there are no restrictions on what you can do in your code

#### Documentation:
  - You are expected to produce a .README file summarising how you have approached both tasks

#### Reproducibility:
  - Your .README file should explain how to replicate the different experiments mentioned in your report

Good luck! We are really looking forward to seeing your reports and your model code!

In [None]:
# You will need to download any word embeddings required for your code, e.g.:

!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

# For any packages that Colab does not provide auotmatically you will also need to install these below, e.g.:

#! pip install torch

--2021-02-21 12:17:44--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-02-21 12:17:44--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-02-21 12:17:45--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2021-0

In [None]:
# Download the task dataset

!wget https://www.cs.rochester.edu/u/nhossain/humicroedit/semeval-2020-task-7-data.zip
!unzip semeval-2020-task-7-data.zip
!rm semeval-2020-task-7-data.zip

--2021-02-21 13:58:10--  https://www.cs.rochester.edu/u/nhossain/humicroedit/semeval-2020-task-7-data.zip
Resolving www.cs.rochester.edu (www.cs.rochester.edu)... 192.5.53.208
Connecting to www.cs.rochester.edu (www.cs.rochester.edu)|192.5.53.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 919538 (898K) [application/zip]
Saving to: ‘semeval-2020-task-7-data.zip’


2021-02-21 13:58:11 (2.14 MB/s) - ‘semeval-2020-task-7-data.zip’ saved [919538/919538]

Archive:  semeval-2020-task-7-data.zip
   creating: data/
   creating: data/task-1/
  inflating: data/task-1/.DS_Store   
  inflating: data/task-1/dev.csv     
  inflating: data/task-1/train.csv   
  inflating: data/.DS_Store          
   creating: data/task-2/
  inflating: data/task-2/.DS_Store   
  inflating: data/task-2/dev.csv     
  inflating: data/task-2/train.csv   


In [None]:
# Imports

import torch
import torch.nn as nn
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from torch.utils.data import Dataset, random_split
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
import torch.optim as optim
import codecs
import tqdm

In [None]:
# Setting random seed and device
SEED = 1

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")

In [None]:
# Load data
train_df = pd.read_csv('data/task-2/train.csv')
test_df = pd.read_csv('data/task-2/dev.csv')

In [None]:
# Number of epochs
epochs = 10

# Proportion of training data for train compared to dev
train_proportion = 0.8


#### BERT

##### Make BERT works

#### Approach 1: Using pre-trained representations

In [None]:
# We define our training loop
def train(train_iter, dev_iter, model, number_epoch):
    """
    Training loop for the model, which calls on eval to evaluate after each epoch
    """

    print("Training model.")

    for epoch in range(1, number_epoch+1):
        
        model.train()
        
        epoch_loss = 0
        epoch_correct = 0
        no_observations = 0  # Observations used for training so far

        for batch in train_iter:
            input_ids, attention_mask, token_type_ids, target = batch
            input_ids, attention_mask, token_type_ids, target = input_ids.to(device), attention_mask.to(device), token_type_ids.to(device), target.to(device)

            # for RNN:
            # model.batch_size = target.shape[0]
            # no_observations = no_observations + target.shape[0]
            # model.hidden = model.init_hidden()

            # for BERT
            input_ids = input_ids.squeeze(1)
            attention_mask = attention_mask.squeeze(1)
            token_type_ids = token_type_ids.squeeze(1)
            predictions = model(input_ids = input_ids, attention_mask = attention_mask, token_type_ids=token_type_ids).logits
            loss = loss_fn(predictions, target)

            correct, __ = model_performance(np.argmax(predictions.detach().cpu().numpy(), axis=1), target.detach().cpu().numpy())

            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # added T-C
            optimizer.step()
            scheduler.step()  # added T-C
            optimizer.zero_grad() # added T-C

            epoch_loss += loss.item()*target.shape[0]
            epoch_correct += correct
            no_observations = no_observations + target.shape[0]

        valid_loss, valid_acc, __, __ = eval(dev_iter, model)
        
        epoch_loss, epoch_acc = epoch_loss / no_observations, epoch_correct / no_observations
        print(f'| Epoch: {epoch:02} | Train Loss: {epoch_loss:.2f} | Train Accuracy: {epoch_acc:.2f} | \
          Val. Loss: {valid_loss:.2f} | Val. Accuracy: {valid_acc:.2f} |')

In [None]:
# We evaluate performance on our dev set
def eval(data_iter, model):
    """
    Evaluating model performance on the dev set
    """
    model.eval()
    epoch_loss = 0
    epoch_correct = 0
    pred_all = []
    trg_all = []
    no_observations = 0

    with torch.no_grad():
        for batch in data_iter:
            input_ids, attention_mask, token_type_ids, target = batch
            input_ids, attention_mask, token_type_ids, target = input_ids.to(device), attention_mask.to(device), token_type_ids.to(device), target.to(device)

            # for RNN:
            # model.batch_size = target.shape[0]
            # no_observations = no_observations + target.shape[0]
            # model.hidden = model.init_hidden()

            # for BERT
            input_ids = input_ids.squeeze(1)
            attention_mask = attention_mask.squeeze(1)
            token_type_ids = token_type_ids.squeeze(1)
            predictions = model(input_ids = input_ids, attention_mask = attention_mask, token_type_ids=token_type_ids).logits
            loss = loss_fn(predictions, target)

            # We get the mse
            pred, trg = predictions.detach().cpu().numpy(), target.detach().cpu().numpy()
            correct, __ = model_performance(np.argmax(pred, axis=1), trg)

            epoch_loss += loss.item()*target.shape[0]
            no_observations = no_observations + target.shape[0]
            epoch_correct += correct
            pred_all.extend(pred)
            trg_all.extend(trg)

    return epoch_loss/no_observations, epoch_correct/no_observations, np.array(pred_all), np.array(trg_all)

In [None]:
# How we print the model performance
def model_performance(output, target, print_output=False):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    correct_answers = (output == target)
    correct = sum(correct_answers)
    acc = np.true_divide(correct,len(output))

    if print_output:
        print(f'| Acc: {acc:.2f} ')

    return correct, acc

In [None]:
# To create our vocab
def create_vocab(data):
    """
    Creating a corpus of all the tokens used
    """
    tokenized_corpus = [] # Let us put the tokenized corpus in a list

    for sentence in data:

        tokenized_sentence = []

        for token in sentence.split(' '): # simplest split is

            tokenized_sentence.append(token)

        tokenized_corpus.append(tokenized_sentence)

    # Create single list of all vocabulary
    vocabulary = []  # Let us put all the tokens (mostly words) appearing in the vocabulary in a list

    for sentence in tokenized_corpus:

        for token in sentence:

            if token not in vocabulary:

                if True:
                    vocabulary.append(token)

    return vocabulary, tokenized_corpus

In [None]:
# Used for collating our observations into minibatches:
def collate_fn_padd(batch):
    '''
    We add padding to our minibatches and create tensors for our model
    '''
    input_ids = [t1 for t1, t2, t3, l in batch]
    att_mask = [t2 for t1, t2, t3, l in batch]
    token_type_ids = [t3 for t1, t2, t3, l in batch]
    batch_labels = [l for t1, t2, t3, l in batch]
    batch_features_len = [t1.shape[-1] for t1, t2, t3, l in batch]

    #batch_labels = [l for f, l in batch]
    #batch_features = [f for f, l in batch]
    #batch_features_len = [len(f) for f, l in batch]

    input_ids_tensor = torch.zeros((len(batch), max(batch_features_len))).long()
    att_mask_tensor = torch.zeros((len(batch), max(batch_features_len))).long()
    token_type_tensor = torch.zeros((len(batch), max(batch_features_len))).long()

    #for idx, (seq, seqlen) in enumerate(zip(batch_features, batch_features_len)):
    #    seq_tensor[idx, :seqlen] = torch.LongTensor(seq)

    for idx, (seq, seqlen) in enumerate(zip(input_ids, batch_features_len)):
        input_ids_tensor[idx, :seqlen] = seq.view(-1)
    for idx, (seq, seqlen) in enumerate(zip(att_mask, batch_features_len)):
        att_mask_tensor[idx, :seqlen] = seq.view(-1)
    for idx, (seq, seqlen) in enumerate(zip(token_type_ids, batch_features_len)):
        token_type_tensor[idx, :seqlen] = seq.view(-1)

    batch_labels = torch.LongTensor(batch_labels)

    return input_ids_tensor, att_mask_tensor, token_type_tensor, batch_labels

# We create a Dataset so we can create minibatches
class Task2Dataset(Dataset):

    def __init__(self, tokenized_data, labels):
        self.tokenized_data = tokenized_data
        self.y_train = labels

    def __len__(self):
        return len(self.y_train)

    def __getitem__(self, item):
        return self.tokenized_data[item]['input_ids'], self.tokenized_data[item]['attention_mask'], self.tokenized_data[item]['token_type_ids'], self.y_train[item]

In [None]:

class BiLSTM_classification(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, batch_size, device):
        super(BiLSTM_classification, self).__init__()
        self.hidden_dim = hidden_dim
        self.embedding_dim = embedding_dim
        self.device = device
        self.batch_size = batch_size
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, bidirectional=True)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2label = nn.Linear(hidden_dim * 2, 3)
        self.hidden = self.init_hidden()

    def init_hidden(self):
        # Before we've done anything, we dont have any hidden state.
        # Refer to the Pytorch documentation to see exactly why they have this dimensionality.
        # The axes semantics are (num_layers * num_directions, minibatch_size, hidden_dim)
        return torch.zeros(2, self.batch_size, self.hidden_dim).to(self.device), \
               torch.zeros(2, self.batch_size, self.hidden_dim).to(self.device)

    def forward(self, sentence):
        embedded = self.embedding(sentence)
        embedded = embedded.permute(1, 0, 2)

        lstm_out, self.hidden = self.lstm(
            embedded.view(len(embedded), self.batch_size, self.embedding_dim), self.hidden)

        out = self.hidden2label(lstm_out[-1])
        return out

In [None]:
# added T-C
def remove_invalid_headline(vectorized_seqs):
  valid_seq = []
  valid_idx = []
  for idx, seq in enumerate(vectorized_seqs):
    if len(seq) > 3:
      valid_seq.append(seq)
      valid_idx.append(idx)
  return valid_seq, valid_idx

In [None]:
import re

def replace_word(sentence, new_word):
  search = re.search("<(.*)/>", sentence)
  word_to_replace = "<" + search.group(1) + "/>"
  # sentence = sentence.replace(word_to_replace,  new_word)
  sentence = sentence.replace(word_to_replace, new_word)
  return sentence

In [None]:
def replace_word_dataset(sentence_dataset, word_dataset):
  new_dataset = []
  for i in range (len(sentence_dataset)):
    new_sentence = replace_word(sentence_dataset[i], word_dataset[i])
    new_dataset.append(new_sentence)
  return new_dataset

In [None]:
def concat_dataset(dataset1, dataset2):
  dataset = []
  for i in range(len(dataset2)):
    # [CLS] is added directly during the tokenization
    dataset.append(dataset1[i] + ' [SEP] ' + dataset2[i] + '[SEP]')
  return dataset

In [None]:
## Approach 1 code, using functions defined above:

# We set our training data and test data

training_data1 = train_df['original1']
training_edit1 = train_df['edit1']
test_data1 = test_df['original1']
test_edit1 = test_df['edit1']

training_data2 = train_df['original2']
training_edit2 = train_df['edit2']
test_data2 = test_df['original2']
test_edit2 = test_df['edit2']

# We replace with the editted word
edit_training_data1 = replace_word_dataset(training_data1, training_edit1)
edit_test_data1 = replace_word_dataset(test_data1, test_edit1)
edit_training_data2 = replace_word_dataset(training_data2, training_edit2)
edit_test_data2 = replace_word_dataset(test_data2, test_edit2)


In [None]:
%pip install transformers
import transformers
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/98/87/ef312eef26f5cecd8b17ae9654cdd8d1fae1eb6dbd87257d6d73c128a4d0/transformers-4.3.2-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 7.7MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/fd/5b/44baae602e0a30bcc53fbdbc60bd940c15e143d252d658dfdefce736ece5/tokenizers-0.10.1-cp36-cp36m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 35.6MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 24.4MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=72dfb

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [None]:
def tokenized_dataset(dataset1, dataset2):
  tokenized = []
  for i in range(len(dataset1)):
    encoding = tokenizer.encode_plus(dataset1[i], dataset2[i], max_length=55, truncation=True, return_token_type_ids=True, padding=True, return_attention_mask=True, return_tensors='pt')
    tokenized.append(encoding)
  return tokenized

In [None]:
tokenized_training = tokenized_dataset(edit_training_data1, edit_training_data2)
tokenized_test = tokenized_dataset(edit_test_data1, edit_test_data2)

In [None]:
tokenized_training[0]

{'input_ids': tensor([[  101,  1000,  4962,  8292, 12789,  2078,  1010,  2197,  8033,  2006,
          1996,  4231,  1010,  8289,  2012,  6445,  1000,   102,  1000,  4962,
          8292, 12789,  2078,  1010,  2197, 19748,  2006,  1996,  4231,  1010,
         17727,  2890, 16989,  3064,  2012,  6445,  1000,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0]])}

In [None]:
BATCH_SIZE = 64
EPOCHS = 10

import transformers
from transformers import AdamW, get_linear_schedule_with_warmup
from transformers import BertForSequenceClassification
# We initialise BERT model for classification

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)
print("Model initialised.")
model.to(device)

# 'feature' is a list of lists, each containing embedding IDs for word tokens
train_and_dev = Task2Dataset(tokenized_training, train_df['label'])

train_examples = round(len(train_and_dev)*train_proportion)
dev_examples = len(train_and_dev) - train_examples

train_dataset, dev_dataset = random_split(train_and_dev,
                                           (train_examples,
                                            dev_examples))


train_loader = torch.utils.data.DataLoader(train_dataset, shuffle=True, batch_size=BATCH_SIZE, collate_fn=collate_fn_padd)
dev_loader = torch.utils.data.DataLoader(dev_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn_padd)

print("Dataloaders created.")

loss_fn = nn.CrossEntropyLoss()
loss_fn = loss_fn.to(device)

# optimizer = torch.optim.Adam(model.parameters())

no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
#optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=1e-5)

# added T-C =====================
optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False) #AdamW corrects weight decay
scheduler = get_linear_schedule_with_warmup(
  optimizer,
  num_warmup_steps=0,
  num_training_steps=len(train_loader) * EPOCHS
)
### =====================

train(train_loader, dev_loader, model, epochs)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Model initialised.
Dataloaders created.
Training model.
| Epoch: 01 | Train Loss: 0.97 | Train Accuracy: 0.45 |           Val. Loss: 0.96 | Val. Accuracy: 0.44 |
| Epoch: 02 | Train Loss: 0.96 | Train Accuracy: 0.47 |           Val. Loss: 0.95 | Val. Accuracy: 0.48 |
| Epoch: 03 | Train Loss: 0.92 | Train Accuracy: 0.56 |           Val. Loss: 0.97 | Val. Accuracy: 0.50 |
| Epoch: 04 | Train Loss: 0.81 | Train Accuracy: 0.65 |           Val. Loss: 1.07 | Val. Accuracy: 0.50 |
| Epoch: 05 | Train Loss: 0.70 | Train Accuracy: 0.71 |           Val. Loss: 1.14 | Val. Accuracy: 0.52 |
| Epoch: 06 | Train Loss: 0.59 | Train Accuracy: 0.75 |           Val. Loss: 1.29 | Val. Accuracy: 0.52 |
| Epoch: 07 | Train Loss: 0.51 | Train Accuracy: 0.79 |           Val. Loss: 1.40 | Val. Accuracy: 0.52 |
| Epoch: 08 | Train Loss: 0.45 | Train Accuracy: 0.82 |           Val. Loss: 1.56 | Val. Accuracy: 0.51 |
| Epoch: 09 | Train Loss: 0.40 | Train Accuracy: 0.84 |           Val. Loss: 1.64 | Val. Accurac

#### Approach 2: No pre-trained representations

In [None]:
train_and_dev = train_df['edit1']

training_data, dev_data, training_y, dev_y = train_test_split(train_df['edit1'], train_df['label'],
                                                                        test_size=(1-train_proportion),
                                                                        random_state=42)

# We train a Tf-idf model
count_vect = CountVectorizer(stop_words='english')
train_counts = count_vect.fit_transform(training_data)
transformer = TfidfTransformer().fit(train_counts)
train_counts = transformer.transform(train_counts)
naive_model = MultinomialNB().fit(train_counts, training_y)

# Train predictions
predicted_train = naive_model.predict(train_counts)

# Calculate Tf-idf using train and dev, and validate on dev:
test_and_test_counts = count_vect.transform(train_and_dev)
transformer = TfidfTransformer().fit(test_and_test_counts)

test_counts = count_vect.transform(dev_data)

test_counts = transformer.transform(test_counts)

# Dev predictions
predicted = naive_model.predict(test_counts)

# We run the evaluation:
print("\nTrain performance:")

sse, mse = model_performance(predicted_train, training_y, True)

print("\nDev performance:")
sse, mse = model_performance(predicted, dev_y, True)

#### Baseline for task 2

In [None]:
# Baseline for the task
pred_baseline = torch.zeros(len(dev_y)) + 1  # 1 is most common class
print("\nBaseline performance:")
sse, mse = model_performance(pred_baseline, torch.tensor(dev_y.values), True)