### Coursework coding instructions (please also see full coursework spec)

Please choose if you want to do either Task 1 or Task 2. You should write your report about one task only.

For the task you choose you will need to do two approaches:
  - Approach 1, which can use use pre-trained embeddings / models
  - Approach 2, which should not use any pre-trained embeddings or models
We should be able to run both approaches from the same colab file

#### Running your code:
  - Your models should run automatically when running your colab file without further intervention
  - For each task you should automatically output the performance of both models
  - Your code should automatically download any libraries required

#### Structure of your code:
  - You are expected to use the 'train', 'eval' and 'model_performance' functions, although you may edit these as required
  - Otherwise there are no restrictions on what you can do in your code

#### Documentation:
  - You are expected to produce a .README file summarising how you have approached both tasks

#### Reproducibility:
  - Your .README file should explain how to replicate the different experiments mentioned in your report

Good luck! We are really looking forward to seeing your reports and your model code!

In [None]:
# You will need to download any word embeddings required for your code, e.g.:

!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

# For any packages that Colab does not provide auotmatically you will also need to install these below, e.g.:

#! pip install torch

--2021-02-19 14:39:25--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-02-19 14:39:25--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-02-19 14:39:25--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’

glove.6

In [None]:
# Download the task dataset

!wget https://www.cs.rochester.edu/u/nhossain/humicroedit/semeval-2020-task-7-data.zip
!unzip semeval-2020-task-7-data.zip
!rm semeval-2020-task-7-data.zip

In [None]:
%pip install tqdm

Collecting tqdm
[?25l  Downloading https://files.pythonhosted.org/packages/d9/13/f3f815bb73804a8af9cfbb6f084821c037109108885f46131045e8cf044e/tqdm-4.57.0-py2.py3-none-any.whl (72kB)
[K     |████████████████████████████████| 81kB 14.0MB/s eta 0:00:01
[?25hInstalling collected packages: tqdm
Successfully installed tqdm-4.57.0
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [1]:
# Imports

import torch
import torch.nn as nn
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from torch.utils.data import Dataset, random_split
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
import torch.optim as optim
import codecs
import tqdm

In [2]:
# Setting random seed and device
SEED = 1

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")

In [3]:
# Load data
train_df = pd.read_csv('data/task-2/train.csv')
test_df = pd.read_csv('data/task-2/dev.csv')

In [4]:
# Number of epochs
epochs = 10

# Proportion of training data for train compared to dev
train_proportion = 0.9


#### BERT

##### Make BERT works

#### Approach 1: Using pre-trained representations

In [5]:
# We define our training loop
def train_regression(train_iter, dev_iter, model, number_epoch, input_ids_sentence1, attention_mark_sentence1, input_ids_sentence2, attention_mark_sentence2, dev_dataset, epsilon):
    """
    Training loop for the model, which calls on eval to evaluate after each epoch
    """

    
    print("Training model.")

    for epoch in range(1, number_epoch+1):

        model.train()
        epoch_loss = 0
        epoch_sse = 0
        no_observations = 0  # Observations used for training so far

        for batch in train_iter:

            input_ids, attention_mask, target = batch
            target = target.type(torch.FloatTensor)
            input_ids, attention_mask, target = input_ids.to(device), attention_mask.to(device), target.to(device)
            input_ids = input_ids.squeeze(1)
            attention_mask = attention_mask.squeeze(1)

            model.batch_size = target.shape[0]
            no_observations = no_observations + target.shape[0]
            # for RNN:
            # model.hidden = model.init_hidden()
            predictions = model(input_ids = input_ids, attention_mask = attention_mask).squeeze(1)
            optimizer.zero_grad()

            loss = loss_fn(predictions, target)

            sse, __ = model_performance_regression(predictions.detach().cpu().numpy(), target.detach().cpu().numpy())

            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()*target.shape[0]
            epoch_sse += sse

        valid_loss, valid_mse, valid_acc, __, __ = eval_regression(dev_iter, model, input_ids_sentence1, attention_mark_sentence1, input_ids_sentence2, attention_mark_sentence2, dev_dataset, epsilon)
        epoch_loss, epoch_mse = epoch_loss / no_observations, epoch_sse / no_observations
        print(f'| Epoch: {epoch:02} | Train Loss: {epoch_loss:.2f} | Train MSE: {epoch_mse:.2f} | Train RMSE: {epoch_mse**0.5:.2f} | \
        Val. Loss: {valid_loss:.2f} | Val. MSE: {valid_mse:.2f} |  Val. RMSE: {valid_mse**0.5:.2f} |  Val. Acc: {valid_acc:.2f} |')

In [6]:
# We define our training loop
def train(train_iter, dev_iter, model, number_epoch):
    """
    Training loop for the model, which calls on eval to evaluate after each epoch
    """

    print("Training model.")

    for epoch in range(1, number_epoch+1):
        
        model.train()
        
        epoch_loss = 0
        epoch_correct = 0
        no_observations = 0  # Observations used for training so far
        for batch in train_iter:

            if model.__class__.__name__ == 'BERT_regression':

              input_ids1, attention_mask1, input_ids2, attention_mask2, target = batch
              # target = target.type(torch.FloatTensor)
              input_ids1, attention_mask1, input_ids2, attention_mask2, target = input_ids1.to(device), attention_mask1.to(device), input_ids2.to(device), attention_mask2.to(device), target.to(device)
              input_ids1 = input_ids1.squeeze(1)
              attention_mask1 = attention_mask1.squeeze(1)
              input_ids2 = input_ids2.squeeze(1)
              attention_mask2 = attention_mask2.squeeze(1)

              predictions = model(input_ids1, attention_mask1, input_ids2, attention_mask2).squeeze(1)
          
            else:

              feature, target = batch
              feature, target = feature.to(device), target.to(device)
              # target = target.type(torch.FloatTensor)
              model.batch_size = target.shape[0]
              model.hidden = model.init_hidden()
              predictions = model(feature).squeeze(1)

            no_observations = no_observations + target.shape[0]
            # for RNN:
            # model.batch_size = target.shape[0]
            
            optimizer.zero_grad()
            loss = loss_fn(predictions, target)

            correct, __ = model_performance(np.argmax(predictions.detach().cpu().numpy(), axis=1), target.detach().cpu().numpy())

            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()*target.shape[0]
            epoch_correct += correct

        valid_loss, valid_acc, __, __ = eval(dev_iter, model)

        epoch_loss, epoch_acc = epoch_loss / no_observations, epoch_correct / no_observations
        print(f'| Epoch: {epoch:02} | Train Loss: {epoch_loss:.2f} | Train Accuracy: {epoch_acc:.2f} | \
        Val. Loss: {valid_loss:.2f} | Val. Accuracy: {valid_acc:.2f} |')

In [7]:
# We evaluate performance on our dev set
def eval_regression(data_iter, model, input_ids_sentence1, attention_mark_sentence1, input_ids_sentence2, attention_mark_sentence2, dev_dataset, epsilon):
    """
    Evaluating model performance on the dev set
    """
    model.eval()
    epoch_loss = 0
    epoch_sse = 0
    pred_all = []
    trg_all = []
    no_observations = 0

    with torch.no_grad():
        for batch in data_iter:
            input_ids, attention_mask, target = batch
            target = target.type(torch.FloatTensor)
            input_ids, attention_mask, target = input_ids.to(device), attention_mask.to(device), target.to(device)
            input_ids = input_ids.squeeze(1)
            attention_mask = attention_mask.squeeze(1)
            
            model.batch_size = target.shape[0]
            no_observations = no_observations + target.shape[0]
            # for RNN:
            # model.hidden = model.init_hidden()

            predictions = model(input_ids = input_ids, attention_mask = attention_mask).squeeze(1)
            loss = loss_fn(predictions, target)

            # We get the mse
            pred, trg = predictions.detach().cpu().numpy(), target.detach().cpu().numpy()
            sse, __ = model_performance_regression(pred, trg)

            epoch_loss += loss.item()*target.shape[0]
            epoch_sse += sse
            pred_all.extend(pred)
            trg_all.extend(trg)
          
        prediction_data_sentence1, prediction_data_sentence2 = get_predictions(input_ids_sentence1, attention_mark_sentence1, input_ids_sentence2, attention_mark_sentence2)
        acc = get_accuracy(dev_dataset, prediction_data_sentence1, prediction_data_sentence2, epsilon)

    return epoch_loss/no_observations, epoch_sse/no_observations, acc, np.array(pred_all), np.array(trg_all)

In [8]:
# We evaluate performance on our dev set
def eval(data_iter, model):
    """
    Evaluating model performance on the dev set
    """
    model.eval()
    epoch_loss = 0
    epoch_correct = 0
    pred_all = []
    trg_all = []
    no_observations = 0

    with torch.no_grad():
        for batch in data_iter:

            if model.__class__.__name__ == 'BERT_regression':

              input_ids1, attention_mask1, input_ids2, attention_mask2, target = batch
              # target = target.type(torch.FloatTensor)
              input_ids1, attention_mask1, input_ids2, attention_mask2, target = input_ids1.to(device), attention_mask1.to(device), input_ids2.to(device), attention_mask2.to(device), target.to(device)
              input_ids1 = input_ids1.squeeze(1)
              attention_mask1 = attention_mask1.squeeze(1)
              input_ids2 = input_ids2.squeeze(1)
              attention_mask2 = attention_mask2.squeeze(1)

              predictions = model(input_ids1, attention_mask1, input_ids2, attention_mask2).squeeze(1)
          
            else:

              feature, target = batch
              feature, target = feature.to(device), target.to(device)
              # target = target.type(torch.FloatTensor)
              model.hidden = model.init_hidden()
              # for RNN:
              model.batch_size = target.shape[0]
              model.hidden = model.init_hidden()
              predictions = model(feature).squeeze(1)

            no_observations = no_observations + target.shape[0]
            loss = loss_fn(predictions, target)

            # We get the mse
            pred, trg = predictions.detach().cpu().numpy(), target.detach().cpu().numpy()
            correct, __ = model_performance(np.argmax(pred, axis=1), trg)

            epoch_loss += loss.item()*target.shape[0]
            epoch_correct += correct
            pred_all.extend(pred)
            trg_all.extend(trg)

    return epoch_loss/no_observations, epoch_correct/no_observations, np.array(pred_all), np.array(trg_all)

In [9]:
def get_predictions(input_ids_sentence1, attention_mark_sentence1, input_ids_sentence2, attention_mark_sentence2):
  batch_size = 64
  model.eval()
  prediction_data1 = []
  prediction_data2 = []
  for i in range(0, len(input_ids_sentence1) // batch_size - 1):
      prediction_data1 += model(input_ids = input_ids_sentence1[i*batch_size:(i+1)*batch_size], attention_mask = attention_mark_sentence1[i*batch_size:(i+1)*batch_size]).detach().squeeze(1).tolist()
      prediction_data2 += model(input_ids = input_ids_sentence2[i*batch_size:(i+1)*batch_size], attention_mask = attention_mark_sentence2[i*batch_size:(i+1)*batch_size]).detach().squeeze(1).tolist()
  prediction_data1 += model(input_ids = input_ids_sentence1[(len(input_ids_sentence1) // batch_size - 1)*batch_size:], attention_mask = attention_mark_sentence1[(len(input_ids_sentence1) // batch_size - 1)*batch_size:]).detach().squeeze(1).tolist()
  prediction_data2 += model(input_ids = input_ids_sentence2[(len(input_ids_sentence1) // batch_size - 1)*batch_size:], attention_mask = attention_mark_sentence2[(len(input_ids_sentence1) // batch_size - 1)*batch_size:]).detach().squeeze(1).tolist()
  return prediction_data1, prediction_data2

In [10]:
def get_accuracy(dataset, predicted1, predicted2, epsilon):
  dataset['predictedGrade1'] = predicted1
  dataset['predictedGrade2'] = predicted2
  label = []
  for i in range(len(predicted1)):
    diff = predicted1[i] - predicted2[i]
    if diff > epsilon:
      label.append(1)
    elif diff < epsilon:
      label.append(2)
    else:
      label.append(0)
  dataset['predictedLabel'] = label
  dataset['correct'] = (dataset['trueLabel'] == dataset['predictedLabel'])
  return dataset['correct'].value_counts()[1] / (dataset['correct'].value_counts()[0] + dataset['correct'].value_counts()[1])

In [11]:
# How we print the model performance
def model_performance_regression(output, target, print_output=False):
    """
    Returns SSE and MSE per batch (printing the MSE and the RMSE)
    """

    sq_error = (output - target)**2

    sse = np.sum(sq_error)
    mse = np.mean(sq_error)
    rmse = np.sqrt(mse)

    if print_output:
        print(f'| MSE: {mse:.2f} | RMSE: {rmse:.2f} |')

    return sse, mse

In [12]:
# How we print the model performance
def model_performance(output, target, print_output=False):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    correct_answers = (output == target)
    correct = sum(correct_answers)
    acc = np.true_divide(correct,len(output))

    if print_output:
        print(f'| Acc: {acc:.2f} ')

    return correct, acc

In [13]:
# To create our vocab
def create_vocab(data):
    """
    Creating a corpus of all the tokens used
    """
    tokenized_corpus = [] # Let us put the tokenized corpus in a list

    for sentence in data:

        tokenized_sentence = []

        for token in sentence.split(' '): # simplest split is

            tokenized_sentence.append(token)

        tokenized_corpus.append(tokenized_sentence)

    # Create single list of all vocabulary
    vocabulary = []  # Let us put all the tokens (mostly words) appearing in the vocabulary in a list

    for sentence in tokenized_corpus:

        for token in sentence:

            if token not in vocabulary:

                if True:
                    vocabulary.append(token)

    return vocabulary, tokenized_corpus

In [14]:
# Used for collating our observations into minibatches:
def collate_fn_padd(batch):
    '''
    We add padding to our minibatches and create tensors for our model
    '''

    batch_labels = [l for f, l in batch]
    batch_features = [f for f, l in batch]

    batch_features_len = [len(f) for f, l in batch]

    seq_tensor = torch.zeros((len(batch), max(batch_features_len))).long()

    for idx, (seq, seqlen) in enumerate(zip(batch_features, batch_features_len)):
        seq_tensor[idx, :seqlen] = torch.LongTensor(seq)

    batch_labels = torch.LongTensor(batch_labels)

    return seq_tensor, batch_labels

# We create a Dataset so we can create minibatches
class Task1Dataset(Dataset):

    def __init__(self, train_data, labels):
        self.x_train = train_data
        self.y_train = labels

    def __len__(self):
        return len(self.y_train)

    def __getitem__(self, item):
        return self.x_train[item]['input_ids'], self.x_train[item]['attention_mask'], self.y_train[item]

# We create a Dataset so we can create minibatches
class Task2Dataset(Dataset):

    def __init__(self, train_data, labels):
        self.x_train = train_data
        self.y_train = labels

    def __len__(self):
        return len(self.y_train)

    def __getitem__(self, item):
        return self.x_train[item], self.y_train[item]

class Task2Dataset_BERT(Dataset):

    def __init__(self, train1_data, train2_data, labels):
        self.x1_train = train1_data
        self.x2_train = train2_data
        self.y_train = labels

    def __len__(self):
        return len(self.y_train)

    def __getitem__(self, item):
        return self.x1_train[item]['input_ids'], self.x1_train[item]['attention_mask'], self.x2_train[item]['input_ids'], self.x2_train[item]['attention_mask'], self.y_train[item]

In [15]:
%pip install transformers
import transformers
from transformers import BertTokenizer, BertPreTrainedModel, BertModel
bert_model = 'bert-base-cased'
tokenizer = BertTokenizer.from_pretrained(bert_model)



In [48]:
class BERT_regression(BertPreTrainedModel):

  def __init__(self, config):
    super().__init__(config)

    # BERT model
    self.bert = BertModel(config)
    
    # Classification layer
    self.final_layer = torch.nn.Sequential(torch.nn.Dropout(0.2),
                                           torch.nn.Linear(3 * config.hidden_size , 3))
    
  def forward(
    self,
    input_ids1=None,
    attention_mask1=None,
    input_ids2=None,
    attention_mask2=None):
 
    outputs1 = self.bert(
        input_ids1,
        attention_mask=attention_mask1)
    
    outputs2 = self.bert(
        input_ids2,
        attention_mask=attention_mask2
    )

    abs = torch.abs(outputs1[1] - outputs2[1])

    outputs = torch.cat((outputs1[1], outputs2[1], abs), 1)

    # Regression

    out = self.final_layer(outputs)
    
    return out

In [36]:
import re

def replace_word(sentence, new_word):
  search = re.search("<(.*)/>", sentence)
  word_to_replace = "<" + search.group(1) + "/>"
  # sentence = sentence.replace(word_to_replace,  new_word)
  sentence = sentence.replace(word_to_replace, new_word)
  return sentence

In [37]:
def replace_word_dataset(sentence_dataset, word_dataset):
  new_dataset = []
  for i in range (len(sentence_dataset)):
    new_sentence = replace_word(sentence_dataset[i], word_dataset[i])
    new_dataset.append(new_sentence)
  return new_dataset

In [38]:
# We set our training data and test data

training_data1 = train_df['original1']
training_edit1 = train_df['edit1']
test_data1 = test_df['original1']
test_edit1 = test_df['edit1']

training_data2 = train_df['original2']
training_edit2 = train_df['edit2']
test_data2 = test_df['original2']
test_edit2 = test_df['edit2']

# We replace with the editted word
edit_training_data1 = replace_word_dataset(training_data1, training_edit1)
edit_test_data1 = replace_word_dataset(test_data1, test_edit1)
edit_training_data2 = replace_word_dataset(training_data2, training_edit2)
edit_test_data2 = replace_word_dataset(test_data2, test_edit2)

In [39]:
# # We create a new dataframe with the editted sentences, their grade and the label

# # Get the total dataset

# editted_data = {'sentence1': edit_training_data1 ,
#         'sentence2': edit_training_data2,
#         'meanGrade1': train_df['meanGrade1'].values,
#         'meanGrade2': train_df['meanGrade2'].values,
#         'trueLabel': train_df['label'].values}

# editted_df = pd.DataFrame(editted_data, columns = ['sentence1', 'sentence2', 'meanGrade1', 'meanGrade2', 'trueLabel'])

In [40]:
# # We create only one column of sentences for regression

# def transform_editted_to_regression(dataset):
#   data = {'sentence': [*dataset['sentence1'], *dataset['sentence2']],
#         'meanGrade': [*dataset['meanGrade1'], *dataset['meanGrade2']]}

#   df = pd.DataFrame(data, columns = ['sentence', 'meanGrade'])
#   df = df.drop_duplicates()
#   return df

In [41]:
# TODO : work on max length
def tokenized_dataset(dataset):
  tokenized = []
  for i in range(len(dataset)):
    encoding = tokenizer.encode_plus(dataset[i], max_length=30,  return_token_type_ids=False, pad_to_max_length=True, return_attention_mask=True, return_tensors='pt')
    tokenized.append(encoding)
  return tokenized

In [42]:
def get_input_ids_attention_mask(dataset):
  input_ids = []
  attention_mask = []
  tokenized = tokenized_dataset(dataset)
  for i in range(len(dataset)):
    input_ids.append(tokenized[i]['input_ids'].tolist()[0])
    attention_mask.append(tokenized[i]['attention_mask'].tolist()[0])
  return torch.LongTensor(input_ids).to(device), torch.LongTensor(attention_mask).to(device)

In [43]:
def delete_sentence_present_in_validation_dataset(training, validation):
    cond = training['sentence'].isin(validation['sentence'])
    training.drop(training[cond].index, inplace = True)
    return training

In [44]:
tokenized_training1 = tokenized_dataset(edit_training_data1)
tokenized_training2 = tokenized_dataset(edit_training_data2)



In [45]:
def get_dataset_and_inverse(labels, tokenized_training1, tokenized_training2):
  inverse_labels = []
  for label in labels:
    if label == 1:
      inverse_labels.append(2)
    elif label == 2:
      inverse_labels.append(1)
    else:
      inverse_labels.append(0)
  total_labels = [*labels, *inverse_labels]
  left_tokenized = [*tokenized_training1, *tokenized_training2]
  right_tokenized = [*tokenized_training2, *tokenized_training1]
  return total_labels, left_tokenized, right_tokenized

In [46]:
import random

def get_train_and_eval(tokenized_training1, tokenized_training2, train_examples):
  train_index = random.sample(range(0, len(train_df['label'].values)), train_examples)
  train_tokenized_training1 = []
  train_tokenized_training2 = []
  train_labels = []

  eval_tokenized_training1 = []
  eval_tokenized_training2 = []
  eval_labels = []

  for i in range(len(train_df['label'].values)):
    if i in train_index:
      train_tokenized_training1.append(tokenized_training1[i])
      train_tokenized_training2.append(tokenized_training2[i])
      train_labels.append(train_df['label'].values[i])
    else:
      eval_tokenized_training1.append(tokenized_training1[i])
      eval_tokenized_training2.append(tokenized_training2[i])
      eval_labels.append(train_df['label'].values[i])
  
  return train_tokenized_training1, train_tokenized_training2, train_labels, eval_tokenized_training1, eval_tokenized_training2, eval_labels

In [49]:
BATCH_SIZE = 64

import transformers
from transformers import BertForSequenceClassification
# We initialise BERT model for classification

model = BERT_regression.from_pretrained(bert_model)
print("Model initialised.")
model.to(device)

# 'feature' is a list of lists, each containing embedding IDs for word tokens

train_examples = round(len(train_df['label'].values)*train_proportion)
# dev_examples = len(train_and_dev) - train_examples

train_tokenized_training1, train_tokenized_training2, train_labels, eval_tokenized_training1, eval_tokenized_training2, eval_labels = get_train_and_eval(tokenized_training1, tokenized_training2, train_examples)

train_total_labels, train_left_tokenized, train_right_tokenized = get_dataset_and_inverse(train_labels, train_tokenized_training1, train_tokenized_training2)
eval_total_labels, eval_left_tokenized, eval_right_tokenized = get_dataset_and_inverse(eval_labels, eval_tokenized_training1, eval_tokenized_training2)

train_dataset = Task2Dataset_BERT(train_left_tokenized, train_right_tokenized, train_total_labels)
dev_dataset = Task2Dataset_BERT(eval_left_tokenized, eval_right_tokenized, eval_total_labels)

# train_dataset, dev_dataset = random_split(train_and_dev,
#                                            (train_examples,
#                                             dev_examples))


train_loader = torch.utils.data.DataLoader(train_dataset, shuffle=True, batch_size=BATCH_SIZE) #, collate_fn=collate_fn_padd)
dev_loader = torch.utils.data.DataLoader(dev_dataset, batch_size=BATCH_SIZE) #, collate_fn=collate_fn_padd)

print("Dataloaders created.")

loss_fn = nn.CrossEntropyLoss()
loss_fn = loss_fn.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

train(train_loader, dev_loader, model, epochs)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BERT_regression: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BERT_regression from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BERT_regression from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BERT_regression were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['final_layer.

Model initialised.
Dataloaders created.
Training model.
| Epoch: 01 | Train Loss: 0.97 | Train Accuracy: 0.45 |         Val. Loss: 0.98 | Val. Accuracy: 0.47 |
| Epoch: 02 | Train Loss: 0.95 | Train Accuracy: 0.48 |         Val. Loss: 0.96 | Val. Accuracy: 0.52 |
| Epoch: 03 | Train Loss: 0.87 | Train Accuracy: 0.60 |         Val. Loss: 0.91 | Val. Accuracy: 0.59 |
| Epoch: 04 | Train Loss: 0.71 | Train Accuracy: 0.70 |         Val. Loss: 0.89 | Val. Accuracy: 0.62 |
| Epoch: 05 | Train Loss: 0.57 | Train Accuracy: 0.77 |         Val. Loss: 0.89 | Val. Accuracy: 0.64 |
| Epoch: 06 | Train Loss: 0.45 | Train Accuracy: 0.81 |         Val. Loss: 0.89 | Val. Accuracy: 0.66 |
| Epoch: 07 | Train Loss: 0.38 | Train Accuracy: 0.84 |         Val. Loss: 0.94 | Val. Accuracy: 0.64 |
| Epoch: 08 | Train Loss: 0.32 | Train Accuracy: 0.87 |         Val. Loss: 1.03 | Val. Accuracy: 0.63 |
| Epoch: 09 | Train Loss: 0.25 | Train Accuracy: 0.91 |         Val. Loss: 1.08 | Val. Accuracy: 0.62 |
| Epoch:

In [None]:
from sklearn.model_selection import train_test_split

BATCH_SIZE = 64
EPSILON = 10**(-4)
model = BERT_regression.from_pretrained(bert_model)
print("Model initialised.")
model.to(device)

# We provide the model with our embeddings
# model.embedding.weight.data.copy_(torch.from_numpy(wvecs))

train_df, dev_df = train_test_split(editted_df, test_size=1 - train_proportion)

regression_train_df = transform_editted_to_regression(train_df)
regression_dev_df = transform_editted_to_regression(dev_df)
regression_train_df = delete_sentence_present_in_validation_dataset(regression_train_df, regression_dev_df)
print('Nb sentences regression_train_df: ' + str(regression_train_df.shape[0]))
print('Nb sentences regression_dev_df: ' + str(regression_dev_df.shape[0]))

train_feature = tokenized_dataset(regression_train_df['sentence'].values)
dev_feature = tokenized_dataset(regression_dev_df['sentence'].values)

# 'feature' is a list of lists, each containing embedding IDs for word tokens
train_dataset = Task1Dataset(train_feature, regression_train_df['meanGrade'].values)
dev_dataset = Task1Dataset(dev_feature, regression_dev_df['meanGrade'].values)

# train_examples_regression = round(len(train_and_dev)*train_proportion)
# dev_examples_regression = len(train_and_dev) - train_examples

# train_dataset, dev_dataset = random_split(train_and_dev,
#                                            (train_examples,
#                                             dev_examples))


train_loader = torch.utils.data.DataLoader(train_dataset, shuffle=True, batch_size=BATCH_SIZE) #, collate_fn=collate_fn_padd)
dev_loader = torch.utils.data.DataLoader(dev_dataset, batch_size=BATCH_SIZE) #, collate_fn=collate_fn_padd)

print("Dataloaders created.")

loss_fn = nn.MSELoss()
loss_fn = loss_fn.to(device)

# optimizer = torch.optim.Adam(model.parameters())
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

# Get element to test result on dev_dataset
dev_sentence1_input_ids, dev_sentence1_attention_mask = get_input_ids_attention_mask(dev_df['sentence1'].values)
dev_sentence2_input_ids, dev_sentence2_attention_mask = get_input_ids_attention_mask(dev_df['sentence2'].values)

train_regression(train_loader, dev_loader, model, epochs, dev_sentence1_input_ids, dev_sentence1_attention_mask, dev_sentence2_input_ids, dev_sentence2_attention_mask, dev_df, EPSILON)

In [None]:
BATCH_SIZE = 64

import transformers
from transformers import BertForSequenceClassification
# We initialise BERT model for classification

model = BERT_regression.from_pretrained(bert_model, num_labels=3)
print("Model initialised.")
model.to(device)

# 'feature' is a list of lists, each containing embedding IDs for word tokens
train_and_dev = Task2Dataset(tokenized_training, train_df['label'])

train_examples = round(len(train_and_dev)*train_proportion)
dev_examples = len(train_and_dev) - train_examples

train_dataset, dev_dataset = random_split(train_and_dev,
                                           (train_examples,
                                            dev_examples))


train_loader = torch.utils.data.DataLoader(train_dataset, shuffle=True, batch_size=BATCH_SIZE) #, collate_fn=collate_fn_padd)
dev_loader = torch.utils.data.DataLoader(dev_dataset, batch_size=BATCH_SIZE) #, collate_fn=collate_fn_padd)

print("Dataloaders created.")

loss_fn = nn.CrossEntropyLoss()
loss_fn = loss_fn.to(device)

# optimizer = torch.optim.Adam(model.parameters())
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

train(train_loader, dev_loader, model, epochs)

#### Approach 2: No pre-trained representations

In [50]:
%%bash
URL="https://raw.githubusercontent.com/pytorch/examples/master/word_language_model/data/wikitext-2"

for split in "train" "valid" "test"; do
  if [ ! -f "${split}.txt" ]; then
    echo "Downloading ${split}.txt"
    wget -q "${URL}/${split}.txt"
    # Remove empty lines
    sed -i '/^ *$/d' "${split}.txt"
    # Remove article titles starting with = and ending with =
    sed -i '/^ *= .* = $/d' "${split}".txt
  fi
done

Downloading train.txt
Downloading valid.txt
Downloading test.txt


In [51]:
class Vocabulary(object):
  """Data structure representing the vocabulary of a corpus."""
  def __init__(self, window_size=2):
    self.window_size = window_size

    # Mapping from tokens to integers
    self._word2idx = {}

    # Reverse-mapping from integers to tokens
    self.idx2word = []

    # Pairs of words according to the window size
    self.idx_pairs = []

    # 0-padding token
    # TO: keep or remove
    self.add_word('<pad>')
    # sentence start
    # TO: keep or remove
    self.add_word('<s>')
    # sentence end
    # TO: keep or remove
    self.add_word('</s>')
    # Unknown words
    self.add_word('<unk>')
    # add separator
    self.add_word('<sep>')

    self._unk_idx = self._word2idx['<unk>']

  def word2idx(self, word):
    """Returns the integer ID of the word or <unk> if not found."""
    return self._word2idx.get(word, self._unk_idx)

  def add_word(self, word):
    """Adds the `word` into the vocabulary."""
    if word not in self._word2idx:
      self.idx2word.append(word)
      self._word2idx[word] = len(self.idx2word) - 1

  def build_from_file(self, fname):
    """Builds a vocabulary from a given corpus file."""
    with open(fname) as f:
      for line in f:
        words = line.strip().split()
        for word in words:
          self.add_word(word)
  
  def get_idx_pairs(self, sentence, id):
    idx_pairs = []
    for w in range(-self.window_size, self.window_size+1):
      id_context_word = id + w
      if w < 0 or w >= len(sentence) or w == id:
        continue
      idx_pairs.append(self.convert_words_to_idxs([sentence[id], sentence[w]]))
    return idx_pairs
  
  def build_idx_pairs(self, fname):
    """Builds the pair of idx from a given corpus file."""
    with open(fname) as f:
      for line in f:
        sentence = line.strip().split()
        for id in range(0, len(sentence)):
          new_idx_pairs = self.get_idx_pairs(sentence, id)
        self.idx_pairs += new_idx_pairs
    return

  def convert_idxs_to_words(self, idxs):
    """Converts a list of indices to words."""
    return ' '.join(self.idx2word[idx] for idx in idxs)

  def convert_words_to_idxs(self, words):
    """Converts a list of words to a list of indices."""
    return [self.word2idx(w) for w in words]

  def __len__(self):
    """Returns the size of the vocabulary."""
    return len(self.idx2word)
  
  def __repr__(self):
    return "Vocabulary with {} items".format(self.__len__())

In [52]:
# Create vocabulary base on train.txt of wiki2
vocab = Vocabulary(window_size=2)
vocab.build_from_file('train.txt')
vocab.build_idx_pairs('train.txt')

In [53]:
# Parameters
vocabulary_size = len(vocab)
embedding_dims = 10
num_epochs = 100
learning_rate = 0.001
nb_words_to_sample = 5

# Matrices
W1 = torch.randn(embedding_dims, vocabulary_size, requires_grad=True, device=device)
W2 = torch.randn(embedding_dims, vocabulary_size, requires_grad=True, device=device)

In [54]:
def one_hot_encoder(word):
  array = torch.zeros(vocabulary_size)
  id = vocab.word2idx(word)
  array[id] = 1.0
  return array

In [55]:
# Compute word embeddings with negative sampling

from torch.autograd import Variable
import torch.nn.functional as F

for epoch in range(num_epochs):
  epoch_loss = 0
  percentage = 0
  count = 0
  for data, target in vocab.idx_pairs:
      count += 1
      if round(count/len(vocab.idx_pairs)*100) > percentage:
        percentage = round(count/len(vocab.idx_pairs)*100)
        print(str(percentage) + '% of epoch ' + str(epoch))
      x_var = Variable(one_hot_encoder(data)).float().to(device)
      y_pos_var = Variable(one_hot_encoder(target)).float().to(device)

      neg_sample = np.random.choice(list(range(vocabulary_size)), size=(nb_words_to_sample))
      y_neg = []
      for id_neg_sample in neg_sample:
        word = vocab.idx2word[id_neg_sample]
        y_neg.append(one_hot_encoder(word).numpy())
      y_neg = torch.tensor(y_neg)
      y_neg_var = Variable(y_neg).float().to(device)

      x_emb = torch.matmul(W1, x_var) 
      y_pos_emb = torch.matmul(W2, y_pos_var)
      y_neg_emb = torch.matmul(W2, y_neg_var.transpose(0,1))

      # get positive sample score
      pos_loss = F.logsigmoid(torch.matmul(x_emb, y_pos_emb))
        
      # get negsample score
      neg_loss = F.logsigmoid(-1 * torch.matmul(x_emb, y_neg_emb))
      exp_neg_loss = torch.mean(neg_loss)
        
      loss = - (pos_loss + nb_words_to_sample * exp_neg_loss)
      epoch_loss += loss.item()
        
      # propagate the error
      loss.backward()
        
      # gradient descent
      W1.data -= learning_rate * W1.grad.data
      W2.data -= learning_rate * W2.grad.data

      # zero out gradient accumulation
      W1.grad.data.zero_()
      W2.grad.data.zero_()
        
  print(f'Loss at epo {epoch}: {epoch_loss/len(vocab.idx_pairs)}')


1% of epoch 0
2% of epoch 0
3% of epoch 0
4% of epoch 0
5% of epoch 0
6% of epoch 0


KeyboardInterrupt: ignored

In [None]:
# Get a clean dataset

In [56]:
def concat_datasets(dataset1, dataset2):
  concat_dataset = []
  for i in range (len(dataset1)):
    new_sentence = dataset1[i] + " <sep> " + dataset2[i]
    # concat_dataset.append(new_sentence)
    concat_dataset.append(new_sentence.lower())
  return pd.Series(concat_dataset)

In [57]:
def tokenized_corpus(concat_dataset):
  tokenized_corpus = []
  for sentence in concat_dataset:
    tokenized_sentence = []
    for token in sentence.split(' '):
      tokenized_sentence.append(token)
    tokenized_corpus.append(tokenized_sentence)
  return tokenized_corpus

In [58]:
def get_vectorized_seq(tokenized_data):
  vectorized_seqs = []
  for seq in tokenized_data:
    vectorized_seq = []
    for tok in seq:
        vectorized_seq.append(vocab.word2idx(tok))
    vectorized_seqs.append(vectorized_seq)
  return vectorized_seqs

In [59]:
# We set our training data and test data
training_data1 = train_df['original1']
training_edit1 = train_df['edit1']
test_data1 = test_df['original1']
test_edit1 = test_df['edit1']

training_data2 = train_df['original2']
training_edit2 = train_df['edit2']
test_data2 = test_df['original2']
test_edit2 = test_df['edit2']

# We replace with the editted word
edit_training_data1 = replace_word_dataset(training_data1, training_edit1)
edit_test_data1 = replace_word_dataset(test_data1, test_edit1)
edit_training_data2 = replace_word_dataset(training_data2, training_edit2)
edit_test_data2 = replace_word_dataset(test_data2, test_edit2)

# We concat both part
training_data = concat_datasets(edit_training_data1, edit_training_data2)
test_data = concat_datasets(edit_test_data1, edit_test_data2)

# # Creating word vectors
# training_vocab, training_tokenized_corpus = create_vocab(training_data)
# test_vocab, test_tokenized_corpus = create_vocab(test_data)

# # Creating joint vocab from test and train:
# joint_vocab, joint_tokenized_corpus = create_vocab(pd.concat([training_data, test_data]))

# print("Vocab created.")

In [60]:
# Get tokenized datasets
# TODO: preprocessing?

tokenized_training_data = tokenized_corpus(training_data.values)
tokenized_test_data = tokenized_corpus(test_data.values)

# Get vectorized seq
vectorized_training_data = get_vectorized_seq(tokenized_training_data)

In [61]:
INPUT_DIM = vocabulary_size
BATCH_SIZE = 32
EMBEDDING_DIM = embedding_dims

In [62]:
class BiLSTM_classification(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, batch_size, device):
        super(BiLSTM_classification, self).__init__()
        self.hidden_dim = hidden_dim
        self.embedding_dim = embedding_dim
        self.device = device
        self.batch_size = batch_size
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, bidirectional=True)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2label = nn.Linear(hidden_dim * 2, 3)
        self.hidden = self.init_hidden()

    def init_hidden(self):
        # Before we've done anything, we dont have any hidden state.
        # Refer to the Pytorch documentation to see exactly why they have this dimensionality.
        # The axes semantics are (num_layers * num_directions, minibatch_size, hidden_dim)
        return torch.zeros(2, self.batch_size, self.hidden_dim).to(self.device), \
               torch.zeros(2, self.batch_size, self.hidden_dim).to(self.device)

    def forward(self, sentence):
        embedded = self.embedding(sentence)
        embedded = embedded.permute(1, 0, 2)

        lstm_out, self.hidden = self.lstm(
            embedded.view(len(embedded), self.batch_size, self.embedding_dim), self.hidden)

        out = self.hidden2label(lstm_out[-1])
        return out

In [63]:
model = BiLSTM_classification(EMBEDDING_DIM, 50, INPUT_DIM, BATCH_SIZE, device)
print("Model initialised.")

model.to(device)
# We provide the model with our embeddings
model.embedding.weight.data.copy_(W1.transpose(0,1))

feature = vectorized_training_data

# 'feature' is a list of lists, each containing embedding IDs for word tokens
train_and_dev = Task2Dataset(feature, train_df['label'].values)

train_examples = round(len(train_and_dev)*train_proportion)
dev_examples = len(train_and_dev) - train_examples

train_dataset, dev_dataset = random_split(train_and_dev,
                                           (train_examples,
                                            dev_examples))


train_loader = torch.utils.data.DataLoader(train_dataset, shuffle=True, batch_size=BATCH_SIZE, collate_fn=collate_fn_padd)
dev_loader = torch.utils.data.DataLoader(dev_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn_padd)

print("Dataloaders created.")

loss_fn = nn.CrossEntropyLoss()
loss_fn = loss_fn.to(device)

optimizer = torch.optim.Adam(model.parameters())

Model initialised.
torch.Size([33234, 10])
Dataloaders created.


In [65]:
train(train_loader, dev_loader, model, epochs)

Training model.
| Epoch: 01 | Train Loss: 0.97 | Train Accuracy: 0.44 |         Val. Loss: 0.95 | Val. Accuracy: 0.47 |
| Epoch: 02 | Train Loss: 0.96 | Train Accuracy: 0.45 |         Val. Loss: 0.96 | Val. Accuracy: 0.46 |
| Epoch: 03 | Train Loss: 0.96 | Train Accuracy: 0.44 |         Val. Loss: 0.96 | Val. Accuracy: 0.43 |
| Epoch: 04 | Train Loss: 0.96 | Train Accuracy: 0.44 |         Val. Loss: 0.96 | Val. Accuracy: 0.43 |
| Epoch: 05 | Train Loss: 0.96 | Train Accuracy: 0.44 |         Val. Loss: 0.96 | Val. Accuracy: 0.47 |


KeyboardInterrupt: ignored

In [None]:
train_and_dev = train_df['edit1']

training_data, dev_data, training_y, dev_y = train_test_split(train_df['edit1'], train_df['label'],
                                                                        test_size=(1-train_proportion),
                                                                        random_state=42)

# We train a Tf-idf model
count_vect = CountVectorizer(stop_words='english')
train_counts = count_vect.fit_transform(training_data)
transformer = TfidfTransformer().fit(train_counts)
train_counts = transformer.transform(train_counts)
naive_model = MultinomialNB().fit(train_counts, training_y)

# Train predictions
predicted_train = naive_model.predict(train_counts)

# Calculate Tf-idf using train and dev, and validate on dev:
test_and_test_counts = count_vect.transform(train_and_dev)
transformer = TfidfTransformer().fit(test_and_test_counts)

test_counts = count_vect.transform(dev_data)

test_counts = transformer.transform(test_counts)

# Dev predictions
predicted = naive_model.predict(test_counts)

# We run the evaluation:
print("\nTrain performance:")

sse, mse = model_performance(predicted_train, training_y, True)

print("\nDev performance:")
sse, mse = model_performance(predicted, dev_y, True)

#### Baseline for task 2

In [None]:
# Baseline for the task
pred_baseline = torch.zeros(len(dev_y)) + 1  # 1 is most common class
print("\nBaseline performance:")
sse, mse = model_performance(pred_baseline, torch.tensor(dev_y.values), True)