<a href="https://colab.research.google.com/github/marektopolewski/ic-nlp-cw1/blob/master/NLP_Task2_Roberta.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Coursework coding instructions (please also see full coursework spec)

Please choose if you want to do either Task 1 or Task 2. You should write your report about one task only.

For the task you choose you will need to do two approaches:
  - Approach 1, which can use use pre-trained embeddings / models
  - Approach 2, which should not use any pre-trained embeddings or models
We should be able to run both approaches from the same colab file

#### Running your code:
  - Your models should run automatically when running your colab file without further intervention
  - For each task you should automatically output the performance of both models
  - Your code should automatically download any libraries required

#### Structure of your code:
  - You are expected to use the 'train', 'eval' and 'model_performance' functions, although you may edit these as required
  - Otherwise there are no restrictions on what you can do in your code

#### Documentation:
  - You are expected to produce a .README file summarising how you have approached both tasks

#### Reproducibility:
  - Your .README file should explain how to replicate the different experiments mentioned in your report

Good luck! We are really looking forward to seeing your reports and your model code!

In [1]:
# You will need to download any word embeddings required for your code, e.g.:

# !wget -q http://nlp.stanford.edu/data/glove.6B.zip
# !unzip glove.6B.zip
# !rm glove.6B.zip

!wget -q https://www.cs.rochester.edu/u/nhossain/humicroedit/semeval-2020-task-7-data.zip
!unzip -q semeval-2020-task-7-data.zip
!rm semeval-2020-task-7-data.zip
!rm -r data/task-1/

!wget https://www.cs.rochester.edu/u/nhossain/funlines/semeval-2020-task-7-extra-training-data.zip
!unzip -q semeval-2020-task-7-extra-training-data.zip
!rm semeval-2020-task-7-extra-training-data.zip
!mv semeval-2020-task-7-extra-training-data/task-2 data/funlines
!rm -r semeval-2020-task-7-extra-training-data

# For any packages that Colab does not provide auotmatically you will also need to install these below, e.g.:

! pip -q install transformers

--2021-02-24 17:08:41--  https://www.cs.rochester.edu/u/nhossain/funlines/semeval-2020-task-7-extra-training-data.zip
Resolving www.cs.rochester.edu (www.cs.rochester.edu)... 192.5.53.208
Connecting to www.cs.rochester.edu (www.cs.rochester.edu)|192.5.53.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 431136 (421K) [application/zip]
Saving to: ‘semeval-2020-task-7-extra-training-data.zip’


2021-02-24 17:08:41 (5.17 MB/s) - ‘semeval-2020-task-7-extra-training-data.zip’ saved [431136/431136]

[K     |████████████████████████████████| 1.8MB 17.2MB/s 
[K     |████████████████████████████████| 890kB 48.8MB/s 
[K     |████████████████████████████████| 3.2MB 52.4MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


In [2]:
# Imports

import torch
import torch.nn as nn
from torch.nn.utils.clip_grad import clip_grad_norm_
from sklearn.feature_extraction.text import TfidfTransformer
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from torch.utils.data import Dataset, random_split, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
import torch.optim as optim
import codecs
from tqdm import tqdm
from transformers import RobertaModel, RobertaTokenizer
import matplotlib.pyplot as plt

In [3]:
# Setting random seed and device
SEED = 1

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
print(device)

cuda:0


In [4]:
# Number of epochs
EPOCHS = 10
BATCH_SIZE = 16
LEARNING_RATE = 2e-05
ADAM_EPSILON = 1e-8
GRAD_CLIP = 5

# Model params
context_types = ['none', 'masked', 'original']
CONTEXT_TYPE = context_types[2]
CLASSIFICATION_HEAD = True

# Proportion of training data for train compared to dev
TRAINING_RATIO = 0.8
AUG_RANDOM_FLIP = True

# Pre-trained models
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
transformer = RobertaModel.from_pretrained("roberta-base", output_hidden_states=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=481.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=501200538.0, style=ProgressStyle(descri…




#### Approach 1: Using pre-trained representations

In [5]:
def get_span_mask(span, sent_len):
    bsz = span.shape[0]
    index_tensor = (
        torch.tensor(list(range(sent_len)), device=device)
        .unsqueeze(0)
        .expand(bsz, sent_len)
    )
    start_index, end_index = span.split(1, dim=-1)
    start_index = start_index + 1
    end_index = end_index - 1

    start_mask = (index_tensor - start_index) >= 0
    end_mask = (index_tensor - end_index) <= 0
    span_mask = start_mask & end_mask

    return span_mask

In [13]:
########################################### LOAD DATA ##############################################
import re

# We create a Dataset so we can create minibatches
class Task2Dataset(Dataset):

    def __init__(self, df):
      cust_tokenize = lambda x: tokenizer.convert_tokens_to_ids(tokenizer.tokenize(x))

      self.x_edit_1 = [cust_tokenize(x) for x in df.edit_sentence1.tolist()]
      self.x_masked_1 = [cust_tokenize(x) for x in df.masked1.tolist()]
      self.y_grade_1 = df.meanGrade1.tolist()

      self.x_edit_2 = [cust_tokenize(x) for x in df.edit_sentence2.tolist()]
      self.x_masked_2 = [cust_tokenize(x) for x in df.masked2.tolist()]
      self.y_grade_2 = df.meanGrade2.tolist()

      self.x_original = [cust_tokenize(x) for x in df.original.tolist()]
      self.y_label = df.label.tolist()

    def __len__(self):
        return len(self.y_label)

    def __getitem__(self, item):
        flip = np.random.choice([True, False]) if AUG_RANDOM_FLIP else False
        label = self.y_label[item]
        return { 'x_edit_1':      getattr(self, f'x_edit_{2 if flip else 1}')[item],
                 'x_masked_1':    getattr(self, f'x_masked_{2 if flip else 1}')[item],
                 'y_grade_1':     getattr(self, f'y_grade_{2 if flip else 1}')[item],
                 'x_edit_2':      getattr(self, f'x_edit_{1 if flip else 2}')[item],
                 'x_masked_2':    getattr(self, f'x_masked_{1 if flip else 2}')[item],
                 'y_grade_2':     getattr(self, f'y_grade_{1 if flip else 2}')[item],
                 'x_original':    self.x_original[item],
                 'y_label':       3 - label if flip and label != 0 else label }

def preprocess_data(df):
  print('[preprocess] started... ', end='')

  # create sentences by replacing <word/> words
  for row_idx, row in df.iterrows():
    original = row['original1'].strip()
    orig_word = re.search("<(.*)/>", original)
    orig_seq = f'{tokenizer.sep_token} {orig_word} {tokenizer.sep_token}'
    df.loc[row_idx, 'original'] = re.sub("<.*/>", orig_seq, original)
    for i in ['1', '2']:
      edit_seq = f'{tokenizer.sep_token} {row[f"edit{i}"]} {tokenizer.sep_token}'
      mask_seq = f'{tokenizer.sep_token} {tokenizer.mask_token} {tokenizer.sep_token}'
      df.loc[row_idx, f'edit_sentence{i}'] = re.sub("<.*/>", edit_seq, original)
      df.loc[row_idx, f'masked{i}'] = re.sub("<.*/>", mask_seq, original)

  # keep only the relevant columns  
  cols_to_drop = list(df.columns)
  for i in ['1', '2']:
    cols_to_drop.remove(f'edit_sentence{i}')
    cols_to_drop.remove(f'masked{i}')
    cols_to_drop.remove(f'meanGrade{i}')
  cols_to_drop.remove('original')
  cols_to_drop.remove('label')
  df = df.drop(columns=cols_to_drop)

  print('done.')
  return df

def pad_column(col):
  max_len = max([len(x) for x in col])
  for k, v in enumerate(col):
    col[k] += [tokenizer.pad_token_id] * (max_len - len(v))
  return col

def pad_batch(batch):
  x_edit_1 = pad_column([row['x_edit_1'] for row in batch])
  x_masked_1 = pad_column([row['x_masked_1'] for row in batch])
  y_grade_1 = [row['y_grade_1'] for row in batch]
  x_edit_2 = pad_column([row['x_edit_2'] for row in batch])
  x_masked_2 = pad_column([row['x_masked_2'] for row in batch])
  y_grade_2 = [row['y_grade_2'] for row in batch]
  x_original = pad_column([row['x_original'] for row in batch])
  y_label = [row['y_label'] for row in batch]
  return {'x_edit_1': x_edit_1, 'x_masked_1': x_masked_1, 'y_grade_1': y_grade_1,
          'x_edit_2': x_edit_2, 'x_masked_2': x_masked_2, 'y_grade_2': y_grade_2,
          'x_original': x_original, 'y_label': y_label}


# Load data from CSV
train_df = pd.read_csv('data/task-2/train.csv')
funlines_df = pd.read_csv('data/funlines/train_funlines.csv')
test_df = pd.read_csv('data/task-2/dev.csv')
print('Data loaded from CSV.')

# Preprocess the data -> creates 2x (edit_sentence, masked_sentence, grade) + original + label
funlines_df = preprocess_data(funlines_df)
train_df = preprocess_data(train_df)
dataframe = pd.concat([train_df, funlines_df])
dataset = Task2Dataset(dataframe)
print('Dataset created.')

# Perform the train/eval split
train_size = round(len(dataset) * TRAINING_RATIO)
eval_size = len(dataset) - train_size
train_dataset, eval_dataset = random_split(dataset, (train_size, eval_size))

# Create data loaders
train_loader = DataLoader(train_dataset, shuffle=True, batch_size=BATCH_SIZE, collate_fn=pad_batch)
eval_loader = DataLoader(eval_dataset, batch_size=BATCH_SIZE, collate_fn=pad_batch)
print('Train and evaluation split data loaders generated.')

Data loaded from CSV.
[preprocess] started... done.
[preprocess] started... done.
Dataset created.
Train and evaluation split data loaders generated.


In [19]:
# DEFINE MODEL

class ColBERT(nn.Module):

  def __init__(self, tokenizer, transformer, context=False, classification=False):
    super(ColBERT, self).__init__()
    self.tokenizer = tokenizer
    self.transformer = transformer
    self.context = context                # combine edits with context
    self.classification = classification  # use classfication head

    hid_size = transformer.config.hidden_size
    feature_size = 4

    if not self.classification:
      self.head = nn.Linear(hid_size * (feature_size if self.context else 1), 1)
    else:
      self.head = nn.Sequential(
          nn.Dropout(0.2),
          nn.Linear(hid_size * feature_size, 3))

    if self.context and self.classification:
      self.pool = nn.Linear(hid_size * feature_size * 2, hid_size * feature_size)


  def forward(self, edit1, context1, edit2, context2):
    if self.context:
      if self.classification:
        return self._classify_context(edit1, context1, edit2, context2)
      else:
        return self._regress_context(edit1, context1, edit2, context2)
    else:
      if self.classification:
        return self._classify(edit1, edit2)
      else:
        return self._regress(edit1, edit2)      
    raise Exception('Invalid model config')

  def _regress_context(self, edit1, context1, edit2, context2):
    edit1_emb, edit2_emb = self._embed(edit1), self._embed(edit2)
    context1_emb, context2_emb = self._embed(context1), self._embed(context2)
    h1 = self._make_feature(edit1_emb, context1_emb)
    h2 = self._make_feature(edit2_emb, context2_emb)
    grade1 = self.head(h1).squeeze()
    grade2 = self.head(h2).squeeze()
    return self._soft_argmax(grade1, grade2), grade1, grade2
  
  def _regress(self, edit1, edit2):
    edit1_emb = self._embed(edit1)
    edit2_emb = self._embed(edit2)
    grade1 = self.head(edit1_emb).squeeze()
    grade2 = self.head(edit2_emb).squeeze()
    return self._soft_argmax(grade1, grade2), grade1, grade2
  
  def _classify_context(self, edit1, context1, edit2, context2):
    edit1_emb = self._embed(edit1)
    edit2_emb = self._embed(edit2)
    context1_emb = self._embed(context1)
    context2_emb = self._embed(context2)
    h1 = self._make_feature(edit1_emb, context1_emb)
    h2 = self._make_feature(edit2_emb, context2_emb)
    h = self.pool(torch.cat([h1, h2], -1))
    return self.head(h), None, None
    
  def _classify(self, edit1, edit2):
    edit1_emb = self._embed(edit1)
    edit2_emb = self._embed(edit2)
    h = self._make_feature(edit1_emb, edit2_emb)
    return self.head(h), None, None
  
  def _embed(self, inp):
    inp_mask = (inp != self.tokenizer.pad_token_id) & (inp != self.tokenizer.sep_token_id)
    sep_mask = inp == self.tokenizer.sep_token_id
        
    outputs = self.transformer(inp, attention_mask=inp_mask)
    last_hidden_state = outputs.last_hidden_state

    span = torch.nonzero(input=sep_mask, as_tuple=True)[1].view(-1, 2)
    span_mask = get_span_mask(span=span, sent_len=inp.shape[-1])

    out = self._pool(last_hidden_state, span_mask)

    return out
  
  @staticmethod
  def _make_feature(u, v):
    return torch.cat([u, v, (u - v).abs(), u * v], -1)

  @staticmethod
  def _pool(sequence, mask):
    if len(sequence.shape) == 2:
      return sequence
    if mask is None:
      mask = torch.ones(sequence.shape[:2], device=device)
    if len(mask.size()) < 3:
      mask = mask.unsqueeze(dim=-1)
    pad_mask = mask == 0
    sequence = sequence.masked_fill(pad_mask, 0)
    seq_emb = sequence.sum(dim=1) / mask.sum(dim=1).float()
    return seq_emb
  
  @staticmethod
  def _soft_argmax(u, v, threshold=None):
    "Assign 0 instead of 1 (for u) or 2 (for v) if difference below threshold"
    argmax = torch.argmax(torch.stack([u, v], -1), dim=-1) + 1
    if not threshold:
      return argmax
    diff_mask = (u - v).abs() <= threshold
    return argmax.masked_fill(diff_mask, 0)

In [9]:
# How we print the model performance
def model_performance(output, target, size, print_output=False):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    correct_answers = (output == target)
    correct = sum(correct_answers)
    acc = np.true_divide(correct, size)

    if print_output:
        print(f'| Acc: {acc:.2f} ')

    return correct, acc

In [10]:
# We evaluate performance on our dev set
def eval(data_iter, model, context_type='none', classification=False):
    """
    Evaluating model performance on the dev set
    """
    model.eval()
    epoch_loss = 0
    epoch_correct = 0
    pred_all = []
    trg_all = []
    no_observations = 0

    with torch.no_grad():
        for batch in data_iter:
  
          # Extract relevant data fields and cast to tensors
          edit1 = torch.LongTensor(batch['x_edit_1']).to(device)
          edit2 = torch.LongTensor(batch['x_edit_2']).to(device)
          y_label_true = torch.LongTensor(batch['y_label']).to(device)
          if context_type == 'masked':
            context1 = torch.LongTensor(batch['x_masked_1']).to(device)
            context2 = torch.LongTensor(batch['x_masked_2']).to(device)
          elif context_type == 'original':
            context1 = context2 = torch.LongTensor(batch['x_original']).to(device)
          else:
            context1 = context2 = None
    
          if classification:
            y_grade1_true = y_grade2_true = None
          else:
            y_grade1_true = torch.FloatTensor(batch['y_grade_1']).to(device)
            y_grade2_true = torch.FloatTensor(batch['y_grade_2']).to(device)
      
          batch_len = len(y_label_true)
          optimizer.zero_grad()

          # Forward pass
          y_label_pred, y_grade1_pred, y_grade2_pred = model(edit1, context1, edit2, context2)
    
          # Calculate regression loss
          if not classification:
            if batch_len == 1:
              y_grade1_pred = y_grade1_pred.unsqueeze(-1)
              y_grade2_pred = y_grade2_pred.unsqueeze(-1)
            loss_fn = nn.MSELoss(reduction="sum")
            loss1 = loss_fn(y_grade1_pred, y_grade1_true)
            loss2 = loss_fn(y_grade2_pred, y_grade2_true)
            loss = (loss1 + loss2)
            label_preds = y_label_pred.detach().cpu().numpy()
          # Calculate classification loss
          else:
            loss_fn = nn.CrossEntropyLoss(reduction="sum")
            loss = loss_fn(y_label_pred, y_label_true)
            label_preds = torch.argmax(y_label_pred.detach().cpu(), dim=-1).numpy()

          # Logging
          correct, __ = model_performance(label_preds, y_label_true.detach().cpu().numpy(), batch_len)
          epoch_loss += loss.item()
          epoch_correct += correct
          no_observations += batch_len

    return epoch_loss/no_observations, epoch_correct/no_observations, np.array(pred_all), np.array(trg_all)

In [17]:
# We define our training loop
def train(train_iter, dev_iter, model, optimizer, number_epoch=10, context_type='none',
          classification=False):
  """
  Training loop for the model, which calls on eval to evaluate after each epoch
  """

  print(f"Training model - classification {classification}, context {context_type}")
  train_losses, valid_losses = [], []
  train_accs, valid_accs = [], []

  for epoch in range(1, number_epoch+1):
    
    model.train()
    
    epoch_loss = 0
    epoch_correct = 0
    no_observations = 0  # Observations used for training so far

    for batch in train_iter:

      # Extract relevant data fields and cast to tensors
      edit1 = torch.LongTensor(batch['x_edit_1']).to(device)
      edit2 = torch.LongTensor(batch['x_edit_2']).to(device)
      y_label_true = torch.LongTensor(batch['y_label']).to(device)
      if context_type == 'masked':
        context1 = torch.LongTensor(batch['x_masked_1']).to(device)
        context2 = torch.LongTensor(batch['x_masked_2']).to(device)
      elif context_type == 'original':
        context1 = context2 = torch.LongTensor(batch['x_original']).to(device)
      else:
        context1 = context2 = None

      if classification:
        y_grade1_true = y_grade2_true = None
      else:
        y_grade1_true = torch.FloatTensor(batch['y_grade_1']).to(device)
        y_grade2_true = torch.FloatTensor(batch['y_grade_2']).to(device)
  
      batch_len = len(y_label_true)
      optimizer.zero_grad()

      # Forward pass
      y_label_pred, y_grade1_pred, y_grade2_pred = model(edit1, context1, edit2, context2)

      # Calculate regression loss
      if not classification:
        if batch_len == 1:
          y_grade1_pred = y_grade1_pred.unsqueeze(-1)
          y_grade2_pred = y_grade2_pred.unsqueeze(-1)
        loss_fn = nn.MSELoss(reduction="sum")
        loss1 = loss_fn(y_grade1_pred, y_grade1_true)
        loss2 = loss_fn(y_grade2_pred, y_grade2_true)
        loss = (loss1 + loss2)
        label_preds = y_label_pred.detach().cpu().numpy()
      # Calculate classification loss
      else:
        loss_fn = nn.CrossEntropyLoss(reduction="sum")
        loss = loss_fn(y_label_pred, y_label_true)
        label_preds = torch.argmax(y_label_pred.detach().cpu(), dim=-1).numpy()

      # Backward pass
      loss.backward()
      sum_grad = 0
      clip_grad_norm_(model.parameters(), GRAD_CLIP)
      optimizer.step()

      # Logging
      correct, __ = model_performance(label_preds, y_label_true.detach().cpu().numpy(), batch_len)
      epoch_loss += loss.item()
      epoch_correct += correct
      no_observations += batch_len
  
    valid_loss, valid_acc, __, __ = eval(dev_iter, model, context_type, classification)
    valid_losses.append(valid_loss)
    valid_accs.append(valid_acc)

    epoch_loss, epoch_acc = epoch_loss / no_observations, epoch_correct / no_observations
    train_losses.append(epoch_loss)
    train_accs.append(epoch_acc)
  
    print(f'| Epoch: {epoch:02} | Train Loss: {epoch_loss:.2f} | Train Accuracy: {epoch_acc:.2f} | \
    Val. Loss: {valid_loss:.2f} | Val. Accuracy: {valid_acc:.2f} |')
  
  fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10,4))
  ax[0].plot(train_losses, label='Training losses')
  ax[0].plot(valid_losses, label='Validation losses')
  ax[0].legend()
  ax[1].plot(train_accs, label='Training accuracies')
  ax[1].plot(valid_accs, label='Validation accuracies')
  ax[0].legend()
  fig.show()

In [None]:
# Regression, no context
#| Epoch: 10 | Train Loss: 0.37 | Train Accuracy: 0.57 |     Val. Loss: 0.59 | Val. Accuracy: 0.62 |

# Classification, no context
#| Epoch: 05 | Train Loss: 0.62 | Train Accuracy: 0.76 |     Val. Loss: 0.93 | Val. Accuracy: 0.62 |


### Run tests

for context_type in context_types[1:]:
  for classification in [False, True]:
    transformer.init_weights()
    model = ColBERT(tokenizer, transformer, context_type != 'none', classification).to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE, eps=ADAM_EPSILON)
    train(train_loader, eval_loader, model, optimizer, EPOCHS, context_type, classification)

Training model - classification False, context masked
| Epoch: 01 | Train Loss: 0.89 | Train Accuracy: 0.46 |     Val. Loss: 0.69 | Val. Accuracy: 0.54 |
| Epoch: 02 | Train Loss: 0.67 | Train Accuracy: 0.54 |     Val. Loss: 0.69 | Val. Accuracy: 0.56 |
| Epoch: 03 | Train Loss: 0.52 | Train Accuracy: 0.61 |     Val. Loss: 0.65 | Val. Accuracy: 0.59 |
| Epoch: 04 | Train Loss: 0.41 | Train Accuracy: 0.64 |     Val. Loss: 0.54 | Val. Accuracy: 0.59 |
| Epoch: 05 | Train Loss: 0.34 | Train Accuracy: 0.67 |     Val. Loss: 0.52 | Val. Accuracy: 0.61 |
| Epoch: 06 | Train Loss: 0.28 | Train Accuracy: 0.69 |     Val. Loss: 0.47 | Val. Accuracy: 0.61 |
| Epoch: 07 | Train Loss: 0.25 | Train Accuracy: 0.71 |     Val. Loss: 0.62 | Val. Accuracy: 0.63 |
| Epoch: 08 | Train Loss: 0.22 | Train Accuracy: 0.73 |     Val. Loss: 0.44 | Val. Accuracy: 0.63 |
| Epoch: 09 | Train Loss: 0.19 | Train Accuracy: 0.74 |     Val. Loss: 0.44 | Val. Accuracy: 0.64 |
| Epoch: 10 | Train Loss: 0.16 | Train Accurac

```
Training model - classification False, context none
| Epoch: 01 | Train Loss: 0.87 | Train Accuracy: 0.46 |     Val. Loss: 0.84 | Val. Accuracy: 0.50 |
| Epoch: 02 | Train Loss: 0.62 | Train Accuracy: 0.55 |     Val. Loss: 0.60 | Val. Accuracy: 0.55 |
| Epoch: 03 | Train Loss: 0.47 | Train Accuracy: 0.61 |     Val. Loss: 0.54 | Val. Accuracy: 0.58 |
| Epoch: 04 | Train Loss: 0.37 | Train Accuracy: 0.65 |     Val. Loss: 0.54 | Val. Accuracy: 0.60 |
| Epoch: 05 | Train Loss: 0.30 | Train Accuracy: 0.68 |     Val. Loss: 0.47 | Val. Accuracy: 0.62 |
| Epoch: 06 | Train Loss: 0.24 | Train Accuracy: 0.71 |     Val. Loss: 0.45 | Val. Accuracy: 0.63 |
| Epoch: 07 | Train Loss: 0.20 | Train Accuracy: 0.73 |     Val. Loss: 0.45 | Val. Accuracy: 0.64 |
| Epoch: 08 | Train Loss: 0.16 | Train Accuracy: 0.76 |     Val. Loss: 0.50 | Val. Accuracy: 0.65 |
| Epoch: 09 | Train Loss: 0.13 | Train Accuracy: 0.77 |     Val. Loss: 0.42 | Val. Accuracy: 0.67 |
| Epoch: 10 | Train Loss: 0.11 | Train Accuracy: 0.79 |     Val. Loss: 0.42 | Val. Accuracy: 0.68 |
Training model - classification True, context none
| Epoch: 01 | Train Loss: 1.02 | Train Accuracy: 0.44 |     Val. Loss: 0.96 | Val. Accuracy: 0.47 |
| Epoch: 02 | Train Loss: 0.98 | Train Accuracy: 0.48 |     Val. Loss: 0.96 | Val. Accuracy: 0.47 |
| Epoch: 03 | Train Loss: 0.91 | Train Accuracy: 0.58 |     Val. Loss: 0.97 | Val. Accuracy: 0.55 |
| Epoch: 04 | Train Loss: 0.78 | Train Accuracy: 0.68 |     Val. Loss: 0.95 | Val. Accuracy: 0.56 |
| Epoch: 05 | Train Loss: 0.67 | Train Accuracy: 0.74 |     Val. Loss: 1.11 | Val. Accuracy: 0.58 |
| Epoch: 06 | Train Loss: 0.58 | Train Accuracy: 0.78 |     Val. Loss: 1.12 | Val. Accuracy: 0.57 |
| Epoch: 07 | Train Loss: 0.49 | Train Accuracy: 0.81 |     Val. Loss: 1.24 | Val. Accuracy: 0.59 |
| Epoch: 08 | Train Loss: 0.41 | Train Accuracy: 0.84 |     Val. Loss: 1.27 | Val. Accuracy: 0.56 |
| Epoch: 09 | Train Loss: 0.34 | Train Accuracy: 0.87 |     Val. Loss: 1.49 | Val. Accuracy: 0.51 |
| Epoch: 10 | Train Loss: 0.26 | Train Accuracy: 0.90 |     Val. Loss: 1.64 | Val. Accuracy: 0.54 |
```

In [None]:
 # To create our vocab
def create_vocab(data):
    """
    Creating a corpus of all the tokens used
    """
    tokenized_corpus = [] # Let us put the tokenized corpus in a list

    for sentence in data:

        tokenized_sentence = []

        for token in sentence.split(' '): # simplest split is

            tokenized_sentence.append(token)

        tokenized_corpus.append(tokenized_sentence)

    # Create single list of all vocabulary
    vocabulary = []  # Let us put all the tokens (mostly words) appearing in the vocabulary in a list

    for sentence in tokenized_corpus:

        for token in sentence:

            if token not in vocabulary:

                if True:
                    vocabulary.append(token)

    return vocabulary, tokenized_corpus

In [None]:
# Used for collating our observations into minibatches:
def collate_fn_padd(batch):
    '''
    We add padding to our minibatches and create tensors for our model
    '''

    batch_labels = [l for f, l in batch]
    batch_features = [f for f, l in batch]

    batch_features_len = [len(f) for f, l in batch]

    seq_tensor = torch.zeros((len(batch), max(batch_features_len))).long()

    for idx, (seq, seqlen) in enumerate(zip(batch_features, batch_features_len)):
        seq_tensor[idx, :seqlen] = torch.LongTensor(seq)

    batch_labels = torch.LongTensor(batch_labels)

    return seq_tensor, batch_labels

# We create a Dataset so we can create minibatches
class Task2Dataset(Dataset):

    def __init__(self, train_data, labels):
        self.x_train = train_data
        self.y_train = labels

    def __len__(self):
        return len(self.y_train)

    def __getitem__(self, item):
        return self.x_train[item], self.y_train[item]

In [None]:

class BiLSTM_classification(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, batch_size, device):
        super(BiLSTM_classification, self).__init__()
        self.hidden_dim = hidden_dim
        self.embedding_dim = embedding_dim
        self.device = device
        self.batch_size = batch_size
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, bidirectional=True)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2label = nn.Linear(hidden_dim * 2, 3)
        self.hidden = self.init_hidden()

    def init_hidden(self):
        # Before we've done anything, we dont have any hidden state.
        # Refer to the Pytorch documentation to see exactly why they have this dimensionality.
        # The axes semantics are (num_layers * num_directions, minibatch_size, hidden_dim)
        return torch.zeros(2, self.batch_size, self.hidden_dim).to(self.device), \
               torch.zeros(2, self.batch_size, self.hidden_dim).to(self.device)

    def forward(self, sentence):
        embedded = self.embedding(sentence)
        embedded = embedded.permute(1, 0, 2)

        lstm_out, self.hidden = self.lstm(
            embedded.view(len(embedded), self.batch_size, self.embedding_dim), self.hidden)

        out = self.hidden2label(lstm_out[-1])
        return out

In [None]:
## Approach 1 code, using functions defined above:

# We set our training data and test data
training_data = train_df['original1']
test_data = test_df['original1']

# Creating word vectors
training_vocab, training_tokenized_corpus = create_vocab(training_data)
test_vocab, test_tokenized_corpus = create_vocab(test_data)

# Creating joint vocab from test and train:
joint_vocab, joint_tokenized_corpus = create_vocab(pd.concat([training_data, test_data]))

print("Vocab created.")

# We create representations for our tokens
wvecs = [] # word vectors
word2idx = [] # word2index
idx2word = []

# This is a large file, it will take a while to load in the memory!
with codecs.open('glove.6B.100d.txt', 'r','utf-8') as f:
  index = 1
  for line in f.readlines():
    # Ignore the first line - first line typically contains vocab, dimensionality
    if len(line.strip().split()) > 3:
      word = line.strip().split()[0]
      if word in joint_vocab:
          (word, vec) = (word,
                     list(map(float,line.strip().split()[1:])))
          wvecs.append(vec)
          word2idx.append((word, index))
          idx2word.append((index, word))
          index += 1


wvecs = np.array(wvecs)
word2idx = dict(word2idx)
idx2word = dict(idx2word)

vectorized_seqs = [[word2idx[tok] for tok in seq if tok in word2idx] for seq in training_tokenized_corpus]

INPUT_DIM = len(word2idx)
EMBEDDING_DIM = 100
BATCH_SIZE = 32

model = BiLSTM_classification(EMBEDDING_DIM, 50, INPUT_DIM, BATCH_SIZE, device)
print("Model initialised.")

model.to(device)
# We provide the model with our embeddings
model.embedding.weight.data.copy_(torch.from_numpy(wvecs))

feature = vectorized_seqs

# 'feature' is a list of lists, each containing embedding IDs for word tokens
train_and_dev = Task2Dataset(feature, train_df['label'])

train_examples = round(len(train_and_dev)*train_proportion)
dev_examples = len(train_and_dev) - train_examples

train_dataset, dev_dataset = random_split(train_and_dev,
                                           (train_examples,
                                            dev_examples))


train_loader = torch.utils.data.DataLoader(train_dataset, shuffle=True, batch_size=BATCH_SIZE, collate_fn=collate_fn_padd)
dev_loader = torch.utils.data.DataLoader(dev_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn_padd)

print("Dataloaders created.")

loss_fn = nn.CrossEntropyLoss()
loss_fn = loss_fn.to(device)

optimizer = torch.optim.Adam(model.parameters())

train(train_loader, dev_loader, model, epochs)

Vocab created.
Model initialised.
Dataloaders created.
Training model.
| Epoch: 01 | Train Loss: 0.96 | Train Accuracy: 0.44 |         Val. Loss: 0.98 | Val. Accuracy: 0.43 |
| Epoch: 02 | Train Loss: 0.96 | Train Accuracy: 0.45 |         Val. Loss: 0.97 | Val. Accuracy: 0.45 |
| Epoch: 03 | Train Loss: 0.96 | Train Accuracy: 0.45 |         Val. Loss: 0.97 | Val. Accuracy: 0.43 |
| Epoch: 04 | Train Loss: 0.95 | Train Accuracy: 0.46 |         Val. Loss: 0.98 | Val. Accuracy: 0.43 |
| Epoch: 05 | Train Loss: 0.94 | Train Accuracy: 0.52 |         Val. Loss: 0.99 | Val. Accuracy: 0.46 |
| Epoch: 06 | Train Loss: 0.87 | Train Accuracy: 0.60 |         Val. Loss: 1.02 | Val. Accuracy: 0.48 |
| Epoch: 07 | Train Loss: 0.80 | Train Accuracy: 0.62 |         Val. Loss: 1.05 | Val. Accuracy: 0.49 |
| Epoch: 08 | Train Loss: 0.75 | Train Accuracy: 0.65 |         Val. Loss: 1.16 | Val. Accuracy: 0.47 |
| Epoch: 09 | Train Loss: 0.70 | Train Accuracy: 0.66 |         Val. Loss: 1.18 | Val. Accuracy: 

#### Approach 2: No pre-trained representations

In [None]:
train_and_dev = train_df['edit1']

training_data, dev_data, training_y, dev_y = train_test_split(train_df['edit1'], train_df['label'],
                                                                        test_size=(1-train_proportion),
                                                                        random_state=42)

# We train a Tf-idf model
count_vect = CountVectorizer(stop_words='english')
train_counts = count_vect.fit_transform(training_data)
transformer = TfidfTransformer().fit(train_counts)
train_counts = transformer.transform(train_counts)
naive_model = MultinomialNB().fit(train_counts, training_y)

# Train predictions
predicted_train = naive_model.predict(train_counts)

# Calculate Tf-idf using train and dev, and validate on dev:
test_and_test_counts = count_vect.transform(train_and_dev)
transformer = TfidfTransformer().fit(test_and_test_counts)

test_counts = count_vect.transform(dev_data)

test_counts = transformer.transform(test_counts)

# Dev predictions
predicted = naive_model.predict(test_counts)

# We run the evaluation:
print("\nTrain performance:")

sse, mse = model_performance(predicted_train, training_y, True)

print("\nDev performance:")
sse, mse = model_performance(predicted, dev_y, True)


Train performance:
| Acc: 0.73 

Dev performance:
| Acc: 0.52 


#### Baseline for task 2

In [None]:
# Baseline for the task
pred_baseline = torch.zeros(len(dev_y)) + 1  # 1 is most common class
print("\nBaseline performance:")
sse, mse = model_performance(pred_baseline, torch.tensor(dev_y.values), True)


Baseline performance:
| Acc: 0.45 
