# Advanced Natural Language Engineering - Assignment 1

This assignment asks us to compete in The Microsoft Research Sentence Completion Challenge - MRSCC (Zweig and Burges, 2011), it requires a system to be able to predict which is the most likely word (from a set of 5 possibilities) to complete a sentence. 

There are 4 different methods that will be compared in this challenge:

1.   Unigram model
2.   Bigram model
3. Pytorch embeddings CBOW model 
4. Transformer models (bert/roberta)

### Loading challenge data

For this challenge we are provided with:

1.   A training corpus of 19th century novels data (522 files)
2.   1040 sentences with one missing word and 5 options to choose from

This dataset was constructed from Project Gutenberg data. Seed sentences were selected from five of Sir
Arthur Conan Doyle’s Sherlock Holmes novels, and then imposter words were suggested with the
aid of a language model trained on over 500 19th century novels. The strategy for competing in this challenge will be to create training and validation data from the complete corpus. This will then help us make predictions in the unseen MRSCC challenge data.

In [None]:
%%capture
!sudo apt-get install libdb++-dev
!export BERKELEYDB_DIR=/usr
!pip3 install bsddb3
!pip install gutenberg
!pip install nltk
!pip install pytorch-lightning
!pip install "ray[tune]"

In [None]:
%%capture
import nltk
nltk.download('punkt')
import os
import random
from nltk import word_tokenize as tokenize
import operator
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from ray import tune
from ray.tune import CLIReporter
from ray.tune.schedulers import ASHAScheduler
from ray.tune.integration.pytorch_lightning import TuneReportCallback
import shutil
import tempfile
from pytorch_lightning.loggers import TensorBoardLogger
from pytorch_lightning.utilities.cloud_io import load as pl_load
import re
from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk import word_tokenize
import string
import pandas as pd, csv
import re
from sklearn.model_selection import train_test_split
import numpy as np

In [None]:
import os
import random
mrscc_dir = '/content/drive/MyDrive/university/2021/ANLE/lab2resources/sentence-completion'

def get_train_val(training_dir=mrscc_dir,split=0.99):
    filenames=os.listdir(training_dir)
    n=len(filenames)
    print("There are {} files in the training directory: {}".format(n,training_dir))
    random.seed(7) #if you want the same random split every time
    random.shuffle(filenames)
    index=int(n*split)
    return(filenames[:index],filenames[index:])

trainingdir=os.path.join(mrscc_dir,"Holmes_Training_Data/")
training,testing=get_train_val(trainingdir)

There are 522 files in the training directory: /content/drive/MyDrive/university/2021/ANLE/lab2resources/sentence-completion/Holmes_Training_Data/


In [None]:
def processfiles(files, training_dir, filter="Conan Doyle"):
  texts = []
  for i, afile in enumerate(files):
      text = ""
      try:
          with open(os.path.join(training_dir,afile)) as instream:
            for line in instream:
              text += line
            if re.search(filter, text, re.IGNORECASE) or i%15==0:
              print("sherlock found at {}".format(i))
              texts.append(strip_headers(text).strip())              
      except UnicodeDecodeError:
          print("UnicodeDecodeError processing {}: ignoring rest of file".format(afile))
  return texts

In [None]:
texts = processfiles(training, trainingdir)

sherlock found at 0
UnicodeDecodeError processing TNGLW10.TXT: ignoring rest of file
sherlock found at 15
sherlock found at 30
sherlock found at 45
sherlock found at 60
sherlock found at 75
sherlock found at 88
UnicodeDecodeError processing HHOHG10.TXT: ignoring rest of file
sherlock found at 90
sherlock found at 104
sherlock found at 105
sherlock found at 115
sherlock found at 120
sherlock found at 128
UnicodeDecodeError processing TBTAS10.TXT: ignoring rest of file
sherlock found at 133
sherlock found at 135
sherlock found at 150
sherlock found at 165
sherlock found at 172
sherlock found at 180
sherlock found at 183
UnicodeDecodeError processing DTROY10.TXT: ignoring rest of file
sherlock found at 195
UnicodeDecodeError processing WTSLW10.TXT: ignoring rest of file
sherlock found at 210
UnicodeDecodeError processing PHIL410.TXT: ignoring rest of file
UnicodeDecodeError processing KRSON10.TXT: ignoring rest of file
UnicodeDecodeError processing HFDTR10.TXT: ignoring rest of file
sherl

In [None]:
len(texts)

50

In [None]:
print(texts[0])

This etext was prepared with the use of Calera WordScan Plus 2.0
Donated by Calera

         

                               THE
                          CERTAIN HOUR

                      (Dizain des Poetes)


                               By
                       JAMES BRANCH CABELL





        "Criticism, whatever may be its
        pretensions, never does more than to
        define the impression which is made upon
        it at a certain moment by a work wherein
        the writer himself noted the impression
        of the world which he received at a
        certain hour."


                            NEW YORK
                   ROBERT M. McBRIDE & COMPANY
                              1916




            Copyright, 1916. by Robert M. McBride &
            Copyright, 1915, by McBride, Nast & Co.
       Copyright, 1914, by the Sewanee Review Quarterly
       Copyright, 1913, by John Adams Thayer Corporation
        Copyright, 1912, by Argonaut Publishing Company
        

load questions

In [None]:
questions=pd.read_csv(os.path.join(mrscc_dir,"testing_data.csv"))
answers=pd.read_csv(os.path.join(mrscc_dir,"test_answer.csv"))
choices = ['a','b','c','d','e']
questions.rename(columns={'a)':'a','b)':'b','c)':'c','d)':'d','e)':'e'}, inplace=True)
word_answers, question_with_answer, question_with_mask = [], [], []
for index,row in questions.iterrows():
  answer = answers.iloc[index].answer
  word_answers.append(row[answer])
  question_with_answer.append(re.sub("_____",row[answer],row.question))
questions['answer'] = word_answers
questions['question_with_answer'] = question_with_answer
questions.head()

Unnamed: 0,id,question,a,b,c,d,e,answer,question_with_answer
0,1,I have it from the same source that you are bo...,crying,instantaneously,residing,matched,walking,residing,I have it from the same source that you are bo...
1,2,It was furnished partly as a sitting and partl...,daintily,privately,inadvertently,miserably,comfortably,daintily,It was furnished partly as a sitting and partl...
2,3,"As I descended , my old ally , the _____ , cam...",gods,moon,panther,guard,country-dance,guard,"As I descended , my old ally , the guard , cam..."
3,4,"We got off , _____ our fare , and the trap rat...",rubbing,doubling,paid,naming,carrying,paid,"We got off , paid our fare , and the trap ratt..."
4,5,"He held in his hand a _____ of blue paper , sc...",supply,parcel,sign,sheet,chorus,sheet,"He held in his hand a sheet of blue paper , sc..."


In [None]:
nltk.download('wordnet')
from nltk.corpus import wordnet as wn

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [None]:
# pos_totals = {}
# for index,row in questions.iterrows():
#   synsets = wn.synsets(row.answer)
#   # for s in synsets:
#   print(row.answer)
#   # print(synsets[0])
#   # pos_totals[synsets[0].pos()] = pos_totals.get(synsets[0].pos(), 0)+(1/len(synsets))

In [None]:
pos_totals

{'a': 44.57175535440123,
 'n': 377.01444745445554,
 'r': 29.411485697178282,
 's': 133.35888017132697,
 'v': 428.643431322629}

Process data into context->Target

In [None]:
def processfiles(all_texts, questions=questions, config={"stop":True, "window_size":4}):
  window = config['window_size']
  vocab = set()
  contexts,targets=[],[]
  stop = set(stopwords.words('english') + list(string.punctuation))
  for text in all_texts:
    if config['stop']:
      tokenized_text = [i for i in word_tokenize(text.lower()) if i not in stop]
    else:
      tokenized_text = [i for i in word_tokenize(text.lower())]
    vocab.update(tokenized_text)
    for i in range(window, len(tokenized_text) - window - 1):
      contexts.append(tokenized_text[i-window:i] + tokenized_text[i+1:i+window+1])
      targets.append(tokenized_text[i])
  train = pd.DataFrame()
  train['contexts']=contexts
  train['targets']=targets
  # naively handle out-of-vocab errors by addding question text to vocab
  for i,row in questions.iterrows():
    stop = set(stopwords.words('english') + list(string.punctuation))
    if config['stop']==True:
      question_tokens = [i for i in word_tokenize(row.question.lower()) if i not in stop]
    else:
      question_tokens = [i for i in word_tokenize(row.question.lower())]
    vocab.update(question_tokens)
    vocab.update(list(row[choices]))
  return train, vocab

In [None]:
train, vocab = processfiles(texts,config={"stop":True, "window_size":4})

In [None]:
train.iloc[100].contexts

['death', 'spouse', 'hath', 'chance', 'whereby', 'decays', 'thing', 'save']

In [None]:
len(train)

1860555

# Pytorch lightning

In [None]:
import pytorch_lightning as pl
import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd

In [None]:
class PLDataset(Dataset):

  def __init__(self, data: pd.DataFrame, vocab: dict):
    self.data = data
    self.vocab = vocab

  def __len__(self):
    return len(self.data)

  def __getitem__(self, index: int):
    row = self.data.iloc[index]
    context = row.contexts
    target = row.targets
    return {'context_ids':torch.tensor([self.vocab[w] for w in context], dtype=torch.long),
            'target_id':torch.tensor(self.vocab[target], dtype=torch.long)}

In [None]:
test = PLDataset(train.head(), word_to_ix)
test.__getitem__(0)

{'context_ids': tensor([23138, 16568, 62447, 37748, 10694, 36938, 43498, 37748]),
 'target_id': tensor(24159)}

In [None]:
class PLTestDataset(Dataset):

  def __init__(self, data: pd.DataFrame, vocab: dict, window: int=4):
    self.data = data
    self.vocab = vocab
    self.window = window
    self.stop = set(stopwords.words('english') + list(string.punctuation))

  def __len__(self):
    return len(self.data)

  def __getitem__(self, index: int, target="_____"):
    row = self.data.iloc[index]
    question = row.question
    answer = row.answer.lower()
    question_tokens = [i for i in word_tokenize(question.lower()) if i not in self.stop]
    window_left,window_right = self.window,self.window
    for i,word in enumerate(question_tokens):
      if word == target:
        if i<window_left:
          window_right = window_right+(window_left-1)
        if i>(len(question_tokens)-window_right):
          window_left = window_left+(len(question_tokens)-i)
        context = question_tokens[i-window_left+1:i]+question_tokens[i+1:i+1+window_right]
        break
    return {'context_ids':torch.tensor([self.vocab[w] for w in context], dtype=torch.long),
            'target_id':torch.tensor(self.vocab[answer], dtype=torch.long)}

In [None]:
test = PLTestDataset(questions.head(), word_to_ix)
test.__getitem__(1)

{'context_ids': tensor([12303, 50716,  7986, 47943, 63047, 34226]),
 'target_id': tensor(5015)}

In [None]:
class PLDataModule(pl.LightningDataModule):

  def __init__(self, train_data, test_data, batch_size=16, vocab=word_to_ix, window=4):
    super().__init__()
    print(len(train_data))
    self.train_data = train_data
    self.test_data = test_data
    self.batch_size = batch_size
    self.vocab = vocab
    self.window = window

  def setup(self):
    self.train_dataset = PLDataset(
        self.train_data,
        self.vocab
    )
    self.test_dataset = PLTestDataset(
        self.test_data,
        self.vocab,
        self.window
    )

  def train_dataloader(self):
    return DataLoader(
        self.train_dataset,
        batch_size=self.batch_size,
        shuffle=True,
        num_workers=2
    )

  def val_dataloader(self):
    return DataLoader(self.test_dataset,batch_size=1,num_workers=2)
  def test_dataloader(self):
    return DataLoader(self.test_dataset,batch_size=1,num_workers=2)

In [None]:
class CBOWModel(pl.LightningModule):

  def __init__(self, config, vocab):
    super().__init__()
    self.config = config
    self.vocab = vocab
    self.embeddings = nn.Embedding(num_embeddings=config['vocab_size'],embedding_dim=config['embedding_dim'])
    self.linear = nn.Linear(in_features=config['embedding_dim'],out_features=config['vocab_size'])
    torch.nn.init.xavier_normal_(self.linear.weight)
    self.accuracy = pl.metrics.Accuracy()
    self.loss_function = nn.NLLLoss()


  def forward(self, inputs, target=None):
    embeds = torch.mean(self.embeddings(inputs), dim=1)
    # print(embeds.shape)
    logits = self.linear(embeds)
    # print(logits.shape)
    out = F.log_softmax(logits, dim=1)
    loss = 0
    if target is not None:   
      loss = self.loss_function(out, target)
    return loss, logits

  def training_step(self, batch, batch_index):
    context_ids = batch['context_ids']
    target_id = batch['target_id']
    loss, outputs = self(context_ids, target_id)
    self.log("train loss ", loss, prog_bar = True, logger=True)
    return {"loss":loss}

  def validation_step(self, batch, batch_index):
    context_ids = batch['context_ids']
    target_id = batch['target_id']
    loss, outputs = self(context_ids, target_id)
    self.log("validation loss ", loss, prog_bar = True, logger=True)
    return {"val_loss": loss, "val_outputs": outputs}

  def validation_epoch_end(self, outputs):
    avg_loss = torch.stack([x["val_loss"] for x in outputs]).mean()
    all_outputs = [x["val_outputs"] for x in outputs]
    preds=[]
    for output in all_outputs:
      for i,row in questions.iterrows():
        choice_ids = [self.vocab[row[c]] for c in choices]
        choice_logits = [float(output[0, id]) for id in choice_ids]
        preds.append(np.argmax(np.array(choice_logits)))

    total,correct=0,0
    for answer,pred in zip(answers.answer, preds):
      total+=1
      if answer==choices[pred]:
        correct+=1
    print(f"test accuracy {correct/total}")
    self.log("ptl/val_loss", avg_loss)
    self.log("ptl/val_accuracy", correct/total)

  def test_step(self, batch, batch_index):
    pass
  
  def training_epoch_end(self, outputs):
    pass

  def configure_optimizers(self):
    optimizer = optim.AdamW(self.parameters(), lr=self.config['lr'])
    return optimizer


Standalone

In [None]:
  config = {
  "lr": 2e-5,
  "batch_size": 128,
  "embedding_dim":256,
  "vocab_size":len(vocab),
  "n_epochs":6,
  "stop":False
  }
  print("Training set size: {}".format(len(train)))
  print("Vocab set size: {}".format(len(vocab)))
  word_to_ix = {word: i for i, word in enumerate(vocab)}
  model = CBOWModel(config, vocab=word_to_ix)
  data_module = PLDataModule(train, questions, batch_size=config['batch_size'],vocab=word_to_ix)
  data_module.setup()
  trainer = pl.Trainer(max_epochs=config['n_epochs'],gpus=1,progress_bar_refresh_rate=100)
  trainer.fit(model, data_module)

Training set size: 1860555
Vocab set size: 67724
1860555


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validation sanity check', layout=Layout…

test accuracy 0.20096153846153847


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

test accuracy 0.2701923076923077


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

test accuracy 0.2721153846153846


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

test accuracy 0.2721153846153846


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

test accuracy 0.2701923076923077


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

test accuracy 0.2701923076923077


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

test accuracy 0.27115384615384613



1

# Hyper-Pararmeter Tuning with Ray Tune

In [None]:
callback = TuneReportCallback({
    "accuracy": "ptl/val_accuracy",
    "loss": "ptl/val_loss",
}, on="validation_end")

In [None]:
def train_tune(config, gpus=0):
  train, vocab = processfiles(texts, config=config)
  word_to_ix = {word: i for i, word in enumerate(vocab)}
  print("Training set size: {}".format(len(train)))
  model = CBOWModel(config,vocab=word_to_ix)
  data_module = PLDataModule(train, questions, vocab=word_to_ix, batch_size=config['batch_size'])
  print("Steps per epoch {}".format(len(train)/config['batch_size']))
  data_module.setup()
  trainer = pl.Trainer(max_epochs=5,gpus=config["n_gpus"],progress_bar_refresh_rate=1000,
                       logger=TensorBoardLogger(save_dir=tune.get_trial_dir(), name="", version="."),
                       callbacks=[callback])
  trainer.fit(model, data_module)

In [None]:
def tune_cbow(config, num_samples=3, gpus_per_trial=0):
  scheduler = ASHAScheduler(
      metric='accuracy',
      mode='max',
      grace_period=3,
      reduction_factor=2)

  reporter = CLIReporter(
      parameter_columns=["lr", "batch_size", "embedding_dim", 'stop', "window_size"],
      metric_columns=["loss", "accuracy", "training_iteration"])

  trainable = tune.with_parameters(
      train_tune,
      gpus=config["n_gpus"])
  analysis = tune.run(
      trainable,
      resources_per_trial={
          "cpu": 1,
          "gpu": config["n_gpus"]
      },
      config=config,
      scheduler=scheduler,
      progress_reporter=reporter,
      num_samples=num_samples,
      name="tune_cbow")

In [None]:
config = {
  "lr": tune.choice([2e-6,2e-5,2e-4]),
  "batch_size": 64,
  "embedding_dim":tune.choice([64,128,256]),
  "vocab_size":len(vocab),
  "n_epochs":20,
  "stop":tune.choice([True, False]),
  "window_size":tune.choice([2,3,4,5,10]),
  "n_gpus":1
}

In [None]:
 import numpy as np
 analysis = tune_cbow(config, num_samples=10)

2021-04-16 10:27:50,249	INFO services.py:1174 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


== Status ==
Memory usage on this node: 3.9/12.7 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 96.000: None | Iter 48.000: None | Iter 24.000: None | Iter 12.000: None | Iter 6.000: None | Iter 3.000: None
Resources requested: 1/2 CPUs, 1/1 GPUs, 0.0/7.32 GiB heap, 0.0/2.54 GiB objects (0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/tune_cbow
Number of trials: 1/10 (1 RUNNING)
+--------------------+----------+-------+--------+--------------+-----------------+--------+---------------+
| Trial name         | status   | loc   |     lr |   batch_size |   embedding_dim | stop   |   window_size |
|--------------------+----------+-------+--------+--------------+-----------------+--------+---------------|
| _inner_679c6_00000 | RUNNING  |       | 0.0002 |           64 |             256 | False  |             3 |
+--------------------+----------+-------+--------+--------------+-----------------+--------+---------------+


[2m[36m(pid=883)[0m Training set size: 4294517


[2m[36m(pid=883)[0m GPU available: True, used: True
[2m[36m(pid=883)[0m TPU available: False, using: 0 TPU cores
[2m[36m(pid=883)[0m LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[2m[36m(pid=883)[0m 2021-04-16 10:28:44.101381: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0


[2m[36m(pid=883)[0m Validation sanity check: 0it [00:00, ?it/s]Validation sanity check:   0%|          | 0/2 [00:00<?, ?it/s]


[2m[36m(pid=883)[0m 
[2m[36m(pid=883)[0m   | Name          | Type      | Params
[2m[36m(pid=883)[0m --------------------------------------------
[2m[36m(pid=883)[0m 0 | embeddings    | Embedding | 17.4 M
[2m[36m(pid=883)[0m 1 | linear        | Linear    | 17.5 M
[2m[36m(pid=883)[0m 2 | accuracy      | Accuracy  | 0     
[2m[36m(pid=883)[0m 3 | loss_function | NLLLoss   | 0     
[2m[36m(pid=883)[0m --------------------------------------------
[2m[36m(pid=883)[0m 34.8 M    Trainable params
[2m[36m(pid=883)[0m 0         Non-trainable params
[2m[36m(pid=883)[0m 34.8 M    Total params
[2m[36m(pid=883)[0m 139.357   Total estimated model params size (MB)


[2m[36m(pid=883)[0m test accuracy 0.16923076923076924
Epoch 0:   0%|          | 0/68142 [00:00<?, ?it/s] 
Epoch 0:   1%|▏         | 1000/68142 [00:20<23:22, 47.86it/s, loss=7.13, v_num=., validation loss =11.20, train loss =7.530]
Epoch 0:   1%|▏         | 1000/68142 [00:40<44:45, 25.00it/s, loss=7.13, v_num=., validation loss =11.20, train loss =7.530]
Epoch 0:   3%|▎         | 2000/68142 [00:40<22:27, 49.10it/s, loss=6.8, v_num=., validation loss =11.20, train loss =6.900] 
Epoch 0:   3%|▎         | 2000/68142 [01:00<33:04, 33.33it/s, loss=6.8, v_num=., validation loss =11.20, train loss =6.900]
Epoch 0:   4%|▍         | 3000/68142 [01:00<21:51, 49.68it/s, loss=6.71, v_num=., validation loss =11.20, train loss =6.400]
Epoch 0:   6%|▌         | 4000/68142 [01:19<21:21, 50.04it/s, loss=6.55, v_num=., validation loss =11.20, train loss =6.850]
Epoch 0:   6%|▌         | 4000/68142 [01:30<24:03, 44.44it/s, loss=6.55, v_num=., validation loss =11.20, train loss =6.850]
Epoch 0:   7%|▋  

KeyboardInterrupt: ignored

In [None]:
%load_ext tensorboard
%tensorboard --logdir ~/ray_results

In [None]:
test = torch.tensor([ word_to_ix['went'], word_to_ix['city'], word_to_ix['walking'], word_to_ix['streets'], word_to_ix['capital'], word_to_ix['building']])
test.shape

In [None]:
loss, log_probs = model(torch.unsqueeze(test, dim=0))

In [None]:
torch.argmax(log_probs)
ix_to_word = dict((v,k) for k,v in word_to_ix.items())
ix_to_word[int(torch.argmax(log_probs))]

# Test Data - (MRSCC Data)

In [None]:
class question:
    
    def __init__(self, aline, lm):
        self.sentence=aline[1]
        self.choices = ["a", "b", "c", "d", "e"]
        self.word_choices = {index:word for index,word in zip(self.choices,aline[2:])}
        self.model = model

    def add_answer(self,fields):
        self.answer=fields[1]

    def get_window_context(self,sent_tokens,window_left, window_right,target="_____"):
        stop = set(stopwords.words('english') + list(string.punctuation))
        # print(sent_tokens)
        tokens = [i for i in word_tokenize(sent_tokens.lower()) if i not in stop]
        # print(tokens)

        for i,token in enumerate(tokens):
            if token==target:
              if i<window_left:
                # print('changing right win')
                window_right = window_right+(window_left-1)
              if i>(len(tokens)-window_right):
                # print('changing left win')
                window_left = window_left+(len(tokens)-i)
            # print(window_left)
            # print(window_right)
            # print(tokens[i-window_left:])
            return tokens[i-window_left+1:i]+tokens[i:i+window_right]
        else:
            return []

    def predict(self, window=2):
      #  get left words
        # print(self.sentence)
        context = self.get_window_context(self.sentence, window, window)
        context = torch.tensor([word_to_ix[w] for w in context])
        _, log_probs = model(torch.unsqueeze(context, dim=0), target=None)
      # get rid of extra dimension
        log_probs = torch.squeeze(log_probs)
      # which of the 5 word choices has the highest probability given this
      # first convert words to ids
        choice_ids = {index:word_to_ix[word] for index,word in self.word_choices.items() if word in word_to_ix.keys()}
      # turn ids into probabilities given model
        choice_probs = {index:float(log_probs[id]) for index, id in choice_ids.items()}
      # choose max prediciton
        prediction = max(choice_probs, key=choice_probs.get)
        return prediction
        
    def predict_and_score(self):
        #compare prediction according to method with the correct answer
        #return 1 or 0 accordingly
        prediction=self.predict()
        if prediction == self.answer:
            return 1
        else:
            return 0
      
          

In [None]:
class scc_reader:
    
    def __init__(self, model, qs=questions, ans=answers):
        self.qs=qs
        self.ans=ans
        self.model = model
        self.read_files()
   
    def read_files(self):
        #create a question instance for each line of the file (other than heading line)
        self.questions=[question(questions.iloc[i], self.model) for i in range(len(questions))]
        #add answers to questions so predictions can be checked    
        for i,q in enumerate(self.questions):
            q.add_answer(answers.iloc[i])
        
    def get_field(self,field):
        return [q.get_field(field) for q in self.questions] 
    
    def predict(self):
        return [q.predict() for q in self.questions]
    
    def predict_and_score(self):
        scores=[q.predict_and_score() for q in self.questions]
        return sum(scores)/len(scores)

In [None]:
SCC = scc_reader(model=model)

In [None]:
SCC.predict_and_score()

In [None]:
t = torch.squeeze(log_probs)

In [None]:
questions.head()

In [None]:
float(t[0])