In [41]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [42]:
import numpy as np
import torch

# CASE STUDY: POS Tagging!

Now let's dive into an example that is more relevant to NLP and is relevant to your HW3, part-of-speech tagging! We will be building up code up until the point where you will be able to process the POS data into tensors, then train a simple model on it.
The code we are building up to forms the basis of the code in the homework assignment.

To start, we'll need some data to train and evaluate on. First download the train and dev POS data `twitter_train.pos` and `twitter_dev.pos` into the same directory as this notebook.

We will now be introducing three new components which are vital to training (NLP) models:
1. a `Vocabulary` object which converts from tokens/labels to integers. This part should also be able to handle padding so that batches can be easily created.
2. a `Dataset` object which takes in the data file and produces data tensors
3. a `DataLoader` object which takes data tensors from `Dataset` and batches them

### `Vocabulary`

Next, we need to get our data into Python and in a form that is usable by PyTorch. For text data this typically entails building a `Vocabulary`  of all of the words, then mapping words to integers corresponding to their place in the sorted vocabulary. This can be done as follows:

In [43]:
class Vocabulary():
    """ Object holding vocabulary and mappings
    Args:
        word_list: ``list`` A list of words. Words assumed to be unique.
        add_unk_token: ``bool` Whether to create an token for unknown tokens.
    """
    def __init__(self, word_list, add_unk_token=False):
        # create special tokens for padding and unknown words
        self.pad_token = '<pad>'
        self.unk_token = '<unk>' if add_unk_token else None

        self.special_tokens = [self.pad_token]
        if self.unk_token:
            self.special_tokens += [self.unk_token]

        self.word_list = word_list
        
        # maps from the token ID to the token
        self.id_to_token = self.word_list + self.special_tokens
        # maps from the token to its token ID
        self.token_to_id = {token: id for id, token in
                            enumerate(self.id_to_token)}
        
    def __len__(self):
        """ Returns size of vocabulary """
        return len(self.token_to_id)
    
    @property
    def pad_token_id(self):
        return self.map_token_to_id(self.pad_token)
        
    def map_token_to_id(self, token: str):
        """ Maps a single token to its token ID """
        return self.token_to_id[token]

    def map_id_to_token(self, id: int):
        """ Maps a single token ID to its token """
        return self.id_to_token[id]

    def map_tokens_to_ids(self, tokens: list, max_length: int = None):
        """ Maps a list of tokens to a list of token IDs """
        # truncate extra tokens and pad to `max_length`
        if max_length:
            tokens = tokens[:max_length]
            tokens = tokens + [self.pad_token]*(max_length-len(tokens))
        return [self.map_token_to_id(token) for token in tokens]

    def map_ids_to_tokens(self, ids: list, filter_padding=True):
        """ Maps a list of token IDs to a list of token """
        tokens = [self.map_id_to_token(id) for id in ids]
        if filter_padding:
            tokens = [t for t in tokens if t != self.pad_token]
        return tokens

### `Dataset`

Next, we need a way to efficiently read in the data file and to process it into tensors. PyTorch provides an easy way to do this using the `torch.utils.data.Dataset` class. We will be creating our own class which inherits from this class. 

Helpful link: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html

A custom `Dataset` class must implement three functions: 

- $__init__$: The init functions is run once when instantisting the `Dataset` object.
- $__len__$: The len function returns the number of data points in our dataset.
- $__getitem__$. The getitem function returns a sample from the dataset give the index of the sample. The output of this part should be a dictionary of (mostly) PyTorch tensors.

In [82]:
class WikiDataset(torch.utils.data.Dataset):
    def __init__(self, data_path, testData: int):
        self._dataset = []
        self.isTestData = testData
        # read the dataset file, extracting tokens and tags
        with open(data_path, 'r', encoding='utf-8') as f:
            for i,line in enumerate(f):
                if(i==0): continue
                if(self.isTestData == 1 and i > 250): continue
                if(self.isTestData == 0 and i <= 250): continue
                sentence_tokens = []
                elements = line.strip().split(',')
                #sentence
                sentence = elements[3].split(' ')
                for word in sentence:
                    clean_word = word.replace(".", "").replace("(","").replace(")","")
                    sentence_tokens.append(clean_word)
                #Origin Page
                origin_page = elements[0]
                #Destination Page
                destination_page = elements[1]
                #WikiData Labels
                labels = elements[4] if self.isTestData!=2 else None
                self._dataset.append({'sentence_tokens': sentence_tokens, 'origin_page': [origin_page], 'destination_page': [destination_page],'label': [labels]})
        
        # intiailize an empty vocabulary
        self.sentence_token_vocab = None
        self.origin_page_vocab = None
        self.destination_page_vocab = None
        self.label_vocab = None

    def __len__(self):
        return len(self._dataset)

    def __getitem__(self, item: int):
        # get the sample corresponding to the index
        instance = self._dataset[item]
        
        # check the vocabulary has been set
        assert self.sentence_token_vocab is not None
        assert self.origin_page_vocab is not None
        assert self.destination_page_vocab is not None
        assert self.label_vocab is not None
        
        # Convert inputs to tensors, then return
        return self.tensorize(instance['sentence_tokens'],instance['origin_page'], instance['destination_page'],instance['label'])
    
    def tensorize(self, sentence_tokens, origin_page, destination_page, cls, max_length=None):
        # map the sentence tokens into their ID form
        sentence_token_ids = self.sentence_token_vocab.map_tokens_to_ids(sentence_tokens, max_length)
        tensor_dict = {'sentence_token_ids': torch.LongTensor(sentence_token_ids)}
        # map origin_page
        origin_page_map = self.origin_page_vocab.map_tokens_to_ids(origin_page)
        tensor_dict['origin_page_id'] = torch.LongTensor(origin_page_map)
        # map destination_page
        destination_page_map = self.destination_page_vocab.map_tokens_to_ids(destination_page)
        tensor_dict['destination_page_id'] = torch.LongTensor(destination_page_map)
        # map class
        if self.isTestData != 2:
            label_map = self.label_vocab.map_tokens_to_ids(cls)
            tensor_dict['label'] = torch.LongTensor(label_map)
        return tensor_dict
        
    def get_sentence_tokens_list(self):
        s_tokens = [token for d in self._dataset for token in d['sentence_tokens']]
        return sorted(set(s_tokens))
    
    def get_origin_pages_list(self):
        o_pages = [op for d in self._dataset for op in d['origin_page']]
        return sorted(set(o_pages))
    
    def get_destination_pages_list(self):
        d_pages = [dp for d in self._dataset for dp in d['destination_page']]
        return sorted(set(d_pages))

    def get_classes_list(self):
        assert self.isTestData != 2
        clss = [c for d in self._dataset for c in d['label']]
        return sorted(set(clss))

    def set_vocab(self, sentence_token_vocab: Vocabulary, origin_page_vocab: Vocabulary, destination_page_vocab: Vocabulary, label_vocab: Vocabulary):
        self.sentence_token_vocab = sentence_token_vocab
        self.origin_page_vocab = origin_page_vocab
        self.destination_page_vocab = destination_page_vocab
        self.label_vocab = label_vocab

Now let's create `Dataset` objects for our training and validation sets!
A key step here is creating the `Vocabulary` for these datasets.
We will use the list of words in the training set to intialize a `Vocabulary` object over the input words. 
We will also use list of tags to intialize a `Vocabulary` over the tags.

In [112]:
train_dataset = WikiDataset('/content/drive/Shared drives/CS272/data/extended_datasets/roman_history_training_dedup.csv', testData = 0)
valdn_dataset = WikiDataset('/content/drive/Shared drives/CS272/data/extended_datasets/roman_history_training_dedup.csv', testData = 1)
test_dataset = WikiDataset('/content/drive/Shared drives/CS272/data/extended_datasets/roman_history_training_dedup.csv', testData = 2)

# Get list of sentence tokens, origin pages, destination pages, and classes in train set
train_sentence_token_list = train_dataset.get_sentence_tokens_list()
train_origin_page_list = train_dataset.get_origin_pages_list()
train_destination_page_list = train_dataset.get_destination_pages_list()
train_class_list = train_dataset.get_classes_list()

# Get list of sentence tokens, origin pages, destination pages, and classes in valdn set
valdn_sentence_token_list = valdn_dataset.get_sentence_tokens_list()
valdn_origin_page_list = valdn_dataset.get_origin_pages_list()
valdn_destination_page_list = valdn_dataset.get_destination_pages_list()
valdn_class_list = valdn_dataset.get_classes_list()

# Get list of sentence tokens, origin pages, destination pages in test set
test_sentence_token_list = test_dataset.get_sentence_tokens_list()
test_origin_page_list = test_dataset.get_origin_pages_list()
test_destination_page_list = test_dataset.get_destination_pages_list()

#Create Vocabulary
sentence_token_vocab = Vocabulary(sorted(set(train_sentence_token_list + valdn_sentence_token_list + test_sentence_token_list)), add_unk_token=False)
origin_page_vocab = Vocabulary(sorted(set(train_origin_page_list + valdn_origin_page_list + test_origin_page_list)), add_unk_token=False)
destination_page_vocab = Vocabulary(sorted(set(train_destination_page_list + valdn_destination_page_list + test_destination_page_list)), add_unk_token=False)
label_vocab = Vocabulary(sorted(set(train_class_list + valdn_class_list)), add_unk_token=False)

# Update the train/valdn/test set with vocabulary.
train_dataset.set_vocab(sentence_token_vocab, origin_page_vocab, destination_page_vocab, label_vocab)
valdn_dataset.set_vocab(sentence_token_vocab, origin_page_vocab, destination_page_vocab, label_vocab)
test_dataset.set_vocab(sentence_token_vocab, origin_page_vocab, destination_page_vocab, label_vocab)

In [113]:
print(f'Size of training set: {len(train_dataset)}')
print(f'Size of validation set: {len(valdn_dataset)}')
print(f'Size of test set: {len(test_dataset)}')
print(f'sentence_token_vocab: {len(sentence_token_vocab)}')
print(f'origin_page_vocab: {len(origin_page_vocab)}')
print(f'destination_page_vocab: {len(destination_page_vocab)}')

Size of training set: 12099
Size of validation set: 250
Size of test set: 12349
sentence_token_vocab: 27054
origin_page_vocab: 5676
destination_page_vocab: 1880


Let's print out one data point of the tensorized data and see what it looks like

In [105]:
instance = train_dataset[0]
print(instance)

sentence_tokens = train_dataset.sentence_token_vocab.map_ids_to_tokens(instance['sentence_token_ids'])
origin_page = train_dataset.origin_page_vocab.map_ids_to_tokens(instance['origin_page_id'])
destination_page = train_dataset.destination_page_vocab.map_ids_to_tokens(instance['destination_page_id'])
cls = train_dataset.label_vocab.map_ids_to_tokens(instance['label'])
print()
print(f'Tokens: {sentence_tokens}')
print(f'OriginPage:   {origin_page}')
print(f'DestinationPage:   {destination_page}')
print(f'Class:   {cls}')

{'sentence_token_ids': tensor([ 6385, 13964, 14205,  4638, 14581, 17537, 15517, 13147, 15587,  3405,
        12189, 17205,  9694, 17591,  7921, 12189, 17206, 12189, 13683, 11301,
        11986, 17537,  9788,  8368,  3871, 17537, 10472, 12189, 15061, 11676,
        16586, 17612, 18030,  6703, 12010, 17612, 17537,  6702, 17310, 17612,
        17537, 13681, 12560, 14150, 15746, 15587, 17537, 14867,  5012, 10590]), 'origin_page_id': tensor([2943]), 'destination_page_id': tensor([906]), 'label': tensor([1])}

Tokens: ['It', 'extends', 'from', 'Dobruja', 'in', 'the', 'northeastern', 'corner', 'of', 'Bulgaria', 'and', 'southeastern', 'Romania', 'through', 'Moldova', 'and', 'southern', 'and', 'eastern', 'Ukraine', 'across', 'the', 'Russian', 'Northern', 'Caucasus', 'the', 'Southern', 'and', 'lower', 'Volga', 'regions', 'to', 'western', 'Kazakhstan', 'adjacent', 'to', 'the', 'Kazakh', 'steppe', 'to', 'the', 'east', 'both', 'forming', 'part', 'of', 'the', 'larger', 'Eurasian', 'Steppe']
OriginPa

In [106]:
print(len(train_dataset.sentence_token_vocab))

19315


### `DataLoader`

At this point our data is in a tensor, and we can create context windows using only PyTorch operations.
Now we need a way to generate batches of data for training and evaluation.
To do this, we will wrap our `Dataset` objects in a `torch.utils.data.DataLoader` object, which will automatically batch datapoints.

In [107]:
batch_size = 1
print(f'Setting batch_size to be {batch_size}')

train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size)
valdn_dataloader = torch.utils.data.DataLoader(valdn_dataset, batch_size)

Setting batch_size to be 1


Now let's do one iteration over our training set to see what a batch looks like:

In [108]:
for i,batch in enumerate(train_dataloader):
    print(batch, '\n')
    print(f'Size of classes: {batch["label"].size()}')
    if(i==2):
        break

{'sentence_token_ids': tensor([[ 6385, 13964, 14205,  4638, 14581, 17537, 15517, 13147, 15587,  3405,
         12189, 17205,  9694, 17591,  7921, 12189, 17206, 12189, 13683, 11301,
         11986, 17537,  9788,  8368,  3871, 17537, 10472, 12189, 15061, 11676,
         16586, 17612, 18030,  6703, 12010, 17612, 17537,  6702, 17310, 17612,
         17537, 13681, 12560, 14150, 15746, 15587, 17537, 14867,  5012, 10590]]), 'origin_page_id': tensor([[2943]]), 'destination_page_id': tensor([[906]]), 'label': tensor([[1]])} 

Size of classes: torch.Size([1, 1])
{'sentence_token_ids': tensor([[11684,  6989, 16294,  2454, 19149, 16839,  3057, 14858, 19297, 16840,
         14710, 11934, 15758, 13900,     0,  3057, 12827, 14581,  8015, 17140,
         15449, 17537, 12827, 15587,  7723, 12189, 12978, 13068, 12308, 17537,
         12187, 12683, 15587, 17537, 14783, 15587,  7665]]), 'origin_page_id': tensor([[4090]]), 'destination_page_id': tensor([[560]]), 'label': tensor([[1]])} 

Size of classes: t

## Model

Now that we can read in the data, it is time to build our model.
We will build a very simple LSTM based tagger! Note that this is pretty similar to the code in `simple_tagger.py` in your homework, but with a lot of things hardcoded.

Useful links:
- Embedding Layer: https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html
- LSTMs: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html
- Linear Layer: https://pytorch.org/docs/stable/generated/torch.nn.Linear.html?highlight=linear#torch.nn.Linear

In [109]:
device = "cuda"

In [110]:
class LSTMClassifier(torch.nn.Module):
    def __init__(self, sentence_token_vocab, origin_page_vocab, destination_page_vocab, label_vocab):
        super(LSTMClassifier, self).__init__()
        self.sentence_token_vocab = sentence_token_vocab
        self.origin_page_vocab = origin_page_vocab
        self.destination_page_vocab = destination_page_vocab
        self.label_vocab = label_vocab
        self.num_classes = len(label_vocab) - 1          ##because of 1 special token in vocab
        # self.input_size=50
        self.hidden_size=25
        self.num_layers=1
        self.embedding_dim = 50
        
        # Initialize random embeddings of size 50 for each word in your token vocabulary
        self._embeddings = torch.nn.Embedding(len(sentence_token_vocab)+len(origin_page_vocab)+len(destination_page_vocab), self.embedding_dim)
        #print('Input dim: ', len(sentence_token_vocab)+len(origin_page_vocab)+len(destination_page_vocab))
        
        # Initialize a single-layer bidirectional LSTM encoder
        self._encoder = torch.nn.LSTM(input_size=self.embedding_dim, hidden_size=self.hidden_size, num_layers=self.num_layers, bidirectional=True)
        
        # _encoder a Linear layer which projects from the hidden state size to the number of tags
        self._fc1 = torch.nn.Linear(in_features=2*self.hidden_size, out_features=self.hidden_size)
        self._fc2 = torch.nn.Linear(self.hidden_size, self.num_classes)

        # Loss will be a Cross Entropy Loss over the tags (except the padding token)
        self.loss = torch.nn.MSELoss()

    def forward(self, sentence_token_ids, origin_page_id, destination_page_id, label=None):
        # Create mask over all the positions where the input is padded
        #mask = token_ids != self.token_vocab.pad_token_id
        
        #Combine all the features
        input_ = sentence_token_ids.tolist().copy()
        for i, _ in enumerate(input_):
            input_[i].insert(0, destination_page_id[i][0])
            input_[i].insert(0, origin_page_id[i][0])
        input_ = torch.LongTensor(input_)
            
        #run on CUDA
#         device = "cuda" if torch.cuda.is_available() else "cpu"
        input_ = input_.to(device)
        #print(input_)
        
        # Embed Inputs
        embeddings = self._embeddings(input_).permute(1, 0, 2)                #######input_ is a 2d array
        #print(embeddings.shape)
        
        # Feed embeddings through LSTM
        encoder_outputs, (h_n, c_n) = self._encoder(embeddings)            #### why [0]?
        encoder_outputs = (encoder_outputs.permute(1, 0, 2))[:,-1,:]                          ##### choosing the last words encoding
        #print(encoder_outputs.shape)
        
        # Project output of LSTM through linear layer to get logits
        fc1_outputs = self._fc1(encoder_outputs)
        #print(fc1_outputs.shape)
        fc2_outputs = self._fc2(fc1_outputs)
        #print(fc2_outputs.shape)
        
        # Get the maximum score for each position as the predicted tag
        final_outputs = torch.nn.functional.softmax(fc2_outputs, dim=-1)
        preds = self.inverseOneHot(final_outputs)
        output_dict = {
            'predicted_labels':  preds
        }
        # Compute loss and accuracy if gold tags are provided
        if label is not None:
            label = label.to(device)
            target_labels = torch.Tensor(self.oneHotEncode(label)).to(device)
            #print(final_outputs.shape, target_labels.shape)
            loss = self.loss(final_outputs, target_labels)
            output_dict['loss'] = loss
            self.getAccuracy(target_labels, final_outputs, output_dict)

        return output_dict
    
    def oneHotEncode(self, label):
        oneHot = np.zeros((label.shape[0],self.num_classes))
        for i,l in enumerate(label):
            oneHot[i][l[0]-1] = 1
        oneHot = oneHot.tolist()
        return oneHot
    
    def inverseOneHot(self, one_hot_labels):
        lbls = []
        for i in range(one_hot_labels.shape[0]):
            lbls.append(torch.argmax(one_hot_labels[i]))
        lbls = torch.Tensor(lbls)
        return lbls
        
    
    def getAccuracy(self, target_labels, final_outputs, output_dict):
        correct = torch.Tensor([1 if target_labels[i,torch.argmax(final_outputs[i]).to(int).item()].to(int).item()==1 else 0 for i in range(target_labels.shape[0])]) # 1's in positions where pred matches gold
        #correct *= mask # zero out positions where mask is zero
        output_dict['accuracy'] = (torch.sum(correct)/target_labels.shape[0]).item()
        
        #per label accuracies
        for i in range(target_labels.shape[0]):
            output_dict['label_accuracy'] = {}
            output_dict['label_count'] = {}
            
            cls = torch.argmax(final_outputs[i]).to(int).item()
            if cls not in output_dict['label_accuracy']:
                output_dict['label_accuracy'][cls] = 0.0
            if cls not in output_dict['label_count']:
                output_dict['label_count'][cls] = 0.0
            
            output_dict['label_count'][cls] += 1
            
            if target_labels[i,cls]==1:
                output_dict['label_accuracy'][cls] += 1
            else:
                output_dict['label_accuracy'][cls] += 0

### Training

The training script essentially follows the same pattern that we used for the linear model above. However we have also added an evaluation step, and code for saving model checkpoints.

In [111]:
from tqdm import tqdm

################################
# Setup
################################
# Create model
model = LSTMClassifier(sentence_token_vocab=sentence_token_vocab, origin_page_vocab=origin_page_vocab, destination_page_vocab=destination_page_vocab,label_vocab=label_vocab)
if device=="cuda":
    model = model.cuda()

# Initialize optimizer.
# Note: The learning rate is an important hyperparameters to tune
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

################################
# Training and Evaluation!
################################
num_epochs = 10
# best_dev_loss = float('inf')

for epoch in range(num_epochs):
    print('\nEpoch', epoch)
    # Training loop
    model.train() # THIS PART IS VERY IMPORTANT TO SET BEFORE TRAINING
    train_loss = 0
    train_acc = 0
    label_acc, label_count = {}, {}
    for batch in train_dataloader:
        batch_size = batch['sentence_token_ids'].size(0)
        optimizer.zero_grad()
        output_dict = model(**batch)
        loss = output_dict['loss']
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item()*batch_size
        accuracy = output_dict['accuracy']
        train_acc += accuracy*batch_size
        label_acc = {k: label_acc.get(k, 0) + output_dict['label_accuracy'].get(k, 0) for k in set(label_acc) | set(output_dict['label_accuracy'])}
        label_count = {k: label_count.get(k, 0) + output_dict['label_count'].get(k, 0) for k in set(label_count) | set(output_dict['label_count'])}
    train_loss /= len(train_dataset)
    train_acc /= len(train_dataset)
    print(f'Train loss {train_loss} accuracy {train_acc}')
#     for k in label_count:
#         print(f'{k} = {train_dataset.label_vocab.map_ids_to_tokens([k])} label accuracy {label_acc[k]/label_count[k]}')
    
    # Evaluation loop
    model.eval() # THIS PART IS VERY IMPORTANT TO SET BEFORE EVALUATION
    dev_loss = 0
    dev_acc = 0
    dev_label_acc, dev_label_count = {}, {}
    for batch in valdn_dataloader:
        batch_size = batch['sentence_token_ids'].size(0)
        output_dict = model(**batch)
        dev_loss += output_dict['loss'].item()*batch_size
        dev_acc += output_dict['accuracy']*batch_size
        dev_label_acc = {k: dev_label_acc.get(k, 0) + output_dict['label_accuracy'].get(k, 0) for k in set(dev_label_acc) | set(output_dict['label_accuracy'])}
        dev_label_count = {k: dev_label_count.get(k, 0) + output_dict['label_count'].get(k, 0) for k in set(dev_label_count) | set(output_dict['label_count'])}
    dev_loss /= len(valdn_dataset)
    dev_acc /= len(valdn_dataset)
    print(f'Dev loss {dev_loss} accuracy {dev_acc}')
    for k in dev_label_count:
        print(f'{k} = {valdn_dataset.label_vocab.map_ids_to_tokens([k])} label accuracy {dev_label_acc[k]/dev_label_count[k]}')
    
#     # Save best model
#     if dev_loss < best_dev_loss:
#         print('Best so far')
#         torch.save(model, 'model.pt')
#         best_dev_loss = dev_loss


Epoch 0
Train loss 0.06613933239150539 accuracy 0.46813806837039496
Dev loss 0.05982429902930744 accuracy 0.496
0 = ['P131'] label accuracy 0.6163522012578616
1 = ['P17'] label accuracy 0.0
2 = ['P19'] label accuracy 0.0
3 = ['P20'] label accuracy 0.24
5 = ['P276'] label accuracy 0.3
7 = ['P39'] label accuracy 0.47619047619047616
8 = ['P47'] label accuracy 0.5
9 = ['P710'] label accuracy 0.0

Epoch 1
Train loss 0.05269091696956827 accuracy 0.5814802522402921
Dev loss 0.05559320368650788 accuracy 0.58
0 = ['P131'] label accuracy 0.773109243697479
1 = ['P17'] label accuracy 0.3333333333333333
2 = ['P19'] label accuracy 0.16666666666666666
3 = ['P20'] label accuracy 0.21428571428571427
4 = ['P27'] label accuracy 0.5
5 = ['P276'] label accuracy 0.46153846153846156
6 = ['P361'] label accuracy 0.5
7 = ['P39'] label accuracy 0.6875
8 = ['P47'] label accuracy 0.4444444444444444
9 = ['P710'] label accuracy 0.42857142857142855

Epoch 2
Train loss 0.042813045565303916 accuracy 0.6755725190839694

In [None]:
test_batch_size = 1
test_sampler = torch.utils.data.SequentialSampler(test_dataset)
test_dataloader = torch.utils.data.DataLoader(test_dataset, sampler= test_sampler,batch_size= test_batch_size)

# Evaluation loop
model.eval() # THIS PART IS VERY IMPORTANT TO SET BEFORE EVALUATION
with open('extended_datasets/predictions/sustainability_test_predictions.csv', mode='w') as f:
    for batch in test_dataloader:
        #print(batch)
        batch_size = batch['sentence_token_ids'].size(0)
        output_dict = model(**batch)
        #print(output_dict)
        test_wiki_labels = test_dataset.label_vocab.map_ids_to_tokens(output_dict['predicted_labels'].to(torch.int))
        for p in test_wiki_labels:
            f.write(p)
            f.write('\n')
    

## Loading Trained Models

Loading a pretrained model can be done easily. To learn more about saving/loading models see https://pytorch.org/tutorials/beginner/saving_loading_models.html

In [None]:
model = torch.load('model.pt')
if torch.cuda.is_available():
    model = model.cuda()
def oneHotToLabel(oneHot: list):
    return np.argmax(np.array(oneHot))

## Feed in your own sentences!

In [None]:
sentence = 'i want to eat a pizza .'.lower().split()

# convert sentence to tensor dictionar
tensor_dict = train_dataset.tensorize(sentence)

# unsqueeze first dimesion so batch size is 1
tensor_dict['token_ids'] = tensor_dict['token_ids'].unsqueeze(0)
tensor_dict['label'] = None
print(tensor_dict)

# feed through model
output_dict = model(**tensor_dict)

# get predicted tag IDs
pred_label = output_dict['predicted_label'].squeeze().tolist()
print(len(pred_label))

# convert tag IDs to tag names
print(oneHotToLabel(pred_label))

## Conclusion

You've now seen at a high level how to create neural networks for NLP.
You've also now seen the components that go around a model (e.g. training loops, data processing).
Setting up these componenents in a flexible way can be tricky for NLP, as there are many issues that you have to take care of like padding, different vocabularies, etc.
For example, how would you build upon this code to load in pre-trained embeddings, or use character embeddings?

That's why there exist many libraries that take care of these boilerplate components so that you can focus on modeling.
One of these libraries is [allennlp](https://allennlp.org/), and if you have time, I encourage you to take a look at it. 
It builds upon PyTorch so everything you've learned here is applicable.