# Introduction to PyTorch

PyTorch is a Python package for performing tensor computation, automatic differentiation, and dynamically defining neural networks. It makes it particularly easy to accelerate model training with a GPU. In recent years it has gained a large following in the NLP community.


## Installing PyTorch

Instructions for installing PyTorch can be found on the home-page of their website: <http://pytorch.org/>. The PyTorch developers recommended you use the `conda` package manager to install the library (in my experience `pip` works fine as well).

One thing to be aware of is that the package name will be different depending on whether or not you intend on using a GPU. If you do plan on using a GPU, then you will need to install CUDA and CUDNN before installing PyTorch. Detailed instructions can be found at NVIDIA's website: <https://docs.nvidia.com/cuda/>. The following versions of CUDA are supported: 7.5, 8, and 9.


## PyTorch Basics

The PyTorch API is designed to very closely resemble NumPy. The central object for performing computation is the `Tensor`, which is PyTorch's version of NumPy's `array`.

In [1]:
import numpy as np
import torch

In [2]:
# Create a 3 x 2 array
np.ndarray((3, 2))

array([[2.68156159e+154, 2.68156159e+154],
       [2.23901381e-314, 2.23902153e-314],
       [2.23862805e-314, 0.00000000e+000]])

In [3]:
# Create a 3 x 2 Tensor
torch.Tensor(3, 2)

tensor([[2.6589e+23, 1.0187e-11],
        [1.0514e-05, 1.1039e-05],
        [1.0978e-05, 4.1291e-05]])

All of the basic arithmetic operations are supported.

In [4]:
a = torch.Tensor([1,2])
b = torch.Tensor([3,4])
print('a + b:', a + b)
print('a - b:', a - b)
print('a * b:', a * b)
print('a / b:', a / b)

a + b: tensor([4., 6.])
a - b: tensor([-2., -2.])
a * b: tensor([3., 8.])
a / b: tensor([0.3333, 0.5000])


Indexing/slicing also behaves the same.

In [5]:
a = torch.randint(0, 10, (4, 4))
print('a:', a, '\n')

# Slice using ranges
print('a[2:, :]', a[2:, :], '\n')

# Can count backwards using negative indices
print('a[:, -1]', a[:, -1])

a: tensor([[9, 5, 7, 4],
        [9, 7, 8, 0],
        [5, 8, 8, 2],
        [0, 7, 8, 6]]) 

a[2:, :] tensor([[5, 8, 8, 2],
        [0, 7, 8, 6]]) 

a[:, -1] tensor([4, 0, 2, 6])


Resizing and reshaping tensors is also quite simple

In [6]:
print('Turn tensor into a 1 dimensional array:')
a = torch.randint(0, 10, (3, 3))

print(f'Before size: {a.size()}')
print(a, '\n')

a = a.view(1, 9)
print(f'After size: {a.size()}')
print(a)

Turn tensor into a 1 dimensional array:
Before size: torch.Size([3, 3])
tensor([[4, 2, 8],
        [9, 0, 4],
        [6, 5, 1]]) 

After size: torch.Size([1, 9])
tensor([[4, 2, 8, 9, 0, 4, 6, 5, 1]])


Changing a `Tensor` to and from an `array` is also quite simple:

In [7]:
# Tensor from array
arr = np.array([1,2])
torch.from_numpy(arr)

tensor([1, 2])

In [8]:
# Tensor to array
t = torch.Tensor([1, 2])
t.numpy()

array([1., 2.], dtype=float32)

Moving `Tensor`s to the GPU is also quite simple:

In [9]:
t = torch.Tensor([1, 2]) # on CPU
if torch.cuda.is_available():
    t = t.cuda() # on GPU

## Automatic Differentiation
https://pytorch.org/tutorials/beginner/basics/autograd_tutorial.html

Derivatives and gradients are critical to a large number of machine learning algorithms. One of the key benefits of PyTorch is that these can be computed automatically.

We'll demonstrate this using the following example. Suppose we have some data $x$ and $y$, and want to fit a model:
$$ \hat{y} = mx + b $$
by minimizing the loss function:
$$ L(y, \hat{y}) = \frac{1}{2}(y - \hat{y})^2 $$


In [10]:
# Data
x = torch.tensor([1.,  2,  3,  4])  # requires_grad = False by default
y = torch.tensor([0., -1, -2, -3])

# Initialize parameters
m = torch.rand(1, requires_grad=True)
b = torch.rand(1, requires_grad=True)

# Define regression function
y_hat = m * x + b
print(y_hat)
# Define loss
loss = torch.mean(0.5 * (y - y_hat)**2)

loss.backward() # Backprop the gradients of the loss w.r.t other variables

tensor([1.5903, 2.2638, 2.9373, 3.6109], grad_fn=<AddBackward0>)


If we look at the $x$ and $y$ values, you can see that the perfect values for our parameters are $m$=-1 and $b$=1

To obtain the gradient of the $L$ w.r.t $m$ and $b$ you need only run:

In [11]:
# Gradients
print('Gradients:')
print('dL/dm: %0.4f' % m.grad)
print('dL/db: %0.4f' % b.grad)

Gradients:
dL/dm: 12.3434
dL/db: 4.1006


## Training Models

While automatic differentiation is in itself a useful feature, it can be quite tedious to keep track of all of the different parameters and gradients for more complicated models. In order to make life simple, PyTorch defines a `torch.nn.Module` class which handles all of these details for you. To paraphrase the PyTorch documentation, this is the base class for all neural network modules, and whenever you define a model it should be a subclass of this class.

There are two main functions you need to implement for a `Module` class:
- $__init__$: Function first called when object is initialized. Used to set parameters, etc.
- $__forward__$: When the model is called, this forwards the inputs through the model.

Here is an example implementation of the simple linear model given above:

In [12]:
import torch.nn as nn

class LinearModel(nn.Module):
    
    def __init__(self):
        """This method is called when you instantiate a new LinearModel object.
        
        You should use it to define the parameters/layers of your model.
        """
        # Whenever you define a new nn.Module you should start the __init__()
        # method with the following line. Remember to replace `LinearModel` 
        # with whatever you are calling your model.
        super(LinearModel, self).__init__()
        
        # Now we define the parameters used by the model.
        self.m = torch.nn.Parameter(torch.rand(1))
        self.b = torch.nn.Parameter(torch.rand(1))
    
    def forward(self, x):
        """This method computes the output of the model.
        
        Args:
            x: The input data.
        """
        return self.m * x + self.b

# Initialize model
model = LinearModel()

# Example forward pass. Note that we use model(x) not model.forward(x) !!! 
y_hat = model(x)
print(x, y_hat)

tensor([1., 2., 3., 4.]) tensor([0.9696, 1.0945, 1.2193, 1.3442], grad_fn=<AddBackward0>)


To train this model we need to pick an optimizer such as SGD, AdaDelta, ADAM, etc. There are many options in `torch.optim`. When initializing an optimizer, the first argument will be the collection of variables you want optimized. To obtain a list of all of the trainable parameters of a model you can call the `nn.Module.parameters()` method. For example, the following code initalizes a SGD optimizer for the model defined above:

In [13]:
import torch.optim as optim
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

Training is done in a loop. The general structure is:

1. Clear the gradients.
2. Evaluate the model.
3. Calculate the loss.
4. Backpropagate.
5. Perform an optimization step.
6. (Once in a while) Print monitoring metrics.

For example, we can train our linear model by running:

In [14]:
import time

for i in range(5001):
    optimizer.zero_grad()
    y_hat = model(x) # calling model() calls the forward function
    loss = torch.mean(0.5 * (y - y_hat)**2)
    loss.backward()
    optimizer.step()
    if i % 1000 == 0:
        time.sleep(1) # DO NOT INCLUDE THIS IN YOUR CODE !!! Only for demo.
        print(f'Iteration {i} - Loss: {loss.item():0.6f}')

Iteration 0 - Loss: 4.320335
Iteration 1000 - Loss: 0.000000
Iteration 2000 - Loss: 0.000000
Iteration 3000 - Loss: 0.000000
Iteration 4000 - Loss: 0.000000
Iteration 5000 - Loss: 0.000000


Observe that the final parameters are what we expect:

In [15]:
print(model(x), y)

tensor([-5.9605e-08, -1.0000e+00, -2.0000e+00, -3.0000e+00],
       grad_fn=<AddBackward0>) tensor([ 0., -1., -2., -3.])


In [16]:
print('Final parameters:')
print('m: %0.2f' % model.m)
print('b: %0.2f' % model.b)

Final parameters:
m: -1.00
b: 1.00


# CASE STUDY: POS Tagging!

Now let's dive into an example that is more relevant to NLP and is relevant to your HW3, part-of-speech tagging! We will be building up code up until the point where you will be able to process the POS data into tensors, then train a simple model on it.
The code we are building up to forms the basis of the code in the homework assignment.

To start, we'll need some data to train and evaluate on. First download the train and dev POS data `twitter_train.pos` and `twitter_dev.pos` into the same directory as this notebook.

In [17]:
print("First data point:")
with open('twitter_train.pos', 'r') as f:
    for line in f:
        line = line.strip()
        print('\t', line)
        if line == '':
            break

First data point:
	 @paulwalk	X	None	None
	 It	PRON	None	None
	 's	VERB	None	None
	 the	DET	None	None
	 view	NOUN	None	None
	 from	ADP	None	None
	 where	ADV	None	None
	 I	PRON	None	None
	 'm	VERB	None	None
	 living	VERB	None	None
	 for	ADP	None	None
	 two	NUM	None	None
	 weeks	NOUN	None	None
	 .	.	None	None
	 Empire	NOUN	None	None
	 State	NOUN	None	None
	 Building	NOUN	None	None
	 =	X	None	None
	 ESB	NOUN	None	None
	 .	.	None	None
	 Pretty	ADV	None	None
	 bad	ADJ	None	None
	 storm	NOUN	None	None
	 here	ADV	None	None
	 last	ADJ	None	None
	 evening	NOUN	None	None
	 .	.	None	None
	 


We will now be introducing three new components which are vital to training (NLP) models:
1. a `Vocabulary` object which converts from tokens/labels to integers. This part should also be able to handle padding so that batches can be easily created.
2. a `Dataset` object which takes in the data file and produces data tensors
3. a `DataLoader` object which takes data tensors from `Dataset` and batches them

### `Vocabulary`

Next, we need to get our data into Python and in a form that is usable by PyTorch. For text data this typically entails building a `Vocabulary`  of all of the words, then mapping words to integers corresponding to their place in the sorted vocabulary. This can be done as follows:

In [18]:
class Vocabulary():
    """ Object holding vocabulary and mappings
    Args:
        word_list: ``list`` A list of words. Words assumed to be unique.
        add_unk_token: ``bool` Whether to create an token for unknown tokens.
    """
    def __init__(self, word_list, add_unk_token=False):
        # create special tokens for padding and unknown words
        self.pad_token = '<pad>'
        self.unk_token = '<unk>' if add_unk_token else None

        self.special_tokens = [self.pad_token]
        if self.unk_token:
            self.special_tokens += [self.unk_token]

        self.word_list = word_list
        
        # maps from the token ID to the token
        self.id_to_token = self.word_list + self.special_tokens
        # maps from the token to its token ID
        self.token_to_id = {token: id for id, token in
                            enumerate(self.id_to_token)}
        
    def __len__(self):
        """ Returns size of vocabulary """
        return len(self.token_to_id)
    
    @property
    def pad_token_id(self):
        return self.map_token_to_id(self.pad_token)
        
    def map_token_to_id(self, token: str):
        """ Maps a single token to its token ID """
        if token not in self.token_to_id:
            token = self.unk_token
        return self.token_to_id[token]

    def map_id_to_token(self, id: int):
        """ Maps a single token ID to its token """
        return self.id_to_token[id]

    def map_tokens_to_ids(self, tokens: list, max_length: int = None):
        """ Maps a list of tokens to a list of token IDs """
        # truncate extra tokens and pad to `max_length`
        if max_length:
            tokens = tokens[:max_length]
            tokens = tokens + [self.pad_token]*(max_length-len(tokens))
        return [self.map_token_to_id(token) for token in tokens]

    def map_ids_to_tokens(self, ids: list, filter_padding=True):
        """ Maps a list of token IDs to a list of token """
        tokens = [self.map_id_to_token(id) for id in ids]
        if filter_padding:
            tokens = [t for t in tokens if t != self.pad_token]
        return tokens

Let's create a vocabulary with a small amount of words

In [19]:
word_list = ['i', 'like', 'dogs', '!']
vocab = Vocabulary(word_list, add_unk_token=True)

In [20]:
print('map from the token "i" to its token ID, then back again')
token_id = vocab.map_token_to_id('i')
print(token_id)
print(vocab.map_id_to_token(token_id))

map from the token "i" to its token ID, then back again
0
i


In [21]:
print('what about a token not in our vocabulary like "you"?')
token_id = vocab.map_token_to_id('you')
print(token_id)
print(vocab.map_id_to_token(token_id))

what about a token not in our vocabulary like "you"?
5
<unk>


In [22]:
token_ids = vocab.map_tokens_to_ids(['i', 'like', 'dogs', '!'], max_length=10)
print("mapping a sequence of tokens: \'['i', 'like', 'dogs', '!']\'")
print(token_ids)
print(vocab.map_ids_to_tokens(token_ids, filter_padding=False))

mapping a sequence of tokens: '['i', 'like', 'dogs', '!']'
[0, 1, 2, 3, 4, 4, 4, 4, 4, 4]
['i', 'like', 'dogs', '!', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']


### `Dataset`

Next, we need a way to efficiently read in the data file and to process it into tensors. PyTorch provides an easy way to do this using the `torch.utils.data.Dataset` class. We will be creating our own class which inherits from this class. 

Helpful link: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html

A custom `Dataset` class must implement three functions: 

- $__init__$: The init functions is run once when instantisting the `Dataset` object.
- $__len__$: The len function returns the number of data points in our dataset.
- $__getitem__$. The getitem function returns a sample from the dataset give the index of the sample. The output of this part should be a dictionary of (mostly) PyTorch tensors.

In [23]:
class TwitterPOSDataset(torch.utils.data.Dataset):
    def __init__(self, data_path, max_length=30):
        self._max_length = max_length
        self._dataset = []
    
        # read the dataset file, extracting tokens and tags
        with open(data_path, 'r') as f:
            tokens, tags = [], []
            for line in f:
                elements = line.strip().split('\t')
                # empty line means end of sentence
                if elements == [""]:
                    self._dataset.append({'tokens': tokens, 'tags': tags})
                    tokens, tags = [], []
                else:
                    tokens.append(elements[0].lower())
                    tags.append(elements[1])
        
        # intiailize an empty vocabulary
        self.token_vocab = None
        self.tag_vocab = None

    def __len__(self):
        return len(self._dataset)

    def __getitem__(self, item: int):
        # get the sample corresponding to the index
        instance = self._dataset[item]
        
        # check the vocabulary has been set
        assert self.token_vocab is not None
        assert self.tag_vocab is not None
        
        # Convert inputs to tensors, then return
        return self.tensorize(instance['tokens'], instance['tags'], self._max_length)
    
    def tensorize(self, tokens, tags=None, max_length=None):
        # map the tokens and tags into their ID form
        token_ids = self.token_vocab.map_tokens_to_ids(tokens, max_length)
        tensor_dict = {'token_ids': torch.LongTensor(token_ids)}
        if tags:
            tag_ids = self.tag_vocab.map_tokens_to_ids(tags, max_length)
            tensor_dict['tag_ids'] = torch.LongTensor(tag_ids)
        return tensor_dict
        
    def get_tokens_list(self):
        """ Returns set of tokens in dataset """
        tokens = [token for d in self._dataset for token in d['tokens']]
        return sorted(set(tokens))

    def get_tags_list(self):
        """ Returns set of tags in dataset """
        tags = [tag for d in self._dataset for tag in d['tags']]
        return sorted(set(tags))

    def set_vocab(self, token_vocab: Vocabulary, tag_vocab: Vocabulary):
        self.token_vocab = token_vocab
        self.tag_vocab = tag_vocab

Now let's create `Dataset` objects for our training and validation sets!
A key step here is creating the `Vocabulary` for these datasets.
We will use the list of words in the training set to intialize a `Vocabulary` object over the input words. 
We will also use list of tags to intialize a `Vocabulary` over the tags.

In [24]:
train_dataset = TwitterPOSDataset('twitter_train.pos')
dev_dataset = TwitterPOSDataset('twitter_dev.pos')

# Get list of tokens and tags seen in training set and use to create Vocabulary
token_list = train_dataset.get_tokens_list()
tag_list = train_dataset.get_tags_list()

token_vocab = Vocabulary(token_list, add_unk_token=True)
tag_vocab = Vocabulary(tag_list)

# Update the train/dev set with vocabulary. Notice we created the vocabulary using the training set
train_dataset.set_vocab(token_vocab, tag_vocab)
dev_dataset.set_vocab(token_vocab, tag_vocab)

In [25]:
print(f'Size of training set: {len(train_dataset)}')
print(f'Size of validation set: {len(dev_dataset)}')

Size of training set: 379
Size of validation set: 112


Let's print out one data point of the tensorized data and see what it looks like

In [26]:
instance = train_dataset[2]
print(instance)

tokens = train_dataset.token_vocab.map_ids_to_tokens(instance['token_ids'])
tags = train_dataset.tag_vocab.map_ids_to_tokens(instance['tag_ids'])
print()
print(f'Tokens: {tokens}')
print(f'Tags:   {tags}')

{'token_ids': tensor([ 361, 1276, 2102, 2087,   90, 2267, 1276,   88, 1093,  576, 2095, 2370,
        2370, 2370, 2370, 2370, 2370, 2370, 2370, 2370, 2370, 2370, 2370, 2370,
        2370, 2370, 2370, 2370, 2370, 2370]), 'tag_ids': tensor([11,  8, 10,  5, 10,  3,  8, 10, 10, 10,  3, 12, 12, 12, 12, 12, 12, 12,
        12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12])}

Tokens: ['@miss_soto', 'i', 'think', 'that', "'s", 'when', 'i', "'m", 'gonna', 'be', 'there']
Tags:   ['X', 'PRON', 'VERB', 'DET', 'VERB', 'ADV', 'PRON', 'VERB', 'VERB', 'VERB', 'ADV']


### `DataLoader`

At this point our data is in a tensor, and we can create context windows using only PyTorch operations.
Now we need a way to generate batches of data for training and evaluation.
To do this, we will wrap our `Dataset` objects in a `torch.utils.data.DataLoader` object, which will automatically batch datapoints.

In [27]:
batch_size = 3
print(f'Setting batch_size to be {batch_size}')

train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size)
dev_dataloader = torch.utils.data.DataLoader(dev_dataset, batch_size)

Setting batch_size to be 3


Now let's do one iteration over our training set to see what a batch looks like:

In [28]:
for batch in train_dataloader:
    print(batch, '\n')
    print(f'Size of tag_ids: {batch["tag_ids"].size()}')
    break

{'token_ids': tensor([[ 376, 1318,   90, 2089, 2217, 1053, 2268, 1276,   88, 1431, 1033, 2175,
         2259,  118,  916, 1997,  663,  228,  934,  118, 1748,  561, 2018, 1164,
         1391,  943,  118, 2370, 2370, 2370],
        [1939,  617, 2067, 2141,  162, 1394, 1022,  721, 2141, 1550, 1576,  128,
         2089,  484,  817,  944, 1001,  499, 1258,   32, 2370, 2370, 2370, 2370,
         2370, 2370, 2370, 2370, 2370, 2370],
        [ 361, 1276, 2102, 2087,   90, 2267, 1276,   88, 1093,  576, 2095, 2370,
         2370, 2370, 2370, 2370, 2370, 2370, 2370, 2370, 2370, 2370, 2370, 2370,
         2370, 2370, 2370, 2370, 2370, 2370]]), 'tag_ids': tensor([[11,  8, 10,  5,  6,  2,  3,  8, 10, 10,  2,  7,  6,  0,  6,  6,  6, 11,
          6,  0,  3,  1,  6,  3,  1,  6,  0, 12, 12, 12],
        [ 1,  6,  6,  6,  7, 10,  7,  6,  6,  6,  6,  0,  5,  5,  6,  6, 10,  6,
         11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12],
        [11,  8, 10,  5, 10,  3,  8, 10, 10, 10,  3, 12, 12, 12, 12, 12

## Model

Now that we can read in the data, it is time to build our model.
We will build a very simple LSTM based tagger! Note that this is pretty similar to the code in `simple_tagger.py` in your homework, but with a lot of things hardcoded.

Useful links:
- Embedding Layer: https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html
- LSTMs: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html
- Linear Layer: https://pytorch.org/docs/stable/generated/torch.nn.Linear.html?highlight=linear#torch.nn.Linear

In [29]:
class SimpleTagger(torch.nn.Module):
    def __init__(self, token_vocab, tag_vocab):
        super(SimpleTagger, self).__init__()
        self.token_vocab = token_vocab
        self.tag_vocab = tag_vocab
        self.num_tags = len(self.tag_vocab)
        
        # Initialize random embeddings of size 50 for each word in your token vocabulary
        self._embeddings = torch.nn.Embedding(len(token_vocab), 50)
        
        # Initialize a single-layer bidirectional LSTM encoder
        self._encoder = torch.nn.LSTM(input_size=50, hidden_size=25, num_layers=1, bidirectional=True)
        
        # _encoder a Linear layer which projects from the hidden state size to the number of tags
        self._tag_projection = torch.nn.Linear(in_features=50, out_features=len(self.tag_vocab))

        # Loss will be a Cross Entropy Loss over the tags (except the padding token)
        self.loss = torch.nn.CrossEntropyLoss(ignore_index=self.tag_vocab.pad_token_id)

    def forward(self, token_ids, tag_ids=None):
        # Create mask over all the positions where the input is padded
        mask = token_ids != self.token_vocab.pad_token_id
        
        # Embed Inputs
        embeddings = self._embeddings(token_ids).permute(1, 0, 2)
        # Feed embeddings through LSTM
        encoder_outputs = self._encoder(embeddings)[0].permute(1, 0, 2)
        # Project output of LSTM through linear layer to get logits
        tag_logits = self._tag_projection(encoder_outputs)
        # Get the maximum score for each position as the predicted tag
        pred_tag_ids = torch.max(tag_logits, dim=-1)[1]

        output_dict = {
            'pred_tag_ids': pred_tag_ids,
            'tag_logits': tag_logits,
            'tag_probs': torch.nn.functional.softmax(tag_logits, dim=-1) # covert logits to probs
        }
        # Compute loss and accuracy if gold tags are provided
        if tag_ids is not None:
            loss = self.loss(tag_logits.view(-1, self.num_tags), tag_ids.view(-1))
            output_dict['loss'] = loss
            
            correct = pred_tag_ids == tag_ids # 1's in positions where pred matches gold
            correct *= mask # zero out positions where mask is zero
            output_dict['accuracy'] = torch.sum(correct)/torch.sum(mask)

        return output_dict

## Training

The training script essentially follows the same pattern that we used for the linear model above. However we have also added an evaluation step, and code for saving model checkpoints.

In [30]:
from tqdm import tqdm

################################
# Setup
################################
# Create model
model = SimpleTagger(token_vocab=token_vocab, tag_vocab=tag_vocab)
if torch.cuda.is_available():
    model = model.cuda()

# Initialize optimizer.
# Note: The learning rate is an important hyperparameters to tune
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

################################
# Training and Evaluation!
################################
num_epochs = 10
best_dev_loss = float('inf')

for epoch in range(num_epochs):
    print('\nEpoch', epoch)
    # Training loop
    model.train() # THIS PART IS VERY IMPORTANT TO SET BEFORE TRAINING
    train_loss = 0
    train_acc = 0
    for batch in train_dataloader:
        batch_size = batch['token_ids'].size(0)
        optimizer.zero_grad()
        output_dict = model(**batch)
        loss = output_dict['loss']
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item()*batch_size
        accuracy = output_dict['accuracy']
        train_acc += accuracy*batch_size
    train_loss /= len(train_dataset)
    train_acc /= len(train_dataset)
    print(f'Train loss {train_loss} accuracy {train_acc}')
    
    # Evaluation loop
    model.eval() # THIS PART IS VERY IMPORTANT TO SET BEFORE EVALUATION
    dev_loss = 0
    dev_acc = 0
    for batch in dev_dataloader:
        batch_size = batch['token_ids'].size(0)
        output_dict = model(**batch)
        dev_loss += output_dict['loss'].item()*batch_size
        dev_acc += output_dict['accuracy']*batch_size
    dev_loss /= len(dev_dataset)
    dev_acc /= len(dev_dataset)
    print(f'Dev loss {dev_loss} accuracy {dev_acc}')
    
    # Save best model
    if dev_loss < best_dev_loss:
        print('Best so far')
        torch.save(model, 'model.pt')
        best_dev_loss = dev_loss


Epoch 0
Train loss 2.2184017755110848 accuracy 0.2823981046676636
Dev loss 1.8892717074070657 accuracy 0.40769562125205994
Best so far

Epoch 1
Train loss 1.6247831381719786 accuracy 0.5223523378372192
Dev loss 1.407148138220821 accuracy 0.5866950750350952
Best so far

Epoch 2
Train loss 1.2275152156095077 accuracy 0.6613854169845581
Dev loss 1.155178641634328 accuracy 0.6573066711425781
Best so far

Epoch 3
Train loss 0.9917180402926845 accuracy 0.7219552397727966
Dev loss 1.0195192662732941 accuracy 0.6941012740135193
Best so far

Epoch 4
Train loss 0.8328757446485333 accuracy 0.7617589235305786
Dev loss 0.9338890106550285 accuracy 0.7289190292358398
Best so far

Epoch 5
Train loss 0.7124635504858475 accuracy 0.7966110706329346
Dev loss 0.8751283083111048 accuracy 0.7454296350479126
Best so far

Epoch 6
Train loss 0.6149178453202613 accuracy 0.8256787657737732
Dev loss 0.8340495005249977 accuracy 0.7545357942581177
Best so far

Epoch 7
Train loss 0.5332277208016227 accuracy 0.851418

## Loading Trained Models

Loading a pretrained model can be done easily. To learn more about saving/loading models see https://pytorch.org/tutorials/beginner/saving_loading_models.html

In [31]:
model = torch.load('model.pt')

## Feed in your own sentences!

In [32]:
sentence = 'i want to eat a pizza .'.lower().split()

# convert sentence to tensor dictionar
tensor_dict = train_dataset.tensorize(sentence)

# unsqueeze first dimesion so batch size is 1
tensor_dict['token_ids'] = tensor_dict['token_ids'].unsqueeze(0)
print(tensor_dict)

# feed through model
output_dict = model(**tensor_dict)

# get predicted tag IDs
pred_tag_ids = output_dict['pred_tag_ids'].squeeze().tolist()
print(pred_tag_ids)

# convert tag IDs to tag names
print(model.tag_vocab.map_ids_to_tokens(pred_tag_ids))

{'token_ids': tensor([[1276, 2236, 2128,  901,  449, 2371,  118]])}
[8, 10, 9, 10, 5, 6, 0]
['PRON', 'VERB', 'PRT', 'VERB', 'DET', 'NOUN', '.']


## Conclusion

You've now seen at a high level how to create neural networks for NLP.
You've also now seen the components that go around a model (e.g. training loops, data processing).
Setting up these componenents in a flexible way can be tricky for NLP, as there are many issues that you have to take care of like padding, different vocabularies, etc.
For example, how would you build upon this code to load in pre-trained embeddings, or use character embeddings?

That's why there exist many libraries that take care of these boilerplate components so that you can focus on modeling.
One of these libraries is [allennlp](https://allennlp.org/), and if you have time, I encourage you to take a look at it. 
It builds upon PyTorch so everything you've learned here is applicable.