In [1]:
import numpy as np
import torch

# CASE STUDY: POS Tagging!

Now let's dive into an example that is more relevant to NLP and is relevant to your HW3, part-of-speech tagging! We will be building up code up until the point where you will be able to process the POS data into tensors, then train a simple model on it.
The code we are building up to forms the basis of the code in the homework assignment.

To start, we'll need some data to train and evaluate on. First download the train and dev POS data `twitter_train.pos` and `twitter_dev.pos` into the same directory as this notebook.

We will now be introducing three new components which are vital to training (NLP) models:
1. a `Vocabulary` object which converts from tokens/labels to integers. This part should also be able to handle padding so that batches can be easily created.
2. a `Dataset` object which takes in the data file and produces data tensors
3. a `DataLoader` object which takes data tensors from `Dataset` and batches them

### `Vocabulary`

Next, we need to get our data into Python and in a form that is usable by PyTorch. For text data this typically entails building a `Vocabulary`  of all of the words, then mapping words to integers corresponding to their place in the sorted vocabulary. This can be done as follows:

In [2]:
class Vocabulary():
    """ Object holding vocabulary and mappings
    Args:
        word_list: ``list`` A list of words. Words assumed to be unique.
        add_unk_token: ``bool` Whether to create an token for unknown tokens.
    """
    def __init__(self, word_list, add_unk_token=False):
        # create special tokens for padding and unknown words
        self.pad_token = '<pad>'
        self.unk_token = '<unk>' if add_unk_token else None

        self.special_tokens = [self.pad_token]
        if self.unk_token:
            self.special_tokens += [self.unk_token]

        self.word_list = word_list
        
        # maps from the token ID to the token
        self.id_to_token = self.word_list + self.special_tokens
        # maps from the token to its token ID
        self.token_to_id = {token: id for id, token in
                            enumerate(self.id_to_token)}
        
    def __len__(self):
        """ Returns size of vocabulary """
        return len(self.token_to_id)
    
    @property
    def pad_token_id(self):
        return self.map_token_to_id(self.pad_token)
        
    def map_token_to_id(self, token: str):
        """ Maps a single token to its token ID """
        if token not in self.token_to_id:
            token = self.unk_token
        return self.token_to_id[token]

    def map_id_to_token(self, id: int):
        """ Maps a single token ID to its token """
        return self.id_to_token[id]

    def map_tokens_to_ids(self, tokens: list, max_length: int = None):
        """ Maps a list of tokens to a list of token IDs """
        # truncate extra tokens and pad to `max_length`
        if max_length:
            tokens = tokens[:max_length]
            tokens = tokens + [self.pad_token]*(max_length-len(tokens))
        return [self.map_token_to_id(token) for token in tokens]

    def map_ids_to_tokens(self, ids: list, filter_padding=True):
        """ Maps a list of token IDs to a list of token """
        tokens = [self.map_id_to_token(id) for id in ids]
        if filter_padding:
            tokens = [t for t in tokens if t != self.pad_token]
        return tokens

### `Dataset`

Next, we need a way to efficiently read in the data file and to process it into tensors. PyTorch provides an easy way to do this using the `torch.utils.data.Dataset` class. We will be creating our own class which inherits from this class. 

Helpful link: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html

A custom `Dataset` class must implement three functions: 

- $__init__$: The init functions is run once when instantisting the `Dataset` object.
- $__len__$: The len function returns the number of data points in our dataset.
- $__getitem__$. The getitem function returns a sample from the dataset give the index of the sample. The output of this part should be a dictionary of (mostly) PyTorch tensors.

In [3]:
class WikiDataset(torch.utils.data.Dataset):
    def __init__(self, data_path):
        self._dataset = []
    
        # read the dataset file, extracting tokens and tags
        with open(data_path, 'r', encoding='utf-8') as f:
            for i,line in enumerate(f):
                if(i==0): continue
                tokens = []
                elements = line.strip().split(',')          #############################################
                content = elements[3].split(' ')
                for word in content:
                    clean_word = word.replace(".", "").replace("(","").replace(")","")
                    tokens.append(clean_word)
                #elements[-2] = elements[-2].replace("P","")
                self._dataset.append({'tokens': tokens, 'label': [elements[-2]]})
        
        # intiailize an empty vocabulary
        self.token_vocab = None
        self.label_vocab = None

    def __len__(self):
        return len(self._dataset)

    def __getitem__(self, item: int):
        # get the sample corresponding to the index
        instance = self._dataset[item]
        
        # check the vocabulary has been set
        assert self.token_vocab is not None
        assert self.label_vocab is not None
        
        # Convert inputs to tensors, then return
        return self.tensorize(instance['tokens'], instance['label'])
    
    def tensorize(self, tokens, cls=None, max_length=None):
        # map the tokens and tags into their ID form
        token_ids = self.token_vocab.map_tokens_to_ids(tokens, max_length)
        tensor_dict = {'token_ids': torch.LongTensor(token_ids)}
        if cls:
            label_map = self.label_vocab.map_tokens_to_ids(cls)
            tensor_dict['label'] = torch.LongTensor(label_map)
        return tensor_dict
        
    def get_tokens_list(self):
        """ Returns set of tokens in dataset """
        tokens = [token for d in self._dataset for token in d['tokens']]
        return sorted(set(tokens))

    def get_classes_list(self):
        """ Returns set of tags in dataset """
        clss = [c for d in self._dataset for c in d['label']]
        return sorted(set(clss))

    def set_vocab(self, token_vocab: Vocabulary, label_vocab: Vocabulary):
        self.token_vocab = token_vocab
        self.label_vocab = label_vocab

Now let's create `Dataset` objects for our training and validation sets!
A key step here is creating the `Vocabulary` for these datasets.
We will use the list of words in the training set to intialize a `Vocabulary` object over the input words. 
We will also use list of tags to intialize a `Vocabulary` over the tags.

In [4]:
train_dataset = WikiDataset('basketball_links_results_training_dedup.csv')
dev_dataset = WikiDataset('basketball_links_results_training_dedup.csv')     ###################################################

# Get list of tokens and tags seen in training set and use to create Vocabulary
token_list = train_dataset.get_tokens_list()
class_list = train_dataset.get_classes_list()

token_vocab = Vocabulary(token_list, add_unk_token=True)
label_vocab = Vocabulary(class_list, add_unk_token=True)

# Update the train/dev set with vocabulary. Notice we created the vocabulary using the training set
train_dataset.set_vocab(token_vocab, label_vocab)
dev_dataset.set_vocab(token_vocab, label_vocab)

In [5]:
print(f'Size of training set: {len(train_dataset)}')
print(f'Size of validation set: {len(dev_dataset)}')

Size of training set: 2683
Size of validation set: 2683


Let's print out one data point of the tensorized data and see what it looks like

In [6]:
instance = train_dataset[0]
print(instance)

tokens = train_dataset.token_vocab.map_ids_to_tokens(instance['token_ids'])
cls = train_dataset.label_vocab.map_ids_to_tokens(instance['label'])
print()
print(f'Tokens: {tokens}')
print(f'Class:   {cls}')

{'token_ids': tensor([2294, 5427, 6330, 6171, 5305, 6062, 5717, 4757, 6330, 2957,  849, 3743,
        1599]), 'label': tensor([61])}

Tokens: ['It', 'is', 'the', 'sole', 'high', 'school', 'operated', 'by', 'the', 'Montgomery', 'Area', 'School', 'District']
Class:   ['P5353']


### `DataLoader`

At this point our data is in a tensor, and we can create context windows using only PyTorch operations.
Now we need a way to generate batches of data for training and evaluation.
To do this, we will wrap our `Dataset` objects in a `torch.utils.data.DataLoader` object, which will automatically batch datapoints.

In [7]:
batch_size = 1
print(f'Setting batch_size to be {batch_size}')

train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size)
dev_dataloader = torch.utils.data.DataLoader(dev_dataset, batch_size)

Setting batch_size to be 1


Now let's do one iteration over our training set to see what a batch looks like:

In [8]:
for i,batch in enumerate(train_dataloader):
    print(batch, '\n')
    print(f'Size of classes: {batch["label"].size()}')
    if(i==2):
        break

{'token_ids': tensor([[2294, 5427, 6330, 6171, 5305, 6062, 5717, 4757, 6330, 2957,  849, 3743,
         1599]]), 'label': tensor([[61]])} 

Size of classes: torch.Size([1, 1])
{'token_ids': tensor([[2584, 1077, 5427, 6330, 5711, 5945, 6372, 5288, 4595, 4699, 5630, 2737,
         4642, 4532, 5811]]), 'label': tensor([[12]])} 

Size of classes: torch.Size([1, 1])
{'token_ids': tensor([[229]]), 'label': tensor([[21]])} 

Size of classes: torch.Size([1, 1])


## Model

Now that we can read in the data, it is time to build our model.
We will build a very simple LSTM based tagger! Note that this is pretty similar to the code in `simple_tagger.py` in your homework, but with a lot of things hardcoded.

Useful links:
- Embedding Layer: https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html
- LSTMs: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html
- Linear Layer: https://pytorch.org/docs/stable/generated/torch.nn.Linear.html?highlight=linear#torch.nn.Linear

In [9]:
class SimpleTagger(torch.nn.Module):
    def __init__(self, token_vocab, label_vocab):
        super(SimpleTagger, self).__init__()
        self.token_vocab = token_vocab
        self.label_vocab = label_vocab
        self.num_classes = len(label_vocab) - 2          ##because of 2 special tokens in vocab
        self.input_size=50
        self.hidden_size=25
        self.num_layers=1
        self.embedding_dim = 50
        
        # Initialize random embeddings of size 50 for each word in your token vocabulary
        self._embeddings = torch.nn.Embedding(len(token_vocab), self.embedding_dim)
        
        # Initialize a single-layer bidirectional LSTM encoder
        self._encoder = torch.nn.LSTM(input_size=self.embedding_dim, hidden_size=self.hidden_size, num_layers=self.num_layers, bidirectional=True)
        
        # _encoder a Linear layer which projects from the hidden state size to the number of tags
        self._fc1 = torch.nn.Linear(in_features=2*self.hidden_size, out_features=self.hidden_size)
        self._fc2 = torch.nn.Linear(self.hidden_size, self.num_classes)

        # Loss will be a Cross Entropy Loss over the tags (except the padding token)
        self.loss = torch.nn.MSELoss()

    def forward(self, token_ids, label):
        # Create mask over all the positions where the input is padded
        #mask = token_ids != self.token_vocab.pad_token_id
        #run on CUDA
        device = "cuda" if torch.cuda.is_available() else "cpu"
        token_ids = token_ids.to(device)
        # Embed Inputs
        #print(token_ids.shape)
        embeddings = self._embeddings(token_ids).permute(1, 0, 2)                #######token ids is a 2d array
        #print(embeddings.shape)
        # Feed embeddings through LSTM
        encoder_outputs = self._encoder(embeddings)[0].permute(1, 0, 2)     #### why [0]?
        encoder_outputs = encoder_outputs[:,-1,:]                          ##### choosing the last words encoding
        #print(encoder_outputs.shape)
        # Project output of LSTM through linear layer to get logits
        fc1_outputs = self._fc1(encoder_outputs)
        #print(fc1_outputs.shape)
        fc2_outputs = self._fc2(fc1_outputs)
        #print(fc2_outputs.shape)
        # Get the maximum score for each position as the predicted tag
        final_outputs = torch.nn.functional.softmax(fc2_outputs, dim=-1)
        
        output_dict = {
            'predicted_label':  final_outputs  # convert values to probs
        }
        # Compute loss and accuracy if gold tags are provided
        if label is not None:
            label = label.to(device)
            target_labels = torch.Tensor(self.oneHotEncode(label)).to(device)
            #print(final_outputs.shape, target_labels.shape)
            loss = self.loss(final_outputs, target_labels)
            output_dict['loss'] = loss

            correct = torch.Tensor([1 if target_labels[i,torch.argmax(final_outputs[i])]==1 else 0 for i in range(target_labels.shape[0])]) # 1's in positions where pred matches gold
            #correct *= mask # zero out positions where mask is zero
            output_dict['accuracy'] = torch.sum(correct)/target_labels.shape[0]

        return output_dict
    
    def oneHotEncode(self, label):
        oneHot = np.zeros((label.shape[0],self.num_classes))
        for i,l in enumerate(label):
            oneHot[i][l[0]-1] = 1
        oneHot = oneHot.tolist()
        return oneHot

## Training

The training script essentially follows the same pattern that we used for the linear model above. However we have also added an evaluation step, and code for saving model checkpoints.

In [10]:
from tqdm import tqdm

################################
# Setup
################################
# Create model
model = SimpleTagger(token_vocab=token_vocab, label_vocab=label_vocab)
if torch.cuda.is_available():
    model = model.cuda()

# Initialize optimizer.
# Note: The learning rate is an important hyperparameters to tune
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

################################
# Training and Evaluation!
################################
num_epochs = 10
best_dev_loss = float('inf')

for epoch in range(num_epochs):
    print('\nEpoch', epoch)
    # Training loop
    model.train() # THIS PART IS VERY IMPORTANT TO SET BEFORE TRAINING
    train_loss = 0
    train_acc = 0
    for batch in train_dataloader:
        batch_size = batch['token_ids'].size(0)
        optimizer.zero_grad()
        output_dict = model(**batch)
        loss = output_dict['loss']
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item()*batch_size
        accuracy = output_dict['accuracy']
        train_acc += accuracy*batch_size
    train_loss /= len(train_dataset)
    train_acc /= len(train_dataset)
    print(f'Train loss {train_loss} accuracy {train_acc}')
    
    # Evaluation loop
    model.eval() # THIS PART IS VERY IMPORTANT TO SET BEFORE EVALUATION
    dev_loss = 0
    dev_acc = 0
    for batch in dev_dataloader:
        batch_size = batch['token_ids'].size(0)
        output_dict = model(**batch)
        dev_loss += output_dict['loss'].item()*batch_size
        dev_acc += output_dict['accuracy']*batch_size
    dev_loss /= len(dev_dataset)
    dev_acc /= len(dev_dataset)
    print(f'Dev loss {dev_loss} accuracy {dev_acc}')
    
    # Save best model
    if dev_loss < best_dev_loss:
        print('Best so far')
        torch.save(model, 'model.pt')
        best_dev_loss = dev_loss


Epoch 0
Train loss 0.009017906530820468 accuracy 0.44353336095809937
Dev loss 0.007407454932819451 accuracy 0.5393216609954834
Best so far

Epoch 1
Train loss 0.007190224206696131 accuracy 0.5672754645347595
Dev loss 0.0062656072959730415 accuracy 0.6306373476982117
Best so far

Epoch 2
Train loss 0.006187825436581479 accuracy 0.6407006978988647
Dev loss 0.005438254338092773 accuracy 0.6865448951721191
Best so far

Epoch 3
Train loss 0.005405245733449477 accuracy 0.6861721873283386
Dev loss 0.004799660144017199 accuracy 0.7204621434211731
Best so far

Epoch 4
Train loss 0.004775789404262164 accuracy 0.7144986987113953
Dev loss 0.004329323236068431 accuracy 0.752888560295105
Best so far

Epoch 5
Train loss 0.004275633770945493 accuracy 0.7517703771591187
Dev loss 0.003977741912409904 accuracy 0.7707789540290833
Best so far

Epoch 6
Train loss 0.004106253581638715 accuracy 0.7562429904937744
Dev loss 0.0037885960976154604 accuracy 0.7815877795219421
Best so far

Epoch 7
Train loss 0.003

In [13]:
print('num of classes: ',(class_list))
print(model.num_classes)
torch.cuda.is_available()
torch.device()

num of classes:  ['P1056', 'P106', 'P108', 'P115', 'P118', 'P123', 'P1268', 'P1269', 'P127', 'P131', 'P1313', 'P1344', 'P1346', 'P1365', 'P1366', 'P137', 'P1376', 'P138', 'P1416', 'P150', 'P155', 'P156', 'P159', 'P166', 'P17', 'P172', 'P178', 'P179', 'P1830', 'P1889', 'P19', 'P190', 'P20', 'P22', 'P2354', 'P2499', 'P2500', 'P2596', 'P27', 'P276', 'P279', 'P286', 'P30', 'P31', 'P3373', 'P3450', 'P36', 'P360', 'P361', 'P3842', 'P39', 'P40', 'P413', 'P449', 'P463', 'P466', 'P47', 'P495', 'P5125', 'P527', 'P530', 'P5353', 'P54', 'P551', 'P6087', 'P641', 'P647', 'P664', 'P69', 'P740', 'P7888', 'P793', 'P840', 'P937']
74


TypeError: Device() received an invalid combination of arguments - got (), but expected one of:
 * (torch.device device)
 * (str type, int index)


## Loading Trained Models

Loading a pretrained model can be done easily. To learn more about saving/loading models see https://pytorch.org/tutorials/beginner/saving_loading_models.html

In [42]:
model = torch.load('model.pt')
if torch.cuda.is_available():
    model = model.cuda()
def oneHotToLabel(oneHot: list):
    return np.argmax(np.array(oneHot))

## Feed in your own sentences!

In [43]:
sentence = 'i want to eat a pizza .'.lower().split()

# convert sentence to tensor dictionar
tensor_dict = train_dataset.tensorize(sentence)

# unsqueeze first dimesion so batch size is 1
tensor_dict['token_ids'] = tensor_dict['token_ids'].unsqueeze(0)
tensor_dict['label'] = None
print(tensor_dict)

# feed through model
output_dict = model(**tensor_dict)

# get predicted tag IDs
pred_label = output_dict['predicted_label'].squeeze().tolist()
print(len(pred_label))

# convert tag IDs to tag names
print(oneHotToLabel(pred_label))

{'token_ids': tensor([[3149, 4439, 4151, 4439, 2248, 4439, 4439]], device='cpu'), 'label': None}
torch.Size([1, 7])
torch.Size([7, 1, 50])
torch.Size([1, 50])
torch.Size([1, 25])
torch.Size([1, 6379])
6379
936


## Conclusion

You've now seen at a high level how to create neural networks for NLP.
You've also now seen the components that go around a model (e.g. training loops, data processing).
Setting up these componenents in a flexible way can be tricky for NLP, as there are many issues that you have to take care of like padding, different vocabularies, etc.
For example, how would you build upon this code to load in pre-trained embeddings, or use character embeddings?

That's why there exist many libraries that take care of these boilerplate components so that you can focus on modeling.
One of these libraries is [allennlp](https://allennlp.org/), and if you have time, I encourage you to take a look at it. 
It builds upon PyTorch so everything you've learned here is applicable.