<a href="https://colab.research.google.com/github/yxyyxy93/UGENT_NLP_lab3_-wordembedding/blob/master/NLP_lab3_student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab session 3: Word embedding

This lab covers word embedding as seen in the theory lectures (DL lecture 5).

General instructions:
- Complete the code where needed
- Provide answers to questions only in the cell where indicated
- **Do not alter the evaluation cells** (`## evaluation`) in any way as they are needed for the partly automated evaluation process

## **Embedding; the Steroids for NLP!**

Pre-trained embedding have brought NLP a long way. Most of the recent methods include word embeddings into their pipeline to obtain state-of-the-art performance. `Word2vec` is among the most famous methods to efficiently create word embeddings and has been around since 2013. Word2Vec has two different model architectures, namely `Skip-gram` and `CBOW`. `Skip-gram` was explained in more detail in the theory lecture, and today we will play with `CBOW`. We will train our own little embeddings, and use them to visualize text corpora. In the last part, we will download and utilize other pretrained embeddings to build a Part-of-Speech tagging (PoS) model.

<img src="http://3g1o5q2sqh3w32ohtj4dwggw.wpengine.netdna-cdn.com/wp-content/uploads/2012/08/steroids-before-and-after-480x321.jpg" alt="img" width="512px"/>



In [1]:
# import necessary packages
import random
import math
import numpy as np

from random import shuffle
from collections import Counter

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cpu


In [0]:
# for reproducibility

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

## 1. Data preparation

As always, let's first prepare the data. We shall use the `text8` dataset, which offers cleaned English Wikipedia text. The data is clean UTF-8 and all characters are lower-cased with valid encodings.

In [3]:
!wget "http://mattmahoney.net/dc/text8.zip" -O text8.zip
!unzip -o text8.zip
!rm text8.zip
!head -c 1b text8 # print first bytes of text8 data

--2020-04-28 06:39:37--  http://mattmahoney.net/dc/text8.zip
Resolving mattmahoney.net (mattmahoney.net)... 67.195.197.75
Connecting to mattmahoney.net (mattmahoney.net)|67.195.197.75|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31344016 (30M) [application/zip]
Saving to: ‘text8.zip’


2020-04-28 06:40:21 (703 KB/s) - ‘text8.zip’ saved [31344016/31344016]

Archive:  text8.zip
  inflating: text8                   
 anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philosophy is the b

In [0]:
# read text8
with open('text8', 'r') as input_file:
    text = input_file.read()

### Tokenization
We first chop our text into pieces using NLTK's `WordPuncTokenizer`:

In [5]:
from nltk.tokenize import WordPunctTokenizer

tknzr = WordPunctTokenizer()
tokenized_text = tknzr.tokenize(text)

print(tokenized_text[0:20])

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english']


### Build dictionary
In this step, we convert each word to a unique id. We can define our vocabulary trimming rules, which specify whether certain words should remain in the vocabulary, be trimmed away, or handled differently. In following, we limit our vocabulary size to `vocab_size` words and replace the remaining tokens with `UNK`:

In [0]:
def get_data(text, vocab_size = None):
    
    word_counts = Counter(text)
    
    sorted_token = sorted(word_counts, key=word_counts.get, reverse=True) # sort by frequency
    
    if vocab_size: # keep most frequent words
        sorted_token = sorted_token[:vocab_size-1] 
    
    sorted_token.insert(0, 'UNK') # reserve 0 for UNK
    
    id_to_token = {k: w for k, w in enumerate(sorted_token)}
    token_to_id = {w: k for k, w in id_to_token.items()}
    
    # tokenize words in vocab and replace rest with UNK
    tokenized_ids = [token_to_id[w] if w in token_to_id else 0 for w in text]

    return tokenized_ids, id_to_token, token_to_id

In [7]:
tokenized_ids, id_to_token, token_to_id = get_data(tokenized_text)
print('-' * 50)
print('Number of uniqe tokens: {}'.format(len(id_to_token)))
print('-' * 50)
print("tokenized text: {}".format(tokenized_text[0:20]))
print('-' * 50)
print("tokenized ids: {}".format(tokenized_ids[0:20]))

--------------------------------------------------
Number of uniqe tokens: 253855
--------------------------------------------------
tokenized text: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english']
--------------------------------------------------
tokenized ids: [5234, 3081, 12, 6, 195, 2, 3134, 46, 59, 156, 128, 742, 477, 10572, 134, 1, 27350, 2, 1, 103]


### Generate samples
 
The `CBOW` model architecture tries to predict the current target word (the center word) based on the source context words (surrounding words). The training data thus comprises pairs of `(context_window, target_word)`, for which the model should predict the `target_word based` on the `context_window` words.

Considering a simple sentence, __the quick brown fox jumps over the lazy dog__, with a `context_window` of size 1, we have examples like __([quick, fox], brown)__, __([the, brown], quick)__, __([the, dog], lazy)__ and so on. 

<img src="
https://cdn-images-1.medium.com/max/800/1*UVe8b6CWYykcxbBOR6uCfg.png" alt="img" width="400px"/>



Now let us convert our tokenized text from `tokenized_ids` into `(context_window, target_word)` pairs.

You should loop over the `tokenized_ids` and build a __generator__ which yields a target word of length 1 and surrounding context of length (2 $\times$ `window_size`) where we take `window_size` words before and after the target word in our corpus. Remember to pad context words with zeroes to a fixed length if needed.

In [0]:
def generate_sample(tknzd_ids, window_size = 5):
    for index, target in enumerate(tknzd_ids):
    ############### for student ################ 
        if index - window_size < 0:
            context_window = [0] * (window_size - index) + tknzd_ids[0: index]
        else:
            context_window = tknzd_ids[index - window_size: index]
        if index + window_size + 1 >= len(tknzd_ids):
            context_window += tknzd_ids[index + 1:] + [0] * (index + window_size + 1 - len(tknzd_ids))
        else:
            context_window += tknzd_ids[index + 1: index + window_size + 1]

        print(context_window)
    ############################################
        yield context_window, target

In [49]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

dummy_gen = generate_sample([11, 12, 13, 14, 15], 2)
dummy_example = list(dummy_gen)

assert isinstance(dummy_example[0], tuple), "Is it a pair?" 
assert len(dummy_example[0][0]) == 4, "Context length should be 2 * window_size"
assert dummy_example[0][1] == 11, "Did you return the correct target word?"
assert dummy_example[0][0][0] == dummy_example[0][0][1]==0, "Did you add 0 pads where needed?"
assert len(dummy_example[0]) == len(dummy_example[-1]), "Length of all instances should be the same due to the padding"
assert dummy_example[0][0] == [0, 0, 12, 13], "Did you consider contexts before and after the target word?" 

print('Well done!')

[0, 0, 12, 13]
[0, 11, 13, 14]
[11, 12, 14, 15]
[12, 13, 15, 0]
[13, 14, 0, 0]
Well done!


To train our model faster, it is good idea to batchify our data. For your convenience, we implemented it for you: 

In [0]:
def batch_gen(tknzd_ids, batch_size = 4,  window_size = 5):
    
    # shuffle(tknzd_ids) # shuffle is in place and does not return anything
    
    single_gen = generate_sample(tknzd_ids, window_size) # get sample generator
    
    while True:
        try: 
            # The end of iterations is indicated by an exception 
            context_batch = np.zeros([batch_size, window_size * 2], dtype=np.int32)
            target_batch = np.zeros([batch_size], dtype=np.int32)
            for index in range(batch_size):
                context_batch[index], target_batch[index] = next(single_gen)
            yield context_batch, target_batch
        except StopIteration:
            break

In [56]:
dummy_batches = batch_gen([11, 12, 13, 14, 15, 16, 17, 18], batch_size=4, window_size=2)

print("First batch:\n", next(dummy_batches))
print('-' * 50)
print("Second batch:\n", next(dummy_batches))

[0, 0, 12, 13]
[0, 11, 13, 14]
[11, 12, 14, 15]
[12, 13, 15, 16]
First batch:
 (array([[ 0,  0, 12, 13],
       [ 0, 11, 13, 14],
       [11, 12, 14, 15],
       [12, 13, 15, 16]], dtype=int32), array([11, 12, 13, 14], dtype=int32))
--------------------------------------------------
[13, 14, 16, 17]
[14, 15, 17, 18]
[15, 16, 18, 0]
[16, 17, 0, 0]
Second batch:
 (array([[13, 14, 16, 17],
       [14, 15, 17, 18],
       [15, 16, 18,  0],
       [16, 17,  0,  0]], dtype=int32), array([15, 16, 17, 18], dtype=int32))


## 2. CBOW Model

We now leverage pytorch to build our CBOW model. For this, our inputs will be our context words which are first converted into one-hot vectors, and next projected into a word-vector. Word-vectors will be obtained from an embedding-matrix ($W$) which represents the distributed feature vectors associated with each word in the vocabulary. This embedding-matrix is initialized with a normal distribution.

Next, the projected words are averaged out (hence we don’t really consider the order or sequence in the context words when averaged) and then we multiply this averaged vector with another embedding matrix ($W'$), which defines so-called context embeddings to project the CBOW representation back to the one-hot space to match with the target word. (Note: in the theory, this is introduced as the linear output layer, with dimensions equal to the transposed of the embedding matrix.)  We thus apply a log-softmax on the resulting context vectors, to predict the most probable target word given the input context.

We match the predicted word with the actual target word, compute the loss by leveraging the cross entropy loss and perform back-propagation with each iteration to update the embedding-matrix in the process.

<img src="https://cdn-images-1.medium.com/freeze/max/1000/1*uATTt40gbJ1HJQgIqE-VPA.png?q=20" alt="img" width="512px"/>



### Question-1

- How could we modify the `CBOW` architecture to consider the order and position of the context words?  

**<font color=blue><<< Replace the average and hidden layer multiplication operation with a RNN architecture to predict the output>>></font>**

Now, complete the CBOW class below, following the instructions in the comments.

In [0]:
class CBOW(nn.Module):

    def __init__(self, embedding_dim=100, vocab_size=10000):
        super(CBOW, self).__init__()
        
        self.vocab_size = vocab_size
        
        # use nn.Parameter to define the two matrices W and W' from above, 
        # thus one for word (W) and one for context (W') embeddings:
        # self.embed_in = ...  # word embedding
        # self.embed_out = ... # context embedding
        ############### for student ################
        self.embed_in = nn.Parameter(torch.zeros(embedding_dim, vocab_size))
        self.embed_out = nn.Parameter(torch.zeros(vocab_size, embedding_dim))

        ############################################
        
        self.reset_parameters()
            
    
    def reset_parameters(self):
        # Initialize parameters
        nn.init.kaiming_uniform_(self.embed_in, a=math.sqrt(5))
        nn.init.kaiming_uniform_(self.embed_out, a=math.sqrt(5))
    
    def get_word_embedding(self):
        return self.embed_in
    
    def get_context_embedding(self):
        return self.embed_out
    
    
    def forward(self, inps):
        """
        Convert given indices to log-probablities. 
        Follow these steps:
        1) convert the inputs' word indices to one-hot vectors
        2) project the one-hot vectors to their embedding (use F.linear, do *NOT* use nn.Embedding)
        3) calculate the mean of the embedded vectors
        4) project back with the context embedding matrix 
        5) calculate the log-probability (with F.log_softmax)
                
        :argument:
            inps (list): List of indecies
        
        :return:
            log-probablity of words
        """
        ############### for student ################
        # embed_in(N, V) * one_hot(V) = embedding(N)
        # 1) convert the inputs' word indices to one-hot vectors
        # n = inps.shape[1]
        # one_hot = torch.zeros(n, self.vocab_size)
        # one_hot[torch.arange(n), inps] = 1
        
        one_hot = torch.nn.functional.one_hot(inps, self.vocab_size)
        one_hot = one_hot.to(dtype=torch.float32)
        print(one_hot.shape)
        # print(self.embed_in.shape)
        # print(one_hot.dtype)
        # print(self.embed_in.dtype)

        # 2) project the one-hot vectors to their embedding (use F.linear, do *NOT* use nn.Embedding)
        embedding = F.linear(one_hot, self.embed_in)
        print(embedding.shape)

        # 3) calculate the mean of the embedded vectors
        if len(embedding.shape) > 2:
            embedding_ave =  torch.mean(embedding, dim=1, keepdim=False)
        else:
            embedding_ave =  torch.mean(embedding, dim=0, keepdim=True)
        print(embedding_ave.shape)

        # 4) project back with the context embedding matrix 
        Out_put = F.linear(embedding_ave, self.embed_out)
        print(Out_put.shape)
        
        # 5) calculate the log-probability (with F.log_softmax)
        log_probs = F.log_softmax(Out_put)
        print(log_probs)

        ############################################
        return log_probs

In [0]:
# one_hot = torch.randn((2, 4, 10))
# print(one_hot.shape)
# embed_in = torch.randn((20, 10))
# print(embed_in.shape)
# temp = F.linear(one_hot, embed_in)
# print(temp.shape)

In [197]:
dummy_model = CBOW(20, 10)
dummy_inps2 = torch.tensor([[6, 7, 9, 0], [1, 2, 3, 4]], dtype=torch.long)
dummy_pred1 = dummy_model(dummy_inps2)

torch.Size([2, 4, 10])
torch.Size([2, 4, 20])
torch.Size([2, 20])
torch.Size([2, 10])
tensor([[-2.2789, -2.3172, -2.3153, -2.3168, -2.3121, -2.2767, -2.2632, -2.2728,
         -2.3450, -2.3312],
        [-2.3766, -2.3987, -2.2271, -2.3053, -2.2720, -2.3562, -2.3239, -2.3105,
         -2.2142, -2.2587]], grad_fn=<LogSoftmaxBackward>)




In [198]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

dummy_model = CBOW(20, 10)
dummy_inps1 = torch.tensor([[6, 7, 9, 0]], dtype=torch.long)
dummy_inps2 = torch.tensor([[6, 7, 9, 0], [1, 2, 3, 4]], dtype=torch.long)
dummy_pred1 = dummy_model(dummy_inps1)
dummy_pred2 = dummy_model(dummy_inps2)

assert isinstance(dummy_model.embed_in, nn.Parameter), "Use nn.Parameter for embed_in"
assert isinstance(dummy_model.embed_out, nn.Parameter), "Use nn.Parameter for embed_out"
assert dummy_model.embed_in.shape == torch.Size([20, 10]), "param_in shape is not correct"
assert dummy_model.embed_out.shape == torch.Size([10, 20]), "param_out shape is not correct"
assert dummy_pred1.shape == torch.Size([1,10]), "Prediction shape is not correct"
assert dummy_pred2.shape == torch.Size([2,10]), "Prediction shape is not correct"
assert dummy_pred1.grad_fn.__class__.__name__ == 'LogSoftmaxBackward', "softmax layer?"

print('Well done!')

torch.Size([1, 4, 10])
torch.Size([1, 4, 20])
torch.Size([1, 20])
torch.Size([1, 10])
tensor([[-2.2139, -2.3625, -2.2115, -2.3181, -2.3558, -2.3497, -2.3327, -2.2591,
         -2.2980, -2.3394]], grad_fn=<LogSoftmaxBackward>)
torch.Size([2, 4, 10])
torch.Size([2, 4, 20])
torch.Size([2, 20])
torch.Size([2, 10])
tensor([[-2.2139, -2.3625, -2.2115, -2.3181, -2.3558, -2.3497, -2.3327, -2.2591,
         -2.2980, -2.3394],
        [-2.2885, -2.2856, -2.2546, -2.2930, -2.3903, -2.3146, -2.2616, -2.3209,
         -2.3831, -2.2448]], grad_fn=<LogSoftmaxBackward>)
Well done!




### Train Model

Before jumping into the training part, we need to define some hyper-parameters:

In [0]:
# embedding hyper-parameters

EMBED_DIM = 100
WINDOW_SIZE = 5
BATCH_SIZE = 128
VOCAB_SIZE = 10_000

EPOCHS = 1 # to make things faster in this basic setup
interval = 100

In [0]:
# get data

tokenized_ids, id_to_token, _ = get_data(tokenized_text, VOCAB_SIZE)

Now we define our main training loop. Please implement the typical steps for training:
- Reset all gradients
- Compute output and loss value
- Perform back-propagation
- Update the network’s parameters

In [0]:
model = CBOW(EMBED_DIM, VOCAB_SIZE)
model = model.to(device)

criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters())

loss_history = []

for e in range(EPOCHS):
    
    batches = batch_gen(tokenized_ids, batch_size=BATCH_SIZE, window_size=WINDOW_SIZE)
    total_loss = 0.0
    
    for iteration, (context, target) in enumerate(batches):
        
        # Step 1. Prepare the inputs to be passed to the model (wrap integer indices in tensors)
        # Step 2. Recall that torch *accumulates* gradients. Before passing a
        #         new instance, you need to zero out the gradients from the old instance
        # Step 3. Run the forward pass, getting predicted target words log probabilities
        # Step 4. Compute your loss function. 
        # Step 5. Do the backward pass and update the gradient
        
        ############### for student ################















        ############################################
        
        total_loss += loss.item()
        
        if iteration % interval == 0:
            print('Epoch:{}/{},\tIteration:{},\tLoss:{}'.format(e, EPOCHS, iteration, total_loss / interval), end = "\r", flush = True)
            loss_history.append(total_loss / interval)
            total_loss = 0.0

In [0]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

assert loss_history[-1] < 6.5

print('Well done!')

### Nearest words

So far, we trained the __CBOW__ successfully, now it is time to explore it more. In this part, we want to find the $k$ nearest word to a given word, i.e., nearby in the vector space.

<img src="https://i0.wp.com/i.imgur.com/IeZt839.png" alt="img" width="480px"/>



Define a helper function to retrieve the corresponding vector for a given word:

In [0]:
# be sure jupyter session is not terminated!
# use token_to_id to retrieve the index

def get_vector(embedding, word):
    """
    :argument:
        embedding (matrix): embedding matrix 
        word (str): The given input
    :return:
        word-vector for a given word
    """
    ############### for student ################










    ############################################

In [0]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

embedding = model.embed_in.data

assert get_vector(embedding, 'the').shape == torch.Size([100, 1]), "vector size should be (embed_dim, 1)"
assert np.allclose(embedding[:,(0,)].data.cpu().numpy(), get_vector(embedding, 'UNK').data.cpu().numpy()), "Do you retrieve correct vector?"
print('Well done!')

Define a function to return the list of $k$ most similar words, e.g., based on `cosine-similarity`, to a given word:

In [0]:
def most_similar_words(embedding, word, k=1):
    """
    return k similar (based on cosine similarity) items
    :argument:
        embedding (matrix): embedding matrix 
        word (str): The given input
        k (int): The number of similar items    
    :return:
        list of k similar items
    """
    most_similar = []
    x = get_vector(embedding, word) # 300, 1
    # ...
    # most_similar = ...
    ############### for student ################


    F.cosine_similarity
    # transpose if needed


    ############################################
    return most_similar

In [0]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

embedding = model.embed_in.data

dummy_list = most_similar_words(embedding, "mutual", 3)
s1 = F.cosine_similarity(get_vector(embedding, dummy_list[0]).T, get_vector(embedding, "mutual").T)
s2 = F.cosine_similarity(get_vector(embedding, dummy_list[1]).T, get_vector(embedding, "mutual").T)
s3 = F.cosine_similarity(get_vector(embedding, dummy_list[2]).T, get_vector(embedding, "mutual").T)

assert len(dummy_list) == 3, "return k nearest words"
assert s1.data.cpu().numpy()[0] >= s2.data.cpu().numpy()[0], "first item should have higher probablity to the given word"
assert s2.data.cpu().numpy()[0] >= s3.data.cpu().numpy()[0], "second item should have higher probability"
assert s1.data.cpu().numpy()[0] != 1 , "Similarity score of one means you return the word itself"

print('Well done!')

### Linear projection


The simplest linear dimensionality reduction method is __P__rincipial __C__omponent __A__nalysis.

In geometric terms, PCA tries to find axes along which most of the variance occurs. The "natural" axes, if you wish.


<img src="https://hackernoon.com/hn-images/1*ZFqnPuxa1PtUece-OHBoTA.png" alt="img" width="512px"/>

Under the hood, it attempts to decompose an object-feature matrix $X$ into two smaller matrices: $W$ and $\hat W$ minimizing the *mean squared error*:

$$\min_{W, \hat{W}} \ \ \|(X W) \hat{W} - X\|^2_2 $$

with
- $X \in \mathbb{R}^{n \times m}$ - object matrix (**centered**);
- $W \in \mathbb{R}^{m \times d}$ - matrix of direct transformation;
- $\hat{W} \in \mathbb{R}^{d \times m}$ - matrix of reverse transformation;
- $n$ samples, $m$ original dimensions and $d$ target dimensions;


In [0]:
from sklearn.decomposition import PCA

# Map word vectors onto a 2D plane with PCA. Use the good old sklearn API (fit, transform).
# Finally, normalize the mapped vectors, to make sure they have zero mean and unit variance 

# word_vectors = ...
# ...
# word_vectors_pca = ...  # normalized vectors

############### for student ################







############################################

In [0]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

assert word_vectors_pca.shape == (len(word_vectors), 2), "there must be a 2D vector for each word"
assert max(abs(word_vectors_pca.mean(0))) < 1e-5, "points must be zero-centered"
assert max(abs(1.0 - word_vectors_pca.std(0))) < 1e-2, "points must have unit variance"

print('Well done')

In [0]:
# !pip install bokeh

import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook

output_notebook()

def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    """ draws an interactive plot for data points with auxiliary info on hover """
    if isinstance(color, str): color = [color] * len(x)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show: pl.show(fig)
    return fig

In [0]:
draw_vectors(word_vectors_pca[:, 0], word_vectors_pca[:, 1], token=list(id_to_token.values()))

### Visualizing neighbors with t-SNE
PCA is nice but it's strictly linear and thus only able to capture coarse high-level structure of the data.

If we instead want to focus on keeping neighboring points near, we could use TSNE, which is itself an embedding method. Here you can read __[more on TSNE](https://distill.pub/2016/misread-tsne/)__.

In [0]:
from sklearn.manifold import TSNE

# Map word vectors onto a 2d plane with TSNE. (Hint: use verbose=100 to see what it's doing.)
# Normalize them just like with PCE into word_tsne

# ...
# word_tsne = ...

############### for student ################




############################################

In [0]:
draw_vectors(word_tsne[:, 0], word_tsne[:, 1], color='green', token=list(id_to_token.values()))

## 3. POS tagging task

The embeddings by themselves are nice to have, but the main objective of course is to solve a particular (NLP) task. Further, so far we have trained our own embedding from a given corpus, but often it is beneficial to use existing word embeddings.

Now, let's use embeddings to train a simple Part of Speech (PoS) tagging model, using pretrained word embeddings. We shall use [50d glove word vectors](https://nlp.stanford.edu/projects/glove/) for the rest of this section.

Before jumping into our neural POS tagger, it is better to set up a baseline to give us an intuition how the neural model performs compared to other models. The baseline model is the [Conditional-Random-Field (CRF)](https://en.wikipedia.org/wiki/Conditional_random_field, also discussed in lecture `NLP_03_PoS_tagging_and_NER_20201`) which is a discriminative sequence labelling model. The evaluation is done on a 10\% sample of the Penn Treebank (which is offered through NLTK).

Download data from `nltk` repository and split it into test (20%) and training (80%) sets:

In [0]:
import nltk

# download necessary packages from nltk
nltk.download('treebank')
nltk.download('universal_tagset')

tagged_sentence = nltk.corpus.treebank.tagged_sents(tagset='universal')
print("Number of Tagged Sentences ", len(tagged_sentence))
print(tagged_sentence[0])

In [0]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(tagged_sentence, test_size=0.20, random_state=42)

print("Train size: {}".format(len(train)))
print("Test size: {}".format(len(test)))

### Setup a baseline

In [0]:
def features(sentence, index):
    """
    Return hand designed features for a given word
    :argument:
        sentence: tokenized sentence [w1, w2, ...] 
        index: index of the word    
    :return:
        a feature set for given word
    """

    return {
        'word': sentence[index],
        'is_first': index == 0,
        'is_last': index == len(sentence) - 1,
        'is_capitalized': sentence[index][0].upper() == sentence[index][0],
        'prev_word': '' if index == 0 else sentence[index - 1],
        'next_word': '' if index == len(sentence) - 1 else sentence[index + 1],
        ############### for student ################














        ############################################
    }

### Question-2

- Suggest about 6 more features that you could improve the above feature-set and add them to the code above. After running the model with these features: which features worked best, and how much did your new features help in improving the model?   

**<font color=blue><<< INSERT ANSWER HERE >>></font>**

In [0]:
def transform2feature_label(tagged_sentence):
    X, y = [], []
 
    for tagged in tagged_sentence:
        X.append([features([w for w, t in tagged], i) for i in range(len(tagged))])
        y.append([tagged[i][1] for i in range(len(tagged))])
    
    return X,y

In [0]:
X_train, y_train = transform2feature_label(train)
X_test, y_test = transform2feature_label(test)

In [0]:
X_train[0][0]

In [0]:
# install crf-classifier

!pip install sklearn-crfsuite

In [0]:
import sklearn_crfsuite


# fit crfsuite classifier on train data
############### for student ################




############################################

print ("Accuracy:", crf.score(X_test, y_test))

### Build neural model 

Now it's time to build our Neural PoS-tagger. The model we want to play with is a bi-directional LSTM on top of pretrained word embeddings. First, we prepare the embedding part and then go into the model itself:

In [0]:
# download glove 50d

!wget "https://www.dropbox.com/s/lc3yjhmovq7nyp5/glove6b50dtxt.zip?dl=1" -O glove6b50dtxt.zip
!unzip -o glove6b50dtxt.zip
!rm glove6b50dtxt.zip

In [0]:
GLOVE_PATH = 'glove.6B.50d.txt'

We build two dictionaries for mapping words and tags to uniqe ids, which we need later on:

In [0]:
word_to_id = {}
tag_to_id = {}

for sentence in tagged_sentence:
    for word, pos_tag in sentence:
        if word not in word_to_id.keys():
            word_to_id[word] = len(word_to_id)
        if pos_tag not in tag_to_id.keys():
            tag_to_id[pos_tag] = len(tag_to_id)
            
word_vocab_size = len(word_to_id)
tag_vocab_size = len(tag_to_id)

print("Unique words: {}".format(word_vocab_size))
print("Unique tags: {}".format(tag_vocab_size))

We created a wrapper for the embedding module to encapsulate it from the other parts. This module aims to load word vectors from file and assign the weights into the corresponding embedding.

Create an embedding layer (this time use `nn.Embedding`), and assign the pretrained embeddings to its `weight` field. In this exercise, you can continue to finetune the embeddings while training the end task; no need to freeze them: this means the pre-trained embeddings serve as a smart initialization of the embedding layer.

In [0]:
class PretrainedEmbeddings(nn.Module):
    def __init__(self, filename, word_to_id, dim_embedding):
        super(PretrainedEmbeddings, self).__init__()
        
        wordvectors = self.load_word_vectors(filename, word_to_id, dim_embedding)
        # self.embed = ...
        ############### for student ################




        ############################################

    def forward(self, inputs):
        return self.embed(inputs)
    
    def load_word_vectors(self, filename, word_to_id, dim_embedding):
        wordvectors = torch.zeros(len(word_to_id), dim_embedding)
        with open(filename, 'r') as file:
            for line in file.readlines():
                data = line.split(' ')
                word = data[0]
                vector = data[1:]
                if word in word_to_id.keys():
                    wordvectors[word_to_id[word],:] = torch.Tensor([float(x) for x in vector])
        
        return wordvectors

In [0]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

dummy_model = PretrainedEmbeddings(GLOVE_PATH, word_to_id, 50)
dummy_inps = torch.tensor([0, 4, 3, 5, 9], dtype=torch.long)

assert dummy_model.embed.weight.shape == torch.Size([word_vocab_size, 50]), "embedding shape is not correct"
assert dummy_model(dummy_inps).shape == torch.Size([5, 50]), "word embedding shape is not correct"
assert np.allclose(dummy_model.embed.weight.detach().numpy()[0], [0] * 50), "Load weights from glove?"
assert np.allclose(dummy_model.embed.weight.detach().numpy()[714], [0] * 50), "Are you sure you load from glove correctly?"

print('Well done')

Let’s now define the model. Here’s what we need:

- We’ll need an embedding layer that computes a word vector for each word in a given sentence
- We’ll need a bidirectional-LSTM layer to incorporate context from both directions  (reshape the embedding since `nn.LSTM` needs 3-dimensional inputs)
- After the LSTM Layer we need a Linear layer that picks the appropriate POS tag (note that this layer is applied to each element of the sequence).
- Apply the LogSoftmax to calculate the log probabilities from the resulting scores.

Complete the forward path of the POSTagger model: 

In [0]:
class POSTagger(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, word_to_id, tag_to_id, embedding_file_path):
        super(POSTagger, self).__init__()
        
        self.embed = PretrainedEmbeddings(embedding_file_path, word_to_id, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, bidirectional=True)
        self.hidden2tag = nn.Linear(hidden_dim * 2, len(tag_to_id))
        
    def forward(self, sentence):
        ############### for student ################





        ############################################
        return tag_scores

In [0]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

dummy_model = POSTagger(50, 50, word_to_id, tag_to_id, GLOVE_PATH)
dummy_inps = torch.tensor([0, 4, 3, 5, 9], dtype=torch.long)

assert dummy_model(dummy_inps).grad_fn.__class__.__name__ == 'LogSoftmaxBackward', "softmax layer?"
assert dummy_model(dummy_inps).shape == torch.Size([5, len(tag_to_id)]), "The output has wrong shape! Probably you need some reshaping!"

print("Well done!")

Perfect! Now train your model:

In [0]:
# Training start

model = POSTagger(50, 64, word_to_id, tag_to_id, GLOVE_PATH)
model = model.to(device)
criterion = nn.NLLLoss()
optimizer = optim.AdamW(model.parameters())

accuracy_list = []
loss_list = []

interval = round(len(train) / 100.)
EPOCHS = 6
e_interval = round(EPOCHS / 10.)

for e in range(EPOCHS):
    acc = 0 
    loss = 0
    
    model.train()
    
    for i, sentence_tag in enumerate(train):
        
        sentence = [word_to_id[s[0]] for s in sentence_tag]
        sentence = torch.tensor(sentence, dtype=torch.long)
        sentence = sentence.to(device)
        targets = [tag_to_id[s[1]] for s in sentence_tag]
        targets = torch.tensor(targets, dtype=torch.long)
        targets = targets.to(device)
        
        model.zero_grad()
        
        tag_scores = model(sentence)
        
        loss = criterion(tag_scores, targets)
        
        loss.backward()
        
        optimizer.step()
        
        loss += loss.item()
        
        _, indices = torch.max(tag_scores, 1)

        acc += torch.mean((targets == indices).float())
        
        if i % interval == 0:
            print("Epoch {} Running;\t{}% Complete".format(e + 1, i / interval), end = "\r", flush = True)
    
    loss = loss / len(train)
    acc = acc / len(train)
    loss_list.append(float(loss))
    accuracy_list.append(float(acc))
    
    if (e + 1) % e_interval == 0:
        print("Epoch {} Completed,\tLoss {}\tAccuracy: {}".format(e + 1, np.mean(loss_list[-e_interval:]), np.mean(accuracy_list[-e_interval:])))

So far, so good! It's time to test our classifier. Complete the evaluation part. Compute accuracy on the test data:

In [0]:
def evaluate(model, data):

    model.eval()
    
    acc = 0.0
    
    # calculate accuracy based on predictions
    ############### for student ################












    ############################################
    
    return score
    
        

In [0]:
score = evaluate(model, test)
print("Accuracy:", score)

assert score > 0.96, "accuracy should be above 96%"
assert score < 1.00, "accuracy should be less than 100!%"

print('Well done!')

### Question-3

- Whether or not to fine-tune the pre-trained embeddings, the number of epochs you need (whether or not to use 'early stopping'), to apply regularization... are hyperparameters that should be properly tuned on a validation set. We did not do this here. It is therefore hard to make strong claims about the model at this point. However, as a quick test, please train the POS model with the same settings, but with a standard randomly initialized embedding layer instead of the pretrained embeddings. What do you observe compared to the CRF baseline / compared to the GloVe initialization? (Note: for your final code in `POSTagger`, please make sure it again loads the pretrained embeddings).

**<font color=blue><<< INSERT ANSWER HERE >>></font>**

### Acknowledgment

If you received help or feedback from fellow students, please acknowledge that here. We count on your academic honesty:

... ...