# LSTM
- in LSTM info goes through 
    - forget gate (gets prev. hidden state & $x^t1$), 
    - input gate (sigmoid, tanh), 
    - output gate (tanh)
- LSTM can deal with long sequences (robust to vanishing gradients?)
- https://www.youtube.com/watch?v=uSdku8Q3d0A

## LSTM Math
Some symbols look funny...🤔
![lstm_math.png](img/lstm_math.png)
[Source](https://www.coursera.org/learn/sequence-models-in-nlp/supplement/ANbcf/lstm-equations-optional)


# Named Entity Recognition (NER)
- scan text fast for desired info
- find & extract defined entities (e.g. names, offensive vocab)

## NER types
- geographical (Germany)
- organizations (EOS)
- geopolitical (German)
- time (June 2021)
- artifacts (...)
- persons (Angela Merkel)


Labeled Sentence:
*Sebastian (PER) fährt nächste Woche (TIME) nach Erfurt (GEO).*

## Examples Applied NER
- efficient search engines (our confluence needs NER 😁)
- recommendation engines (may not work for all recommendation use cases)
- first level customer communication (call/chat) (e.g. forward customer to the best agent for his request)
- tradobot (1. evaluate scraped content with NER, 2. trade accordingly)

![acc.png](img/acc.png)
[Source](https://www.coursera.org/learn/sequence-models-in-nlp/lecture/odcLM/computing-accuracy)

# Assignment 3

In [1]:
import trax 
from trax import layers as tl
import os 
import numpy as np
import pandas as pd


from w3 import get_params, get_vocab
import random as rnd

# set random seeds to make this notebook easier to replicate (depricated)
#trax.supervised.trainer_lib.init_random_number_generators(33)
# https://github.com/google/trax/issues/920

In [2]:
vocab, tag_map = get_vocab('w3data/large/words.txt', 'w3data/large/tags.txt')
t_sentences, t_labels, t_size = get_params(vocab, tag_map, 'w3data/large/train/sentences.txt', 'w3data/large/train/labels.txt')
v_sentences, v_labels, v_size = get_params(vocab, tag_map, 'w3data/large/val/sentences.txt', 'w3data/large/val/labels.txt')
test_sentences, test_labels, test_size = get_params(vocab, tag_map, 'w3data/large/test/sentences.txt', 'w3data/large/test/labels.txt')

### Looking at one training example

In [3]:
print(open('w3data/large/train/sentences.txt', 'r').readline(),
      t_sentences[0],
      t_labels[0], 
      sep="\n")

Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 9, 15, 1, 16, 17, 18, 19, 20, 21]
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0]


In [4]:
print(tag_map)

{'O': 0, 'B-geo': 1, 'B-gpe': 2, 'B-per': 3, 'I-geo': 4, 'B-org': 5, 'I-org': 6, 'B-tim': 7, 'B-art': 8, 'I-art': 9, 'I-per': 10, 'I-gpe': 11, 'I-tim': 12, 'B-nat': 13, 'B-eve': 14, 'I-eve': 15, 'I-nat': 16}


In [5]:
tag_map["B-geo"], tag_map["B-gpe"]

(1, 2)

The `tag_map` contains various encoded tags. The example sentence above contains
- one *geographical entity - geo* and 
- a *geopolitical entity - gpe*. 

Other abbreviations:
* org: organization
* per: person 
* tim: time indicator
* art: artifact
* eve: event
* nat: natural phenomenon
* O: filler word

Note, here we're differentiating between `B`-efore Tokens & `I`-nside Tokens:

Examples:

- *\"EOS (`B-org`) Solutions (`I-org`) is a great company.\"*

- *\"We are Borg (`B-org`). Resistance is futile.\"*

In [6]:
def data_generator(batch_size: int, x, y, pad: int, shuffle=False, verbose=False):
    """This generator creates batches of data to be feeded into the NN.

    Parameters
    ----------
    batch_size : int
        integer describing the batch size
        
    x : List
        list containing sentences where words are represented as integers
        
    y : List
        list containing tags associated with the sentences
        
    pad : int
        an integer representing a pad character
        
    shuffle : bool, optional
        Shuffle the data order, by default False
        
    verbose : bool, optional
        Print information during runtime, by default False

    Yields
    -------
    Tuple (X,Y)
        - X padded sentences, np.array(), shape = (batch_size, max_len)
        - Y tags associated with the sentences in X , np.array(), shape = (batch_size, max_len)
    """

    # count the number of lines in data_lines
    num_lines = len(x)

    # create an array with the indexes of data_lines that can be shuffled
    lines_index = [*range(num_lines)]

    # "Everyday I'm shuffeling..."
    if shuffle:
        rnd.shuffle(lines_index)

    index = 0  # tracks current location in x, y
    
    while True:
        
        # Temporal arrays to store the raw x & y data for this batch
        buffer_x,  buffer_y = [0] * batch_size, [0] * batch_size  
        
        max_len = 0 
        for i in range(batch_size):
            if index >= num_lines:
                index = 0
                if shuffle:
                    rnd.shuffle(lines_index)

            # Get current position & store the x value in buffer_x, buffer_y
            buffer_x[i],buffer_y[i] = x[lines_index[index]],y[lines_index[index]]

            # Get max len for later padding
            lenx = len(buffer_x[i])
            if lenx > max_len:
                max_len = lenx  
            index += 1

        # create X,Y, NumPy arrays of size (batch_size, max_len) 'full' of pad value
        X = np.full((batch_size, max_len), pad)
        Y = np.full((batch_size, max_len), pad)

        # copy values from lists to NumPy arrays. Use the buffered values
        for i in range(batch_size):
            
            # Get the example (sentence as a tensor) & labels
            # in buffer_x, buffer_y at the i index
            x_i, y_i = buffer_x[i], buffer_y[i]

            # Walk through each word in x_i
            # & store word & label in x_i, y_i at position j into X,Y
            for j in range(len(x_i)):   
                X[i, j], Y[i, j] = x_i[j], y_i[j]

        if verbose:
            print("index=", index)
        yield ((X, Y))

In [7]:
# Testing the data generator
batch_size = 5
mini_sentences = t_sentences[0: 8]
mini_labels = t_labels[0: 8]

dg = data_generator(batch_size, 
                    mini_sentences, 
                    mini_labels, 
                    vocab["<PAD>"], 
                    shuffle=False, 
                    verbose=False)
X1, Y1 = next(dg)
X2, Y2 = next(dg)

print(Y1.shape, X1.shape, Y2.shape, X2.shape)
print(X1[0][:], "\n", Y1[0][:])

(5, 30) (5, 30) (5, 30) (5, 30)
[    0     1     2     3     4     5     6     7     8     9    10    11
    12    13    14     9    15     1    16    17    18    19    20    21
 35180 35180 35180 35180 35180 35180] 
 [    0     0     0     0     0     0     1     0     0     0     0     0
     1     0     0     0     0     0     2     0     0     0     0     0
 35180 35180 35180 35180 35180 35180]


In [8]:
def NER(vocab_size=35181, d_model=50, tags=tag_map):
    '''
      Input: 
        vocab_size - integer containing the size of the vocabulary
        d_model - integer describing the embedding size
      Output:
        model - a trax serial model
    '''
    model = tl.Serial(
      tl.Embedding(vocab_size=vocab_size, d_feature=d_model), # Embedding layer
      tl.LSTM(50),                                            # LSTM layer
      tl.Dense(len(tags)),                                    # Dense layer with len(tags) units
      tl.LogSoftmax()                                         # LogSoftmax layer
      )
    return model

In [9]:
from trax.supervised import training

rnd.seed(33)

batch_size = 64

# Create training data, mask pad id=35180 for training.
train_generator = trax.data.inputs.add_loss_weights(
    data_generator(batch_size, t_sentences, t_labels, vocab['<PAD>'], True),
    id_to_mask=vocab['<PAD>'])

# Create validation data, mask pad id=35180 for training.
eval_generator = trax.data.inputs.add_loss_weights(
    data_generator(batch_size, v_sentences, v_labels, vocab['<PAD>'], True),
    id_to_mask=vocab['<PAD>'])

In [23]:
def train_model(NER, train_generator, eval_generator, train_steps=1, output_dir='w3model'):
    '''
    Input: 
        NER - the model you are building
        train_generator - The data generator for training examples
        eval_generator - The data generator for validation examples,
        train_steps - number of training steps
        output_dir - folder to save your model
    Output:
        training_loop - a trax supervised training Loop
    '''
    train_task = training.TrainTask(
      train_generator, 
      loss_layer = tl.CrossEntropyLoss(), 
      optimizer = trax.optimizers.adam.Adam(learning_rate=0.01),
    )

    eval_task = trax.supervised.training.EvalTask(
      labeled_data = eval_generator, 
      metrics = [tl.CrossEntropyLoss(), tl.Accuracy()],
      n_eval_batches = 10
    )

    training_loop = trax.supervised.training.Loop(
        NER, 
        train_task,
        eval_tasks = eval_task, 
        output_dir = output_dir)

    training_loop.run(n_steps = train_steps)

    return training_loop

In [26]:
train_steps = 100            
#!rm -f 'w3model/model.pkl.gz'

# Train the model
training_loop = train_model(NER(), train_generator, eval_generator, train_steps, output_dir="w3model")


Step      1: Total number of trainable weights: 1780117
Step      1: Ran 1 train steps in 1.69 secs
Step      1: train CrossEntropyLoss |  1.93273330
Step      1: eval  CrossEntropyLoss |  1.23505280
Step      1: eval          Accuracy |  0.84667937

Step    100: Ran 99 train steps in 39.54 secs
Step    100: train CrossEntropyLoss |  0.49246120
Step    100: eval  CrossEntropyLoss |  0.25739905
Step    100: eval          Accuracy |  0.93541647


In [27]:
# loading in a pretrained model..
model = NER()
model.init(trax.shapes.ShapeDtype((1, 1), dtype=np.int32))

# Load the pretrained model
model.init_from_file('w3model/model.pkl.gz', weights_only=True)

((array([[-0.06427982, -0.23895922, -0.07353285, ..., -0.0216987 ,
           0.23409638, -0.11087177],
         [-0.01760263, -0.37041888,  0.52372843, ..., -0.01245554,
          -0.18482512, -0.04075526],
         [-0.03994921, -0.21592025,  0.19810744, ...,  0.09780945,
           0.06051806, -0.13982528],
         ...,
         [-0.03431689,  0.04886205, -0.08553553, ...,  0.06023186,
           0.08740468,  0.07686337],
         [-0.12607062,  0.21226813, -0.15920283, ...,  0.18507613,
           0.05739477,  0.178495  ],
         [ 0.10810461, -0.02375236,  0.21169613, ...,  0.22548877,
           0.10661811, -0.01324451]], dtype=float32),
  (((), ((), ())),
   ((array([[ 1.4821516e-01,  1.8639611e-01, -1.7188908e-01, ...,
             -2.5087228e-01, -2.8236269e-03, -3.3412048e-01],
            [-2.8480816e-01,  3.2045119e-02,  2.3914267e-01, ...,
             -5.1432490e-01, -6.4350647e-01, -5.0259823e-01],
            [ 5.7419825e-01, -1.9105087e-01,  7.7752713e-03, ...,
    

In [28]:
def evaluate_prediction(pred, labels, pad):
    """
    Inputs:
        pred: prediction array with shape 
            (num examples, max sentence length in batch, num of classes)
        labels: array of size (batch_size, seq_len)
        pad: integer representing pad character
    Outputs:
        accuracy: float
    """

    outputs = np.argmax(pred, axis=-1) #words with highest probability
    print("outputs shape:", outputs.shape)

    mask = labels != pad
    print("mask shape:", mask.shape, "mask[0][20:30]:", mask[0][20:30])

    accuracy = np.sum(outputs == labels)/np.sum(mask)

    return accuracy

In [29]:
x, y = next(data_generator(len(test_sentences), test_sentences, test_labels, vocab['<PAD>']))

In [30]:
accuracy = evaluate_prediction(model(x), y, vocab['<PAD>'])
print("accuracy: ", accuracy)

outputs shape: (7194, 70)
mask shape: (7194, 70) mask[0][20:30]: [ True  True  True False False False False False False False]
accuracy:  0.9355262


In [31]:
def predict(sentence, model, vocab, tag_map):
    s = [vocab[token] if token in vocab else vocab['UNK'] for token in sentence.split(' ')]
    batch_data = np.ones((1, len(s)))
    batch_data[0][:] = s
    sentence = np.array(batch_data).astype(int)
    output = model(sentence)
    outputs = np.argmax(output, axis=2)
    labels = list(tag_map.keys())
    pred = []
    for i in range(len(outputs[0])):
        idx = outputs[0][i] 
        pred_label = labels[idx]
        pred.append(pred_label)
    return pred

In [34]:
sentence = "Sebastians current house in Germany is beautiful. But his future house in Spain will be much better."

s = [vocab[token] if token in vocab else vocab['UNK'] for token in sentence.split(' ')]
predictions = predict(sentence, model, vocab, tag_map)
for x,y in zip(sentence.split(' '), predictions):
    if y != 'O':
        print(x,y)

Germany B-geo
Spain B-geo
