## LAB 3 : LSTMs and Sequences

#### Data Science course offered by Pavlos Protopapas

TAs : Patrick Ohiomoba, Srivatsan Srinivasan

In this lab, we will look at LSTMs and the power of LSTMs in modeling sequences. We will dive deep into two exercises - a.) using LSTMs to tag some named entities in a sentence ( such as person, geographic location etc.) and b.) learning the powerful seq-to-seq models to perform simple addition of 3 digit numbers. While it is important to learn the modeling part from these exercises, students will also benefit from understanding how data is parsed into formats conducive for using as I/O in Keras models in both these exercises.

*  For a greater understanding of LSTMs and how they are different from simple RNNs, please refer to this blog. http://colah.github.io/posts/2015-08-Understanding-LSTMs/
*  For a greater understanding of GRUs and how they are different from simple RNNs, please refer to this blog. https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be

## EXERCISE 1 : ENTITY TAGGING IN SENTENCES

IOB tagging for parts of speech -
https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)

### Load Data

In [0]:
#RUN THIS ONLY IF YOU RUN ON GOOGLE COLAB
from google.colab import files
uploaded = files.upload()


Saving ner_dataset.csv to ner_dataset.csv


In [0]:
import pandas as pd
import io
df2 = pd.read_csv(io.BytesIO(uploaded['ner_dataset.csv']),encoding='latin1')
#df2 = pd.read_csv('ner_dataset.csv',encoding='latin1')

In [0]:
df2 = df2.fillna(method="ffill")
df2.tail(10)
words = list(set(df2["Word"].values))
words.append("ENDPAD")
tags = list(set(df2["Tag"].values))
n_words, n_tags = len(words), len(tags)
print('Number of Words : ', len(words), ' and number of tags : ', len(tags))

Number of Words :  35179  and number of tags :  17


### Preprocessing

In the preprocessing step, we are grouping our dataset into sentences. In each sentence, we are mapping each element to a triplet of (Word, POS, Tag(what we intend to learn)).

As an example at the end of preprocessing, the output looks so. 

[('Thousands', 'NNS', 'O'), ('of', 'IN', 'O'), ('demonstrators', 'NNS', 'O'), ('have', 'VBP', 'O'), ('marched', 'VBN', 'O'), ('through', 'IN', 'O'), ('London', 'NNP', 'B-geo'), ('to', 'TO', 'O'), ('protest', 'VB', 'O'), ('the', 'DT', 'O'), ('war', 'NN', 'O'), ('in', 'IN', 'O'), ('Iraq', 'NNP', 'B-geo'), ('and', 'CC', 'O'), ('demand', 'VB', 'O'), ('the', 'DT', 'O'), ('withdrawal', 'NN', 'O'), ('of', 'IN', 'O'), ('British', 'JJ', 'B-gpe'), ('troops', 'NNS', 'O'), ('from', 'IN', 'O'), ('that', 'DT', 'O'), ('country', 'NN', 'O'), ('.', '.', 'O')]

In [0]:
class SentenceGetter(object):
    
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
                                                           s["POS"].values.tolist(),
                                                           s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def get_next(self):
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None

In [0]:
getter = SentenceGetter(df2)
print(getter.get_next())
sentences = getter.sentences

max_len = 50
word2idx = {w: i for i, w in enumerate(words)}
tag2idx = {t: i for i, t in enumerate(tags)}

[('Thousands', 'NNS', 'O'), ('of', 'IN', 'O'), ('demonstrators', 'NNS', 'O'), ('have', 'VBP', 'O'), ('marched', 'VBN', 'O'), ('through', 'IN', 'O'), ('London', 'NNP', 'B-geo'), ('to', 'TO', 'O'), ('protest', 'VB', 'O'), ('the', 'DT', 'O'), ('war', 'NN', 'O'), ('in', 'IN', 'O'), ('Iraq', 'NNP', 'B-geo'), ('and', 'CC', 'O'), ('demand', 'VB', 'O'), ('the', 'DT', 'O'), ('withdrawal', 'NN', 'O'), ('of', 'IN', 'O'), ('British', 'JJ', 'B-gpe'), ('troops', 'NNS', 'O'), ('from', 'IN', 'O'), ('that', 'DT', 'O'), ('country', 'NN', 'O'), ('.', '.', 'O')]


Let us pad sequences to be of same length

In [0]:
from keras.preprocessing.sequence import pad_sequences
X = [[word2idx[w[0]] for w in s] for s in sentences]
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=n_words - 1)

y = [[tag2idx[w[2]] for w in s] for s in sentences]
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tag2idx["O"])

In [0]:
from keras.utils import to_categorical

y = [to_categorical(i, num_classes=n_tags) for i in y]
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.1)

### Model

Let us build a bidirectional LSTM here. 

<b>Discussion</b>  : Why bidirectional LSTM is useful for this sequence tagging exercise ?


Before building a LSTM, we need to learn a Keras wrapper called TimeDistributed which is useful to apply any operation in a LSTM module across timesteps.

#### TimeDistributed Example

TimeDistributed is a wrapper function call that applies an input operation on all the timesteps of an input data.  For instance I have a feedforward network which converts a 10-dim vector to a 5-dim vector, then wrapping this timedistributed layer on that feedforward operation would convert a batch_size  \* sentence_len \* vector_len(=10) to batch_size  \* sentence_len \*  output_len(=5)

In [0]:
model = Sequential()
#Inputs to it will be batch_size*time_steps*input_vector_dim(to Dense) . Output will be batch_size*time_steps* output_vector_dim
#Here dense converts a 5-dim input vector to a 8-dim vector.
model.add(TimeDistributed(Dense(8), input_shape=(3, 5)))
input_array = np.random.randint(10, size=(1,3,5))
print("Shape of input : ", input_array.shape)
model.compile('rmsprop', 'mse')
output_array = model.predict(input_array)
print("Shape of output : ", output_array.shape)
# note: `None` is the batch dimension


Shape of input :  (1, 3, 5)
Shape of output :  (1, 3, 8)


In [0]:
#@title
from keras.models import Model, Input
import numpy as np
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional

input = Input(shape=(max_len,))
model = Embedding(input_dim=n_words, output_dim=50, input_length=max_len)(input)
model = Dropout(0.1)(model)
model = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(model)
out = TimeDistributed(Dense(n_tags, activation="softmax"))(model)  # softmax output layer

model = Model(input, out)
model.compile(optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"])
history = model.fit(X_tr, np.array(y_tr), batch_size=32, epochs=5, validation_split=0.1, verbose=1)

Instructions for updating:
Use tf.cast instead.
Train on 38846 samples, validate on 4317 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Let us generate a sentence and look at what our model has learned.

In [0]:
i = 2318
p = model.predict(np.array([X_te[i]]))
t = [tags[k] for k in np.argmax(y_te[i], axis=-1)]
p = np.argmax(p, axis=-1)
print("{:15} ({:5}): {}".format("Word", "True", "Pred"))
for w, true, pred in zip(X_te[i], t, p[0]):
    print("{:15}: ({:5}): {}".format(words[w], true, tags[pred]))

Word            (True ): Pred
Iran           : (B-geo): B-geo
angered        : (O    ): O
Washington     : (B-geo): B-geo
and            : (O    ): O
the            : (O    ): O
European       : (B-org): B-org
Union          : (I-org): I-org
by             : (O    ): O
resuming       : (O    ): O
uranium        : (O    ): O
conversion     : (O    ): O
this           : (O    ): O
week           : (O    ): O
after          : (O    ): O
rejecting      : (O    ): O
an             : (O    ): O
EU             : (B-org): B-org
offer          : (O    ): O
of             : (O    ): O
political      : (O    ): O
and            : (O    ): O
economic       : (O    ): O
incentives     : (O    ): O
in             : (O    ): O
return         : (O    ): O
for            : (O    ): O
giving         : (O    ): O
up             : (O    ): O
its            : (O    ): O
nuclear        : (O    ): O
program        : (O    ): O
.              : (O    ): O
ENDPAD         : (O    ): O
ENDPAD         : (O    ): 

## EXERCISE 2 : LEARNING TO ADD TWO NUMBERS (<1000) USING LSTMs



In [0]:
from __future__ import print_function
from keras.models import Sequential
from keras import layers
from keras.layers import Dense, RepeatVector, TimeDistributed
import numpy as np
from six.moves import range


#### Encode and decode sequences 

What is one hot encoding ? 

One-hot encoding is an indicator encoding for mapping discrete tokens. It is done to remove "ORDINALITY" effects.

Let us take an example. If I have to create one-hot encodings for characters from 0-9, the one-hot encoding of any character is .

0 - [1,0,0,0,0,0,0,0,0,0]
1 - [0,1,0,0,0,0,0,0,0,0]
...
9 - [0,0,0,0,0,0,0,0,0,1]

Same as the dummy variable you use when employing categorical variables in regression models.

In [0]:
class CharacterTable(object):
    def __init__(self, chars):        
        self.chars = sorted(set(chars))
        self.char_indices = dict((c, i) for i, c in enumerate(self.chars))
        self.indices_char = dict((i, c) for i, c in enumerate(self.chars))

    def encode(self, C, num_rows):        
        x = np.zeros((num_rows, len(self.chars)))
        for i, c in enumerate(C):
            x[i, self.char_indices[c]] = 1
        return x

    def decode(self, x, calc_argmax=True):        
        if calc_argmax:
            x = x.argmax(axis=-1)
        return ''.join(self.indices_char[x] for x in x)


Let us set some hyperparameters and create our character set.

In [0]:
TRAINING_SIZE = 50000
DIGITS = 3
MAXOUTPUTLEN = DIGITS + 1
MAXLEN = DIGITS + 1 + DIGITS

chars = '0123456789+ '
ctable = CharacterTable(chars)

Let us write some functions to generate and preprocess training examples. Mostly you will find comments inline that explain the procedure. Broadly, we generate two numbers(<1000) randomly add a +sign between them convert them into a string of length 7 and create one hot encoding of this problem.

In [0]:
def return_random_digit():
  return np.random.choice(list('0123456789'))  
  
def generate_number():
  num_digits = np.random.randint(1, DIGITS + 1)  
  return int(''.join( return_random_digit()
                      for i in range(num_digits)))

def data_generate(num_examples):
  questions = []
  expected = []
  seen = set()
  print('Generating data...')
  while len(questions) < TRAINING_SIZE:      
      a, b = generate_number(), generate_number()  
      #Remove already seen elements
      key = tuple(sorted((a, b)))
      if key in seen:
          continue
      seen.add(key)
      # Pad the data with spaces such that it is always MAXLEN.
      q = '{}+{}'.format(a, b)
      query = q + ' ' * (MAXLEN - len(q))
      ans = str(a + b)
      # Answers can be of maximum size DIGITS + 1.
      ans += ' ' * (DIGITS + 1 - len(ans))
      questions.append(query)
      expected.append(ans)
  print('Total addition questions:', len(questions))
  return questions, expected


def encode_examples(questions,answers):
  x = np.zeros((len(questions), MAXLEN, len(chars)), dtype=np.bool)
  y = np.zeros((len(questions), DIGITS + 1, len(chars)), dtype=np.bool)
  for i, sentence in enumerate(questions):
      x[i] = ctable.encode(sentence, MAXLEN)
  for i, sentence in enumerate(answers):
      y[i] = ctable.encode(sentence, DIGITS + 1)

  indices = np.arange(len(y))
  np.random.shuffle(indices)
  return x[indices],y[indices]

Generate data and make train test split. Also, let us take some time to visualize the data and interpret the dimensions.

In [0]:
q,a = data_generate(TRAINING_SIZE)
x,y = encode_examples(q,a)
split_at = len(x) - len(x) // 10
x_train, x_val, y_train, y_val = x[:split_at], x[split_at:],y[:split_at],y[split_at:]


print('Training Data shape:')
print('X : ', x_train.shape)
print('Y : ', y_train.shape)

print('Sample Question(in decoded form) : ', ctable.decode(x_train[0]),'Sample Output : ', ctable.decode(y_train[0]))

Generating data...
Total addition questions: 50000
Training Data shape:
X :  (45000, 7, 12)
Y :  (45000, 4, 12)
Sample Question(in decoded form) :  10+27   Sample Output :  37  


### Model

In the Model, we follow sequence to sequence approach i.e. we build an encoder RNN use its learned states in a decoder RNN. 

For making our lives easy while creating multiple copies of hidden state, here is a useful Keras function that we need to learn.

#### RepeatVector Example 

Repeats the vector a specified number of times. Dimension changes from batch_size * number of elements to batch_size* number of repetitions * number of elements.

In [0]:
model = Sequential()
#converts from 1*32 to 1 * 6
model.add(Dense(6, input_dim=10))
print(model.output_shape)
#converts from 1*6 to 1*3*6
model.add(RepeatVector(3))
print(model.output_shape) 
input_array = np.random.randint(1000, size=(1, 10))
print("Shape of input : ", input_array.shape)
model.compile('rmsprop', 'mse')
output_array = model.predict(input_array)
print("Shape of output : ", output_array.shape)
# note: `None` is the batch dimension
print('Input : ', input_array[0])
print('Output : ', output_array[0])

(None, 6)
(None, 3, 6)
Shape of input :  (1, 10)
Shape of output :  (1, 3, 6)
Input :  [874 891  48 263  93 654 648 101 700 986]
Output :  [[-345.1526     30.735317  -69.681915  357.6878    351.5093   -131.6774  ]
 [-345.1526     30.735317  -69.681915  357.6878    351.5093   -131.6774  ]
 [-345.1526     30.735317  -69.681915  357.6878    351.5093   -131.6774  ]]


In [0]:
#Hyperaparams
RNN = layers.LSTM
HIDDEN_SIZE = 128
BATCH_SIZE = 128
LAYERS = 1

print('Build model...')
model = Sequential()
#ENCODING
model.add(RNN(HIDDEN_SIZE, input_shape=(MAXLEN, len(chars))))
model.add(RepeatVector(MAXOUTPUTLEN))
#DECODING
for _ in range(LAYERS):    
    model.add(RNN(HIDDEN_SIZE, return_sequences=True))

model.add(TimeDistributed(layers.Dense(len(chars), activation='softmax')))
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.summary()

Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_3 (LSTM)                (None, 128)               72192     
_________________________________________________________________
repeat_vector_7 (RepeatVecto (None, 4, 128)            0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 4, 128)            131584    
_________________________________________________________________
time_distributed_4 (TimeDist (None, 4, 12)             1548      
Total params: 205,324
Trainable params: 205,324
Non-trainable params: 0
_________________________________________________________________


#### Train and validate

After every 15 epochs, we are checking how the addition performs over 20 random examples from the validation set.

In [0]:
# Train the model each generation and show predictions against the validation
# dataset.
for iteration in range(1, 2):
    print()  
    model.fit(x_train, y_train,
              batch_size=BATCH_SIZE,
              epochs=20,
              validation_data=(x_val, y_val))
    # Select 10 samples from the validation set at random so we can visualize
    # errors.
    print('Finished iteration ', iteration)
    numcorrect = 0
    numtotal = 20
    
    for i in range(numtotal):
        ind = np.random.randint(0, len(x_val))
        rowx, rowy = x_val[np.array([ind])], y_val[np.array([ind])]
        preds = model.predict_classes(rowx, verbose=0)
        q = ctable.decode(rowx[0])
        correct = ctable.decode(rowy[0])
        guess = ctable.decode(preds[0], calc_argmax=False)
        print('Question', q, end=' ')
        print('True', correct, end=' ')
        print('Guess', guess, end=' ')
        if guess == correct :
          print('Good job')
          numcorrect += 1
        else:
          print('Fail')
         
     print('The model scored ', numcorrect*100/numtotal,' % in its test.')
        


Train on 45000 samples, validate on 5000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Finished iteration  1
Question 892+14  True 906  Guess 906  Good job
The model scored  5.0  % in its test.
Question 322+454 True 776  Guess 776  Good job
The model scored  10.0  % in its test.
Question 394+381 True 775  Guess 785  Fail
The model scored  10.0  % in its test.
Question 27+892  True 919  Guess 919  Good job
The model scored  15.0  % in its test.
Question 39+43   True 82   Guess 82   Good job
The model scored  20.0  % in its test.
Question 33+481  True 514  Guess 514  Good job
The model scored  25.0  % in its test.
Question 86+427  True 513  Guess 513  Good job
The model scored  30.0  % in its test.
Question 641+3   True 644  Guess 644  Good job
The model scored  35.0  % in its test.
Question 506+7

### EXERCISE

1. Try changing the hyperparams, use other RNNs, more layers, check if increasing the number of epochs is useful.

2.  Try reversing the data from validation set and check if commutative property of addition is learned by the model. Try printing the hidden layer with two inputs that are commutative and check if the hidden representations it learned are same or similar. Do we expect it to be true ? If so, why ? If not why ? You can access the layer using an index with model.layers and layer.output will give the output of that layer.

3. (TAKE-HOME Cannot be completed within the class) Try doing addition in the RNN model the same way we do by hand. Reverse the order of digits and at each time step, input two digits(units in the first time step, tens in the second time step etc.)
