# RECURRENT LAYERS

* How to create embeddings that convey the meaning of variable-length phrases and sentences
* Different sentences have different-length vectors
-> makes vector comparison tricky   

# Averaging word embeddings

* very effective and powerful for capturing complex relationships between words
* when we average word embeddings, average shapes/curves remain
* similar words have similarities to their shapes (see image in the book)   
<br/>   

* In the following example, the reviews' vector representations capture statistical information such that positive and negative embeddings cluster together

In [None]:
# RUN THIS CODE AT THE END OF THE PREVIOUS CHAPTER'S NOTEBOOK

import numpy as np

norms = np.sum(weights_0_1 * weights_0_1, axis=1)
norms.resize(norms.shape[0], 1)
normed_weights = weights_0_1 * norms

def make_sentence_vector(words):
    indices = list(map(lambda x: vocab_index.get(x), \
              filter(lambda x: x in vocab_index, words)))
    return np.mean(normed_weights[indices], axis=0)

reviews2vectors = list()
for review in tokens:
    reviews2vectors.append(make_sentence_vector(review))
reviews2vectors = np.array(reviews2vectors)

def most_similar_reviews(review):
    v = make_sentence_vectors(review)
    
    scores = Counter()
    for index, value in enumerate(reviews2vectors.dot(v)):
        scores[index] = value
    
    most_similar = list()
    for index, score in scores.most_common(3):
        most_similar.append(raw_reviews[index][0:80])
    
    return most_similar

print(most_similar_reviews(['boring', 'awful']))
print()
print(most_similar_reviews(['nice', 'good']))

# OUTPUT:
#
#  ['comment this movie is impossible  is terrible  very improbable  bad interpretati', 
#   'horrible waste of time   bad acting  plot  directing  this is the most boring mo',
#   'this movie stinks  the stench resembles bad cowpies that sat in the sun too long']
#
#  ['this is actually one of my favorite films  i would recommend that everyone watch',
#   'this movie is terrible but it has some good effects ', 
#   'malcolm mcdowell has not had too many good movies lately and this is no differen']

## How the network uses embeddings

* It detects the curves that have correlation with a target label
* **Bag of words**: sentence embeddings are a **sum or average** of the characteristics of their words
* **Pros**:
  * if a sequence has repeating patterns, the sentence vector retains the most dominant patterns accross the word vectors being summed
* **Cons**:
  * if a sequence is too long, the sentence vector will average out to a **straight line** (vector of near-0s)
  * **order** becomes irrelevant (ex. "Yankees defeat Red Sox" vs "Red Sox defeat Yankees")

# RNN: Learning word order in embeddings

* **Identity matrix**: a square matrix of 0s with 1s on the diagonal
```
[1, 0, 0] 
[0, 1, 0] 
[0, 0, 1]
```
* Multiplied with any vector, it returns the original vector   


In [None]:
import numpy as np

this  = np.array([ 2,  4,  6])
movie = np.array([10, 10, 10])
rocks = np.array([ 1,  1,  1])

identity = np.eye(3)

print(identity)
print(this + movie + rocks)
print( (this.dot(identity) + movie).dot(identity) + rocks )

* Instead of directly summing or averaging, we add a step after each sum/average:
```this movie rocks```

  * multiply the `this` vector by the matrix
  * add the output to the `movie` vector
  * multiply the output by the matrix
  * add the output to the `rocks` vector

```
(this.dot(identity) + movie).dot(identity) + rocks
```

## Learning transition matrices

* Which matrix to use as transition? The network learns it:
  * learn useful word vectors
  * learn useful modifications to the transition matrices
* The method:
  * creating a sentence embedding
  * using it to predict
  * then modifying the parts that formed the sentence embedding to make this prediction more accurate
* Transition matrix starts as an identity matrix, then it is modified:
  * during training, we backpropagate gradients into it and update it to help the network make better predictions


#### 3 sets of weights

In [None]:
import numpy as np

def softmax(x_):
    x = np.atleast_2d(x_)
    temp = np.exp(x)
    return temp / np.sum(temp, axis=1, keepdims=True)

# Dictionary of word embeddings
word_vects = {}
word_vects['yankees'] = np.array([[0., 0., 0.]])
word_vects['bears']   = np.array([[0., 0., 0.]])
word_vects['braves']  = np.array([[0., 0., 0.]])
word_vects['red']     = np.array([[0., 0., 0.]])
word_vects['sox']     = np.array([[0., 0., 0.]])
word_vects['lose']    = np.array([[0., 0., 0.]])
word_vects['defeat']  = np.array([[0., 0., 0.]])
word_vects['beat']    = np.array([[0., 0., 0.]])
word_vects['tie']     = np.array([[0., 0., 0.]])

# Identity matrix (transition weights)
identity = np.eye(3)

# Classification layer:
# a weight matrix to predict the next word given a sentence vector of length 3
sent2output = np.random.rand(3, len(word_vects))


#### Forward propagation

In [None]:
# Creating a sentence embedding
layer_0 = word_vects['red']
layer_1 = layer_0.dot(identity) + word_vects['sox']
layer_2 = layer_1.dot(identity) + word_vects['defeat']

# Predicting over all vocabulary
pred = softmax(layer_2.dot(sent2output))
print(pred)

#### Backpropagation

* We generate `layer_2_delta` by backpropagating twice:
  * once accross the identity matrix to create `layer_1_delta`
  * and again to `word_vects[defeat]`

In [None]:
# The target: the one-hot vector for 'yankees'
y = np.array([1, 0, 0, 0, 0, 0, 0, 0, 0])

# Generating layer_2_delta

pred_delta    = pred - y
layer_2_delta = pred_delta.dot(sent2output.T)  # sent2output: weight matrix

# Backpropagating layer_2_delta accross word_vects

defeat_delta  = layer_2_delta * 1   # can ignore the '1' (see chap. 11)
layer_1_delta = layer_2_delta.dot(identity.T)
sox_delta     = layer_1_delta * 1   # idem
layer_0_delta = layer_1_delta.dot(identity.T)

alpha = 0.01
word_vects['red']    -= alpha * layer_0_delta
word_vects['sox']    -= alpha * sox_delta
word_vects['defeat'] -= alpha * defeat_delta

# Backpropagating layer_2_delta accross the identity matrix

identity    -= alpha * np.outer(layer_0, layer_1_delta)
identity    -= alpha * np.outer(layer_1, layer_2_delta)
sent2output -= alpha * np.outer(layer_2, pred_delta)


#### Training on a toy corpus: Babi dataset

* synthetically generated question-answer corpus to teach machines how to answer simple questions about an environment
* contains a variety of simple statements and questions
* each question is followed by the correct answer
* we train the network to finish each sentence when given one or more starting words

In [None]:
# In the terminal
# wget http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-1.tar.gz
# tar -xvf tasks_1-20_v1-1.tar.gz

import sys
import random
import math
from collections import Counter
import numpy as np

with open('/Users/macbook/code/_dataset_thespermwhale/tasksv11/en/qa1_single-supporting-fact_train.txt', 'r') as f:
    raw = f.readlines().strip().lower().replace('\n', '')

sentences = list()
for line in raw[0:1000]:
    sentences.append(line.split(' ')[1:])

for sentence in sentences[0:3]:
    print(sentence)


## Complete example

In [1]:
import sys
import os
import re
import itertools
import numpy as np

#### Loading and encoding data

In [2]:
with open('/Users/macbook/code/_dataset_thespermwhale/tasksv11/en/qa1_single-supporting-fact_train.txt', 'r') as f:
    raw = f.readlines()

sentences = list()
for line in raw[0:100]:
    sentences.append(line.strip().lower().replace('\n', '').split(' ')[1:])
        
vocab = set()
for sentence in sentences:
    for word in sentence:
        vocab.add(word)
vocab = list(vocab)

vocab_index = {}
for index, word in enumerate(vocab):
    vocab_index[word] = index

print(f'Vocab {len(vocab_index)} Sentences {len(sentences)}')

Vocab 44 Sentences 100


#### Network parameters

In [3]:
def sentence2sequence(sentence):
    sequence = np.array([vocab_index.get(word) for word in sentence if len(word) > 0], dtype=int)
    return sequence

def softmax(x):
    '''Used to predict'''
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)

np.random.seed(1)
alpha = 0.001
iterations = 30000
embed_size = 10

word_embeddings = (np.random.rand(len(vocab), embed_size) - 0.5) * 0.1  # word embeddings
sentence_embedding = np.zeros(embed_size)     # sentence embeddings
recurrent_embedding = np.eye(embed_size)   # recurrent embedding: embedding -> embedding (transition matrix)
decoder = (np.random.rand(embed_size, len(vocab)) - 0.5) * 0.1  # weight matrix: embedding -> output weights
one_hot = np.eye(len(vocab))     # utility matrix: one-hot lookups (for the loss function)


#### PREDICT: Forward propagation and prediction with arbitraty length

* same procedure as before: summing embeddings by using an identity matrix called `recurrent` (initialized to all 0s)
* only difference: `layers` is a new way to forward propagate 
  * we can't use static layers anymore
  * we need more forward propagations if the length of `sent` is larger
  * we append new layers to the list based on the number of forward propagations needed
* instead of predicting only the last word, we make a prediction `layer[pred]` avec every step, based on the embedding generated by the previous words
  * more efficient than doing forward propagation from the beginning for each new prediction

In [4]:
def predict(sequence):
    
    layers = list()  # one per word in the sequence: 'pred' + 'hidden'
    layer = {}
    layer['hidden'] = sentence_embedding
    layers.append(layer)
    #print('START:\n', layer['hidden'])
    
    preds = list()  # forward propagation
    loss = 0
    
    for target_i in range(len(sequence)):
        layer = {}
        
        # predict next word using softmax: hidden * weights
        layer['pred'] = softmax(layers[-1]['hidden'].dot(decoder))
        
        loss += -np.log(layer['pred'][sequence[target_i]])
        
        # generates the next hidden state
        layer['hidden'] = layers[-1]['hidden'].dot(recurrent_embedding) \
                        + word_embeddings[sequence[target_i]]
        
        layers.append(layer)
        
     #   print(f'TOKEN: {sequence[target_i]}')
        for key, value in layer.items():
            print(key, ':\n', value)
     #   print(f'LOSS: {loss}')
    
    return layers, loss

#### COMPARE: Backpropagation with arbitrary length

In [5]:
def backpropagation(layers):
    
    for layer_idx in reversed(range(len(layers))):  # backpropagates
        layer = layers[layer_idx]
        target = sequence[layer_idx-1]
        print('-----', layer_idx, target)
        
        if (layer_idx > 0):
            layer['output_delta'] = layer['pred'] - one_hot[target]
            print('output_delta:', layer['output_delta'].shape, len(vocab_index))
            print(layer['output_delta'])
            
            new_hidden_delta = layer['output_delta'].dot(decoder.transpose())
            print('new_hidden_delta:', new_hidden_delta.shape)
            print(new_hidden_delta)
            
            if (layer_idx == len(layers)-1):  # last layer: no backpropagation
                layer['hidden_delta'] = new_hidden_delta
            else:
                layer['hidden_delta'] = new_hidden_delta \
                                        + layers[layer_idx+1]['hidden_delta'] \
                                                    .dot(recurrent_embedding.transpose())
            print('hidden_delta:', layer['hidden_delta'].shape)
            print(layer['hidden_delta'])
                
        else:  # first layer
            layer['hidden_delta'] = layers[layer_idx+1]['hidden_delta'] \
                                                    .dot(recurrent_embedding.transpose())

#### LEARN: update weights

In [6]:
def weight_update(sequences, layers, sentence_embedding, decoder, word_embeddings, recurrent_embedding):
    
    sentence_embedding -= layers[0]['hidden_delta'] * alpha / float(len(sequence))  # updating weights
    
    for layer_idx, layer in enumerate(layers[1:]):
        print('---', layer_idx)
        
        decoder -= np.outer(layers[layer_idx]['hidden'], \
                            layer['output_delta']) \
                 * alpha / float(len(sequence))
        print('decoder:', decoder.shape)
        print(decoder)
        
        embed_idx = sequence[layer_idx]
        word_embeddings[embed_idx] -= layers[layer_idx]['hidden_delta'] \
                                      * alpha / float(len(sequence)) 
        print('word_embedding:', word_embeddings.shape, word_embeddings[embed_idx].shape)
        print(word_embeddings[embed_idx].shape)
        
        recurrent_embedding -= np.outer(layers[layer_idx]['hidden'], \
                                     layer['hidden_delta']) \
                               * alpha / float(len(sequence))
        print('recurrent_embedding:', recurrent_embedding.shape)
        print(recurrent_embedding)

#### Training

* **perplexity**: 
  * represents the **difference between two probability distributions**
  * in this example, the perfect probability distribution is 100% to the correct term and 0% elsewhere
  * probability of the correct label (word), passed through a log function, negated, and exponentiated ($e^x$)
  * **high** when probability distributions don't match, **low** (close to 1) when they do match   
  * decreasing perplexity is a good thing
  * means the network is learning to predict probabilities that match the data

In [None]:
for iteration in range(iterations):  # forward
    
    sequence = sentence2sequence(sentences[iteration % len(sentences)][1:])
    print('----------------------------')
    print('SEQUENCE: ', sequence)
    
    print('-- @predict() --')
    layers, loss = predict(sequence)                                                          # predict
    print('-- @backpropagation() --')
    backpropagation(layers)                                                                   # compare
    print('-- @weight_update() --')
    weight_update(sequence, layers, sentence_embedding, decoder, word_embeddings, recurrent_embedding)   # learn
    
    if (iteration % 5000 == 0) or (iteration == iterations-1):
        perplexity = np.exp(loss / len(sequence))
        print(f'{iteration:5} Perplexity {perplexity:.5f}')

#### Testing

In [None]:
seq_index = 4
l, _ = predict(sentence2sequence(sentences[seq_index]))
print(sentences[seq_index])

for i, each_layer in enumerate(l[1:-1]):
    input = sentences[seq_index][i]
    target = sentences[seq_index][i+1]
    pred = vocab[each_layer['pred'].argmax()]
    print(f'Prev input: {input}' + (' ' * (12 - len(input)))  + \
          f' Target: {target}'   + (' ' * (15 - len(target))) + \
          f' Pred: {pred}')
