# Amphi 6 - Recurrent Neural Networks

#  1. First Example: Name Entity Recognition

Recurrent Neural Networks have many applications in:

- Natural Language Processing (Speech Recognition, Text Generation, Sentiment Classification, Translation, Name Entity Recognition)
- Bioinformatics (DNA sequence analysis)
- Video activity recognition
- Time series

## 1.1 The Problem

In Name Entity Recognition problem, we want to define name entities in a sentence/paragraph. Example of inputs and outputs are below:

Input: `ha noi is the capital of vietnam` (it can be in speech or text form)

Output: (1, 1, 0, 0, 0, 0, 1)

## 1.2 Notions

Notation: 
- Data/Example $i$ is denoted by $x^{(i)}, y^{(i)}$
- The $t^{th}$ word of $x^{(i)}$ is denoted by $x^{(i)<t>}$. So if $x^{(1)} = $`ha noi is the capital of Vietnam`, it can be rewritten as
    $$x^{(1)} = (x^{(1)<1>}, x^{(1)<2>}, \ldots, x^{(1)<7>})$$ where $x^{(1)<1>} = $`ha`, $\ldots, x^{(1)<7>} = $ `vietnam`.
and
    $$y^{(1)} = (y^{(1)<1>}, y^{(1)<2>}, \ldots, y^{(1)<7>})$$ where $x^{(1)<1>} = 1, \ldots, x^{(1)<7>} =1$.
- We denote the length of $x^{(1)}$ by $T_x^{(i)}$. In this example $T_x^{(1)} = 7$. Similarly, $T_y^{(i)}$ is the length of $T_x^{(i)}$. So $T_y^{(1)} = T_x^{(1)} = 7$ in the example.

**Vocabulary**

We introduce a **vocabulary**, which is a list of all possible words (in some context). For example, 
<center>
$V = $`['a', 'abs', ..., 'capital', ..., 'is', ..., 'vietnam', ..., 'zebra']`
</center>

Suppose that orders of those words in the vocabulary are
<center>
    [1, 2, ..., 3001, ..., 7645, ..., 16999, ..., 20000]
</center>

then using one-hot-coding, one can rewrite
$$
x^{(i)<j>} = \mathbf I_{pos} = \begin{pmatrix}
0 \\
\ldots\\
0 \\
1 \\
0 \\
\ldots\\
0
\end{pmatrix}
$$

where $pos$, the position of 1 is the order of the associated in the vocabulary. This is a $D-$dimension vector where $D$ is the vocabulary's size.

With our example, we have:
$$
x^{(1)<3>} = \mathbf I_{7645}, x^{(1)<7>} = \mathbf I_{16999}
$$

We also define the target output dimension by $K$. In our example, $K=1$ as the output can be denoted by a probability (a real number between 0-1).

## 1.3 Problems with FNN/CNN

- Inputs, outputs can be of different lengths in different examples
- They does not share features learned across different positions of text.

## 1.4 RNN Model for Name Identity Recognition

<img src="F3.png" width="900">

We introduce latent variables
$$
a^{(i)<0>}, \ldots, a^{(i)<T_x^{(i)}>}
$$

which are vectors of dimension $A$.

For short, we ignore the subscript $(i)$ (for order of example), and write
$$
a^{<0>}, \ldots, a^{<T_x>}
$$

$a^{<t>}$ will have a role to learn something from $x^{<t>}$ and its previous value $a^{<t-1>}$. It is like a history of the word stream. This history will play a role together with the new input $x^{<t+1>}$ to predict the next value $y^{<t+1>}$ 


The model supposes the following relation between $a^{<t>}, x^{<t>}, y^{<t>}$:

$$
a^{<t>} = g_a (W_{a}a^{<t-1>} + W_{x}x^{<t>} + b_a)
$$

$$
\hat y^{<t>} = g_y(W_{y}a^{<t>} + b_y)
$$

where $\mathbf W_{a}, \mathbf W_{x}, \mathbf W_{y}$ are $A\times A$, $A\times D$ and $K\times A$-matrices, $b_a, b_y$ are $A-$dimensional and $K-$dimensional vectors, $g_a, g_y$ are activations. $g_a$ are usually $\tanh$ function while $g_t$ depends on the output. Neither $W_{a}, W_{x}, W_{y}, b_a, b_y, g_a$ nor $g_y$ depend on $t$.

We write
$$
\mathbf W_a = \begin{pmatrix} 
W_a^{1,1}& \ldots& W_a^{1,A},& W_x^{1,1}& \ldots& \ldots& W_x^{1,D}\\ 
. & \ldots & .,& . \ldots& \ldots& . \\
W_a^{A,1}& \ldots& W_a^{A,A},& W_x^{A,1}& \ldots& \ldots& W_x^{A,D} 
\end{pmatrix}
$$

or
$$
\mathbf W_a = (W_a | W_x)
$$

and
$$
\mathbf W_y = W_y
$$

then the relations can be rewritten
$$
a^{<t>} = g_a (\mathbf W_{a}[a^{<t-1>}, x^{<t>}] + b_a)
$$

$$
\hat y^{<t>} = g_y(\mathbf W_{y}a^{<t>} + b_y)
$$

Therefor, $\mathbf W_a$ is an $A\times (A+D)$-matrix, $\mathbf W_y$ is a $K \times A$-matrix.

## 1.5 The Loss Function

The loss function is defined as usual: for example MSE for regression, binary or categorical crossentropy for classification. In our example, we can define component loss:

$$
L^{<t>}(y^{<t>}, \hat y^{<t>}) = -y^{<t>}\log \hat y^{<t>} - (1-y)^{<t>}\log (1-\hat y^{<t>}) 
$$

and the overall loss
$$
L(y, \hat y) = \sum_{t=1}^{T_x} L^{<t>} (y^{<t>}, \hat y^{<t>})
$$

We want to find $\mathbf W_a, \mathbf W_y, b_a, b_y$ so that this quantity is small. If we have a training set, we want to optimize the sum of these overall losses on that set. This can be done by different optimizers and based on **backpropagation through time**.

## 1.6 Simple RNN

Simple RNN is a simpler version of RNN where $A = K$, $\mathbf W_y = \mathbf {Id} $, $b_y = 0$, i.e.,

$$
a^{<t>} = g_a(\mathbf W_{a}[a^{<t-1>}, x^{<t>}] + b_a)
$$

and
$$
\hat y^{<t>} = g_y(a^{<t>})
$$

Usually, $g_a$ is $\tanh$ and $g_y$ depending on the problem ($id$, $sigmoid$, $softmax$).

<img src="F14.png" width=600>

RNN can be viewed as Simple RNN followed by a dense layer with activation.

## 1.7 Implementation

The vocabulary = English alphabet = [a, b, c, $\ldots$, z]

Input: Some text sequence like `azmbnckedsafkasdjfhasdl`

Name identity rule (for example): Pattern of the form 2 vowels between 2 consonants, like `baec`

In [1]:
import string
import numpy as np
np.random.seed(1)

VOCABULARY = string.ascii_lowercase
VOWELS = "aeiou"
CONSONANTS = "bcdfghjklmnpqrstvxyz"
LENGTH_LOWER = 50
LENGTH_UPPER = 50 + 1
VOCABULARY_SIZE = len(VOCABULARY)
VOWEL_SIZE = len(VOWELS)
CONSONANT_SIZE = len(CONSONANTS)
NUMERIZER = {'c': 0, 'v': 0, 'C': 1, 'V': 1}

**Generate random text**

In [2]:
def generateRandomText(vowelProba = 0.3):
    randomLength = np.random.randint(LENGTH_LOWER, LENGTH_UPPER)
    randomText = ""
    randomPattern = ""
    for i in range(randomLength):
        alpha = np.random.binomial(1, vowelProba)
        if alpha == 1:
            randomLetterIndex = np.random.randint(0, VOWEL_SIZE)
            randomLetter = VOWELS[randomLetterIndex]
            randomPattern += "v"
        else:
            randomLetterIndex = np.random.randint(0, CONSONANT_SIZE)
            randomLetter = CONSONANTS[randomLetterIndex]
            randomPattern += "c"
        randomText += randomLetter
    return randomText, randomPattern

In [3]:
text, pattern = generateRandomText()
text, pattern

('qetcryinsmbeeuatlfxgpvmbiimgzatknbcfrjokpgrabunaui',
 'cvccccvccccvvvvcccccccccvvcccvccccccccvccccvcvcvvv')

**Get output (1 for name entity, 0 for others)**

In [4]:
def getNameEntityOutput(pattern):
    newPattern = pattern.replace('cvvc', 'CVVC')
    newPattern2 = newPattern.replace('CvvC', 'CVVC').replace('Cvvc', 'CVVC')
    while newPattern2 != newPattern:
        newPattern = newPattern2
        newPattern2 = newPattern.replace('CvvC', 'CVVC').replace('Cvvc', 'CVVC')
    Y = []
    for i in range(len(newPattern2)):
        Y.append(NUMERIZER[newPattern2[i]])
    return Y

In [5]:
output = getNameEntityOutput(pattern)
print(output)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


**Generate corpus**

In [6]:
def generateCorpus_1(size = 1000):
    inputs = []
    outputs = []
    for i in range(size):
        text, pattern = generateRandomText()
        inputs.append(text)
        outputs.append(getNameEntityOutput(pattern))
    return inputs, outputs

In [7]:
X_as_text, Y = generateCorpus_1(1000)
print(X_as_text[:10])
print(Y[:10])

['dkbiahvrcikzsptehlxrrndbpczaynorezntspfsdbaenaxsfj', 'rjymvatuuvtsyqqqkqvihpbkiuzjdneaqpvzfiuvdmgeoehsnv', 'enjgoaxbjhyyvpiaikcmensxvqvyqbtnhaqotrkdiviaezofev', 'bijnoukiikuzjhmujyxrtupjotrnkuobysbduypqbedkryuecj', 'jeeufdialaasgajbjkfdziibsuaygbncheihgbfqvhpckcogyo', 'fcpaaxjezncjarztazakypczxkyorobcmzlupnytoirctxdlvv', 'opedhocxgmkbyeedohnxkaflmogqyipzbizpmfmciglyfpmurp', 'duajzehikuogcffuzpuougfyivyarbclyxjhdgrzasknritprf', 'ghrmafamftlcfmlueebcitfknulifntnutotxtuqcpomljmypm', 'xieatieyshixrtcdzzoqunseuycjzbqrironqigxxkdgggiuyc']
[[0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0

**Encode the letters**

In [8]:
ENCODER = {}

for idx, letter in enumerate(VOCABULARY):
    ENCODER[letter] = [0] * VOCABULARY_SIZE
    ENCODER[letter][idx] = 1
    
for k, V in sorted(ENCODER.items(), key = lambda X: X[0]):
    print(k, V)

a [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
b [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
c [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
d [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
e [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
f [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
g [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
h [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
i [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
j [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
k [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
l [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
m [0, 0, 0, 0, 0, 0, 0, 0, 0

**Encode the texts**

In [9]:
def encodeText(text):
    result = []
    for letter in text:
        result.append(ENCODER[letter])
    return np.array(result)

In [10]:
encode_example = encodeText(X_as_text[0])
print(X_as_text[0])
print(encode_example)
print(encode_example.shape)

dkbiahvrcikzsptehlxrrndbpczaynorezntspfsdbaenaxsfj
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
(50, 26)


In [11]:
def generateCorpus(size = 1000):
    raw_inputs = []
    inputs = []
    outputs = []
    for i in range(size):
        text, pattern = generateRandomText()
        raw_inputs.append(text)
        inputs.append(encodeText(text))
        outputs.append(getNameEntityOutput(pattern))
    return raw_inputs, np.array(inputs), np.array(outputs).reshape(len(outputs), LENGTH_LOWER, 1)

In [12]:
X_as_text, X, Y = generateCorpus(10000)
X_test_as_text, X_test, Y_test = generateCorpus(2000)
print(X_as_text[:10])
print(X_test_as_text[:10])
print("Training set shape:")
print(X.shape)
print(Y.shape)
print("Test set shape:")
print(X_test.shape)
print(Y_test.shape)

['lpcjxmacdrvzuojeodbjrkdqagacrcimjlxdqijtiyhcnhgbez', 'ysjhhkldffkorivqaxixyuvvujnaujejzfngndxsefyvvmadoe', 'yulemiolaqyiiphqsytfreudjiuuknvvoekohgssupzijsvefo', 'kebmibgaqrcxvzlofiorjuxoedzkfsidyeeexekjvzvmxbtaaf', 'bmkallqvotxeqbomyuxxoviuyejfgkozoeoogtkeokomvaxksc', 'sozamnoatupfohdtqbquomzepgfrsgbcumcfajeemtaeggzjdd', 'uoxadzeubbkaehgbvettunzoqrbposqbeseosibgmesavuzxrv', 'ovtbkinyqfsoschaofmyhejnagtooxyltxkliditledqbtdgni', 'vuyciomahobmmiohcdopugonkphsiojiauqszfexxjuvhrucqh', 'lksljutqccblqfxkdoauiuevbojcfvduedoozhoenvixosrgpy']
['exxsoeavaqgtgvlmxmxnfqiygsitplotipqqgotsgzrugbzubs', 'xibvbuqvhtdmttmnpxqxrfnuilvupivqxxooquiotypmvihmgf', 'xenknalodjbegycubktjhinapdpradcnoeprlilohcfshbgezv', 'yuetgabogeojyxhkrbsljeclthgbevlphrhvnsukjoqjfknmda', 'sviverciioqdapsbuotrxonoitoglnjbgjfvhtsozyybdiarbn', 'acdornixkmqmkuzacuutkacuqqjqsdsiyecdabnbjihhdsfkub', 'foockjxdsizbrgszmyigtkhxnuadhvxntbpclhqgvaionaorol', 'qyqarajadybnucmolyiavvnerrtsuqftvnensctvhujklezkip', 'zxvkipephfclpiacmrvedunez

**Simple RNN**

In [13]:
from keras.models import Sequential
from keras.layers import Dense, Activation, SimpleRNN
from keras.utils import plot_model
from sklearn.metrics import precision_score, recall_score

input_shape = (None, VOCABULARY_SIZE)

model = Sequential()
model.add(SimpleRNN(units = 256, input_shape = input_shape, activation='tanh', return_sequences = True))
model.add(Dense(units = 1, activation = 'sigmoid'))
model.summary()

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn_1 (SimpleRNN)     (None, None, 256)         72448     
_________________________________________________________________
dense_1 (Dense)              (None, None, 1)           257       
Total params: 72,705
Trainable params: 72,705
Non-trainable params: 0
_________________________________________________________________


In [14]:
# Remember: install pydot and graphviz first
import os
os.environ["PATH"] += os.pathsep + 'C:/Programs/release/bin'

plot_model(model, to_file='F13_2.png', show_shapes=True)

<img src="F13_2.png">

In [15]:
import tensorflow as tf

def as_keras_metric(method):
    import functools
    from keras import backend as K
    import tensorflow as tf
    @functools.wraps(method)
    def wrapper(self, args, **kwargs):
        """ Wrapper for turning tensorflow metrics into keras metrics """
        value, update_op = method(self, args, **kwargs)
        K.get_session().run(tf.local_variables_initializer())
        with tf.control_dependencies([update_op]):
            value = tf.identity(value)
        return value
    return wrapper

#precision = as_keras_metric(tf.metrics.precision)
#recall = as_keras_metric(tf.metrics.recall)
auc = as_keras_metric(tf.metrics.auc)

In [16]:
from keras.optimizers import SGD

mySGD = SGD(lr = 0.1, momentum = 0.9)
model.compile(loss = "binary_crossentropy", optimizer = mySGD, metrics = ["accuracy", auc])

In [17]:
history = model.fit(X, Y, epochs = 50, batch_size = 128, verbose = 2, validation_split = 0.2)

Train on 8000 samples, validate on 2000 samples
Epoch 1/50
 - 6s - loss: 0.4217 - acc: 0.8231 - auc: 0.5932 - val_loss: 0.3865 - val_acc: 0.8221 - val_auc: 0.6926
Epoch 2/50
 - 7s - loss: 0.3900 - acc: 0.8213 - auc: 0.7165 - val_loss: 0.3854 - val_acc: 0.8181 - val_auc: 0.7302
Epoch 3/50
 - 7s - loss: 0.3881 - acc: 0.8217 - auc: 0.7371 - val_loss: 0.3842 - val_acc: 0.8256 - val_auc: 0.7420
Epoch 4/50
 - 8s - loss: 0.3866 - acc: 0.8227 - auc: 0.7454 - val_loss: 0.3871 - val_acc: 0.8273 - val_auc: 0.7479
Epoch 5/50
 - 7s - loss: 0.3850 - acc: 0.8249 - auc: 0.7498 - val_loss: 0.3828 - val_acc: 0.8248 - val_auc: 0.7516
Epoch 6/50
 - 7s - loss: 0.3836 - acc: 0.8272 - auc: 0.7530 - val_loss: 0.3805 - val_acc: 0.8301 - val_auc: 0.7543
Epoch 7/50
 - 7s - loss: 0.3805 - acc: 0.8279 - auc: 0.7554 - val_loss: 0.3759 - val_acc: 0.8335 - val_auc: 0.7566
Epoch 8/50
 - 7s - loss: 0.3747 - acc: 0.8315 - auc: 0.7578 - val_loss: 0.3697 - val_acc: 0.8303 - val_auc: 0.7591
Epoch 9/50
 - 7s - loss: 0.3648 

In [18]:
score_test = model.evaluate(X_test, Y_test)
print(score_test)

[0.31134568071365354, 0.8941700048446656, 0.8317164134979248]


In [19]:
Y_pred_proba = model.predict(X_test)
Y_pred = model.predict_classes(X_test)
print(X_test_as_text[:10])
print("Prediction as probability for first test data:")
print(Y_pred_proba[0].reshape(Y_test.shape[1]))
print("True and predicted values")

['exxsoeavaqgtgvlmxmxnfqiygsitplotipqqgotsgzrugbzubs', 'xibvbuqvhtdmttmnpxqxrfnuilvupivqxxooquiotypmvihmgf', 'xenknalodjbegycubktjhinapdpradcnoeprlilohcfshbgezv', 'yuetgabogeojyxhkrbsljeclthgbevlphrhvnsukjoqjfknmda', 'sviverciioqdapsbuotrxonoitoglnjbgjfvhtsozyybdiarbn', 'acdornixkmqmkuzacuutkacuqqjqsdsiyecdabnbjihhdsfkub', 'foockjxdsizbrgszmyigtkhxnuadhvxntbpclhqgvaionaorol', 'qyqarajadybnucmolyiavvnerrtsuqftvnensctvhujklezkip', 'zxvkipephfclpiacmrvedunezuuemgkhheecikkclnryrpfusx', 'pajzvtiuilbgorjvifrbikdpvkctotafztpflljaudxqzekolo']
Prediction as probability for first test data:
[0.00662487 0.04238252 0.07605126 0.07756739 0.2867158  0.81349176
 0.01670473 0.03427134 0.05120999 0.13458757 0.10910339 0.20737395
 0.11546207 0.04964935 0.08184267 0.04454813 0.0252463  0.08957209
 0.05117536 0.01705522 0.05111629 0.11633451 0.3681083  0.03131708
 0.05376703 0.04352481 0.2444832  0.05293031 0.08188481 0.06153117
 0.3853891  0.0195616  0.02128983 0.00500048 0.27767462 0.04827879
 0.1035429

In [20]:
for i, (trueVal, predVal) in enumerate(zip(Y_test[:10], Y_pred[:10])):
    print(str(i) + ":")
    print(trueVal.reshape(50))
    print(predVal.reshape(50))

0:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0]
1:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1
 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1
 0 1 0 0 0 0 0 0 0 0 0 0 0]
2:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0]
3:
[1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0]
4:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 1 1 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0

**Adapt probability threshold**

In [21]:
def predict_classes_with_threshold(model, X, threshold = 0.5):
    Y_proba = model.predict(X)
    return np.ceil(Y_proba - threshold).astype(int)

In [22]:
Y_pred_new = predict_classes_with_threshold(model, X_test, 0.33)
for i, (trueVal, predVal) in enumerate(zip(Y_test[:10], Y_pred_new[:10])):
    print(str(i) + ":")
    print(trueVal.reshape(50))
    print(predVal.reshape(50))

0:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 0]
1:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1
 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1
 1 1 0 0 0 0 0 0 0 0 0 0 0]
2:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0]
3:
[1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 0 0 0 0 0 0]
4:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 1 1 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0

**Observe**

- Somehow, the model has learnt the rules "after 0,1,1, it should be 1; after 0,1,1,1, it should be 1."
- If the model can learn the rules "before 1,1,0, it should be 1; before 1,1,1,0, it should be 1", it would be a better success. This motivates us to consider a model that can learn "from future". We will talk later about bidirectional RNN.

# 2. Other Examples

## 2.1 Text Generation

Input: of length $T_x = 0$ or 1.

Output: of some length $T_y > 0$

This is an example of **one-to-many** models.

<img src="F6.png" width=600>

## 2.2 Sentiment Classification

Input: of length $T_x > 0$

Output: a real number for example, so $T_y = 1$; or categorical variable, so $T_y = K$, a constant.

This is an example of **many-to-one** models.

<img src="F4.png" width=600>

## 2.3 Machine Translation

In general, $T_x \neq T_y$.

This is an example of **many-to-many** models, like in name entity recognition model.

The model can be described as follows:

<img src="F7.png" width=600>

This is known as **encoder-decoder** model. The first "half" is called **encoder** while second half is **decoder**.

# 3. Another Example: Language Modelling

## 3.1 Notion

Language modelling aims to build a probability distribution of meeting some combination of words. For example, given some text "The Earth rotates around the", what the next word should be. After a model is trained, it can predict something like:
<center>
$
P($ `Sun` | `The Earth rotates around the` $) = 0.64$ 
</center>

while
<center>
$
P($ `son` | `The Earth rotates around the` $) = 1e-4$ 
</center>    

In application, for **speech to text** problem, if we hear some sound like "sun" or "son" after "The Earth rotates around ...", the model helps us to choose the better word associated with this context. In our example, "sun" is the better choice.

In general, language modelling answers the question: given a sequence $y^{<1>}, \ldots, y^{<T_y>}$, what is the probability of meeting $y^{<1>}, \ldots, y^{<T_y>}$ consecutively?

## 3.2 Model

- Training set: Large corpus of text in a language
- We build a model to predict the probability of first word. Let $a_1$ is of dimension A, $y_1$ of dimension $D$ (size of vocabulary). We have
$$
a_1 = g_a(b_a)
$$
$$
y_1 = g_y(\mathbf W_y a_1 + b_y)
$$

This corresponds to the following model with $a_0 = 0, x_1 = 0$.

<img src="F9.png" width=200>

- Choose $g_y$ is the softmax function. This model can be used to determine the probability of words in the vocabulary.
- To predict the next words, we use $x^{<t>} = y^{<t-1>}$:

<img src="F11.png" width=600>

- Using softmax as $g_y$ at each $t$, the prediction becomes a probability distribution of
$$
\mathbf P(\cdot | y^{<1>}, \ldots, y^{<t-1>})
$$

- The loss function for each component

$$
L^{<t>}(y^{<t>}, \hat y^{<t>}) = -y_j^{<t>}\log \hat y_j^{<t>} 
$$

where $j=1, \ldots, D$ denotes the coordinate indices of the $D-$dimensional vectors $y^{<t>}$.

- The overall loss function
$$
L(y, \hat y) = \sum_{t=1}^{T_y} L^{<t>} (y^{<t>}, \hat y^{<t>})
$$

## 3.3 Special Cases

- Punctuation: In lots of situations, punctuations are treated as words. They are included in the vocabulary.

- Unknown words: Some unknown words may be found in the text although they may be not included in the vocabulary. We can treat them as a special word "\<UNKNOWN\>" or sometimes use another solution to treat them as a category.
    
**Example:** "Barnaby Marmaduke sets world record." -> "<UNK> <UNK> sets world record <PUNC>" or "<NAME_ENTITY> <NAME_ENTITY> sets world record <PUNC>".


## 3.4 Sampling a Sequence from the Trained Model

- Use the probabilities output by the RNN to randomly sample a chosen word for that time-step as $\hat{y}^{<t>} $
- Then pass this selected word to the next time-step.

<img src="F12.png" width=600>

## 3.5 Implementation

We will play with Vietnamese.

**Load data**

In [47]:
import pandas as pd
import re

GOOD_LEN = 300
DATA_SIZE = 10000
VIETNAMESE_LETTERS = '[^a-zđăâêôơưàằầèềìòồờùừỳáắấéếíóốớúứýảẳẩẻểỉỏổởủửỷãẵẫẽễĩõỗỡũữỹạặậẹệịọộợụựỵ ]+'

df = pd.read_csv("VN.csv", sep="\t\t\t\t", header=None, engine='python', encoding = 'utf-8').values
counter = 0
lines = []
while len(lines) < DATA_SIZE:
    textLine = df[counter][0].lower()
    textLine = re.sub(VIETNAMESE_LETTERS, '', textLine)
    if len(textLine) >= GOOD_LEN:
        lines.append(textLine[:GOOD_LEN])
    counter += 1
        
len(lines), lines[35], len(lines[35])

(10000,
 'tên họ của đao ba khách rất quái dị gọi là cung thần xuân nghe nói y tinh thông hơn mười loại binh khí khác nhau tình hình thực tế ra sao trừ phi gặp được người đã quá chiêu động thủ với y bằng không e là không thể khảo cứu được từ tử lăng thầm nhủ trong số mặt nạ mà lỗ diệu tử làm ra đã có một tấm ',
 300)

**Training data: Encode $y$ as integer**

In [48]:
vocabulary = {}
vocabulary_list = []
counter = 0
lines_encode_int = []

for line in lines:
    line_encode_int = []
    for char in line:
        if char not in vocabulary:
            vocabulary[char] = counter
            vocabulary_list.append(char)
            counter += 1
        line_encode_int.append(vocabulary[char])
    lines_encode_int.append(line_encode_int)

In [49]:
print(vocabulary)
vocalen = len(vocabulary)
print(vocalen)

{'ý': 49, 'ẹ': 86, 'ẽ': 81, 'z': 93, 'f': 89, 'ú': 68, 'ỳ': 80, 'ỹ': 87, 'ỗ': 47, 'ồ': 59, 'ắ': 44, 'ỉ': 67, 'đ': 17, 'ợ': 57, 'ử': 64, 'ọ': 30, 'ằ': 78, 'x': 55, 'c': 12, 'm': 29, 'r': 5, 'ề': 15, 'i': 11, 'ổ': 56, 'â': 48, 't': 4, 'ủ': 39, 'a': 16, 'ì': 6, 'k': 26, 'ỵ': 91, 'n': 7, 'ữ': 46, 'ỡ': 63, 'ơ': 61, 'ă': 34, 'ẩ': 74, 'v': 50, 'e': 69, 'ế': 58, 'ộ': 51, 'j': 92, 'ệ': 54, 'w': 88, 'ò': 71, 'é': 84, 'à': 21, 'ã': 65, 'l': 20, 'ù': 60, 'y': 14, 'ả': 40, 'ấ': 38, 'ự': 13, 'ẻ': 82, 'ĩ': 72, 'd': 23, 'õ': 77, 'ẫ': 66, 'ẵ': 90, 'g': 31, 'ố': 52, 'ư': 32, 'u': 1, 'è': 73, 'ở': 42, 'o': 19, 'ỷ': 85, 's': 41, 'ớ': 25, 'á': 2, 'ể': 36, 'ũ': 83, 'p': 10, 'ậ': 9, 'ạ': 18, 'í': 75, 'ễ': 24, ' ': 3, 'b': 43, 'ỏ': 79, 'ó': 27, 'ê': 35, 'ứ': 37, 'ị': 62, 'ừ': 22, 'ờ': 33, 'ụ': 45, 'h': 8, 'ẳ': 70, 'ặ': 76, 'ầ': 28, 'ô': 53, 'q': 0}
94


In [50]:
print(lines_encode_int[35])
print(len(lines_encode_int[35]))

[4, 35, 7, 3, 8, 30, 3, 12, 39, 16, 3, 17, 16, 19, 3, 43, 16, 3, 26, 8, 2, 12, 8, 3, 5, 38, 4, 3, 0, 1, 2, 11, 3, 23, 62, 3, 31, 30, 11, 3, 20, 21, 3, 12, 1, 7, 31, 3, 4, 8, 28, 7, 3, 55, 1, 48, 7, 3, 7, 31, 8, 69, 3, 7, 27, 11, 3, 14, 3, 4, 11, 7, 8, 3, 4, 8, 53, 7, 31, 3, 8, 61, 7, 3, 29, 32, 33, 11, 3, 20, 19, 18, 11, 3, 43, 11, 7, 8, 3, 26, 8, 75, 3, 26, 8, 2, 12, 3, 7, 8, 16, 1, 3, 4, 6, 7, 8, 3, 8, 6, 7, 8, 3, 4, 8, 13, 12, 3, 4, 58, 3, 5, 16, 3, 41, 16, 19, 3, 4, 5, 22, 3, 10, 8, 11, 3, 31, 76, 10, 3, 17, 32, 57, 12, 3, 7, 31, 32, 33, 11, 3, 17, 65, 3, 0, 1, 2, 3, 12, 8, 11, 35, 1, 3, 17, 51, 7, 31, 3, 4, 8, 39, 3, 50, 25, 11, 3, 14, 3, 43, 78, 7, 31, 3, 26, 8, 53, 7, 31, 3, 69, 3, 20, 21, 3, 26, 8, 53, 7, 31, 3, 4, 8, 36, 3, 26, 8, 40, 19, 3, 12, 37, 1, 3, 17, 32, 57, 12, 3, 4, 22, 3, 4, 64, 3, 20, 34, 7, 31, 3, 4, 8, 28, 29, 3, 7, 8, 39, 3, 4, 5, 19, 7, 31, 3, 41, 52, 3, 29, 76, 4, 3, 7, 18, 3, 29, 21, 3, 20, 47, 3, 23, 11, 54, 1, 3, 4, 64, 3, 20, 21, 29, 3, 5, 16, 3, 17, 65, 

In [51]:
maxlen = max([len(line) for line in lines_encode_int])
print(maxlen)

300


**Training data: Encode $y$ as vector**

In [52]:
lines_encode_vector = np.zeros((DATA_SIZE, GOOD_LEN, vocalen))
for i, line in enumerate(lines_encode_int):
    for j, char in enumerate(line):
        lines_encode_vector[i][j][char] = 1
        
lines_encode_vector.shape

(10000, 300, 94)

In [53]:
lines_encode_vector[35]

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [54]:
lines_encode_vector[35][0]

array([0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0.])

**Decoding**

In [55]:
def decodeChar(char):
    """
        Char is a vector
    """
    return vocabulary_list[np.argmax(char)]

decodeChar([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

'z'

In [56]:
def decodeLine(line):
    return "".join(decodeChar(char) for char in line)

decodeLine(lines_encode_vector[35])

'tên họ của đao ba khách rất quái dị gọi là cung thần xuân nghe nói y tinh thông hơn mười loại binh khí khác nhau tình hình thực tế ra sao trừ phi gặp được người đã quá chiêu động thủ với y bằng không e là không thể khảo cứu được từ tử lăng thầm nhủ trong số mặt nạ mà lỗ diệu tử làm ra đã có một tấm '

In [83]:
def decodeInt(line):
    return "".join([vocabulary_list[charInt] for charInt in line])

decodeInt([0, 1, 2, 3, 4, 5, 6, 7, 8])

'quá trình'

**The model**

In [58]:
from keras.models import Sequential
from keras.layers import Dense, Activation, SimpleRNN
from keras.utils import plot_model
from sklearn.metrics import precision_score, recall_score

input_shape_2 = (GOOD_LEN, vocalen)

model2 = Sequential()
model2.add(SimpleRNN(units = 256, input_shape = input_shape_2, activation='tanh', return_sequences = True))
model2.add(Dense(units = vocalen, activation = 'softmax'))
model2.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn_3 (SimpleRNN)     (None, 300, 256)          89856     
_________________________________________________________________
dense_3 (Dense)              (None, 300, 94)           24158     
Total params: 114,014
Trainable params: 114,014
Non-trainable params: 0
_________________________________________________________________


In [59]:
from keras.optimizers import SGD

mySGD = SGD(lr = 0.2, momentum = 0.99)
model2.compile(loss = "binary_crossentropy", optimizer = mySGD, metrics = ["accuracy"])

In [60]:
Y2 = lines_encode_vector[:]
X2 = np.zeros(Y2.shape)
X2[:,1:,:] = Y2[:,:-1,:]

In [61]:
history = model2.fit(X2, Y2, epochs = 2, batch_size = 128, verbose = 1, validation_split = 0.1)

Train on 9000 samples, validate on 1000 samples
Epoch 1/2
Epoch 2/2


**Prediction**

In [62]:
Y2_pred = model2.predict(X2[:1])
print(Y2_pred[:,:,:5])

[[[0.00861543 0.01550654 0.01278425 0.06880084 0.0234017 ]
  [0.00249593 0.01863731 0.01107734 0.32687682 0.04443014]
  [0.00120255 0.02026371 0.00577814 0.3738823  0.04816699]
  ...
  [0.00078103 0.00605488 0.00247188 0.7388783  0.01416358]
  [0.00349976 0.0124269  0.00795586 0.07511744 0.13161837]
  [0.00390961 0.02486826 0.00817277 0.06053606 0.03643496]]]


In [63]:
Y2_pred = model2.predict(X2[:1])
print(Y2_pred[:,:,:5])

[[[0.00861543 0.01550654 0.01278425 0.06880084 0.0234017 ]
  [0.00249593 0.01863731 0.01107734 0.32687682 0.04443014]
  [0.00120255 0.02026371 0.00577814 0.3738823  0.04816699]
  ...
  [0.00078103 0.00605488 0.00247188 0.7388783  0.01416358]
  [0.00349976 0.0124269  0.00795586 0.07511744 0.13161837]
  [0.00390961 0.02486826 0.00817277 0.06053606 0.03643496]]]


In [64]:
Wax, Waa, ba = model2.get_layer(index=0).get_weights()
Wax.shape, Waa.shape, ba.shape

((94, 256), (256, 256), (256,))

In [65]:
Wya, by = model2.get_layer(index=1).get_weights()
Wya.shape, by.shape

((256, 94), (94,))

**Verification**

The prediction is done by:

$$
a^{<t>} = g_a(\mathbf W_{a}[a^{<t-1>}, x^{<t>}] + b_a)
$$

and
$$
\hat y^{<t>} = g_y(a^{<t>})
$$

where $g_a$ is $\tanh$ and $g_y$ is softmax. We will verify with `Y_pred` calculated by `model.predict`

In [66]:
new_a = np.tanh(np.dot(Wax.T, X2[0,0]) + np.dot(Waa.T, np.zeros(256)) + ba)
print(new_a.shape)
def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

new_y = softmax(np.dot(Wya.T, new_a) + by)
print(new_y.shape)

(256,)
(94,)


`new_y` ($y^{(0)<1>}$) should be identique to `model.predict(X2)[0,0,:]` :

In [67]:
np.round(Y2_pred[:,0,:] - new_y, 5)

array([[-0.,  0.,  0.,  0.,  0., -0.,  0.,  0.,  0., -0., -0., -0., -0.,
        -0.,  0.,  0., -0., -0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0., -0., -0.,  0.,  0., -0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        -0., -0., -0., -0.,  0., -0., -0.,  0.,  0.,  0.,  0.,  0., -0.,
        -0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0., -0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0., -0.,  0., -0.,  0., -0.,  0.,  0.,  0.,
        -0., -0.,  0., -0.,  0.,  0.,  0., -0., -0., -0.,  0.,  0.,  0.,
        -0.,  0.,  0.]])

In [68]:
new_a = np.tanh(np.dot(Wax.T, X2[0,1]) + np.dot(Waa.T, new_a) + ba)
new_y = softmax(np.dot(Wya.T, new_a) + by)
np.round(Y2_pred[:,1,:] - new_y, 5)

array([[-0., -0., -0.,  0., -0., -0., -0., -0.,  0., -0., -0., -0., -0.,
        -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0.,  0., -0.,
        -0., -0., -0., -0., -0.,  0., -0., -0., -0., -0., -0., -0., -0.,
        -0., -0., -0., -0., -0., -0., -0., -0.,  0., -0., -0., -0., -0.,
        -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0.,
        -0., -0., -0., -0., -0., -0., -0.,  0.,  0., -0., -0., -0., -0.,
        -0., -0., -0., -0.,  0., -0., -0.,  0.,  0.,  0., -0., -0., -0.,
        -0., -0., -0.]])

**Sampling**

In [69]:
def passState(x, a_prev, Waa, Wax, Wya, ba, by):
    a = np.tanh(np.dot(Wax.T, x) + np.dot(Waa.T, a_prev) + ba)
    y = softmax(np.dot(Wya.T, a) + by)
    return y, a


a_prev = np.zeros(256)
x = np.zeros(vocalen)
generated_text = []

np.random.seed(270)
for t in range(GOOD_LEN):
    y, a = passState(x, a_prev, Waa, Wax, Wya, ba, by)
    a_prev = a[:]
    idx = np.random.choice(range(vocalen), p = y.ravel())
    x = np.zeros(vocalen)
    x[idx] = 1
    generated_text.append(vocabulary_list[idx])

"".join(generated_text)

'ôm ât thh ni iấờ taệả ghỉ  tả tcúâ  ại nưđq  hn t ht ó ôấqến ưện  gệư gsư g no  ég ng r ềăn trgcbmn  min  aừ  t x nto ciycgảẵ nảuhỹã g nhờc  lc ư ặưe êiếư u àrầ ạà  g  lnà ỡ l ệhimàýsuạm hn  ãn ccuảntmi ôàhư oịấ ign aà  hpnh lyh  â kớnị nô hưưnhp   hòđ đơ càlt uòỗê nếồảnhsạ cci hlwẫ toi ớhừưgwàâỡ t '

**Train more**

In [70]:
history = model2.fit(X2, Y2, epochs = 8, batch_size = 128, verbose = 1, validation_split = 0.1)

Train on 9000 samples, validate on 1000 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


In [74]:
Wax, Waa, ba = model2.get_layer(index=0).get_weights()
Wya, by = model2.get_layer(index=1).get_weights()

a_prev = np.zeros(256)
x = np.zeros(vocalen)
generated_text = []
np.random.seed(270)

for t in range(GOOD_LEN):
    y, a = passState(x, a_prev, Waa, Wax, Wya, ba, by)
    a_prev = a[:]
    idx = np.random.choice(range(vocalen), p = y.ravel())
    x = np.zeros(vocalen)
    x[idx] = 1
    generated_text.append(vocabulary_list[idx])

"".join(generated_text)

'sg bnn nàng nớư nhưếc khính kring tràn cải trận đánh duợc và kệt ngệờ kăờ g nài ơê thi tay thiảng hì long cà trám nhưipng mơé nguầẽ tà hào nhưiic vữệ gaệ tu lanh ch lu đựi vry bong sổu khừ thí từn sà dà bể mìng tiôn lư thàng lànc guểi vi biich ci táng tiãu lưi qỏủ nhồệ hạà họnh nóéh thư loiềtc êẩ tr'

**Train more (2)**

In [75]:
history = model2.fit(X2, Y2, epochs = 20, batch_size = 128, verbose = 1, validation_split = 0.1)

Train on 9000 samples, validate on 1000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [76]:
Wax, Waa, ba = model2.get_layer(index=0).get_weights()
Wya, by = model2.get_layer(index=1).get_weights()

a_prev = np.zeros(256)
x = np.zeros(vocalen)
generated_text = []
np.random.seed(270)

for t in range(GOOD_LEN):
    y, a = passState(x, a_prev, Waa, Wax, Wya, ba, by)
    a_prev = a[:]
    idx = np.random.choice(range(vocalen), p = y.ravel())
    x = np.zeros(vocalen)
    x[idx] = 1
    generated_text.append(vocabulary_list[idx])

"".join(generated_text)

'mộ bia này chể mề húo doợn tầu cẽ trọn yổi tranh thi lựm qma kiou lúc kào ghi chén cể nhưbng tay khi lành đý tráy hiện phâa ịuhxu íỏ m thông chương việt ấu tiên lưi m tiên vno bàng bơ đệu nhạk nói ở hện bảnh đònh dừ đý tho làng hìn khôi nghây đắn tram thẳ lên tuòng híộh cân can cạí hiủn dại gỷ síu t'

**Train more (3)**

In [77]:
mySGD = SGD(lr = 0.5, momentum = 0.99)
model2.compile(loss = "binary_crossentropy", optimizer = mySGD, metrics = ["accuracy"])
history = model2.fit(X2, Y2, epochs = 20, batch_size = 128, verbose = 1, validation_split = 0.1)

Train on 9000 samples, validate on 1000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [78]:
Wax, Waa, ba = model2.get_layer(index=0).get_weights()
Wya, by = model2.get_layer(index=1).get_weights()

a_prev = np.zeros(256)
x = np.zeros(vocalen)
generated_text = []
np.random.seed(270)

for t in range(GOOD_LEN):
    y, a = passState(x, a_prev, Waa, Wax, Wya, ba, by)
    a_prev = a[:]
    idx = np.random.choice(range(vocalen), p = y.ravel())
    x = np.zeros(vocalen)
    x[idx] = 1
    generated_text.append(vocabulary_list[idx])

"".join(generated_text)

'ừau ra hạng thiếu long nếu toán vẫn cạ hãi tra nà nhận bợá hiệm trương lệ không xa cả trược hay bbi tiểu thư nhìn nhất càng íu kì ồợ dianh huynh tặc rồi ấu thiệu lạng thiếu nó giọi sồu dận tìm ta tứ phó bản thúc cả tiểu thư phương ta cũng chúp cầu thèi đã động quốc nữm nhân ca thểm thên mưởng đãy nh'

**Train more (4)**

In [79]:
mySGD = SGD(lr = 1, momentum = 0.999)
model2.compile(loss = "binary_crossentropy", optimizer = mySGD, metrics = ["accuracy"])
history = model2.fit(X2, Y2, epochs = 20, batch_size = 128, verbose = 1, validation_split = 0.1)

Train on 9000 samples, validate on 1000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [82]:
Wax, Waa, ba = model2.get_layer(index=0).get_weights()
Wya, by = model2.get_layer(index=1).get_weights()

a_prev = np.zeros(256)
x = np.zeros(vocalen)
generated_text = []
np.random.seed(270)

for t in range(GOOD_LEN):
    y, a = passState(x, a_prev, Waa, Wax, Wya, ba, by)
    a_prev = a[:]
    idx = np.random.choice(range(vocalen), p = y.ravel())
    x = np.zeros(vocalen)
    x[idx] = 1
    generated_text.append(vocabulary_list[idx])

"".join(generated_text)

'ào sự thần hai gã đã từng trên còn thanh kì thiên hạ lại quả là thảm thân miễn tỏi của thủ nhiên dừng hôn hư trực ra phóng lét muối tay cảm thấy tính khỏi quả chân ta thiệu tự bành gã lời nhưng lừng ngọc được ý rồi tay thanh âm ra tại sao ngươi chu nơi đã đối trúch nắm chúng ta cứỉ hai tay gão mặt t'

**Some observations**
<table>
    <tr><th>After ... steps</th><th>Fact</th></tr>
    <tr>
        <td>2</td>
        <td>Words with >7 letters exist, 2 spaces consecutive exist. Text respects letter frequency but does not seem to build correct words</td>
    <tr>
        <td>10</td>
        <td>Words are mostly between 2 and 6 letters (like Vietnamese), no (rarely) 2 spaces consecutive, but few correct words</td>
    </tr>
    <tr>
        <td>30</td>
        <td>More correct words appear.</td>
    </tr>
    <tr>
        <td>50</td>
        <td>More and more correct words appear, including complex combinations ("tiểu thư")</td>
    <tr>
        <td>70</td>
        <td>Almost all words are correct. Very few incorrect words (only 1 in our example, "trúch"). More complex combinations appear ("hai gã", "thiên hạ", "cảm thấy", "tại sao ngươi", "chúng ta", "hai tay")</td>
    </tr>
</table>

The more we train, the longer dependency between time steps we get.

# 4. Variation of RNN Models

## 4.1 Limitation of Basic RNN Models

Long-term dependency is the dependency of an output of some timestep on inputs of far previous timesteps.

<img src="F15.png" width=600>

If the network is very deep, the gradient from output $y$ will have a hard time propagating back to affect the weights of earlier layers because of **vanishing** and **exploding** gradients.

So, it is difficult for basic RNN to learn long-term dependency. 

## 4.2 RNN Unit in a Basic RNN Models

<img src="F16.png" width=600>

In a basic RNN model, we usually have a variable $a$ that changes its state after each time step. Each input $x$ influences the state of $a$ by:

$$
a_{t} = \tanh {W_{a}a^{<t_1>} + W_{x}x^{<t>} + b_a}
$$

(here we suppose the activation used for $a$ is $\tanh$)

This was illustrated in the above figure: there is 2 arrows, one from the previous $a$, the other from the new $x$, that meet in a box, activated by the $\tanh$ function. The output after activation is then used

- for predicting $y$ at timestep $t$: $\hat y^{<t>} = g_y(W_y a^{<t>} + b_y)$.

- as the new $a$, to be transferred to timestep $t+1$.

## 4.3 Gated Recurrent Unit (GRU) 
### 4.3.1 Simple version

Gated Recurrent Unit is a model that is similar to basic RNN, but instead of using $a^{<t>} = \tanh(W_a a^{<t-1>}+ W_x x^{<t>} + b_a)$ to transfer to the following timestep, it uses a combination of $a^{<t-1>}$ and this new value $\tanh(W_a a^{<t-1>}+ W_x x^{<t>} + b_a)$.

We use the notation $c^{<t>}$ to avoid confusion with basic RNN.

For each timestep, GRU will have a previous value $c^{<t-1>}$ (standing for "cell state" or memory state). The transformation is as follows:

- Calculate a candidate for the new cell state
$$
\tilde c^{<t>} = \tanh(W_c c^{<t-1>}, W_{xc} x^{<t>} + b_c)
$$

- Calculate a coefficient in [0,1] that represents the weight of $c^{<t>}$ in the next cell state
$$
\Gamma_u = \sigma(W_u c^{<t-1>}, W_{xu} x^{<t>} + b_u)
$$

$\Gamma_u = 0$ means the candidate does not have effect on the new cell state.
(Example, if we meet some unimportant word that does not affect sentence's grammar, current state is nothing but previous state.)

$\Gamma_u = 1$ means the past cell states does not have effect on the present cell state. (Example, if we meet a "." punctuation, we pass to new sentence, grammar on the last sentence does not have effect anymore on the current state.)

- The new cell state is a combination of the previous cell state and the new candidate
$$
c^{<t>} = \Gamma_u * \tilde c^{<t>} + (1 - \Gamma_u) * c^{<t-1>}
$$

where * stands for point-wise multiplication.

- Prediction of output is calculated as
$$
\hat y^{<t>} = g_y (W_y c^{<t>} + b_y)
$$


**Summary**
$$
\tilde c^{<t>} = \tanh(W_c c^{<t-1>}, W_{xc} x^{<t>} + b_c)
$$

$$
\Gamma_u = \sigma(W_u c^{<t-1>}, W_{xu} x^{<t>} + b_u)
$$

$$
c^{<t>} = \Gamma_u * \tilde c^{<t>} + (1 - \Gamma_u) * c^{<t-1>}
$$

$$
\hat y^{<t>} = g_y (W_y c^{<t>} + b_y)
$$





### 4.3.2 Full Version

In full version, we do not use $c^{<t-1>}$ directly to calculate $\hat c^{<t>}$, but a proportion of it.

$$
\Gamma_r = \sigma (W_r c^{<t-1>} + W_{xr} x^{<t>} + br)
$$

$$
\hat c^{<t>} = \tanh(W_c (\Gamma_r * c^{<t-1>}), W_{xc} x^{<t>} + b_c)
$$

**Summary**
$$
\Gamma_r = \sigma (W_r c^{<t-1>} + W_{xr} x^{<t>} + br)
$$

$$
\Gamma_u = \sigma(W_u c^{<t-1>}, W_{xu} x^{<t>} + b_u)
$$

$$
\hat c^{<t>} = \tanh(W_c (\Gamma_r * c^{<t-1>}), W_{xc} x^{<t>} + b_c)
$$

$$
c^{<t>} = \Gamma_u * \tilde c^{<t>} + (1 - \Gamma_u) * c^{<t-1>}
$$

$$
\hat y^{<t>} = g_y (W_y c^{<t>} + b_y)
$$

<img src="F19.png" width=600>
<center>
    In this figure, $h$ -> $a$, $r$ -> $\Gamma_r$, $z$ -> $\Gamma_u$
</center>

## 4.4 Long Short-Term Memory (LSTM)

In LSTM, at each timestep, we consider 2 variables:

- Cell state $c^{<t>}$

- Output state $a^{<t>}$

In GRU, we can consider $a^{<t>} = c^{<t>}$.

The updates in step $t$ are:

$$
\Gamma_f = \sigma(W_f a^{<t-1>} + W_{xf} x^{<t>} + b_f)
$$

$$
\Gamma_u = \sigma(W_u a^{<t-1>} + W_{xu} x^{<t>} + b_u)
$$

$$
\Gamma_o = \sigma(W_o a^{<t-1>} + W_{xo} x^{<t>} + b_o)
$$

$$
\tilde {c}^{<t>} = \tanh (W_c a^{<t-1>} + W_{xc} x^{<t>} + b_c)
$$

$$
c^{<t>} = \Gamma_f * c^{<t-1>} + \Gamma_u * \tilde c^{<t>}
$$

$$
a^{<t>} = \Gamma_o * \tanh c^{<t>}
$$

$$
\hat y^{<t>} = g_y (W_y a^{<t>} + b_y)
$$
 
Instead of 1, 2 gates in GRU (simple, full version), LSTM uses 3 gates:

- The forget gate, represented by the coefficient $\Gamma_f$. It controls the amount to forget from previous state. If $\Gamma_f = 0$, no memory from the past is kept.
- The update gate, represented by the coefficient $\Gamma_u$. It gives the proportion of the candidate to contribute in $c^{<t>}$.
- The output gate, represented by the coefficient $\Gamma_o$. It gives the proportion of $c^{<t>}$ to output in $a^{<t>}$.

**Remark**

Sometimes in practice, we use **hard_sigmoid** function instead of sigmoid.

<img src="F20.png" width=400></img>

## 4.5 Example - Implementation of LSTM

In [97]:
from keras.models import Sequential
from keras.layers import Dense, Activation, SimpleRNN, LSTM
from keras.utils import plot_model
from sklearn.metrics import precision_score, recall_score

input_shape_3 = (GOOD_LEN, vocalen)

model3 = Sequential()
model3.add(LSTM(units = 256, input_shape = input_shape_3, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, unit_forget_bias=True, return_sequences=True))
model3.add(Dense(units = vocalen, activation = 'softmax'))
model3.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_5 (LSTM)                (None, 300, 256)          359424    
_________________________________________________________________
dense_8 (Dense)              (None, 300, 94)           24158     
Total params: 383,582
Trainable params: 383,582
Non-trainable params: 0
_________________________________________________________________


In [98]:
from keras.optimizers import Adam

myAdam = Adam(lr = 0.02)
model3.compile(loss = "binary_crossentropy", optimizer = myAdam, metrics = ["accuracy"])
history3 = model3.fit(X2, Y2, epochs = 20, batch_size = 128, verbose = 1, validation_split = 0.1)

Train on 9000 samples, validate on 1000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [146]:
units = 256

W = model3.layers[0].get_weights()[0]
U = model3.layers[0].get_weights()[1]
b = model3.layers[0].get_weights()[2]

W_u = W[:, :units]
W_f = W[:, units: units * 2]
W_c = W[:, units * 2: units * 3]
W_o = W[:, units * 3:]

U_u = U[:, :units]
U_f = U[:, units: units * 2]
U_c = U[:, units * 2: units * 3]
U_o = U[:, units * 3:]

b_u = b[:units]
b_f = b[units: units * 2]
b_c = b[units * 2: units * 3]
b_o = b[units * 3:]

W_y, b_y = model3.layers[1].get_weights()

In [149]:
def sigmoid(x):
    return 1/(1 + np.exp(-x))

def passLSTMState(x, a_prev, c_prev, W_u, W_f, W_c, W_o, U_u, U_f, U_c, U_o, b_u, b_f, b_c, b_o, W_y, b_y):
    Gamma_f = sigmoid(np.dot(U_f.T, a_prev) + np.dot(W_f.T, x) + b_f)
    Gamma_u = sigmoid(np.dot(U_u.T, a_prev) + np.dot(W_u.T, x) + b_u)
    Gamma_o = sigmoid(np.dot(U_o.T, a_prev) + np.dot(W_o.T, x) + b_o)
    c_tilde = np.tanh(np.dot(U_c.T, a_prev) + np.dot(W_c.T, x) + b_c)
    c = Gamma_f * c_prev + Gamma_u * c_tilde
    a = Gamma_o * np.tanh(c)
    y = softmax(np.dot(W_y.T, a) + b_y)
    return y, a, c, [Gamma_f, Gamma_u, Gamma_o, c_tilde]

In [152]:
a_prev = np.zeros(256)
c_prev = np.zeros(256)
x = np.zeros(vocalen)
generated_text = []
np.random.seed(123)

for t in range(GOOD_LEN):
    y, a, c, otherParams = passLSTMState(x, a_prev, c_prev, W_u, W_f, W_c, W_o, U_u, U_f, U_c, U_o, b_u, b_f, b_c, b_o, W_y, b_y)
    a_prev = a[:]
    c_prev = c[:]
    idx = np.random.choice(range(vocalen), p = y.ravel())
    x = np.zeros(vocalen)
    x[idx] = 1
    generated_text.append(vocabulary_list[idx])

"".join(generated_text)

'lưu khí chàng của chúng ta hai phủ nhân khó mà xảy ra phải đối diện rất đết chiến làm cung xin cho người đồng thiết là vĩ thế nào lại vui lưu bà khiến cho ta không cảm giác gốn này thế nhưng do tinh thang đi luôn giọng nói vô độ trn đế phủ quỷ tiên sinh rằng nguồn trang nuốt đầu suy tiếp vào sách ti'

**Save and reload model**

In [153]:
model3.save('LSTM-after-20.steps.hdf5')

In [155]:
from keras.models import load_model
savedModel3 = load_model('LSTM-after-20.steps.hdf5')

In [157]:
savedModel3.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_5 (LSTM)                (None, 300, 256)          359424    
_________________________________________________________________
dense_8 (Dense)              (None, 300, 94)           24158     
Total params: 383,582
Trainable params: 383,582
Non-trainable params: 0
_________________________________________________________________


# 5. Bidirectional RNN

Motivation of bidirectional RNN is that an output at timestep $t$ depends not only on input of previous steps but only later steps as well, as we has seen in name entity recognition.

<img src="F21.png">

It uses 2 type variables: forward cell states (denoted for example by $a^{<t>}$) and backward cell states (denoted by $a'^{<t>}$)

## 5.1 Example with Simple Bidirectional RNN

At each timestep

**For forward cell states**
$$
a^{<t>} = g_a (W_a a^{<t-1>} + W_{ax} x^{<t>} + b_a) 
$$

**For backward cell states**
$$
a'^{<t>} = g_{a'} (W_{a'} a'^{<t+1>}) + W_{a'x} x^{<t>}+ b_{a'})
$$

**Prediction**
$$
\hat y^{<t>} = g_y (W_{ya} a^{<t>} + W_{ya'} a'^{<t>} + b_y)
$$

## 5.2 Example with Bidirectional LSTM

<img src="F22.png" width=400></img>

At each timestep

**Forward cell states**
$$
\Gamma_f = \sigma(W_f a^{<t-1>} + W_{xf} x^{<t>} + b_f)
$$

$$
\Gamma_u = \sigma(W_u a^{<t-1>} + W_{xu} x^{<t>} + b_u)
$$

$$
\Gamma_o = \sigma(W_o a^{<t-1>} + W_{xo} x^{<t>} + b_o)
$$

$$
\tilde {c}^{<t>} = \tanh (W_c a^{<t-1>} + W_{xc} x^{<t>} + b_c)
$$

$$
c^{<t>} = \Gamma_f * c^{<t-1>} + \Gamma_u * \tilde c^{<t>}
$$

$$
a^{<t>} = \Gamma_o * \tanh c^{<t>}
$$

**Backward cell states**

$$
\Gamma'_f = \sigma(W'_f a'^{<t-1>} + W'_{xf} x^{<t>} + b'_f)
$$

$$
\Gamma'_u = \sigma(W'_u a'^{<t-1>} + W'_{xu} x^{<t>} + b'_u)
$$

$$
\Gamma'_o = \sigma(W'_o a'^{<t-1>} + W'_{xo} x^{<t>} + b'_o)
$$

$$
\tilde {c'}^{<t>} = \tanh (W'_{c'} a'^{<t-1>} + W'_{xc'} x'^{<t>} + b'_c)
$$

$$
c'^{<t>} = \Gamma'_f * c'^{<t-1>} + \Gamma'_u * \tilde c'^{<t>}
$$

$$
a'^{<t>} = \Gamma'_o * \tanh c'^{<t>}
$$

**Prediction**

$$
\hat y^{<t>} = g_y (W_y a^{<t>} + W'_y a'^{t} + b_y)
$$

# References

[1] http://viet.jnlp.org/download-du-lieu-tu-vung-corpus, *Vietnamese Corpus* 

[2] http://colah.github.io/posts/2015-08-Understanding-LSTMs/

[3] I. Goodfellow, Y. Bengio, A.Courville, *Deep Learning*

[4] A. Ng, K. Katanforoosh, B. Mouru, *Sequence Models* (Course in coursera.org)

[5] R. Atienza, *Advanced Deep Learning with Keras*