# Recurrent neural network
by <b style="font-size:14px" align="left"> Ievgen Terpil </b>

### Table of Contents

1. Neural networks for sentiment analysis
2. Recurrent neural network
    1. LSTM
    2. Pre-trained embeddings
    3. Bidirectional and stacked LSTM
4. Attention mechanism

---

## 1. Neural networks for sentiment analysis

<img width="90%" src="https://github.com/udsclub/whiskey-sentiment-analysis/blob/master/workshop/img/MLP.png?raw=true" />

In [1]:
e = 32
h_1 = 128
h_2 = 128

In [2]:
MAX_NB_WORDS = 20000
MAX_SEQUENCE_LENGTH = 30

In [3]:
from keras.models import Sequential
from keras.layers import Dense
from keras.models import model_from_json
from keras.layers import Flatten, Dropout
from keras.layers.embeddings import Embedding
from keras.callbacks import ModelCheckpoint, EarlyStopping

Using Theano backend.


In [41]:
mlp = Sequential()
mlp.add(Embedding(MAX_NB_WORDS, e, input_length=MAX_SEQUENCE_LENGTH))
mlp.add(Flatten())
mlp.add(Dense(h_1, activation='relu'))
mlp.add(Dense(h_2, activation='relu'))
mlp.add(Dense(1, activation='sigmoid'))
mlp.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

<img width="80%" src="http://adilmoujahid.com/images/activation.png"/>

In [5]:
mlp.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_1 (Embedding)          (None, 30, 32)        640000      embedding_input_1[0][0]          
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 960)           0           embedding_1[0][0]                
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 128)           123008      flatten_1[0][0]                  
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 128)           16512       dense_1[0][0]                    
___________________________________________________________________________________________

---

In [6]:
import pickle
import re
import nltk

import pandas as pd
import numpy as np

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical

Swith on full text mode)

In [7]:
pd.options.display.max_colwidth = -1

## Load data

In [8]:
train_data = pd.read_csv('../data/movie_reviews.csv')
test_data = pd.read_csv('../data/test.csv')

In [9]:
train_data.shape

(152610, 2)

In [10]:
test_data.shape

(10660, 2)

---

## Preprocessing

In [11]:
from nltk.tokenize import TweetTokenizer
tweet_tokenizer = TweetTokenizer()

In [12]:
negatives = {
    "didn't": "didn_`_t",
    "couldn't": "couldn_`_t",
    "can't": "can_`_t",
    "don't": "don_`_t",
    "wouldn't": "wouldn_`_t",
    "doesn't": "doesn_`_t",
    "wasn't": "wasn_`_t",
    "weren't": "weren_`_t",
    "shouldn't":"shouldn_`_t",
    "isn't": "isn_`_t",
    "aren't": "aren_`_t",
}

def preprocess(text):
    text = text.lower()
    text = text.replace('<br />', ' ')
    text = ' '.join(tweet_tokenizer.tokenize(text))
    for k, v in negatives.items():
        text = text.replace(k, v)
    return text

In [13]:
preprocess("""A sentimental crowd-pleaser, well-directed by Le McCarey, this tale about a priest (Bing Crosby) assigned to a problematic parish was so popular that Paramount reteamed the same players for The Bells of St. Mary's.""")

"a sentimental crowd-pleaser , well-directed by le mccarey , this tale about a priest ( bing crosby ) assigned to a problematic parish was so popular that paramount reteamed the same players for the bells of st . mary's ."

In [14]:
train_data.loc[:, 'prep_text'] = train_data['text'].map(preprocess)
test_data.loc[:, 'prep_text'] = test_data['text'].map(preprocess)

### Padding data

Keras [Embedding layer](https://keras.io/layers/embeddings/) turn positive integers (indexes) into dense vectors of fixed size. 

* 1) Firstly convert words to indexes
* 2) Then we padding data

['not bad', 'movie is bad'] -> [[0, 4, 10], [2, 3, 4]] -> [[0.25, 0.1], [0.6, -0.2]]

#### 1. String -> Int vector

In [15]:
tokenizer = Tokenizer(nb_words=MAX_NB_WORDS, filters='"#$%&()*+-/:;<=>@[\\]^{|}~\t\n,.')
tokenizer.fit_on_texts(train_data['prep_text'])
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 132209 unique tokens.


In [16]:
nb_words = min(MAX_NB_WORDS, len(word_index))

In [17]:
#save tokenizer
with open('tokenizer','wb') as ofile:
    pickle.dump(tokenizer, ofile)
    ofile.close()

In [18]:
sequences_train = tokenizer.texts_to_sequences(train_data['prep_text'])
sequences_test = tokenizer.texts_to_sequences(test_data['prep_text'])

In [19]:
tokenizer.texts_to_sequences(['not bad movie'])

[[20, 82, 17]]

#### 2. Padding

In [20]:
padded_sequences_train = pad_sequences(sequences_train, maxlen=MAX_SEQUENCE_LENGTH)
padded_sequences_test = pad_sequences(sequences_test, maxlen=MAX_SEQUENCE_LENGTH)

In [21]:
def padding(text):
    return pad_sequences(tokenizer.texts_to_sequences([text]), maxlen=MAX_SEQUENCE_LENGTH)

In [22]:
padding('not bad movie')

array([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 20, 82, 17]], dtype=int32)

In [23]:
x_train = padded_sequences_train
x_test = padded_sequences_test
y_train = train_data['label']
y_test= test_data['label']

In [35]:
labels = to_categorical(np.asarray(y_train))
print('Shape of data tensor:', x_train.shape)
print('Shape of label tensor:', labels.shape)

Shape of data tensor: (152610, 30)
Shape of label tensor: (152610, 2)


---

### Train

In [42]:
%%time
mlp.fit(x_train, y_train, 
                     nb_epoch=3,
                     batch_size=128,
                     verbose=2,
                     validation_data=(x_test, y_test))

Train on 152610 samples, validate on 10660 samples
Epoch 1/3
10s - loss: 0.4742 - acc: 0.7654 - val_loss: 0.4395 - val_acc: 0.7891
Epoch 2/3
10s - loss: 0.3431 - acc: 0.8455 - val_loss: 0.4455 - val_acc: 0.7925
Epoch 3/3
11s - loss: 0.1881 - acc: 0.9228 - val_loss: 0.5636 - val_acc: 0.7985
CPU times: user 41.5 s, sys: 1.85 s, total: 43.4 s
Wall time: 33.1 s


<keras.callbacks.History at 0x12d937e80>

In [43]:
def predict(model, text):
    return model.predict(padding(preprocess(text)))[0][0]

In [44]:
predict(mlp, 'awesome film')

0.94094473

In [45]:
predict(mlp, 'this film is bad')

0.050300647

In [46]:
predict(mlp, 'this film is not bad')

0.2203566

In [105]:
predict(mlp, "i think the movie is good")

0.53700453

In [106]:
predict(mlp, "i don't think the movie is good")

0.43518016

## 2. Recurrent neural network

<img width="90%" src="https://github.com/udsclub/whiskey-sentiment-analysis/blob/master/workshop/img/RNN.png?raw=true" />

### LSTM
[Long Short Term Memory networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They work tremendously well on a large variety of problems, and are now widely used.


<table>
    <tr>
        <th><b>Classic RNN</b></th>
        <th><b>LSTM</b></th>
    </tr>
    <tr>
        <td><img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png"></td>
        <td><img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png"></td>
    </tr>
</table>

LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

#### LSTM unit
<img width="60%" src="https://cdn-images-1.medium.com/max/1600/1*laH0_xXEkFE0lKJu54gkFQ.png">

---

In [148]:
from keras.layers import LSTM

In [163]:
lstm = Sequential()
lstm.add(Embedding(nb_words, e, input_length=MAX_SEQUENCE_LENGTH))
lstm.add(LSTM(128, dropout_U=0.2, dropout_W=0.2))
lstm.add(Dropout(0.2))
lstm.add(Dense(1, activation='sigmoid'))
lstm.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [164]:
lstm.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_17 (Embedding)         (None, 30, 32)        640000      embedding_input_12[0][0]         
____________________________________________________________________________________________________
lstm_17 (LSTM)                   (None, 128)           82432       embedding_17[0][0]               
____________________________________________________________________________________________________
dropout_12 (Dropout)             (None, 128)           0           lstm_17[0][0]                    
____________________________________________________________________________________________________
dense_18 (Dense)                 (None, 1)             129         dropout_12[0][0]                 
Total params: 722,561
Trainable params: 722,561
Non-trainable params: 0
___________________

In [165]:
%%time
lstm.fit(x_train, y_train, 
                     nb_epoch=4,
                     batch_size=128,
                     verbose=2,
                     validation_data=(x_test, y_test))

Train on 152610 samples, validate on 10660 samples
Epoch 1/4
183s - loss: 0.4854 - acc: 0.7612 - val_loss: 0.4330 - val_acc: 0.7985
Epoch 2/4
188s - loss: 0.3934 - acc: 0.8196 - val_loss: 0.3991 - val_acc: 0.8179
Epoch 3/4
188s - loss: 0.3511 - acc: 0.8413 - val_loss: 0.4012 - val_acc: 0.8189
Epoch 4/4
193s - loss: 0.3177 - acc: 0.8583 - val_loss: 0.4114 - val_acc: 0.8191
CPU times: user 13min 51s, sys: 2min 27s, total: 16min 18s
Wall time: 12min 42s


<keras.callbacks.History at 0x24418c3c8>

In [166]:
predict(lstm, 'awesome film')

0.93790126

In [167]:
predict(lstm, 'this film is bad')

0.057851244

In [168]:
predict(lstm, 'this film is not bad')

0.46220678

In [169]:
predict(lstm, "i think the movie is good")

0.86614245

In [170]:
predict(lstm, "i don't think the movie is good")

0.47535789

### Pre-trained embeddings

You can download Google’s pre-trained model [here](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit). It’s **1.5GB**!

In [85]:
from gensim.models import KeyedVectors

In [86]:
word2vec_google = KeyedVectors.load_word2vec_format('../data/GoogleNews-vectors-negative300.bin.gz', binary=True)
word2vec_google.init_sims(replace=True)

In [87]:
def get_embedding(word2vec_model, word):
    try:
        return word2vec_model.word_vec(word)
    except KeyError:
        return np.zeros(word2vec_model.syn0norm.shape[1])


embedding_weights_google = np.zeros((nb_words, word2vec_google.syn0norm.shape[1]))
for word, i in word_index.items():
    if i >= MAX_NB_WORDS:
        continue
    embedding_weights_google[i] = get_embedding(word2vec_google, word)

In [91]:
lstm_word2vec = Sequential()
lstm_word2vec.add(Embedding(nb_words, 300,
                        mask_zero=True,
                        weights=[embedding_weights_google],
                        trainable=False))
lstm_word2vec.add(LSTM(128, dropout_U=0.2, dropout_W=0.2))
lstm_word2vec.add(Dropout(0.2))
lstm_word2vec.add(Dense(1, activation='sigmoid'))
lstm_word2vec.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [92]:
lstm_word2vec.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_6 (Embedding)          (None, None, 300)     6000000     embedding_input_6[0][0]          
____________________________________________________________________________________________________
lstm_4 (LSTM)                    (None, 128)           219648      embedding_6[0][0]                
____________________________________________________________________________________________________
dropout_4 (Dropout)              (None, 128)           0           lstm_4[0][0]                     
____________________________________________________________________________________________________
dense_10 (Dense)                 (None, 1)             129         dropout_4[0][0]                  
Total params: 6,219,777
Trainable params: 219,777
Non-trainable params: 6,000,000
_________

In [93]:
%%time
lstm_word2vec.fit(x_train, y_train, 
                     nb_epoch=20,
                     batch_size=128,
                     verbose=2,
                     validation_data=(x_test, y_test))

Train on 152610 samples, validate on 10660 samples
Epoch 1/20
277s - loss: 0.5380 - acc: 0.7234 - val_loss: 0.4980 - val_acc: 0.7534
Epoch 2/20
307s - loss: 0.4899 - acc: 0.7556 - val_loss: 0.4465 - val_acc: 0.7923
Epoch 3/20
285s - loss: 0.4663 - acc: 0.7703 - val_loss: 0.4539 - val_acc: 0.7800
Epoch 4/20
301s - loss: 0.4498 - acc: 0.7829 - val_loss: 0.4128 - val_acc: 0.8089
Epoch 5/20
281s - loss: 0.4366 - acc: 0.7909 - val_loss: 0.4013 - val_acc: 0.8129
Epoch 6/20
267s - loss: 0.4248 - acc: 0.7981 - val_loss: 0.3970 - val_acc: 0.8172
Epoch 7/20
267s - loss: 0.4140 - acc: 0.8044 - val_loss: 0.3958 - val_acc: 0.8153
Epoch 8/20
266s - loss: 0.4055 - acc: 0.8093 - val_loss: 0.3744 - val_acc: 0.8303
Epoch 9/20
278s - loss: 0.3970 - acc: 0.8146 - val_loss: 0.3817 - val_acc: 0.8264
Epoch 10/20
289s - loss: 0.3894 - acc: 0.8189 - val_loss: 0.3784 - val_acc: 0.8277
Epoch 11/20
284s - loss: 0.3825 - acc: 0.8218 - val_loss: 0.3633 - val_acc: 0.8371
Epoch 12/20
284s - loss: 0.3770 - acc: 0.8258

<keras.callbacks.History at 0x28d56a828>

In [94]:
predict(lstm_word2vec, 'this film is bad')

0.017051214

In [95]:
predict(lstm_word2vec, 'this film is not bad')

0.54443789

In [134]:
predict(lstm_word2vec, "i think the movie is good")

0.89198178

In [141]:
predict(lstm_word2vec, "i don't think the movie is bad")

0.17547244

#### Save model

In [111]:
# save model weights
lstm.save_weights("lstm.h5")

In [112]:
# serialize model to JSON
lstm_json = lstm.to_json()
with open("lstm.json", "w") as json_file:
    json_file.write(lstm_json)

### Bidirectional and stacked LSTM

<img width="90%" src="https://github.com/udsclub/whiskey-sentiment-analysis/blob/master/workshop/img/BiRNN.png?raw=true" />

In [113]:
from keras.layers import Bidirectional

In [114]:
blstm = Sequential()
blstm.add(Embedding(nb_words, e, input_length=MAX_SEQUENCE_LENGTH))
blstm.add(Bidirectional(LSTM(128, dropout_U=0.2, dropout_W=0.2)))
blstm.add(Dropout(0.2))
blstm.add(Dense(1, activation='sigmoid'))
blstm.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [115]:
blstm.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_7 (Embedding)          (None, 30, 32)        640000      embedding_input_7[0][0]          
____________________________________________________________________________________________________
bidirectional_1 (Bidirectional)  (None, 256)           164864      embedding_7[0][0]                
____________________________________________________________________________________________________
dropout_5 (Dropout)              (None, 256)           0           bidirectional_1[0][0]            
____________________________________________________________________________________________________
dense_11 (Dense)                 (None, 1)             257         dropout_5[0][0]                  
Total params: 805,121
Trainable params: 805,121
Non-trainable params: 0
___________________


<img width="90%" src="https://github.com/udsclub/whiskey-sentiment-analysis/blob/master/workshop/img/stacked.png?raw=true" />

In [119]:
slstm = Sequential()
slstm.add(Embedding(nb_words, e, input_length=MAX_SEQUENCE_LENGTH))
slstm.add(LSTM(128, dropout_U=0.2, dropout_W=0.2, return_sequences=True))
slstm.add(Dropout(0.2))
slstm.add(LSTM(128, dropout_U=0.2, dropout_W=0.2))
slstm.add(Dropout(0.2))
slstm.add(Dense(1, activation='sigmoid'))
slstm.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [120]:
slstm.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_10 (Embedding)         (None, 30, 32)        640000      embedding_input_10[0][0]         
____________________________________________________________________________________________________
lstm_10 (LSTM)                   (None, 30, 128)       82432       embedding_10[0][0]               
____________________________________________________________________________________________________
dropout_9 (Dropout)              (None, 30, 128)       0           lstm_10[0][0]                    
____________________________________________________________________________________________________
lstm_11 (LSTM)                   (None, 128)           131584      dropout_9[0][0]                  
___________________________________________________________________________________________

## Attention mechanism

<img width="90%" src="https://github.com/udsclub/whiskey-sentiment-analysis/blob/master/workshop/img/attention.png?raw=true" />

In [159]:
from keras import backend as K, initializations
from keras.layers import Layer
class AttentionLayer(Layer):
    '''
    Attention layer.
    '''

    def __init__(self, init='glorot_uniform', **kwargs):
        super(AttentionLayer, self).__init__(**kwargs)
        self.supports_masking = True
        self.init = initializations.get(init)

    def build(self, input_shape):
        self.Uw = self.init((input_shape[-1],))
        self.b = self.init((input_shape[1],))
        self.trainable_weights = [self.Uw, self.b]
        super(AttentionLayer, self).build(input_shape)

    def compute_mask(self, input, mask):
        return mask

    def call(self, x, mask=None):
        multData = K.exp(K.tanh(K.dot(x, self.Uw) + self.b))
        if mask is not None:
            multData = mask * multData
        output = multData / (K.sum(multData, axis=1) + K.epsilon())[:, None]
        return K.reshape(output, (output.shape[0], output.shape[1], 1))

    def get_output_shape_for(self, input_shape):
        newShape = list(input_shape)
        newShape[-1] = 1
        return tuple(newShape)

In [160]:
from keras.models import Model
from keras.layers import Input, merge, Lambda

In [161]:
wordsInputs = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32', name='words_input')
emb = Embedding(nb_words, e, mask_zero=True)(wordsInputs)
word_rnn = Bidirectional(LSTM(128, dropout_U=0.2, dropout_W=0.2, return_sequences=True))(emb)
attention = AttentionLayer()(word_rnn)
doc_emb = merge([word_rnn, attention], mode=lambda x: x[1] * x[0], output_shape=lambda x: x[0])
doc_emb = Lambda(lambda x: K.sum(x, axis=1), output_shape=lambda x: (x[0], x[2]))(doc_emb)
output = Dense(1, activation="sigmoid")(doc_emb)

model = Model(input=[wordsInputs], output=[output])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [162]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
words_input (InputLayer)         (None, 30)            0                                            
____________________________________________________________________________________________________
embedding_16 (Embedding)         (None, 30, 32)        640000      words_input[0][0]                
____________________________________________________________________________________________________
bidirectional_5 (Bidirectional)  (None, 30, 256)       164864      embedding_16[0][0]               
____________________________________________________________________________________________________
attentionlayer_4 (AttentionLayer (None, 30, 1)         286         bidirectional_5[0][0]            
___________________________________________________________________________________________