# Recurrent neural network
by <b style="font-size:14px" align="left"> Ievgen Terpil </b>

### Table of Contents

1. Neural networks overview
2. Recurrent neural network
    1. LSTM
    2. Pre-trained embeddings
    3. Bidirectional and stacked LSTM
4. Attention mechanism

---

## 1. Neural networks for sentiment analysis

<img width="90%" src="https://github.com/udsclub/whiskey-sentiment-analysis/blob/master/workshop/img/MLP.png?raw=true" />

---

In [2]:
DATASETS = ["data/test_imdb.csv", "data/train_imdb.csv", "data/test_rt_en.csv", "data/train_rt_en.csv"]

In [3]:
import nltk
import numpy
import pickle
import re
numpy.random.seed(42)

from nltk.corpus import stopwords
import pandas as pd
import numpy as np
from itertools import chain
from collections import Counter

from sklearn.model_selection import train_test_split


from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical

Using Theano backend.


Swith on full text mode)

In [4]:
pd.options.display.max_colwidth = -1

## Load data

In [5]:
datasets = []
for dataset in DATASETS:    
    datasets.append(pd.read_csv(dataset, sep="|"))

In [6]:
whole_data = pd.concat(datasets)

In [7]:
for data in datasets:
    print(data.shape)

(10000, 3)
(40000, 3)
(19798, 3)
(79190, 3)


In [11]:
test_data, train_data = train_test_split(whole_data, train_size=0.1, random_state=42)

In [12]:
test_data.shape

(14898, 3)

In [13]:
train_data.shape

(134090, 3)

---

## Preprocessing

In [42]:
MAX_NB_WORDS = 20000
MAX_SEQUENCE_LENGTH = 30

Remove stopwords, numbers, punctuation

In [43]:
from nltk.tokenize import TweetTokenizer
tweet_tokenizer = TweetTokenizer()

In [44]:
def preprocess(text):
    text = text.lower()
    text = text.replace('<br />', ' ')
    text = ' '.join(tweet_tokenizer.tokenize(text))
    return text

In [45]:
preprocess("""A sentimental crowd-pleaser, well-directed by Le McCarey, this tale about a priest (Bing Crosby) assigned to a problematic parish was so popular that Paramount reteamed the same players for The Bells of St. Mary's.""")

"a sentimental crowd-pleaser , well-directed by le mccarey , this tale about a priest ( bing crosby ) assigned to a problematic parish was so popular that paramount reteamed the same players for the bells of st . mary's ."

In [46]:
train_data['prep_text'] = train_data['text'].map(preprocess)
test_data['prep_text'] = test_data['text'].map(preprocess)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


### Padding data

Keras [Embedding layer](https://keras.io/layers/embeddings/) turn positive integers (indexes) into dense vectors of fixed size. 

* 1) Firstly convert words to indexes
* 2) Then we padding data

['not bad', 'movie is bad'] -> [[0, 4, 10], [2, 3, 4]] -> [[0.25, 0.1], [0.6, -0.2]]

#### 1. String -> Int vector

In [47]:
tokenizer = Tokenizer(nb_words=MAX_NB_WORDS, filters='"#$%&()*+-/:;<=>@[\\]^{|}~\t\n,.')
tokenizer.fit_on_texts(train_data['prep_text'])
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 117664 unique tokens.


In [48]:
nb_words = min(MAX_NB_WORDS, len(word_index))

In [49]:
#save tokenizer
with open('tokenizer','wb') as ofile:
    pickle.dump(tokenizer, ofile)
    ofile.close()

#### 2. Padding

In [50]:
sequences_train = tokenizer.texts_to_sequences(train_data['prep_text'])
sequences_test = tokenizer.texts_to_sequences(test_data['prep_text'])

padded_sequences_train = pad_sequences(sequences_train, maxlen=MAX_SEQUENCE_LENGTH)
padded_sequences_test = pad_sequences(sequences_test, maxlen=MAX_SEQUENCE_LENGTH)

In [51]:
def padding(text):
    return pad_sequences(tokenizer.texts_to_sequences([text]), maxlen=MAX_SEQUENCE_LENGTH)

In [52]:
x_train = padded_sequences_train
x_test = padded_sequences_test
y_train = train_data['label']
y_test= test_data['label']

---

## Model

In [58]:
EMBEDDING_DIM = 32

In [59]:
from keras.models import Sequential
from keras.layers import Dense
from keras.models import model_from_json
from keras.layers import Flatten, Dropout
from keras.layers import LSTM, Bidirectional
from keras.layers.embeddings import Embedding
from keras.callbacks import ModelCheckpoint, TensorBoard, EarlyStopping

In [60]:
labels = to_categorical(np.asarray(train_data['label']))
print('Shape of data tensor:', padded_sequences_train.shape)
print('Shape of label tensor:', labels.shape)

Shape of data tensor: (134090, 30)
Shape of label tensor: (134090, 2)


In [73]:
mlp = Sequential()
mlp.add(Embedding(nb_words, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH))
mlp.add(Flatten())
mlp.add(Dense(128, activation='relu'))
mlp.add(Dense(128, activation='relu'))
mlp.add(Dense(1, activation='sigmoid'))
mlp.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', 'fmeasure', 'precision', 'recall'])

In [70]:
mlp.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_7 (Embedding)          (None, 30, 32)        640000      embedding_input_7[0][0]          
____________________________________________________________________________________________________
flatten_7 (Flatten)              (None, 960)           0           embedding_7[0][0]                
____________________________________________________________________________________________________
dense_15 (Dense)                 (None, 128)           123008      flatten_7[0][0]                  
____________________________________________________________________________________________________
dense_16 (Dense)                 (None, 128)           16512       dense_15[0][0]                   
___________________________________________________________________________________________

In [71]:
%%time
mlp.fit(x_train, y_train, 
                     nb_epoch=2,
                     batch_size=128,
                     verbose=1,
                     validation_data=(x_test, y_test))

Train on 134090 samples, validate on 14898 samples
Epoch 1/2
Epoch 2/2
CPU times: user 27.4 s, sys: 1.97 s, total: 29.4 s
Wall time: 26.6 s


<keras.callbacks.History at 0x122a81828>

In [72]:
def predict(model, text):
    return model.predict(padding(preprocess(text)))[0][0]

In [83]:
predict(mlp, 'awesome film')

0.91119802

In [97]:
predict(mlp, 'bad film')

0.22034332

In [98]:
predict(mlp, 'not bad film')

0.19305691

## 2. Recurrent neural network

<img width="90%" src="https://github.com/udsclub/whiskey-sentiment-analysis/blob/master/workshop/img/RNN.png?raw=true" />

### LSTM
[Long Short Term Memory networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.


<table>
    <tr>
        <th><b>Classic RNN</b></th>
        <th><b>LSTM</b></th>
    </tr>
    <tr>
        <td><img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png"></td>
        <td><img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png"></td>
    </tr>
</table>

LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

#### LSTM unit
<img width="60%" src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-var-GRU.png">

---

In [86]:
lstm = Sequential()
lstm.add(Embedding(nb_words, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH))
lstm.add(LSTM(128, dropout_U=0.2, dropout_W=0.2))
lstm.add(Dropout(0.2))
lstm.add(Dense(1, activation='sigmoid'))
lstm.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', 'fmeasure', 'precision', 'recall'])

In [87]:
lstm.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_9 (Embedding)          (None, 30, 32)        640000      embedding_input_9[0][0]          
____________________________________________________________________________________________________
lstm_1 (LSTM)                    (None, 128)           82432       embedding_9[0][0]                
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 128)           0           lstm_1[0][0]                     
____________________________________________________________________________________________________
dense_21 (Dense)                 (None, 1)             129         dropout_1[0][0]                  
Total params: 722,561
Trainable params: 722,561
Non-trainable params: 0
___________________

In [88]:
%%time
lstm.fit(x_train, y_train, 
                     nb_epoch=2,
                     batch_size=128,
                     verbose=1,
                     validation_data=(x_test, y_test))

Train on 134090 samples, validate on 14898 samples
Epoch 1/2
Epoch 2/2
CPU times: user 7min 14s, sys: 1min 16s, total: 8min 30s
Wall time: 6min 40s


<keras.callbacks.History at 0x12e136358>

In [90]:
predict(lstm, 'awesome film')

0.78056192

In [99]:
predict(lstm, 'bad film')

0.16125944

In [101]:
predict(lstm, 'not bad film')

0.11650348

### Pre-trained embeddings

In [103]:
from gensim.models import KeyedVectors

In [104]:
word2vec_google = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
word2vec_google.init_sims(replace=True)

In [105]:
def get_embedding(word2vec_model, word):
    try:
        return word2vec_model.word_vec(word)
    except KeyError:
        return np.zeros(word2vec_model.syn0norm.shape[1])


embedding_weights_google = np.zeros((nb_words, word2vec_google.syn0norm.shape[1]))
for word, i in word_index.items():
    if i >= MAX_NB_WORDS:
        continue
    embedding_weights_google[i] = get_embedding(word2vec_google, word)

In [109]:
lstm = Sequential()
lstm.add(Embedding(nb_words, 300,
                        mask_zero=True,
                        weights=[embedding_weights_google],
                        trainable=False))
lstm.add(LSTM(128, dropout_U=0.2, dropout_W=0.2))
lstm.add(Dropout(0.2))
lstm.add(Dense(1, activation='sigmoid'))
lstm.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', 'fmeasure', 'precision', 'recall'])

In [110]:
lstm.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_11 (Embedding)         (None, None, 300)     6000000     embedding_input_11[0][0]         
____________________________________________________________________________________________________
lstm_2 (LSTM)                    (None, 128)           219648      embedding_11[0][0]               
____________________________________________________________________________________________________
dropout_2 (Dropout)              (None, 128)           0           lstm_2[0][0]                     
____________________________________________________________________________________________________
dense_22 (Dense)                 (None, 1)             129         dropout_2[0][0]                  
Total params: 6,219,777
Trainable params: 219,777
Non-trainable params: 6,000,000
_________

In [111]:
%%time
lstm.fit(x_train, y_train, 
                     nb_epoch=2,
                     batch_size=128,
                     verbose=1,
                     validation_data=(x_test, y_test))

Train on 134090 samples, validate on 14898 samples
Epoch 1/2
Epoch 2/2
CPU times: user 14min 53s, sys: 2min 33s, total: 17min 26s
Wall time: 11min 34s


<keras.callbacks.History at 0x28bce8438>

In [112]:
predict(lstm, 'awesome film')

0.9068529

In [113]:
predict(lstm, 'bad film')

0.17987305

In [122]:
predict(lstm, 'not bad film')

0.14775775

#### Save model

In [128]:
# save model weights
lstm.save_weights("lstm.h5")

In [129]:
# serialize model to JSON
lstm_json = lstm.to_json()
with open("lstm.json", "w") as json_file:
    json_file.write(lstm_json)

### Bidirectional and stacked LSTM

<img width="90%" src="https://github.com/udsclub/whiskey-sentiment-analysis/blob/master/workshop/img/BiRNN.png?raw=true" />