# Practical 5.2 Modeling Text

# Sequence classification model

In [1]:
from __future__ import print_function

## 1. Character-level sentiment classification

### IMDB user review data set

We will use character sequences of IMDB text reviews to predict whether the review is positive (class label=1) or negative (class label =0). Download data set from https://storage.googleapis.com/trl_data/imdb_dataset.zip. Run Practical 5.1 to preprocess data.

In [2]:
import os
import sys
import numpy as np
import pandas as pd
pd.options.display.max_colwidth = 100
import re
import nltk

DATA_PATH = 'data'
EMBEDDING_PATH = 'embedding'
MODEL_PATH = 'model'

## 2. Read preprocessed data

In [3]:
# reading stored character-level vocabulary index

np_indices_char = np.load(os.path.join(DATA_PATH,'indices_char.npy'))

import collections

indices_char = collections.OrderedDict()
for i in range(len(np_indices_char.item())):
    index_val =  np_indices_char.item()[i]
    indices_char[i] = index_val
    
char_indices = dict((c, i) for i, c in (indices_char.items()))

In [4]:
X_train = np.load(os.path.join(DATA_PATH,'X_train_char.npy'))
y_train = np.load(os.path.join(DATA_PATH,'y_train_char.npy'))

X_valid = np.load(os.path.join(DATA_PATH,'X_valid_char.npy'))
y_valid = np.load(os.path.join(DATA_PATH,'y_valid_char.npy'))

In [5]:
# here we only use smaller set to train our model 
# original set consists of 25.000 reviews

X_train = X_train[:5000]
y_train = y_train[:5000]

X_valid = X_valid[5000:6000]
y_valid = y_valid[5000:6000]

## 3. Character-level Recurrent Neural Networks (RNN) model

In [6]:
from keras.models import Model
from keras.layers import Dense, Input, Dropout
from keras.layers import LSTM, Lambda
import tensorflow as tf
import keras.callbacks

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [7]:
num_chars = len(char_indices)
max_sequence_length = 500
rnn_dim = 32
batch_size = 64
epochs = 5

In [8]:
def binarize(x, sz=num_chars):
    return tf.to_float(tf.one_hot(x, sz, on_value=1, off_value=0, axis=-1))

In [9]:
def binarize_outshape(in_shape):
    return in_shape[0], in_shape[1], num_chars

### LSTM model (Keras sequential model)

In [10]:
from keras.models import Sequential
from keras.layers import Dense, Lambda
from keras.layers import LSTM

model = Sequential()
model.add(Lambda(binarize, output_shape=binarize_outshape,name='char_embedding', \
                 input_shape=(max_sequence_length,), dtype='int32'))
model.add(LSTM(rnn_dim, name='lstm_layer'))
model.add(Dense(1 , name='prediction_layer', activation='sigmoid'))

print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
char_embedding (Lambda)      (None, 500, 71)           0         
_________________________________________________________________
lstm_layer (LSTM)            (None, 32)                13312     
_________________________________________________________________
prediction_layer (Dense)     (None, 1)                 33        
Total params: 13,345
Trainable params: 13,345
Non-trainable params: 0
_________________________________________________________________
None


#### Compile model

In [11]:
model.compile(loss='binary_crossentropy', optimizer='RMSprop', metrics=['accuracy'])

#### Train model

In [12]:
model.fit(X_train, y_train, validation_data=(X_valid, y_valid), batch_size=batch_size, epochs=epochs)

Train on 5000 samples, validate on 1000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f2418f370f0>

### LSTM model (Keras functional API)

Same model architecture, with modularity of Keras functional API

In [13]:
# construct architecture
input_layer = Input(shape=(max_sequence_length, ), name='input_layer', dtype='int32')
char_embedding = Lambda(binarize, output_shape=binarize_outshape,name='char_embedding')(input_layer)
lstm_layer = LSTM(rnn_dim, name='lstm_layer')(char_embedding)
output_layer = Dense(1, name='prediction_layer', activation='sigmoid')(lstm_layer)

# define and load model
lstm_model = Model(inputs=input_layer, outputs=output_layer)
lstm_model.summary()

# compile model
lstm_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_layer (InputLayer)     (None, 500)               0         
_________________________________________________________________
char_embedding (Lambda)      (None, 500, 71)           0         
_________________________________________________________________
lstm_layer (LSTM)            (None, 32)                13312     
_________________________________________________________________
prediction_layer (Dense)     (None, 1)                 33        
Total params: 13,345
Trainable params: 13,345
Non-trainable params: 0
_________________________________________________________________


In [14]:
lstm_model.fit(X_train, y_train, validation_data=(X_valid, y_valid), batch_size=batch_size, epochs=epochs)

Train on 5000 samples, validate on 1000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f24151a0048>

#### Discussion

Discuss the result of model training. What could be the reason why this model does not converge? 
Try adding more layers (Dropout, Dense) -- or adding more data, changing hyperparameters, does it help?

In [16]:
# construct architecture
input_layer = Input(shape=(max_sequence_length, ), name='input_layer', dtype='int32')
char_embedding = Lambda(binarize, output_shape=binarize_outshape,name='char_embedding')(input_layer)
lstm_layer = LSTM(rnn_dim, name='lstm_layer')(char_embedding)
output = Dropout(0.5)(lstm_layer)
output = Dense(128, activation='relu')(output)
output = Dropout(0.5)(output)
output_layer = Dense(1, name='prediction_layer', activation='sigmoid')(output)

# define and load model
lstm_model = Model(inputs=input_layer, outputs=output_layer)
lstm_model.summary()

# compile model
lstm_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_layer (InputLayer)     (None, 500)               0         
_________________________________________________________________
char_embedding (Lambda)      (None, 500, 71)           0         
_________________________________________________________________
lstm_layer (LSTM)            (None, 32)                13312     
_________________________________________________________________
dropout_3 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 128)               4224      
_________________________________________________________________
dropout_4 (Dropout)          (None, 128)               0         
_________________________________________________________________
prediction_layer (Dense)     (None, 1)                 129       
Total para

In [17]:
lstm_model.fit(X_train, y_train, validation_data=(X_valid, y_valid), batch_size=batch_size, epochs=epochs)

Train on 5000 samples, validate on 1000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f2405d7d550>

The model simply cannot capture high level abstraction (sentiment polarity) from character sequences.

Think how sentiment polarity is conveyed in this type of text reviews. 
What factors could play important role in capturing sentiment of corresponding text?

Can we do better with shorter text?