## Selecting a deep learning model with MLP

Next step for our research is deep learning. For this stage we try using **MLP model** with keras. 

####  Import libraries,  import custom scripts and define constants  

In [2]:
from __future__ import print_function
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping, TensorBoard, ModelCheckpoint
from tensorflow.keras.optimizers import Adam

import tensorflow.keras.backend as K

from tensorflow.keras.preprocessing.text import text_to_word_sequence
from sklearn.model_selection import train_test_split


import re

In [3]:
import os,sys,inspect
currentdir=os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir=os.path.dirname(currentdir)
sys.path.insert(0,parentdir)
from src import preprocessing

In [4]:
#definition constants
RANDOM_STATE = 11
TARGET_METRIC = 'f1'
TEST_SIZE = 0.15


####  Loading the data

In [5]:
# import & display data
data = pd.read_csv('../../data/IMDB_Dataset.csv')
data['sentiment'] = data['sentiment'].replace({'positive' : 1, 'negative' : 0})
data = data.drop_duplicates()
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


####  Split data for training, testing and validation sets

In [6]:
X = data.review
y = data.sentiment
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y,
                                                    test_size=TEST_SIZE, 
                                                    random_state=RANDOM_STATE, 
                                                    stratify = y)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, 
                                                    y_train,
                                                    test_size=TEST_SIZE, 
                                                    random_state=RANDOM_STATE, 
                                                    stratify = y_train)

#### Preprocessing Data

For pre-processing and vectorization, let's use the approach from the previous stage, which received the best performance.

For pre-processing:
* the removal of html-tags,
* the separation of numbers and letters.  

For vectorization we use  `TfidfVectorizer(ngram_range=(1,2))`

In [7]:
MAX_FEATURES = 20000

vectorizer = TfidfVectorizer(ngram_range=(1,2), preprocessor=preprocessing.preprocessing_text, max_features=MAX_FEATURES)

vectorizer.fit(X_train)
X_train_features = vectorizer.transform(X_train)
X_valid_features = vectorizer.transform(X_valid)
X_test_features = vectorizer.transform(X_test)

#### Creating the model
Define the function for create the simple MLP model

In [8]:
def mlp_model(layers, units, dropout_rate, input_shape):
    model = Sequential()
    model.add(Dropout(rate=dropout_rate, input_shape=input_shape))
    for _ in range(layers-1):
        model.add(Dense(units=units, activation='relu'))
        model.add(Dropout(rate=dropout_rate))
    model.add(Dense(1, activation='sigmoid'))
    return model

In [9]:
#function for f1 metric
def get_f1(y_true, y_pred): 
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    recall = true_positives / (possible_positives + K.epsilon())
    f1_val = 2*(precision*recall)/(precision+recall+K.epsilon())
    return f1_val

Define the function for training the model

In [10]:
def train_model(data,
                learning_rate=1e-3,
                epochs=100,
                batch_size=128,
                layers=2,
                units=32,
                dropout_rate=0.3,
               filepath='model.hdf5',
                log_dir='logs/mlp1/fit'):
    # Get the data.
    (x_train, train_labels), (x_val, val_labels) = data

    # Create model instance.
    K.clear_session()
    model = mlp_model(layers=layers,
                                  units=units,
                                  dropout_rate=dropout_rate,
                                  input_shape=x_train.shape[1:])

    # Compile model with learning parameters.
    loss = 'binary_crossentropy'
    optimizer = Adam(lr=learning_rate)
    model.compile(optimizer=optimizer, loss=loss, metrics=[get_f1])

    # Create callback for early stopping on validation loss. 
    tensorboard_callback = TensorBoard(log_dir=log_dir, histogram_freq=1)
    early_stop = EarlyStopping(monitor='val_loss', 
                           patience=3)
    cp_callback = ModelCheckpoint(filepath=filepath,
                              save_best_only=True,
                              verbose=1)
    callbacks = [cp_callback, early_stop, tensorboard_callback]
    
    model.summary()
    
    # Train and validate model.
    history = model.fit(
            x_train,
            train_labels,
            epochs=epochs,
            callbacks=callbacks,
            validation_data=(x_val, val_labels),
            verbose=2,  # Logs once per epoch.
            batch_size=batch_size)

    return history

#### Train different models and save result

Model1:  
* learning_rate=1e-3,
* epochs=100,
* batch_size=128,
* layers=1,
* units=32,
* dropout_rate=0.3.


In [11]:
result1 = train_model(((X_train_features,y_train), (X_valid_features, y_valid)),
                    filepath='models/model1.hdf5',
                      layers=1,
                     log_dir='logs/mlp_5_5/model1')

Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dropout (Dropout)            (None, 20000)             0         
_________________________________________________________________
dense (Dense)                (None, 1)                 20001     
Total params: 20,001
Trainable params: 20,001
Non-trainable params: 0
_________________________________________________________________
Train on 35822 samples, validate on 6322 samples
Instructions for updating:
Use tf.cast instead.
Epoch 1/100

Epoch 00001: val_loss improved from inf to 0.61387, saving model to models/model1.hdf5
 - 15s - loss: 0.6527 - get_f1: 0.7689 - val_loss: 0.6139 - val_get_f1: 0.8632
Epoch 2/100

Epoch 00002: val_loss improved from 0.61387 to 0.55469, sa

Model2:  
* learning_rate=1e-3,
* epochs=100,
* batch_size=128,
* layers=1,
* units=64,
* dropout_rate=0.3.


In [12]:
result2 = train_model(((X_train_features,y_train), (X_valid_features, y_valid)),
                      units = 64,
                      layers=1,
                      filepath='models/model2.hdf5',
                      log_dir='logs/mlp_5_5/model2')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dropout (Dropout)            (None, 20000)             0         
_________________________________________________________________
dense (Dense)                (None, 1)                 20001     
Total params: 20,001
Trainable params: 20,001
Non-trainable params: 0
_________________________________________________________________
Train on 35822 samples, validate on 6322 samples
Epoch 1/100

Epoch 00001: val_loss improved from inf to 0.61410, saving model to models/model2.hdf5
 - 15s - loss: 0.6527 - get_f1: 0.8214 - val_loss: 0.6141 - val_get_f1: 0.8592
Epoch 2/100

Epoch 00002: val_loss improved from 0.61410 to 0.55457, saving model to models/model2.hdf5
 - 14s - loss: 0.5820 - get_f1: 0.8594 - val_loss: 0.5546 - val_get_f1: 0.8714
Epoch 3/100

Epoch 00003: val_loss improved from 0.55457 to 0.50892, saving model to models/model2.hdf5
 - 15s - loss: 0.52

Model3:  
* learning_rate=5e-3,
* epochs=100,
* batch_size=128,
* layers=1,
* units=32,
* dropout_rate=0.3.

In [14]:
result3 = train_model(((X_train_features,y_train), (X_valid_features, y_valid)),
                      learning_rate=5e-3,
                      units = 32,
                      layers=1,
                      filepath='models/model3.hdf5',
                      log_dir='logs/mlp_5_5/model3')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dropout (Dropout)            (None, 20000)             0         
_________________________________________________________________
dense (Dense)                (None, 1)                 20001     
Total params: 20,001
Trainable params: 20,001
Non-trainable params: 0
_________________________________________________________________
Train on 35822 samples, validate on 6322 samples
Epoch 1/100

Epoch 00001: val_loss improved from inf to 0.44230, saving model to models/model3.hdf5
 - 15s - loss: 0.5469 - get_f1: 0.8411 - val_loss: 0.4423 - val_get_f1: 0.8835
Epoch 2/100

Epoch 00002: val_loss improved from 0.44230 to 0.35710, saving model to models/model3.hdf5
 - 14s - loss: 0.3937 - get_f1: 0.8852 - val_loss: 0.3571 - val_get_f1: 0.8965
Epoch 3/100

Epoch 00003: val_loss improved from 0.35710 to 0.31404, saving model to models/model3.hdf5
 - 15s - loss: 0.32

Model4:  
* learning_rate=1e-2,
* epochs=100,
* batch_size=128,
* layers=1,
* units=32,
* dropout_rate=0.3.

In [15]:
result4 = train_model(((X_train_features,y_train), (X_valid_features, y_valid)),
                      learning_rate=1e-2,
                      units = 32,
                      layers=1,
                      filepath='models/model4.hdf5',
                      log_dir='logs/mlp_5_5/model4')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dropout (Dropout)            (None, 20000)             0         
_________________________________________________________________
dense (Dense)                (None, 1)                 20001     
Total params: 20,001
Trainable params: 20,001
Non-trainable params: 0
_________________________________________________________________
Train on 35822 samples, validate on 6322 samples
Epoch 1/100

Epoch 00001: val_loss improved from inf to 0.35482, saving model to models/model4.hdf5
 - 14s - loss: 0.4753 - get_f1: 0.8335 - val_loss: 0.3548 - val_get_f1: 0.8968
Epoch 2/100

Epoch 00002: val_loss improved from 0.35482 to 0.28818, saving model to models/model4.hdf5
 - 14s - loss: 0.3142 - get_f1: 0.8998 - val_loss: 0.2882 - val_get_f1: 0.9102
Epoch 3/100

Epoch 00003: val_loss improved from 0.28818 to 0.25876, saving model to models/model4.hdf5
 - 14s - loss: 0.26

Model5:  
* learning_rate=5e-3,
* epochs=100,
* batch_size=128,
* layers=1,
* units=64,
* dropout_rate=0.3.

In [16]:
result5 = train_model(((X_train_features,y_train), (X_valid_features, y_valid)),
                      learning_rate=5e-3,
                      units = 64,
                      layers=1,
                      filepath='models/model5.hdf5',
                      log_dir='logs/mlp_5_5/model5')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dropout (Dropout)            (None, 20000)             0         
_________________________________________________________________
dense (Dense)                (None, 1)                 20001     
Total params: 20,001
Trainable params: 20,001
Non-trainable params: 0
_________________________________________________________________
Train on 35822 samples, validate on 6322 samples
Epoch 1/100

Epoch 00001: val_loss improved from inf to 0.44277, saving model to models/model5.hdf5
 - 15s - loss: 0.5468 - get_f1: 0.8407 - val_loss: 0.4428 - val_get_f1: 0.8834
Epoch 2/100

Epoch 00002: val_loss improved from 0.44277 to 0.35755, saving model to models/model5.hdf5
 - 14s - loss: 0.3944 - get_f1: 0.8841 - val_loss: 0.3576 - val_get_f1: 0.8965
Epoch 3/100

Epoch 00003: val_loss improved from 0.35755 to 0.31447, saving model to models/model5.hdf5
 - 14s - loss: 0.33

Model6:  
* learning_rate=5e-3,
* epochs=1000,
* batch_size=128,
* layers=2,
* units=64,
* dropout_rate=0.3.

In [17]:
result6 = train_model(((X_train_features,y_train), (X_valid_features, y_valid)),
                      learning_rate=5e-3,
                      units = 64,
                      layers=2,
                      filepath='models/model6.hdf5',
                      log_dir='logs/mlp_5_5/model6')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dropout (Dropout)            (None, 20000)             0         
_________________________________________________________________
dense (Dense)                (None, 64)                1280064   
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 1,280,129
Trainable params: 1,280,129
Non-trainable params: 0
_________________________________________________________________
Train on 35822 samples, validate on 6322 samples
Epoch 1/100

Epoch 00001: val_loss improved from inf to 0.22292, saving model to models/model6.hdf5
 - 20s - loss: 0.3259 - get_f1: 0.8557 - val_loss: 0.2229 - val_get_f1: 0.9122
Epoch 2/100

Epoch 00002: val_loss i

Model7:  
* learning_rate=5e-3,
* epochs=100,
* batch_size=128,
* layers=2,
* units=64,
* dropout_rate=0.6.

In [18]:
result7 = train_model(((X_train_features,y_train), (X_valid_features, y_valid)),
                      learning_rate=5e-3,
                      units = 64,
                      layers=2,
                      dropout_rate=0.6,
                      filepath='models/model7.hdf5',
                      log_dir='logs/mlp_5_5/model7')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dropout (Dropout)            (None, 20000)             0         
_________________________________________________________________
dense (Dense)                (None, 64)                1280064   
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 1,280,129
Trainable params: 1,280,129
Non-trainable params: 0
_________________________________________________________________
Train on 35822 samples, validate on 6322 samples
Epoch 1/100

Epoch 00001: val_loss improved from inf to 0.25987, saving model to models/model7.hdf5
 - 20s - loss: 0.4104 - get_f1: 0.8184 - val_loss: 0.2599 - val_get_f1: 0.9007
Epoch 2/100

Epoch 00002: val_loss i

Model8:  
* learning_rate=1e-3,
* epochs=100,
* batch_size=128,
* layers=1,
* units=32,
* dropout_rate=0.1.

In [21]:
result8 = train_model(((X_train_features,y_train), (X_valid_features, y_valid)),
                    filepath='models/model8.hdf5',
                      layers=1,
                      dropout_rate=0.1,
                      units=32,
                      learning_rate=1e-3,
                     log_dir='logs/mlp_5_5/model8')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dropout (Dropout)            (None, 20000)             0         
_________________________________________________________________
dense (Dense)                (None, 1)                 20001     
Total params: 20,001
Trainable params: 20,001
Non-trainable params: 0
_________________________________________________________________
Train on 35822 samples, validate on 6322 samples
Epoch 1/100

Epoch 00001: val_loss improved from inf to 0.60705, saving model to models/model8.hdf5
 - 15s - loss: 0.6489 - get_f1: 0.8203 - val_loss: 0.6070 - val_get_f1: 0.8616
Epoch 2/100

Epoch 00002: val_loss improved from 0.60705 to 0.54405, saving model to models/model8.hdf5
 - 15s - loss: 0.5729 - get_f1: 0.8635 - val_loss: 0.5440 - val_get_f1: 0.8702
Epoch 3/100

Epoch 00003: val_loss improved from 0.54405 to 0.49639, saving model to models/model8.hdf5
 - 14s - loss: 0.51