## The best MLP model for imdb Dataset

Next step is the final retraining best MLP model, which has been selected in `imdb_mlp_select_structure.ipynb`.
For the retrain let's train model with the training and the validation subsets. The model will be trained in 25 epochs. 
This number of epochs meets the epoch number with the minimum value of the loss function on the validation subset.

**From the previous step the model has been selected:**  
* `learning_rate=1e-4`,
* `epochs=1000`,
* `batch_size=128`,
* `layers=3`,
* `units=32`,
* `dropout_rate=0.6`.

####  Import libraries,  import custom scripts and define constants  

In [33]:
from __future__ import print_function
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import TensorBoard, ModelCheckpoint
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import load_model


import tensorflow.keras.backend as K

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

import re
import pickle
import json
from datetime import date

In [38]:
# pip show tensorflow

In [39]:
import os,sys,inspect
currentdir=os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir=os.path.dirname(currentdir)
sys.path.insert(0,parentdir)
from src import preprocessing

In [40]:
#definition constants
RANDOM_STATE = 11
TARGET_METRIC = 'f1'
TEST_SIZE = 0.15


####  Loading the data

In [41]:
# import & display data
data = pd.read_csv('../../data/IMDB_Dataset.csv')
data['sentiment'] = data['sentiment'].replace({'positive' : 1, 'negative' : 0})
data = data.drop_duplicates()
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


####  Split data for training, testing and validation sets

In [42]:
X = data.review
y = data.sentiment
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y,
                                                    test_size=TEST_SIZE, 
                                                    random_state=RANDOM_STATE, 
                                                    stratify = y)


#### Preprocessing Data

For pre-processing and vectorization, let's use the approach from the previous stage, which received the best performance.

For pre-processing:
* the removal of html-tags,
* the separation of numbers and letters.  

For vectorization we use  `TfidfVectorizer(ngram_range=(1,2))`

In [45]:

mlp_final_preproc = Pipeline([
    ('vect', TfidfVectorizer(ngram_range=(1,2), 
                             preprocessor=preprocessing.preprocessing_text, 
                             max_features=MAX_FEATURES)),
])

mlp_final_preproc.fit(X_train, y_train)



Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=50000, min_df=1,
        ngram_range=(1, 2), norm='l2',
        preprocessor=<function prep...f=False, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, use_idf=True, vocabulary=None))])

#### Creating the model
Define the model 

In [46]:
#function for f1 metric
def get_f1(y_true, y_pred): 
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    recall = true_positives / (possible_positives + K.epsilon())
    f1_val = 2*(precision*recall)/(precision+recall+K.epsilon())
    return f1_val

In [47]:
def mlp_model(layers=3, units=32, dropout_rate=0.6, input_shape=(MAX_FEATURES, ), learning_rate=1e-4 ):
    model = Sequential()
    model.add(Dropout(rate=dropout_rate, input_shape=input_shape))
    for _ in range(layers-1):
        model.add(Dense(units=units, activation='relu'))
        model.add(Dropout(rate=dropout_rate))
    model.add(Dense(1, activation='sigmoid'))
    loss = 'binary_crossentropy'
    optimizer = Adam(lr=learning_rate)
    model.compile(optimizer=optimizer, loss=loss, metrics=[get_f1])
    
    return model

In [48]:
K.clear_session()
final_mlp_model = mlp_model()
final_mlp_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dropout (Dropout)            (None, 50000)             0         
_________________________________________________________________
dense (Dense)                (None, 32)                1600032   
_________________________________________________________________
dropout_1 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                1056      
_________________________________________________________________
dropout_2 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 33        
Total params: 1,601,121
Trainable params: 1,601,121
Non-trainable params: 0
_________________________________________________________________


#### Train the model

Let's train the final MLP model and save it to the destination folder: *`../service/model/mlp/`*

In [50]:
folder_path = '../service/model/mlp/'

In [51]:
epochs = 25
checkpoint_path = folder_path + '/mlp_final.hdf5'
log_dir = 'logs/mlp_final'

cp_callback = ModelCheckpoint(
    filepath=checkpoint_path, 
    verbose=2, 
    save_weights_only=False,
    period=25)

tensorboard_callback = TensorBoard(log_dir=log_dir, histogram_freq=0)

final_mlp_model.fit(mlp_final_preproc.transform(X_train),
                    y_train,
                    epochs=epochs,
                    verbose=2, 
                    callbacks=[cp_callback, tensorboard_callback],
                    batch_size=128)

Epoch 1/25
 - 45s - loss: 0.6877 - get_f1: 0.5027
Epoch 2/25
 - 46s - loss: 0.6456 - get_f1: 0.7147
Epoch 3/25
 - 43s - loss: 0.5587 - get_f1: 0.7898
Epoch 4/25
 - 44s - loss: 0.4682 - get_f1: 0.8313
Epoch 5/25
 - 60s - loss: 0.4027 - get_f1: 0.8581
Epoch 6/25
 - 56s - loss: 0.3568 - get_f1: 0.8716
Epoch 7/25
 - 47s - loss: 0.3218 - get_f1: 0.8838
Epoch 8/25
 - 48s - loss: 0.2961 - get_f1: 0.8941
Epoch 9/25
 - 45s - loss: 0.2805 - get_f1: 0.8998
Epoch 10/25
 - 44s - loss: 0.2664 - get_f1: 0.9060
Epoch 11/25
 - 45s - loss: 0.2539 - get_f1: 0.9100
Epoch 12/25
 - 47s - loss: 0.2455 - get_f1: 0.9140
Epoch 13/25
 - 50s - loss: 0.2370 - get_f1: 0.9164
Epoch 14/25
 - 48s - loss: 0.2315 - get_f1: 0.9184
Epoch 15/25
 - 46s - loss: 0.2221 - get_f1: 0.9216
Epoch 16/25
 - 47s - loss: 0.2195 - get_f1: 0.9224
Epoch 17/25
 - 46s - loss: 0.2091 - get_f1: 0.9257
Epoch 18/25
 - 49s - loss: 0.2039 - get_f1: 0.9270
Epoch 19/25
 - 48s - loss: 0.1998 - get_f1: 0.9285
Epoch 20/25
 - 50s - loss: 0.1987 - get_

<tensorflow.python.keras.callbacks.History at 0x220dc31a308>

#### Evaluate the model with test subset

In [65]:
loss, f1_score = final_mlp_model.evaluate(mlp_final_preproc.transform(X_test),  y_test, verbose=2)
print("F1 score with the best MLP model for test dataset: {:5.2f}%".format(100*f1_score))

 - 3s - loss: 0.2121 - get_f1: 0.9141
F1 score with the best MLP model for test dataset: 91.41%


#### Save the preprocessing and the additional information about model to disc

In [55]:
# save the preprocessing to disk with pickle
mlp_preproc_file_name = folder_path +  "mlp_preproc.pkl"  

with open(mlp_preproc_file_name, 'wb') as file:  
    pickle.dump(mlp_final_preproc, file)

In [66]:
# save the metadata to model 

metadata_to_model = {
    'vectorizer' : str(mlp_final_pipeline.steps[0][1]),
    'model_type': 'MLP NN model',
    'author': 'Tatsiana Drabysheuskaya',
    'data' : str(date.today()),
    'trainig_data' : 'https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews',
    'metrics_test_data_set': {
        'f1_score': "{:5.2f}%".format(100*f1_score)
    }
    }


metadata_file_name = folder_path + "mlp_model.json"  
    
with open(metadata_file_name, 'w') as file:
    json_string = json.dumps(metadata_to_model, default=lambda o: o.__dict__, sort_keys=True, indent=2)
    file.write(json_string)

In [67]:
# save a few test samples 
test_samples_file_name = folder_path + "mlp_modelTest_samples.csv"  

test_df = pd.DataFrame({'review': X_test, 'sentiment': y_test})
test_df.sample(n=8, random_state=RANDOM_STATE).to_csv(test_samples_file_name) 

In [31]:
folder_path = '../service/model/mlp/'
mlp_preproc_file_name = folder_path +  "mlp_preproc.pkl" 
checkpoint_path = folder_path + '/mlp_final.hdf5'


### Load and check the model

In [40]:
mpl_model_loaded = load_model(checkpoint_path, custom_objects={'get_f1' : get_f1} )


In [39]:
with open(mlp_preproc_file_name, 'rb') as file:
    mlp_preproc_loaded = pickle.load(file)

In [41]:
loss, f1_score = mpl_model_loaded.evaluate(mlp_preproc_loaded.transform(X_test),  y_test, verbose=2)

 - 3s - loss: 0.2121 - get_f1: 0.9141
