## Explore the best model

From previous notebook (model_selection.ipynb) choosed the best model. 

The **best** model is 
    
`SVM(C=10, break_ties=False, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma=0.1, kernel='rbf', max_iter=-1, probability=False,    random_state=11, shrinking=True, tol=0.001, verbose=False)`
                      
`'f1_cv': 0.9162920983650459`
`'f1_test': 0.913498035559699`

In [1]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, KFold

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.metrics import f1_score

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
import datetime

import re

import pickle

import joblib


In [11]:
import os,sys,inspect
currentdir=os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir=os.path.dirname(currentdir)
sys.path.insert(0,parentdir)
from src import preprocessing

In [5]:
#definition constants
RANDOM_STATE = 11
NUMBER_K_FOLD = 5
TARGET_METRIC = 'f1'
TEST_SIZE = 0.3
N_JOBS = 4

In [6]:
# import & display data
data = pd.read_csv('../../data/IMDB_Dataset.csv')
data['sentiment'] = data['sentiment'].replace({'positive' : 1, 'negative' : 0})
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [8]:
data_for_train = data.drop_duplicates()
data_for_train.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [9]:
X = data_for_train.review
y = data_for_train.sentiment
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y,
                                                    test_size=TEST_SIZE, 
                                                    random_state=RANDOM_STATE, 
                                                    stratify = y)

In [13]:
best_model = pipeline_best = Pipeline([
    ('vect', TfidfVectorizer(ngram_range=(1,2), preprocessor=preprocessing.preprocessing_text)),
    ('clf', svm.SVC(kernel='rbf', C=10, gamma=0.1, random_state=RANDOM_STATE)),
])


In [14]:
best_model.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), norm='l2',
        preprocessor=<function prepr...f',
  max_iter=-1, probability=False, random_state=11, shrinking=True,
  tol=0.001, verbose=False))])

In [15]:
f1_score(best_model.predict(X_test), y_test)

0.913498035559699

### Serialization our best model

In [20]:
metadata_to_model = {
    'vectorizer' : 'TF-IDF vectorizer with ngramm_range=(1,2)',
    'model_type': 'SVM with rbf kernel',
    'author': 'Tatsiana Drabysheuskaya',
    'data' : '23-04-2020',
    'trainig_data' : 'https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews',
    'metrics_cross_validation': {
        'f1_score': '0.916'
    }
    }



In [21]:
data_to_save = {
    'model' : best_model,
    'metadata' : metadata_to_model
}

In [22]:
# save the model to disk with pickle
model_file_name = "../service/model/model_sentiment_prediction.pkl"  

with open(model_file_name, 'wb') as file:  
    pickle.dump(data_to_save, file)

In [23]:
# save the model to disk with joblib
filename = '../service/model/model_sentiment_prediction.sav'
joblib.dump(data_to_save, filename)

['../service/model/model_sentiment_prediction.sav']

In [None]:
review = re.sub(r'(\d+)', r' \1 ', review)