## Explore the best model

From previous notebook (model_selection.ipynb) choosed the best model. 

The best model is 
SVM(C=10, break_ties=False, cache_size=200, class_weight=None,
    coef0=0.0, decision_function_shape='ovr', degree=3,
    gamma=0.1, kernel='rbf', max_iter=-1, probability=False,
    random_state=11, shrinking=True, tol=0.001,
    verbose=False)
                      
'f1_cv': 0.9162920983650459,
'f1_test': 0.913498035559699

In [2]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, KFold

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.metrics import f1_score

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm

import re

import pickle

import joblib

In [3]:
#definition constants
RANDOM_STATE = 11
NUMBER_K_FOLD = 5
TARGET_METRIC = 'f1'
TEST_SIZE = 0.3
N_JOBS = 4

In [4]:
# import & display data
data = pd.read_csv('data/IMDB_Dataset.csv')
data['sentiment'] = data['sentiment'].replace({'positive' : 1, 'negative' : 0})
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [5]:
# function for preprocessing

def clean_html_string(review):
    return re.sub(r'<br.*?>', ' ', review)

def split_digit_letters_string(review):
    review = re.sub(r'(\d+)', r' \1 ', review)
    return re.sub(r'_+', r' ', review)

def preprocessing_text(review):
    return clean_html_string(split_digit_letters_string(review.lower()))

In [13]:
data_for_train = data.drop_duplicates()
data_for_train['review'] = data_for_train['review'].apply(lambda x : preprocessing_text(x))
data_for_train.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,1
1,a wonderful little production. the filming t...,1
2,i thought this was a wonderful way to spend ti...,1
3,basically there's a family where a little boy ...,0
4,"petter mattei's ""love in the time of money"" is...",1


In [14]:
X = data_for_train.review
y = data_for_train.sentiment
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y,
                                                    test_size=TEST_SIZE, 
                                                    random_state=RANDOM_STATE, 
                                                    stratify = y)

In [16]:
best_model = pipeline_best = Pipeline([
    ('vect', TfidfVectorizer(ngram_range=(1,2))),
    ('clf', svm.SVC(kernel='rbf', C=10, gamma=0.1, random_state=RANDOM_STATE)),
])


In [None]:
best_model.fit(X_train, y_train)

In [None]:
f1_score(best_model.predict(X_test), y_test)

### Serialization our best model

In [None]:
# save the model to disk with pickle
model_file_name = "app_predict/webapp/model_sentiment_prediction.pkl"  

with open(model_file_name, 'wb') as file:  
    pickle.dump(best_model, file)

In [None]:
# save the model to disk with joblib
filename = 'app_predict/webapp/model_sentiment_prediction.sav'
joblib.dump(best_model, filename)