### Models

In the first section i used undersampling to fix the unbalanced issue. \
I undersample only the training set so i can validate my model with data that resembles real world data. \
In the second section i trained the Naive Bayes Model. \
Lastly i tried some self-made comments to see probabilities

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import re
import math

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split,GridSearchCV,StratifiedKFold
import pickle

In [3]:
data = pd.read_csv("Data.csv")
data.head(10)

Unnamed: 0.1,Unnamed: 0,overall,verified,reviewText,summary,Binary_Overall
0,0,5.0,True,This is the best novel I have read in 2 or 3 y...,A star is born,1.0
1,1,3.0,True,"Pages and pages of introspection, in the style...",A stream of consciousness novel,0.0
2,2,5.0,False,This is the kind of novel to read when you hav...,I'm a huge fan of the author and this one did ...,1.0
3,3,5.0,False,What gorgeous language! What an incredible wri...,The most beautiful book I have ever read!,1.0
4,4,3.0,True,I was taken in by reviews that compared this b...,A dissenting view--In part.,0.0
5,5,4.0,True,I read this probably 50 years ago in my youth ...,Above average mystery,1.0
6,6,5.0,True,I read every Perry mason book voraciously. Fin...,Lam is cool!,1.0
7,7,5.0,True,I love this series of Bertha and Lamb.. Great...,Five Stars,1.0
8,8,5.0,True,Great read!,Five Stars,1.0
9,9,4.0,False,"Crows Can't Count, A.A. Fair\n\nMr. Harry Shar...",A Fast and Far Moving Adventure,1.0


In [4]:
data = data.sample(100000)
data

Unnamed: 0.1,Unnamed: 0,overall,verified,reviewText,summary,Binary_Overall
2436804,2437241,3.0,True,The Taotronics system works great. It paired a...,Less than expected,0.0
1920593,1920927,5.0,False,"Working good,\nThanks",Five Stars,1.0
1039206,1039384,5.0,True,Worked fine.,Cheap and works fine,1.0
5322231,5323644,5.0,True,GREAT--WOULD LIKE TO HAVE ANOTHER,Five Stars,1.0
2638961,2639441,5.0,True,this product is all it made out to be and more...,execellent,1.0
...,...,...,...,...,...,...
1777054,1777348,5.0,True,Nice pigtails worked out real well,Good stuff,1.0
6395290,6397164,3.0,False,Zmodo no longer offers phone technical support...,Zmodo no longer offers phone technical support,0.0
4753447,4754606,5.0,True,I'm glad to finally have a case that won't fal...,Perfect! Deep grooves.,1.0
372805,372873,5.0,True,When the guys at Besbuy tell you that to repla...,"Inexpensive, not Cheap",1.0


In [5]:
data.Binary_Overall.value_counts()

1.0    81051
0.0    18949
Name: Binary_Overall, dtype: int64

In [6]:
import nltk

In [7]:
X = data.reviewText
Y = data.Binary_Overall

In [8]:
X = X.replace(r"\n", ' ', regex = True)

In [9]:
xtrain,xtest,y_train,y_test=train_test_split(X,Y); 
print('train size:',len(xtrain))
print('test size:',len(xtest))

train size: 75000
test size: 25000


In [10]:
from imblearn.under_sampling import RandomUnderSampler

undersampler=RandomUnderSampler(sampling_strategy='majority');

X_train_us,y_train_us=undersampler.fit_resample(xtrain.values.reshape(-1, 1),y_train);

print('Composición del training set:')
print(y_train_us.value_counts())

print('\nComposición del test set:')
print(y_test.value_counts())

Composición del training set:
0.0    14188
1.0    14188
Name: Binary_Overall, dtype: int64

Composición del test set:
1.0    20239
0.0     4761
Name: Binary_Overall, dtype: int64


-------------------------------------

In [11]:
from nltk.corpus import stopwords 
stop_words=stopwords.words('english');

In [12]:
# modelo con vectorizer

from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

vectorizer=TfidfVectorizer(stop_words=stop_words,strip_accents='unicode')

model=Pipeline([('vect',vectorizer),('classifier',MultinomialNB())])

params={'classifier__alpha':[0.1,0.5,1],'vect__max_df':[1,100],'vect__ngram_range':[(1,2),(2,2),(1,1)]}

GS_CV=GridSearchCV(model,params,cv=5,n_jobs=-1)

GS_CV.fit(X_train_us.ravel(), y_train_us);

print('best score:',GS_CV.best_score_)
print('best params:',GS_CV.best_params_)

best score: 0.7605022895691884
best params: {'classifier__alpha': 0.5, 'vect__max_df': 100, 'vect__ngram_range': (1, 2)}


In [13]:
from sklearn.metrics import accuracy_score
pred = GS_CV.predict(xtest.values.astype("U"))
print ( 'Accuracy score in test: ', accuracy_score(y_test,pred))

Accuracy score in test:  0.69944


In [14]:
from sklearn.metrics import confusion_matrix,f1_score
confusion_matrix(y_test,pred)

array([[ 4047,   714],
       [ 6800, 13439]], dtype=int64)

In [15]:
f1_score(y_test, pred)                         

0.7815189578971854

----------------------

In [39]:
texto_prueba=['Could be better']
k = GS_CV.predict(texto_prueba)
print('Comentario:',texto_prueba,'\n')
print("overall:" , k)
print(' proba:',GS_CV.predict_proba(texto_prueba).max())

Comentario: ['Could be better'] 

overall: [1.]
 proba: 0.5556632480809894


In [40]:
texto_prueba=['awsome']
k = GS_CV.predict(texto_prueba)
print('Comentario:',texto_prueba,'\n')
print("overall:" , k)
print(' proba:',GS_CV.predict_proba(texto_prueba).max())

Comentario: ['awsome'] 

overall: [1.]
 proba: 0.6989922812752908


In [41]:
import pickle
with open('model_binary.pkl', 'wb') as f_model:
    pickle.dump(GS_CV, f_model)