# Machine Learning Model

In this notebook, we will train our model for predicting Sentiments. The model i choose are:

- Multinomial Naive Bayes : since it's commonly used for text data due to it being based on the Bayes' Theorem
- Logistic Regression : since it's one of the simple yet powerful classification machine learning model. But because this is a multiclass classification problem, i use One vs Rest Logistic Regression.
- Random Forest : since it is one of the most robust machine learning model, i included Random Forest since it might be better to deal with our imbalanced classification problem.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style(style ='whitegrid')

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, f1_score, roc_curve, auc, roc_auc_score 
from sklearn.metrics import balanced_accuracy_score, log_loss

In [2]:
import pickle
df_sentiment = pickle.load(open('sentiment_words.sav', 'rb'))

In [3]:
df_sentiment.head()

Unnamed: 0,alpha_title,stem_title,lemma_title,alpha_review,stem_review,lemma_review,alpha_combination,stem_combination,lemma_combination,Sentiment
0,some major design flaws,major design flaw,major design flaw,i had such high hopes for this dress and reall...,high hope dress realli want work initi order p...,high hope dress really wanted work initially o...,some major design flaws i had such high hopes ...,major design flaw high hope dress realli want ...,major design flaw high hope dress really wante...,1
1,my favorite buy,favorit buy,favorite buy,i love love love this jumpsuit it s fun fl...,love love love jumpsuit fun flirti fabul everi...,love love love jumpsuit fun flirty fabulous ev...,my favorite buy i love love love this jumps...,favorit buy love love love jumpsuit fun flirti...,favorite buy love love love jumpsuit fun flirt...,2
2,flattering shirt,flatter shirt,flatter shirt,this shirt is very flattering to all due to th...,shirt flatter due adjust front tie perfect len...,shirt flatter due adjustable front tie perfect...,flattering shirt this shirt is very flattering...,flatter shirt shirt flatter due adjust front t...,flatter shirt shirt flatter due adjustable fro...,2
3,not for the very petite,petit,petite,i love tracy reese dresses but this one is no...,love traci rees dress one petit feet tall usua...,love tracy reese dress one petite foot tall us...,not for the very petite i love tracy reese dre...,petit love traci rees dress one petit feet tal...,petite love tracy reese dress one petite foot ...,0
4,cagrcoal shimmer fun,cagrcoal shimmer fun,cagrcoal shimmer fun,i aded this in my basket at hte last mintue to...,ade basket hte last mintu see would look like ...,aded basket hte last mintue see would look lik...,cagrcoal shimmer fun i aded this in my basket ...,cagrcoal shimmer fun ade basket hte last mintu...,cagrcoal shimmer fun aded basket hte last mint...,2


In [4]:
X = df_sentiment.drop('Sentiment', axis=1)
y = df_sentiment['Sentiment']

alpha_title_train, alpha_title_test, alpha_title_y_train, alpha_title_y_test = train_test_split(X['alpha_title'], y, test_size=0.2, stratify=y, random_state=0)
stem_title_train, stem_title_test, stem_title_y_train, stem_title_y_test = train_test_split(X['stem_title'], y, test_size=0.2, stratify=y, random_state=0)
lemma_title_train, lemma_title_test, lemma_title_y_train, lemma_title_y_test = train_test_split(X['lemma_title'], y, test_size=0.2, stratify=y, random_state=0)

alpha_review_train, alpha_review_test, alpha_review_y_train, alpha_review_y_test = train_test_split(X['alpha_review'], y, test_size=0.2, stratify=y, random_state=0)
stem_review_train, stem_review_test, stem_review_y_train, stem_review_y_test = train_test_split(X['stem_review'], y, test_size=0.2, stratify=y, random_state=0)
lemma_review_train, lemma_review_test, lemma_review_y_train, lemma_review_y_test = train_test_split(X['lemma_review'], y, test_size=0.2, stratify=y, random_state=0)

alpha_combination_train, alpha_combination_test, alpha_combination_y_train, alpha_combination_y_test = train_test_split(X['alpha_combination'], y, test_size=0.2, stratify=y, random_state=0)
stem_combination_train, stem_combination_test, stem_combination_y_train, stem_combination_y_test = train_test_split(X['stem_combination'], y, test_size=0.2, stratify=y, random_state=0)
lemma_combination_train, lemma_combination_test, lemma_combination_y_train, lemma_combination_y_test = train_test_split(X['lemma_combination'], y, test_size=0.2, stratify=y, random_state=0)

In [5]:
X_train = [alpha_title_train, stem_title_train, lemma_title_train,
           alpha_review_train, stem_review_train, lemma_review_train,
           alpha_combination_train, stem_combination_train, lemma_combination_train]
y_train = [alpha_title_y_train, stem_title_y_train, lemma_title_y_train,
           alpha_review_y_train, stem_review_y_train, lemma_review_y_train,
           alpha_combination_y_train, stem_combination_y_train, lemma_combination_y_train]

X_test = [alpha_title_test, stem_title_test, lemma_title_test,
           alpha_review_test, stem_review_test, lemma_review_test,
           alpha_combination_test, stem_combination_test, lemma_combination_test]
y_test = [alpha_title_y_test, stem_title_y_test, lemma_title_y_test,
           alpha_review_y_test, stem_review_y_test, lemma_review_y_test,
           alpha_combination_y_test, stem_combination_y_test, lemma_combination_y_test]

train_score = []
test_score = []
index_name = []

In [6]:
def calc_train_error(X_train, y_train, model):
    predictions = model.predict(X_train)
    predictProba = model.predict_proba(X_train)
    ROC_AUC_Score_macro = roc_auc_score(y_train, predictProba, multi_class='ovr', average='macro')
    ROC_AUC_Score_weighted = roc_auc_score(y_train, predictProba, multi_class='ovr', average='weighted')
    f1_macro = f1_score(y_train, predictions, average ='macro')
    f1_weighted = f1_score(y_train, predictions, average ='weighted')
    accuracy = balanced_accuracy_score(y_train, predictions)
    logloss = log_loss(y_train, predictProba)
    return{
        'ROC AUC Macro Train' : ROC_AUC_Score_macro,
        'ROC AUC Weighted Train' : ROC_AUC_Score_weighted,
        'F1 Macro Train' : f1_macro,
        'F1 Weighted Train' : f1_weighted,
        'Balanced Accuracy Score Train': accuracy,
        'Log Loss Train' : logloss
    }
def calc_validation_error(X_test, y_test, model):
    predictions = model.predict(X_test)
    predictProba = model.predict_proba(X_test)
    ROC_AUC_Score_macro = roc_auc_score(y_test, predictProba, multi_class='ovr', average='macro')
    ROC_AUC_Score_weighted = roc_auc_score(y_test, predictProba, multi_class='ovr', average='weighted')
    f1_macro = f1_score(y_test, predictions, average ='macro')
    f1_weighted = f1_score(y_test, predictions, average ='weighted')
    accuracy = balanced_accuracy_score(y_test, predictions)
    logloss = log_loss(y_test,predictProba)
    return{
        'ROC AUC Macro Test' : ROC_AUC_Score_macro,
        'ROC AUC Weighted Test' : ROC_AUC_Score_weighted,
        'F1 Macro Test' : f1_macro,
        'F1 Weighted Test' : f1_weighted,
        'Balanced Accuracy Score Test': accuracy,
        'Log Loss Test' : logloss
    }

In [7]:
pipeline_multiNB = Pipeline([('vect', CountVectorizer()),
                     ('clf', MultinomialNB())])

tuned_parameters_nb = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'vect__max_features' : [10000, 15000, 20000, 25000, 30000, None],
    'clf__alpha': [1, 1e-1, 1e-2]
}

gs_multiNB = GridSearchCV(pipeline_multiNB, tuned_parameters_nb, cv=5, n_jobs=-1, scoring='balanced_accuracy')

In [8]:
%%time
model = '_multi_nb'
print('Multinomial Naive Bayes\n')
for x, y, x_test, y_true in zip(X_train, y_train, X_test, y_test):
    gs = gs_multiNB.fit(x, y)
    print('Feature : ', x.name)
    print('Best Score : ', gs.best_score_)
    print('Best Params : ', gs.best_params_)
    y_score = gs.best_estimator_.predict(x_test)
    eval_score = balanced_accuracy_score(y_true, y_score)
    print('Balanced Accuracy Score on Test Data : ', eval_score)
    train_score.append(calc_train_error(x, y, gs.best_estimator_))
    test_score.append(calc_validation_error(x_test, y_true, gs.best_estimator_))
    index_name.append(x.name+model)
    print('\n')

Multinomial Naive Bayes

Feature :  alpha_title
Best Score :  0.6082904828217878
Best Params :  {'clf__alpha': 0.1, 'vect__max_features': 10000, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.6125669646419161


Feature :  stem_title
Best Score :  0.5499422518199724
Best Params :  {'clf__alpha': 0.01, 'vect__max_features': 15000, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.5415160580943491


Feature :  lemma_title
Best Score :  0.5454794060174709
Best Params :  {'clf__alpha': 0.01, 'vect__max_features': 15000, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.537544707862692


Feature :  alpha_review
Best Score :  0.6549110594199856
Best Params :  {'clf__alpha': 1, 'vect__max_features': 10000, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.6813790004373798


Feature :  stem_review
Best Score :  0.6189173882724794
Best Params :  {'clf__alpha': 1, 'vect__max_features': 10000, 'vect__ngram_range

In [9]:
pipeline_multiNB_tfidf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])

tuned_parameters_nb_tfidf = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'vect__max_features' : [10000, 15000, 20000, 25000, 30000, None],
    'clf__alpha': [1, 1e-1, 1e-2]
}

gs_multiNB_tfidf = GridSearchCV(pipeline_multiNB_tfidf, tuned_parameters_nb_tfidf, cv=5, n_jobs=-1, scoring='balanced_accuracy')

In [10]:
%%time
model = '_multi_nb_tfidf'
print('Multinomial Naive Bayes\n')
for x, y, x_test, y_true in zip(X_train, y_train, X_test, y_test):
    gs = gs_multiNB_tfidf.fit(x, y)
    print('Feature : ', x.name)
    print('Best Score : ', gs.best_score_)
    print('Best Params : ', gs.best_params_)
    y_score = gs.best_estimator_.predict(x_test)
    eval_score = balanced_accuracy_score(y_true, y_score)
    print('Balanced Accuracy Score on Test Data : ', eval_score)
    train_score.append(calc_train_error(x, y, gs.best_estimator_))
    test_score.append(calc_validation_error(x_test, y_true, gs.best_estimator_))
    index_name.append(x.name+model)
    print('\n')

Multinomial Naive Bayes

Feature :  alpha_title
Best Score :  0.5770521099771455
Best Params :  {'clf__alpha': 0.01, 'vect__max_features': 20000, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.5831539724936139


Feature :  stem_title
Best Score :  0.5348789565680047
Best Params :  {'clf__alpha': 0.01, 'vect__max_features': 15000, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.535527860551703


Feature :  lemma_title
Best Score :  0.5291431100444619
Best Params :  {'clf__alpha': 0.01, 'vect__max_features': 15000, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.5319296196268565


Feature :  alpha_review
Best Score :  0.5374323931645669
Best Params :  {'clf__alpha': 0.1, 'vect__max_features': 15000, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.5410863787085661


Feature :  stem_review
Best Score :  0.4743522154779944
Best Params :  {'clf__alpha': 0.1, 'vect__max_features': 15000, 'vect__ngram_

In [11]:
pipeline_logReg = Pipeline([('vect', CountVectorizer()),
                     ('clf', LogisticRegression(multi_class='ovr'))])

tuned_parameters_lr = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'vect__max_features' : [10000, 15000, 20000, 25000, 30000, None],
    'clf__solver': ['sag', 'saga', 'lbfgs'],
    'clf__penalty': ['l2', 'none'],
    'clf__max_iter' : [100, 200, 300, 400, 500]
}

gs_logReg = GridSearchCV(pipeline_logReg, tuned_parameters_lr, cv=5, n_jobs=-1, scoring='balanced_accuracy')

In [12]:
%%time
model = '_logReg_OVR'
print('Logistic Regression OVR\n')
for x, y, x_test, y_true in zip(X_train, y_train, X_test, y_test):
    gs = gs_logReg.fit(x, y)
    print('Feature : ', x.name)
    print('Best Score : ', gs.best_score_)
    print('Best Params : ', gs.best_params_)
    y_score = gs.best_estimator_.predict(x_test)
    eval_score = balanced_accuracy_score(y_true, y_score)
    print('Balanced Accuracy Score on Test Data : ', eval_score)
    train_score.append(calc_train_error(x, y, gs.best_estimator_))
    test_score.append(calc_validation_error(x_test, y_true, gs.best_estimator_))
    index_name.append(x.name+model)
    print('\n')

Logistic Regression OVR





Feature :  alpha_title
Best Score :  0.5747583843907813
Best Params :  {'clf__max_iter': 400, 'clf__penalty': 'none', 'clf__solver': 'sag', 'vect__max_features': 10000, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.5681698010328352




STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

Feature :  stem_title
Best Score :  0.5487856383076776
Best Params :  {'clf__max_iter': 100, 'clf__penalty': 'none', 'clf__solver': 'lbfgs', 'vect__max_features': 15000, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.5523027421003811




STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

Feature :  lemma_title
Best Score :  0.5484133521010106
Best Params :  {'clf__max_iter': 100, 'clf__penalty': 'none', 'clf__solver': 'lbfgs', 'vect__max_features': 15000, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.5485244403780699






Feature :  alpha_review
Best Score :  0.5779432844783153
Best Params :  {'clf__max_iter': 300, 'clf__penalty': 'none', 'clf__solver': 'saga', 'vect__max_features': 25000, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.5931432139281592






Feature :  stem_review
Best Score :  0.5642023578041405
Best Params :  {'clf__max_iter': 100, 'clf__penalty': 'none', 'clf__solver': 'sag', 'vect__max_features': 10000, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.5676085921600521






Feature :  lemma_review
Best Score :  0.5561915319925543
Best Params :  {'clf__max_iter': 100, 'clf__penalty': 'l2', 'clf__solver': 'saga', 'vect__max_features': 20000, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.5574722801634712






Feature :  alpha_combination
Best Score :  0.6054689909176146
Best Params :  {'clf__max_iter': 300, 'clf__penalty': 'none', 'clf__solver': 'saga', 'vect__max_features': None, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.619404473447632






Feature :  stem_combination
Best Score :  0.5965388597953925
Best Params :  {'clf__max_iter': 100, 'clf__penalty': 'none', 'clf__solver': 'saga', 'vect__max_features': 30000, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.5844512050260895






Feature :  lemma_combination
Best Score :  0.5955359503898247
Best Params :  {'clf__max_iter': 200, 'clf__penalty': 'none', 'clf__solver': 'saga', 'vect__max_features': 20000, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.5819102446985355


Wall time: 21h 47min 52s


In [13]:
pipeline_logReg_tfidf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', LogisticRegression(multi_class='ovr'))])

tuned_parameters_lr_tfidf = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'vect__max_features' : [10000, 15000, 20000, 25000, 30000, None],
    'clf__solver': ['sag', 'saga', 'lbfgs'],
    'clf__penalty': ['l2', 'none'],
    'clf__max_iter' : [100, 200, 300, 400, 500]
}

gs_logReg_tfidf = GridSearchCV(pipeline_logReg_tfidf, tuned_parameters_lr_tfidf, cv=5, n_jobs=-1, scoring='balanced_accuracy')

In [14]:
%%time
model = '_logReg_OVR_tfidf'
print('Logistic Regression OVR\n')
for x, y, x_test, y_true in zip(X_train, y_train, X_test, y_test):
    gs = gs_logReg_tfidf.fit(x, y)
    print('Feature : ', x.name)
    print('Best Score : ', gs.best_score_)
    print('Best Params : ', gs.best_params_)
    y_score = gs.best_estimator_.predict(x_test)
    eval_score = balanced_accuracy_score(y_true, y_score)
    print('Balanced Accuracy Score on Test Data : ', eval_score)
    train_score.append(calc_train_error(x, y, gs.best_estimator_))
    test_score.append(calc_validation_error(x_test, y_true, gs.best_estimator_))
    index_name.append(x.name+model)
    print('\n')

Logistic Regression OVR



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

Feature :  alpha_title
Best Score :  0.5788747999737006
Best Params :  {'clf__max_iter': 100, 'clf__penalty': 'none', 'clf__solver': 'lbfgs', 'vect__max_features': 15000, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.5743128108388432




STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

Feature :  stem_title
Best Score :  0.5499655634048717
Best Params :  {'clf__max_iter': 100, 'clf__penalty': 'none', 'clf__solver': 'lbfgs', 'vect__max_features': 15000, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.5586168484568196




STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

Feature :  lemma_title
Best Score :  0.5441449702881307
Best Params :  {'clf__max_iter': 100, 'clf__penalty': 'none', 'clf__solver': 'lbfgs', 'vect__max_features': 15000, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.5529234204495901


Feature :  alpha_review
Best Score :  0.584235169403945
Best Params :  {'clf__max_iter': 100, 'clf__penalty': 'none', 'clf__solver': 'lbfgs', 'vect__max_features': None, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.5824070351515189


Feature :  stem_review
Best Score :  0.562311945657589
Best Params :  {'clf__max_iter': 100, 'clf__penalty': 'none', 'clf__solver': 'lbfgs', 'vect__max_features': None, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.5577800455719555


Feature :  lemma_review
Best Score :  0.5585882843801571
Best Params :  {'clf__max_iter': 100, 'clf__penalty': 'none', 'clf__solver': 'lbfgs', 'vect__max_features': None, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Sc

In [15]:
pipeline_RandomForest = Pipeline([('vect', CountVectorizer()),
                     ('clf', RandomForestClassifier(random_state=0, n_jobs=-1))])

tuned_parameters_randomforest = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'vect__max_features' : [10000, 15000, 20000, 25000, 30000, None],
    'clf__max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
    'clf__n_estimators' : [100, 200, 300]
}

gs_RandomForest = GridSearchCV(pipeline_RandomForest, tuned_parameters_randomforest, cv=5, n_jobs=-1, scoring='balanced_accuracy')

In [16]:
%%time
model = '_RandomForest'
print('Random Forest\n')
for x, y, x_test, y_true in zip(X_train, y_train, X_test, y_test):
    gs = gs_RandomForest.fit(x, y)
    print('Feature : ', x.name)
    print('Best Score : ', gs.best_score_)
    print('Best Params : ', gs.best_params_)
    y_score = gs.best_estimator_.predict(x_test)
    eval_score = balanced_accuracy_score(y_true, y_score)
    print('Balanced Accuracy Score on Test Data : ', eval_score)
    train_score.append(calc_train_error(x, y, gs.best_estimator_))
    test_score.append(calc_validation_error(x_test, y_true, gs.best_estimator_))
    index_name.append(x.name+model)
    print('\n')

Random Forest

Feature :  alpha_title
Best Score :  0.5597220131710363
Best Params :  {'clf__max_depth': None, 'clf__n_estimators': 300, 'vect__max_features': 10000, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.5660137070170789






Feature :  stem_title
Best Score :  0.5506976855608879
Best Params :  {'clf__max_depth': None, 'clf__n_estimators': 300, 'vect__max_features': 10000, 'vect__ngram_range': (1, 1)}
Balanced Accuracy Score on Test Data :  0.5496177928747392


Feature :  lemma_title
Best Score :  0.5441184442229801
Best Params :  {'clf__max_depth': None, 'clf__n_estimators': 200, 'vect__max_features': 10000, 'vect__ngram_range': (1, 1)}
Balanced Accuracy Score on Test Data :  0.5460195519498926






Feature :  alpha_review
Best Score :  0.4151530389610049
Best Params :  {'clf__max_depth': None, 'clf__n_estimators': 100, 'vect__max_features': 10000, 'vect__ngram_range': (2, 2)}
Balanced Accuracy Score on Test Data :  0.4058631774178408






Feature :  stem_review
Best Score :  0.44991463872891774
Best Params :  {'clf__max_depth': None, 'clf__n_estimators': 300, 'vect__max_features': 10000, 'vect__ngram_range': (2, 2)}
Balanced Accuracy Score on Test Data :  0.45356287061545225






Feature :  lemma_review
Best Score :  0.4468911422428796
Best Params :  {'clf__max_depth': None, 'clf__n_estimators': 100, 'vect__max_features': 10000, 'vect__ngram_range': (2, 2)}
Balanced Accuracy Score on Test Data :  0.443633512157439






Feature :  alpha_combination
Best Score :  0.4286152661544932
Best Params :  {'clf__max_depth': None, 'clf__n_estimators': 100, 'vect__max_features': 10000, 'vect__ngram_range': (2, 2)}
Balanced Accuracy Score on Test Data :  0.4367303787085879






Feature :  stem_combination
Best Score :  0.4780002665034835
Best Params :  {'clf__max_depth': None, 'clf__n_estimators': 100, 'vect__max_features': 10000, 'vect__ngram_range': (2, 2)}
Balanced Accuracy Score on Test Data :  0.48629819628696597


Feature :  lemma_combination
Best Score :  0.4699570247566217
Best Params :  {'clf__max_depth': None, 'clf__n_estimators': 300, 'vect__max_features': 10000, 'vect__ngram_range': (2, 2)}
Balanced Accuracy Score on Test Data :  0.4764233665231914


Wall time: 1d 19h 7min 42s


In [17]:
pipeline_RandomForest_tfidf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),        
                     ('clf', RandomForestClassifier(random_state=0, n_jobs=-1))])

tuned_parameters_randomforest_tfidf = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'vect__max_features' : [10000, 15000, 20000, 25000, 30000, None],
    'clf__max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
    'clf__n_estimators' : [100, 200, 300]
}

gs_RandomForest_tfidf = GridSearchCV(pipeline_RandomForest_tfidf, tuned_parameters_randomforest_tfidf, cv=5, n_jobs=-1, scoring='balanced_accuracy')

In [None]:
%%time
model = '_RandomForest_tfidf'
print('Random Forest\n')
for x, y, x_test, y_true in zip(X_train[:6], y_train[:6], X_test[:6], y_test[:6] thinking):
    gs = gs_RandomForest_tfidf.fit(x, y)
    print('Feature : ', x.name)
    print('Best Score : ', gs.best_score_)
    print('Best Params : ', gs.best_params_)
    y_score = gs.best_estimator_.predict(x_test)
    eval_score = balanced_accuracy_score(y_true, y_score)
    print('Balanced Accuracy Score on Test Data : ', eval_score)
    train_score.append(calc_train_error(x, y, gs.best_estimator_))
    test_score.append(calc_validation_error(x_test, y_true, gs.best_estimator_))
    index_name.append(x.name+model)
    print('\n')

Random Forest

Feature :  alpha_title
Best Score :  0.5447798273872929
Best Params :  {'clf__max_depth': None, 'clf__n_estimators': 300, 'vect__max_features': 15000, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.5513781232186681


Feature :  stem_title
Best Score :  0.5426439715099719
Best Params :  {'clf__max_depth': None, 'clf__n_estimators': 300, 'vect__max_features': 15000, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.536194301936983






Feature :  lemma_title
Best Score :  0.5344687461702133
Best Params :  {'clf__max_depth': None, 'clf__n_estimators': 300, 'vect__max_features': 15000, 'vect__ngram_range': (1, 2)}
Balanced Accuracy Score on Test Data :  0.5393656317504402






Feature :  alpha_review
Best Score :  0.40921258425890716
Best Params :  {'clf__max_depth': None, 'clf__n_estimators': 100, 'vect__max_features': 10000, 'vect__ngram_range': (2, 2)}
Balanced Accuracy Score on Test Data :  0.41449757596172415


Feature :  stem_review
Best Score :  0.4387364229641836
Best Params :  {'clf__max_depth': None, 'clf__n_estimators': 100, 'vect__max_features': 10000, 'vect__ngram_range': (2, 2)}
Balanced Accuracy Score on Test Data :  0.4383309315234085






Feature :  lemma_review
Best Score :  0.4340025345423328
Best Params :  {'clf__max_depth': None, 'clf__n_estimators': 100, 'vect__max_features': 10000, 'vect__ngram_range': (2, 2)}
Balanced Accuracy Score on Test Data :  0.42831427720815324




In [22]:
%%time
model = '_RandomForest_tfidf'
print('Random Forest\n')
for x, y, x_test, y_true in zip(X_train[6:], y_train[6:], X_test[6:], y_test[6:]):
    gs = gs_RandomForest_tfidf.fit(x, y)
    print('Feature : ', x.name)
    print('Best Score : ', gs.best_score_)
    print('Best Params : ', gs.best_params_)
    y_score = gs.best_estimator_.predict(x_test)
    eval_score = balanced_accuracy_score(y_true, y_score)
    print('Balanced Accuracy Score on Test Data : ', eval_score)
    train_score.append(calc_train_error(x, y, gs.best_estimator_))
    test_score.append(calc_validation_error(x_test, y_true, gs.best_estimator_))
    index_name.append(x.name+model)
    print('\n')

Random Forest

Feature :  alpha_combination
Best Score :  0.43157368560101333
Best Params :  {'clf__max_depth': None, 'clf__n_estimators': 100, 'vect__max_features': 10000, 'vect__ngram_range': (2, 2)}
Balanced Accuracy Score on Test Data :  0.43790278970512486


Feature :  stem_combination
Best Score :  0.47379821911529857
Best Params :  {'clf__max_depth': None, 'clf__n_estimators': 100, 'vect__max_features': 10000, 'vect__ngram_range': (2, 2)}
Balanced Accuracy Score on Test Data :  0.4749618536630851


Feature :  lemma_combination
Best Score :  0.46781469248798146
Best Params :  {'clf__max_depth': None, 'clf__n_estimators': 200, 'vect__max_features': 10000, 'vect__ngram_range': (2, 2)}
Balanced Accuracy Score on Test Data :  0.4671397445208094


Wall time: 16h 4min 28s


In [23]:
import pickle
pickle.dump(train_score, open('train_score.sav', 'wb'))
pickle.dump(test_score, open('test_score.sav', 'wb'))
pickle.dump(index_name, open('index_name.sav', 'wb'))