# Baseline Modeling #

We are approaching the problem as a supervised classification problem. We will begin by using Binomial Logistic Regression to construct a model since this algorithm typically performs well as a baseline and is regarded as a good starting point. For a helpful overview of Logistic Regression (as well as recommended additional reading) see [here](https://machinelearningmastery.com/logistic-regression-for-machine-learning/).

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn.model_selection
import nltk
from nltk.corpus import stopwords

In [2]:
# Import data as DataFrame
reviews = pd.read_csv('/Users/dwalkerpage/Documents/Data_Science/Springboard/Projects/springboard/Capstone_Projects/Capstone_Project_1/Capstone_Project_1_Data/restaurant_reviews_final.csv')

## 1. Clean Corpus of Terms ##

In [3]:
reviews.shape

(4043449, 3)

In [139]:
# Identify the numbers and proprotions of the two categories (positive reviews and negative reviews) in our dataset

print('Number of Positive Reviews:', len(reviews[reviews.stars > 3.0]))
print('Rounded Proportion of Positive Reviews:', round(len(reviews[reviews.stars > 3.0])/len(reviews), 2))
print()
print('Number of Negative Reviews:', len(reviews[reviews.stars <= 3.0]))
print('Rounded Proportion of Negative Reviews:', round(len(reviews[reviews.stars <= 3.0])/len(reviews), 2))

Number of Positive Reviews: 2644830
Rounded Proportion of Positive Reviews: 0.65

Number of Negative Reviews: 1398619
Rounded Proportion of Negative Reviews: 0.35


To preserve time and computational efficiency, we will work with a large sample of our dataset. Since we are working with a large and random sample, we can be reasonably confident that our results will also apply to the larger dataset. Determining the extent to which our results do in fact extend to the larger dataset could be a fruitful direction for future developments.

In [4]:
# Take large sample from data
reviews_sample = reviews.sample(n=1000000, random_state=7)

In [140]:
# Identify the numbers and proprotions of the two categories (positive reviews and negative reviews) in
# the sample of our dataset

print('Number of Positive Reviews:', len(reviews_sample[reviews_sample.stars > 3.0]))
print('Rounded Proportion of Positive Reviews:', round(len(reviews_sample[reviews_sample.stars > 3.0])/len(reviews_sample), 2))
print()
print('Number of Negative Reviews:', len(reviews_sample[reviews_sample.stars <= 3.0]))
print('Rounded Proportion of Negative Reviews:', round(len(reviews_sample[reviews_sample.stars <= 3.0])/len(reviews_sample), 2))

Number of Positive Reviews: 654478
Rounded Proportion of Positive Reviews: 0.65

Number of Negative Reviews: 345522
Rounded Proportion of Negative Reviews: 0.35


In [5]:
import string

In [6]:
# Define function to remove punctuation from a string
# From here: https://stackoverflow.com/questions/33047818/remove-punctuation-for-each-row-in-a-pandas-data-frame?noredirect=1&lq=1

def remove_punctuation(s):
    '''Removes punctuation from a string'''
    s = ''.join([i for i in s if i not in set(string.punctuation)])
    return s

In [7]:
# Load NLTK's list of stop words

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

In [8]:
# Define function to remove stop words, using NLTK's list of stop words

def remove_stopwords(s):
    '''Removes stopwords from a string'''
    s = ' '.join([word for word in s.split() if word not in stop_words])
    return s

In [9]:
# Define function to clean corpus of review data

def clean_corpus(df):
    '''Makes review text lowercase, removes punctuation, removes stop words'''
    df['text'] = df['text'].str.lower().apply(remove_punctuation).apply(remove_stopwords)
    return df

In [10]:
%%time
# Clean corpus of terms in reviews

clean_reviews_sample = reviews_sample.copy()
clean_reviews_sample = clean_corpus(clean_reviews_sample)

CPU times: user 7min 21s, sys: 2.09 s, total: 7min 23s
Wall time: 7min 24s


In [11]:
reviews_sample.head()

Unnamed: 0,business_id,stars,text
3229215,ja01cHy1xqUB9DQ1r1OYKQ,5.0,Great place! Awesome atmosphere. Had the att...
54352,wJY74R0zAgjxvBf-d4gm9g,1.0,Me and my fiancé came here for drinks. We were...
3020909,r1k3JVOrfF4vJJUWJrF8uQ,5.0,Their ad says everybody loves Showmars and I w...
286600,mOMeDQB8NjdBTTzKtikAYg,5.0,Great place ! Great atmosphere! It's already m...
3836550,XcQKsEUlh1W0R4iXbTA1Yg,5.0,This really is a perfect little eatery that I ...


In [12]:
clean_reviews_sample.reset_index(drop=True, inplace=True)

In [13]:
clean_reviews_sample.head()

Unnamed: 0,business_id,stars,text
0,ja01cHy1xqUB9DQ1r1OYKQ,5.0,great place awesome atmosphere attic chard fan...
1,wJY74R0zAgjxvBf-d4gm9g,1.0,fiancé came drinks seated outside wasnt busy w...
2,r1k3JVOrfF4vJJUWJrF8uQ,5.0,ad says everybody loves showmars wholeheartedl...
3,mOMeDQB8NjdBTTzKtikAYg,5.0,great place great atmosphere already second ti...
4,XcQKsEUlh1W0R4iXbTA1Yg,5.0,really perfect little eatery love visiting hap...


## 2. Vectorize Review Text ##

In [14]:
%%time
# Add column of sentiment labels to DataFrame

# Initialize empty column in DataFrame
clean_reviews_sample['sentiment_label'] = np.nan

# Add sentiment labels to the column
for i in range(len(clean_reviews_sample)):
    if clean_reviews_sample['stars'].iloc[i] >= 4.0:
        clean_reviews_sample.at[i, 'sentiment_label'] = 1
    else:
        clean_reviews_sample.at[i, 'sentiment_label'] = 0

CPU times: user 24 s, sys: 52.9 ms, total: 24.1 s
Wall time: 24.1 s


In [18]:
clean_reviews_sample.to_csv('clean_reviews_sample.csv')

In [12]:
clean_reviews_sample = pd.read_csv('clean_reviews_sample.csv', index_col=0)

In [13]:
clean_reviews_sample.head()

Unnamed: 0,business_id,stars,text,sentiment_label
0,ja01cHy1xqUB9DQ1r1OYKQ,5.0,great place awesome atmosphere attic chard fan...,1.0
1,wJY74R0zAgjxvBf-d4gm9g,1.0,fiancé came drinks seated outside wasnt busy w...,0.0
2,r1k3JVOrfF4vJJUWJrF8uQ,5.0,ad says everybody loves showmars wholeheartedl...,1.0
3,mOMeDQB8NjdBTTzKtikAYg,5.0,great place great atmosphere already second ti...,1.0
4,XcQKsEUlh1W0R4iXbTA1Yg,5.0,really perfect little eatery love visiting hap...,1.0


In [20]:
clean_reviews_sample.shape

(1000000, 4)

In [24]:
clean_reviews_sample.drop(index=294610, inplace=True)

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

In [14]:
# Define function to construct x & y values for model construction

def make_xy(clean_reviews_sample, vectorizer=None):  
    '''Make x & y values for model construction'''
    if vectorizer is None:
        vectorizer = CountVectorizer()
    x = vectorizer.fit_transform(clean_reviews_sample.text)
    x = x.tocsc()  # some versions of sklearn return COO format
    y = np.asarray([i for i in clean_reviews_sample['sentiment_label'].values])
    return x, y, vectorizer

In [25]:
%%time

x, y, vect = make_xy(clean_reviews_sample)

CPU times: user 42.8 s, sys: 1.65 s, total: 44.5 s
Wall time: 43.8 s


In [157]:
# Save vectorizer object for later use
# See here: https://www.kaggle.com/mattwills8/fit-transform-and-save-tfidfvectorizer

import pickle

pickle.dump(vect, open('Yelp_Sentiment_CountVectorizer.sav', 'wb'))

# To load vectorizer for future use input:
# variable_name = pickle.load(open('Yelp_Sentiment_CountVectorizer.sav', 'rb'))

## 3. Logistic Regression Model ##

In [26]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [27]:
# Define function to construct logistic regression models
'''Performs train_test_split, construct LogisticRegression object, and fits the classifier.
Allows for specification of stratify in train_test_split, and the values of the C and penalty
parameters in the classifer object.'''

def logistic_regression_model(features,
                              labels,
                              test_size=0.25,
                              random_state=7,
                              stratify=None,
                              classifier=LogisticRegression,
                              solver='liblinear',
                              C=1.0,
                              penalty='l2'):
    # Split the data into a training and test set
    xtrain, xtest, ytrain, ytest = train_test_split(features, labels, test_size=test_size, random_state=random_state, stratify=stratify)
    
    # Construct classifier object
    logreg_clf = LogisticRegression(solver=solver, C=C, penalty=penalty)
    
    # Fit the model on the training data, which trains the model on the training data.
    # xtrain data are the features that are being used for the classification.
    # ytrain data are the labels used to classify data points.
    logreg_clf.fit(xtrain, ytrain)
    
    return xtrain, xtest, ytrain, ytest, logreg_clf

### Model Version 1: No Stratify in train_test_split + l2 penalty parameter in model object ###

In [84]:
%%time

xtrain1a, xtest1a, ytrain1a, ytest1a, logreg_clf1a = logistic_regression_model(x, y, test_size=0.3, random_state=7)

CPU times: user 31min 16s, sys: 13.9 s, total: 31min 30s
Wall time: 5min 33s


In [85]:
# Print the accuracy scores for the training and testing data. Accuracy is the percentage of correct classifications made.
# Predict involves using the model to classify the feature data (xtrain/xtest), which then generates predicted y-values/labels
# and accuracy_score compares these predicted y-values against the label data.

print('The training accuracy score is: {}'.format(accuracy_score(ytrain1a, logreg_clf1a.predict(xtrain1a))))
print ('The test accuracy score is: {}'.format(accuracy_score(ytest1a, logreg_clf1a.predict(xtest1a))))

The training accuracy score is: 0.9276171428571428
The test accuracy score is: 0.8953333333333333


In [86]:
# Classification reports with multiple performance metrics for training data and test data

print('Classification Report for Training Data:')
print(classification_report(ytrain1a, logreg_clf1a.predict(xtrain1a)))
print()
print ('Classification Report for Testing Data:')
print(classification_report(ytest1a, logreg_clf1a.predict(xtest1a)))

Classification Report for Training Data:
              precision    recall  f1-score   support

         0.0       0.92      0.86      0.89    241702
         1.0       0.93      0.96      0.95    458298

    accuracy                           0.93    700000
   macro avg       0.93      0.91      0.92    700000
weighted avg       0.93      0.93      0.93    700000


Classification Report for Testing Data:
              precision    recall  f1-score   support

         0.0       0.87      0.81      0.84    103820
         1.0       0.91      0.94      0.92    196180

    accuracy                           0.90    300000
   macro avg       0.89      0.88      0.88    300000
weighted avg       0.89      0.90      0.89    300000



The model performs fairly well. Although there is only a 0.03 gap between the training and test accuracy, which indicates that the model is *not* overfitting, it would be interesting to see if we can narrow this gap even more and improve model performance. Accordingly, we will now see if we can tune the regularization parameter $C$ to improve the performance of the model. We will do this using Scikit-Learn's [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) tool. First, however, we will perform k-fold cross-validation on the basic model constructed above without parameter tuning.

In [44]:
from sklearn.model_selection import KFold

In [45]:
# Define function to calculate cross-validation score using a specified scoring function

def cv_score(clf, x, y, score_func=accuracy_score):
    result = 0
    nfold = 5
    for train, test in KFold(nfold).split(x): # split data into train/test groups, 5 times
        clf.fit(x[train], y[train]) # fit model on training data
        result += score_func(y[test], clf.predict(x[test])) # evaluate score function on held-out data
    return result / nfold # average the scores

In [75]:
%%time
# Compute cv_score on the basic logistic regression model used above without tuning the parameter

score = cv_score(logreg_clf1a, xtrain1a, ytrain1a)
score

CPU times: user 2h 16min 24s, sys: 44.7 s, total: 2h 17min 9s
Wall time: 22min 57s


0.8938642857142858

In [59]:
from sklearn.model_selection import GridSearchCV

In [76]:
# Use Scikit-Learn's GridSearchCV tool to find the optimal value for the model parameter

#initialize grid of parameter values to search over
Cs = np.power(10.0, np.arange(-5, 5))

# set up hyperparameter grid for grid search
param_grid = {'C': Cs}

# instantiate logistic regression classifier: clf
gridsearch_logreg1 = LogisticRegression(solver='liblinear')

In [77]:
# instantiate GridSearchCV object: clf_cv
gridsearch_logreg1_cv = GridSearchCV(gridsearch_logreg1, param_grid, cv=5)

In [78]:
%%time
# fit GridSearchCV object to training data
gridsearch_logreg1_cv.fit(xtrain1a, ytrain1a)



CPU times: user 17h 1min 11s, sys: 5min 26s, total: 17h 6min 37s
Wall time: 2h 51min 53s


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='liblinear',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='warn', n_jobs=None,
             param_grid={'C': array([1.e-05, 1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02,
       1.e+03, 1.e+04])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [79]:
# Print the tuned parameter and score
print("Tuned Logistic Regression Parameter: {}".format(gridsearch_logreg1_cv.best_params_)) 
print("Best score is {}".format(gridsearch_logreg1_cv.best_score_))

Tuned Logistic Regression Parameter: {'C': 0.1}
Best score is 0.8972485714285714


The GridSearch found a value for $C$ with slightly better results than the default $C$ parameter in the LogisticRegression function. We can now use the value for $C$ provided by the GridSearch in our model construction to see how it influences our performance metrics.

In [81]:
%%time

xtrain1b, xtest1b, ytrain1b, ytest1b, logreg_clf1b = logistic_regression_model(x, y, test_size=0.3, random_state=7, C=0.1)

CPU times: user 4min 49s, sys: 2.96 s, total: 4min 52s
Wall time: 54.2 s


In [83]:
# Print the accuracy scores for the training and testing data.

print('The training accuracy score is: {}'.format(accuracy_score(ytrain1b, logreg_clf1b.predict(xtrain1b))))
print ('The test accuracy score is: {}'.format(accuracy_score(ytest1b, logreg_clf1b.predict(xtest1b))))

The training accuracy score is: 0.9094328571428572
The test accuracy score is: 0.8981866666666667


Note, having tuned the $C$ parameter, the accuracy scores for the training and test data are even closer: .91 and .90, respectively, as opposed to .93 and .90 previously. We have thus improved the model.

Below, we will see if we can construct an even better Logistic Regression model by varying the stratify parameter in the train_test_split, and the penalty parameter in the classifier object.

### Model Version 2: No Stratify in train_test_split + l1 penalty parameter in model object ###

In [87]:
%%time

xtrain2a, xtest2a, ytrain2a, ytest2a, logreg_clf2a = logistic_regression_model(x, y, test_size=0.3, random_state=7, penalty='l1')

CPU times: user 31.1 s, sys: 1.23 s, total: 32.3 s
Wall time: 29.8 s


In [88]:
# Print the accuracy scores for the training and testing data.

print('The training accuracy score is: {}'.format(accuracy_score(ytrain2a, logreg_clf2a.predict(xtrain2a))))
print ('The test accuracy score is: {}'.format(accuracy_score(ytest2a, logreg_clf2a.predict(xtest2a))))

The training accuracy score is: 0.91546
The test accuracy score is: 0.8954766666666667


In [89]:
# Classification reports with multiple performance metrics for training data and test data

print('Classification Report for Training Data:')
print(classification_report(ytrain2a, logreg_clf2a.predict(xtrain2a)))
print()
print ('Classification Report for Testing Data:')
print(classification_report(ytest2a, logreg_clf2a.predict(xtest2a)))

Classification Report for Training Data:
              precision    recall  f1-score   support

         0.0       0.91      0.84      0.87    241702
         1.0       0.92      0.95      0.94    458298

    accuracy                           0.92    700000
   macro avg       0.91      0.90      0.90    700000
weighted avg       0.92      0.92      0.91    700000


Classification Report for Testing Data:
              precision    recall  f1-score   support

         0.0       0.88      0.81      0.84    103820
         1.0       0.90      0.94      0.92    196180

    accuracy                           0.90    300000
   macro avg       0.89      0.88      0.88    300000
weighted avg       0.89      0.90      0.89    300000



Note, the model is slightly improved using the 'l1' penalty rather than the 'l2'. There is only a 0.02 gap between the training and test accuracy scores as opposed to a 0.03 gap. It will still be interesting to see if we can improve this performance by tuning the $C$ parameter.

In [105]:
# Use Scikit-Learn's GridSearchCV tool to find the optimal value for the model parameter

#initialize grid of parameter values to search over
Cs = np.power(10.0, np.arange(-5, 5))

# set up hyperparameter grid for grid search
param_grid = {'C': Cs}

# instantiate logistic regression classifier: clf
gridsearch_logreg2 = LogisticRegression(solver='liblinear', penalty='l1')

In [106]:
# instantiate GridSearchCV object: clf_cv
gridsearch_logreg2_cv = GridSearchCV(gridsearch_logreg2, param_grid, cv=5)

In [107]:
%%time
# fit GridSearchCV object to training data
gridsearch_logreg2_cv.fit(xtrain2a, ytrain2a)

CPU times: user 18min 20s, sys: 28.9 s, total: 18min 49s
Wall time: 16min 49s


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l1',
                                          random_state=None, solver='liblinear',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='warn', n_jobs=None,
             param_grid={'C': array([1.e-05, 1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02,
       1.e+03, 1.e+04])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [110]:
# Print the tuned parameter and score
print("Tuned Logistic Regression Parameter: {}".format(gridsearch_logreg2_cv.best_params_)) 
print("Best score is {}".format(gridsearch_logreg2_cv.best_score_))

Tuned Logistic Regression Parameter: {'C': 0.1}
Best score is 0.8959185714285715


In [96]:
%%time

xtrain2b, xtest2b, ytrain2b, ytest2b, logreg_clf2b = logistic_regression_model(x, y, test_size=0.3, random_state=7, C=0.1, penalty='l1')

CPU times: user 19 s, sys: 2.54 s, total: 21.6 s
Wall time: 19.3 s


In [97]:
# Print the accuracy scores for the training and testing data.

print('The training accuracy score is: {}'.format(accuracy_score(ytrain2b, logreg_clf2b.predict(xtrain2b))))
print ('The test accuracy score is: {}'.format(accuracy_score(ytest2b, logreg_clf2b.predict(xtest2b))))

The training accuracy score is: 0.8997585714285714
The test accuracy score is: 0.8964666666666666


Note, having tuned the $C$ parameter, the accuracy scores for the training and test data are even closer, and in fact, almost equivalent. Both scores round to .90, and there is only about a .003 difference between them. This is the best model performance so far.

Let us now see how varying the 'stratify' parameter in the train_test_split affects model performance.

### Model Version 3: Stratify in train_test_split + l2 penalty parameter in model object ###

In [99]:
%%time

xtrain3a, xtest3a, ytrain3a, ytest3a, logreg_clf3a = logistic_regression_model(x, y, test_size=0.3, random_state=7, stratify=y)

CPU times: user 34min 15s, sys: 27.4 s, total: 34min 43s
Wall time: 6min 48s




In [100]:
# Print the accuracy scores for the training and testing data.

print('The training accuracy score is: {}'.format(accuracy_score(ytrain3a, logreg_clf3a.predict(xtrain3a))))
print ('The test accuracy score is: {}'.format(accuracy_score(ytest3a, logreg_clf3a.predict(xtest3a))))

The training accuracy score is: 0.9259514285714285
The test accuracy score is: 0.89606


In [101]:
# Classification reports with multiple performance metrics for training data and test data

print('Classification Report for Training Data:')
print(classification_report(ytrain3a, logreg_clf3a.predict(xtrain3a)))
print()
print ('Classification Report for Testing Data:')
print(classification_report(ytest3a, logreg_clf3a.predict(xtest3a)))

Classification Report for Training Data:
              precision    recall  f1-score   support

         0.0       0.92      0.86      0.89    241865
         1.0       0.93      0.96      0.94    458135

    accuracy                           0.93    700000
   macro avg       0.92      0.91      0.92    700000
weighted avg       0.93      0.93      0.93    700000


Classification Report for Testing Data:
              precision    recall  f1-score   support

         0.0       0.87      0.82      0.84    103657
         1.0       0.91      0.94      0.92    196343

    accuracy                           0.90    300000
   macro avg       0.89      0.88      0.88    300000
weighted avg       0.90      0.90      0.90    300000



Note, the model peformance is about the same as with the first version, with a 0.03 gap between the training and test accuracy scores.

### Model Version 4: Stratify in train_test_split + l1 penalty parameter in model object ###

In [102]:
%%time

xtrain4a, xtest4a, ytrain4a, ytest4a, logreg_clf4a = logistic_regression_model(x, y, test_size=0.3, random_state=7, stratify=y, penalty='l1')

CPU times: user 29.3 s, sys: 1.25 s, total: 30.6 s
Wall time: 27.1 s


In [103]:
# Print the accuracy scores for the training and testing data.

print('The training accuracy score is: {}'.format(accuracy_score(ytrain4a, logreg_clf4a.predict(xtrain4a))))
print ('The test accuracy score is: {}'.format(accuracy_score(ytest4a, logreg_clf4a.predict(xtest4a))))

The training accuracy score is: 0.9152942857142857
The test accuracy score is: 0.8960966666666667


In [104]:
# Classification reports with multiple performance metrics for training data and test data

print('Classification Report for Training Data:')
print(classification_report(ytrain4a, logreg_clf4a.predict(xtrain4a)))
print()
print ('Classification Report for Testing Data:')
print(classification_report(ytest4a, logreg_clf4a.predict(xtest4a)))

Classification Report for Training Data:
              precision    recall  f1-score   support

         0.0       0.91      0.84      0.87    241865
         1.0       0.92      0.95      0.94    458135

    accuracy                           0.92    700000
   macro avg       0.91      0.90      0.90    700000
weighted avg       0.91      0.92      0.91    700000


Classification Report for Testing Data:
              precision    recall  f1-score   support

         0.0       0.88      0.81      0.84    103657
         1.0       0.91      0.94      0.92    196343

    accuracy                           0.90    300000
   macro avg       0.89      0.88      0.88    300000
weighted avg       0.90      0.90      0.90    300000



Again, we find that using 'l1' as the penalty value enables the model to perform better than using 'l2'. Let us see if tuning the $C$ parameter will improve performance more than Version 2 above when we used l1, but did not stratify the train_test_split.

In [111]:
# Use Scikit-Learn's GridSearchCV tool to find the optimal value for the model parameter

#initialize grid of parameter values to search over
Cs = np.power(10.0, np.arange(-5, 5))

# set up hyperparameter grid for grid search
param_grid = {'C': Cs}

# instantiate logistic regression classifier: clf
gridsearch_logreg4 = LogisticRegression(solver='liblinear', penalty='l1')

In [112]:
# instantiate GridSearchCV object: clf_cv
gridsearch_logreg4_cv = GridSearchCV(gridsearch_logreg4, param_grid, cv=5)

In [113]:
%%time
# fit GridSearchCV object to training data
gridsearch_logreg4_cv.fit(xtrain4a, ytrain4a)

CPU times: user 19min 35s, sys: 31.7 s, total: 20min 6s
Wall time: 18min 8s


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l1',
                                          random_state=None, solver='liblinear',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='warn', n_jobs=None,
             param_grid={'C': array([1.e-05, 1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02,
       1.e+03, 1.e+04])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [114]:
# Print the tuned parameter and score
print("Tuned Logistic Regression Parameter: {}".format(gridsearch_logreg4_cv.best_params_)) 
print("Best score is {}".format(gridsearch_logreg4_cv.best_score_))

Tuned Logistic Regression Parameter: {'C': 0.1}
Best score is 0.8954414285714286


In [28]:
%%time

xtrain4b, xtest4b, ytrain4b, ytest4b, logreg_clf4b = logistic_regression_model(x, y, test_size=0.3, random_state=7, stratify=y, C=0.1, penalty='l1')

CPU times: user 20.3 s, sys: 1.29 s, total: 21.6 s
Wall time: 19.1 s


In [29]:
# Print the accuracy scores for the training and testing data.

print('The training accuracy score is: {}'.format(accuracy_score(ytrain4b, logreg_clf4b.predict(xtrain4b))))
print ('The test accuracy score is: {}'.format(accuracy_score(ytest4b, logreg_clf4b.predict(xtest4b))))

The training accuracy score is: 0.8995841422630604
The test accuracy score is: 0.8967866666666666


In [117]:
# Print the accuracy scores for the training and testing data.

print('The training accuracy score is: {}'.format(accuracy_score(ytrain2b, logreg_clf2b.predict(xtrain2b))))
print ('The test accuracy score is: {}'.format(accuracy_score(ytest2b, logreg_clf2b.predict(xtest2b))))

The training accuracy score is: 0.8997585714285714
The test accuracy score is: 0.8964666666666666


The gap between the training and test accuracy scores is even smaller than the gap between the scores in Version 2 of our model. Version 4, therefore, performs the best of the variations we have tested.

In [30]:
# Classification reports with multiple performance metrics for training data and test data

print('Classification Report for Training Data:')
print(classification_report(ytrain4b, logreg_clf4b.predict(xtrain4b)))
print()
print ('Classification Report for Testing Data:')
print(classification_report(ytest4b, logreg_clf4b.predict(xtest4b)))

Classification Report for Training Data:
              precision    recall  f1-score   support

         0.0       0.89      0.81      0.85    241865
         1.0       0.90      0.95      0.93    458134

    accuracy                           0.90    699999
   macro avg       0.90      0.88      0.89    699999
weighted avg       0.90      0.90      0.90    699999


Classification Report for Testing Data:
              precision    recall  f1-score   support

         0.0       0.89      0.81      0.84    103656
         1.0       0.90      0.95      0.92    196344

    accuracy                           0.90    300000
   macro avg       0.89      0.88      0.88    300000
weighted avg       0.90      0.90      0.90    300000



In [153]:
# Save final tuned model for later use
# See here: https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/
# And here: https://www.geeksforgeeks.org/saving-a-machine-learning-model/

import pickle

filename = 'logreg_Yelp_sentiment_classifier.sav'

pickle.dump(logreg_clf4b, open(filename, 'wb'))

# To load model for future use input:
# variable_name = pickle.load(open(filename, 'rb'))

## 4. Possible Future Developments ##
* Use stemming or lemmatization
* Build a model where features are bigrams/pairs of words instead of individual words
* Use a different vectorizer (e.g. tf-idf vectorizer, binary vectorizer)
* Add additional features, such as length of review, to help with label prediction
* Use different min_df and max_df values in the vectorization
* Try doing train/test/split *before* doing vectorization
* Use different algorithms to classify the reviews (e.g. Naive Bayes, tree-based such as RandomForest or XG Boost, SVMs, Neural Nets)