# Create Models	50	

Create a logistic regression model and a support vector machine model for the classification task involved with your dataset. Assess how well each model performs (use 80/20 training/testing split for your data). Adjust parameters of the models to make them more accurate. If your dataset size requires the use of stochastic gradient descent, then linear kernel only is fine to use. That is, the SGDClassifier is fine to use for optimizing logistic regression and linear support vector machines. For many problems, SGD will be required in order to train the SVM model in a reasonable timeframe. 

In [1]:
import pandas as pd
import numpy as ny
import seaborn as sb
import matplotlib.pyplot as plt
import plotly.express as px
import os 
import sklearn.model_selection as ms
from sklearn.linear_model import LogisticRegression
import sklearn.linear_model as lm
from sklearn import metrics as mt

from sklearn.metrics import (accuracy_score,brier_score_loss, precision_score, recall_score,
                             f1_score)

In [2]:
!pip3 install pickle5
import socket
import pickle5 as pickle

is_rohit=socket.gethostname()=='Rohits-MacBook-Pro.local'
is_blake=socket.gethostname()=='BJH-ML-machine'

if(is_rohit):
    with open('~/Documents/yelp_datasets/df_business_eda.pickle', "rb") as f:
      pick_data = pickle.load(f)
      pick_data.to_pickle('~/Documents/yelp_datasets/df_business_eda_proto4.pickle')

    df_business_eda = pd.read_pickle("~/Documents/yelp_datasets/df_business_eda_proto4.pickle")
    
if(is_blake):
    df_business_eda = pd.read_pickle("df_business_eda.pickle")

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [3]:
df_business_eda

Unnamed: 0,rating_category,stars,Beauty & Spas,BusinessAcceptsCreditCards,Restaurants,is_open,parking_lot,city_state,zip3,RestaurantsPriceRange2,text,checkin_count,useful_count,cool_count,funny_count,review_count,review_word_count
1,2,5.0,False,False,False,True,False,Scottsdale_AZ,852,,,9.0,4,1,2,4,121.000000
3,0,2.5,False,True,False,False,True,North Las Vegas_NV,890,4.0,,3.0,1,0,0,3,26.666667
4,2,4.5,False,True,False,True,False,Mesa_AZ,852,,,1.0,11,3,6,26,86.962963
5,2,4.5,False,True,False,True,False,Gilbert_AZ,852,,4.0,39.0,25,2,6,38,81.690476
6,1,3.5,False,True,False,True,True,Las Vegas_NV,891,1.0,22.0,328.0,50,22,18,81,111.097561
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
209380,0,2.0,False,True,True,True,True,Phoenix_AZ,850,2.0,32.0,253.0,43,27,26,106,103.944954
209382,2,4.5,False,False,False,True,False,Las Vegas_NV,891,,9.0,168.0,73,30,10,124,143.593985
209384,2,5.0,False,True,False,True,False,Tempe _AZ,852,,,9.0,1,0,0,5,65.200000
209386,2,5.0,False,True,False,True,False,Las Vegas_NV,891,,11.0,8.0,89,49,52,217,100.977679


In [4]:
categorical_cols=df_business_eda.select_dtypes(include=['object']).columns

df_business_hot=df_business_eda

for col in categorical_cols:
    dummies=pd.get_dummies(df_business_hot[col], dummy_na=True, prefix=col)
    df_business_hot=df_business_hot.\
        drop(col,axis=1).\
    merge(
        dummies,
        how='left',
        left_index=True,
        right_index=True
        )
    
df_business_hot=df_business_hot.fillna(False)
df_business_hot.head()

Unnamed: 0,rating_category,stars,Beauty & Spas,BusinessAcceptsCreditCards,Restaurants,is_open,parking_lot,RestaurantsPriceRange2,text,checkin_count,...,zip3_928,zip3_930,zip3_940,zip3_952,zip3_953,zip3_959,zip3_967,zip3_981,zip3_nan,zip3_nan.1
1,2,5.0,False,False,False,True,False,False,False,9,...,0,0,0,0,0,0,0,0,0,0
3,0,2.5,False,True,False,False,True,4,False,3,...,0,0,0,0,0,0,0,0,0,0
4,2,4.5,False,True,False,True,False,False,False,1,...,0,0,0,0,0,0,0,0,0,0
5,2,4.5,False,True,False,True,False,False,4,39,...,0,0,0,0,0,0,0,0,0,0
6,1,3.5,False,True,False,True,True,1,22,328,...,0,0,0,0,0,0,0,0,0,0


In [5]:
df_business_hot["above_average"] = ny.where(df_business_hot["stars"] > 3.5, 0, 1)
y=df_business_hot.rating_category
y2 = df_business_hot["above_average"]
X=df_business_hot.drop(["rating_category", "stars","above_average"], axis=1)

In [6]:
num_cv_iterations = 5
num_instances = len(y)
cv_object = ms.ShuffleSplit(n_splits=num_cv_iterations,
                         random_state=123,
                         test_size  = 0.2)
                         
print(cv_object)

ShuffleSplit(n_splits=5, random_state=123, test_size=0.2, train_size=None)


In [7]:
def cv_train(model,x,y):
    
    iter_num=0

    for train_indices, test_indices in cv_object.split(x,y): 
        #print(f'{train_indices},{test_indices}')

        X_train = x.iloc[train_indices]
        y_train = y.iloc[train_indices]

        X_test = x.iloc[test_indices]
        y_test = y.iloc[test_indices]

        model.fit(X_train,y_train)  # train object
        y_hat = model.predict(X_test) # get test set precitions

        conf = mt.confusion_matrix(y_test,y_hat)
        
        print(f"====Iteration {iter_num} ====")
        print(f"confusion matrix:\n {conf}")
        print("\tAccuracy: %1.3f" % accuracy_score(y_test, y_hat))
        print("\tPrecision: %1.3f" % precision_score(y_test, y_hat, average="macro"))
        print("\tRecall: %1.3f" % recall_score(y_test, y_hat, average="macro"))
        print("\tF1: %1.3f\n" % f1_score(y_test, y_hat, average="macro"))
        iter_num+=1
        
#cv_train(linear_model, X, y2)

# Logistic Regression

## 3 Classes

In [8]:
linear_model = LogisticRegression(penalty='l2', C=1.0, class_weight=None, solver='liblinear' )

Tried all solvers (‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’), liblinear --> gives the best accuracy

In [9]:
cv_train(linear_model, X, y)

====Iteration 0 ====
confusion matrix:
 [[1194  716 2140]
 [ 610 1343 3668]
 [ 445  698 9164]]
	Accuracy: 0.586
	Precision: 0.543
	Recall: 0.474
	F1: 0.475

====Iteration 1 ====
confusion matrix:
 [[1219  690 2166]
 [ 691 1294 3837]
 [ 427  662 8992]]
	Accuracy: 0.576
	Precision: 0.537
	Recall: 0.471
	F1: 0.468

====Iteration 2 ====
confusion matrix:
 [[1226  678 2161]
 [ 643 1332 3718]
 [ 480  648 9092]]
	Accuracy: 0.583
	Precision: 0.543
	Recall: 0.475
	F1: 0.474

====Iteration 3 ====
confusion matrix:
 [[1159  691 2235]
 [ 636 1310 3740]
 [ 430  668 9109]]
	Accuracy: 0.580
	Precision: 0.539
	Recall: 0.469
	F1: 0.467

====Iteration 4 ====
confusion matrix:
 [[1254  668 2159]
 [ 664 1313 3626]
 [ 459  659 9176]]
	Accuracy: 0.588
	Precision: 0.546
	Recall: 0.478
	F1: 0.478



## 2 Class

In [10]:
cv_train(linear_model, X, y2)

====Iteration 0 ====
confusion matrix:
 [[7587 2720]
 [3507 6164]]
	Accuracy: 0.688
	Precision: 0.689
	Recall: 0.687
	F1: 0.687

====Iteration 1 ====
confusion matrix:
 [[7448 2633]
 [3575 6322]]
	Accuracy: 0.689
	Precision: 0.691
	Recall: 0.689
	F1: 0.688

====Iteration 2 ====
confusion matrix:
 [[7493 2727]
 [3490 6268]]
	Accuracy: 0.689
	Precision: 0.690
	Recall: 0.688
	F1: 0.688

====Iteration 3 ====
confusion matrix:
 [[7557 2650]
 [3527 6244]]
	Accuracy: 0.691
	Precision: 0.692
	Recall: 0.690
	F1: 0.689

====Iteration 4 ====
confusion matrix:
 [[7666 2628]
 [3505 6179]]
	Accuracy: 0.693
	Precision: 0.694
	Recall: 0.691
	F1: 0.691



## Logistic Regression - Stoichastic Gradient Descent 

## 3 Class

In [11]:
log_model = lm.SGDClassifier(n_jobs=-1, loss="log")

## 3 class

In [12]:
cv_train(log_model, X, y)

====Iteration 0 ====
confusion matrix:
 [[2100  831 1119]
 [1605 1416 2600]
 [1819  829 7659]]
	Accuracy: 0.559
	Precision: 0.505
	Recall: 0.505
	F1: 0.490

====Iteration 1 ====
confusion matrix:
 [[1807  603 1665]
 [1335  919 3568]
 [ 948  529 8604]]
	Accuracy: 0.567
	Precision: 0.504
	Recall: 0.485
	F1: 0.465

====Iteration 2 ====
confusion matrix:
 [[ 933 1607 1525]
 [ 656 1448 3589]
 [1155  994 8071]]
	Accuracy: 0.523
	Precision: 0.437
	Recall: 0.425
	F1: 0.420

====Iteration 3 ====
confusion matrix:
 [[1528 1214 1343]
 [1583 1994 2109]
 [1320 2016 6871]]
	Accuracy: 0.520
	Precision: 0.464
	Recall: 0.466
	F1: 0.465

====Iteration 4 ====
confusion matrix:
 [[2587  727  767]
 [2460 1611 1532]
 [3216 1141 5937]]
	Accuracy: 0.507
	Precision: 0.499
	Recall: 0.499
	F1: 0.472



## 2 class

In [13]:
cv_train(log_model, X, y2)

====Iteration 0 ====
confusion matrix:
 [[8806 1501]
 [6176 3495]]
	Accuracy: 0.616
	Precision: 0.644
	Recall: 0.608
	F1: 0.587

====Iteration 1 ====
confusion matrix:
 [[6310 3771]
 [2820 7077]]
	Accuracy: 0.670
	Precision: 0.672
	Recall: 0.670
	F1: 0.670

====Iteration 2 ====
confusion matrix:
 [[7883 2337]
 [4130 5628]]
	Accuracy: 0.676
	Precision: 0.681
	Recall: 0.674
	F1: 0.672

====Iteration 3 ====
confusion matrix:
 [[2420 7787]
 [1471 8300]]
	Accuracy: 0.537
	Precision: 0.569
	Recall: 0.543
	F1: 0.493

====Iteration 4 ====
confusion matrix:
 [[5320 4974]
 [3825 5859]]
	Accuracy: 0.560
	Precision: 0.561
	Recall: 0.561
	F1: 0.559



# Support Vector Machine

In [14]:
svm_model = lm.SGDClassifier(n_jobs=-1, loss="hinge")

## 3 Class

In [15]:
cv_train(svm_model, X, y)

====Iteration 0 ====
confusion matrix:
 [[ 880  950 2220]
 [ 752  718 4151]
 [1696 1290 7321]]
	Accuracy: 0.446
	Precision: 0.347
	Recall: 0.352
	F1: 0.339

====Iteration 1 ====
confusion matrix:
 [[1806 1343  926]
 [3507  959 1356]
 [4368  954 4759]]
	Accuracy: 0.377
	Precision: 0.386
	Recall: 0.360
	F1: 0.343

====Iteration 2 ====
confusion matrix:
 [[ 116 2874 1075]
 [ 102 2966 2625]
 [ 140 2683 7397]]
	Accuracy: 0.525
	Precision: 0.446
	Recall: 0.424
	F1: 0.388

====Iteration 3 ====
confusion matrix:
 [[ 420  960 2705]
 [ 293 2178 3215]
 [ 276 2946 6985]]
	Accuracy: 0.480
	Precision: 0.441
	Recall: 0.390
	F1: 0.380

====Iteration 4 ====
confusion matrix:
 [[1334 1818  929]
 [ 947 2308 2348]
 [1914 1992 6388]]
	Accuracy: 0.502
	Precision: 0.452
	Recall: 0.453
	F1: 0.452



## 2 Class

In [16]:
cv_train(svm_model, X, y2)

====Iteration 0 ====
confusion matrix:
 [[5187 5120]
 [1774 7897]]
	Accuracy: 0.655
	Precision: 0.676
	Recall: 0.660
	F1: 0.648

====Iteration 1 ====
confusion matrix:
 [[6665 3416]
 [3094 6803]]
	Accuracy: 0.674
	Precision: 0.674
	Recall: 0.674
	F1: 0.674

====Iteration 2 ====
confusion matrix:
 [[6089 4131]
 [2619 7139]]
	Accuracy: 0.662
	Precision: 0.666
	Recall: 0.664
	F1: 0.661

====Iteration 3 ====
confusion matrix:
 [[7070 3137]
 [4360 5411]]
	Accuracy: 0.625
	Precision: 0.626
	Recall: 0.623
	F1: 0.622

====Iteration 4 ====
confusion matrix:
 [[7458 2836]
 [4246 5438]]
	Accuracy: 0.646
	Precision: 0.647
	Recall: 0.643
	F1: 0.642



# Model Advantages	10

Discuss the advantages of each model for each classification task. Does one type of model offer superior performance over another in terms of prediction accuracy? In terms of training time or efficiency? Explain in detail.

# Interpret Feature Importance	30

Use the weights from logistic regression to interpret the importance of different features for the classification task. Explain your interpretation in detail. Why do you think some variables are more important?

# Interpret Support Vectors	10

Look at the chosen support vectors for the classification task. Do these provide any insight into the data? Explain. If you used stochastic gradient descent (and therefore did not explicitly solve for support vectors), try subsampling your data to train the SVC model— then analyze the support vectors from the subsampled dataset.