# Create Models	50	

Create a logistic regression model and a support vector machine model for the classification task involved with your dataset. Assess how well each model performs (use 80/20 training/testing split for your data). Adjust parameters of the models to make them more accurate. If your dataset size requires the use of stochastic gradient descent, then linear kernel only is fine to use. That is, the SGDClassifier is fine to use for optimizing logistic regression and linear support vector machines. For many problems, SGD will be required in order to train the SVM model in a reasonable timeframe. 

In [21]:
import pandas as pd
import numpy as ny
import seaborn as sb
import matplotlib.pyplot as plt
import plotly.express as px
import os 
import sklearn.model_selection as ms
from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt

In [5]:
df_business_eda = pd.read_pickle("~/Documents/yelp_datasets/df_business_eda.pickle")

ValueError: unsupported pickle protocol: 5

In [14]:
!pip3 install pickle5
import socket
import pickle5 as pickle

is_rohit=socket.gethostname()=='Rohits-MacBook-Pro.local'

if(is_rohit):
    with open('~/Documents/yelp_datasets/df_business_eda.pickle', "rb") as f:
      pick_data = pickle.load(f)
      pick_data.to_pickle('~/Documents/yelp_datasets/df_business_eda_proto4.pickle')

    df_business_eda = pd.read_pickle("~/Documents/yelp_datasets/df_business_eda_proto4.pickle")



In [16]:
categorical_cols=df_business_eda.select_dtypes(include=['object']).columns

df_business_hot=df_business_eda

for col in categorical_cols:
    dummies=pd.get_dummies(df_business_hot[col], dummy_na=True, prefix=col)
    df_business_hot=df_business_hot.\
        drop(col,axis=1).\
    merge(
        dummies,
        how='left',
        left_index=True,
        right_index=True
        )
    
df_business_hot=df_business_hot.fillna(False)

In [23]:
y=df_business_hot.rating_category
X=df_business_hot.drop(["rating_category", "stars"], axis=1)

x_train,x_test,y_train,y_test=ms.train_test_split(X,y,test_size=0.2)

In [39]:
#printing shapes of testing and training sets :
print("shape of original dataset :", df_business_hot.shape)
print("shape of input - training set", x_train.shape)
print("shape of output - training set", y_train.shape)
print("shape of input - testing set", x_test.shape)
print("shape of output - testing set", y_test.shape)

shape of original dataset : (99887, 276)
shape of input - training set (79909, 274)
shape of output - training set (79909,)
shape of input - testing set (19978, 274)
shape of output - testing set (19978,)


# Logistic Regression

## Stoichastic Gradient Descent 

In [40]:
num_cv_iterations = 5
num_instances = len(y)
cv_object = ms.ShuffleSplit(n_splits=num_cv_iterations,
                         test_size  = 0.2)
                         
print(cv_object)

ShuffleSplit(n_splits=5, random_state=None, test_size=0.2, train_size=None)


In [47]:
linear_model = LogisticRegression(penalty='l2', C=1.0, class_weight=None, solver='liblinear' )

In [52]:
iter_num=0

for train_indices, test_indices in cv_object.split(X,y): 
    #print(f'{train_indices},{test_indices}')

    X_train = X.iloc[train_indices]
    y_train = y.iloc[train_indices]
    
    X_test = X.iloc[test_indices]
    y_test = y.iloc[test_indices]
    
    linear_model.fit(X_train,y_train)  # train object
    y_hat = linear_model.predict(X_test) # get test set precitions

    acc = mt.accuracy_score(y_test,y_hat)
    conf = mt.confusion_matrix(y_test,y_hat)
    print(f"====Iteration Multinomial {iter_num} ====")
    print(f"accuracy: {acc}" )
    print(f"confusion matrix:\n {conf}")
    iter_num+=1

====Iteration Multinomial 0 ====
accuracy: 0.5829912904194614
confusion matrix:
 [[1198  716 2184]
 [ 650 1324 3568]
 [ 448  765 9125]]
====Iteration Multinomial 1 ====
accuracy: 0.5780358394233657
confusion matrix:
 [[1179  712 2182]
 [ 666 1316 3690]
 [ 461  719 9053]]
====Iteration Multinomial 2 ====
accuracy: 0.582390629692662
confusion matrix:
 [[1201  716 2104]
 [ 623 1260 3825]
 [ 447  628 9174]]
====Iteration Multinomial 3 ====
accuracy: 0.5806387025728301
confusion matrix:
 [[1187  702 2239]
 [ 661 1351 3636]
 [ 453  687 9062]]
====Iteration Multinomial 4 ====
accuracy: 0.5803383722094304
confusion matrix:
 [[1162  677 2212]
 [ 666 1331 3718]
 [ 447  664 9101]]


## Tried all solvers (‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’), liblinear --> gives the best accuracy

In [58]:
df_business_hot["above_average"] = ny.where(df_business_hot["stars"] > 3.5, 0, 1)
y2 = df_business_hot["above_average"]
y2.head()

1    0
3    1
4    0
5    0
6    1
Name: above_average, dtype: int64

In [59]:
iter_num=0

for train_indices, test_indices in cv_object.split(X2,y): 
    #print(f'{train_indices},{test_indices}')

    X_train = X.iloc[train_indices]
    y_train = y2.iloc[train_indices]
    
    X_test = X.iloc[test_indices]
    y_test = y2.iloc[test_indices]
    
    linear_model.fit(X_train,y_train)  # train object
    y_hat = linear_model.predict(X_test) # get test set precitions

    acc = mt.accuracy_score(y_test,y_hat)
    conf = mt.confusion_matrix(y_test,y_hat)
    print(f"====Iteration binomial {iter_num} ====")
    print(f"accuracy: {acc}" )
    print(f"confusion matrix:\n {conf}")
    iter_num+=1

[12469 94674 54688 ... 73487 63555 32288],[96690 98284 47916 ... 11102 63554 41631]
====Iteration binomial 0 ====
accuracy: 0.6886074682150365
confusion matrix:
 [[7485 2816]
 [3405 6272]]
[31147 66474 77168 ... 22976 52624 13079],[75090 33918 18365 ... 19861 23800   706]
====Iteration binomial 1 ====
accuracy: 0.6905596155771349
confusion matrix:
 [[7637 2582]
 [3600 6159]]
[ 8664 81834  3331 ... 41722 92419 69920],[78553 76033 87061 ... 69360 93809 47401]
====Iteration binomial 2 ====
accuracy: 0.6896586244869356
confusion matrix:
 [[7487 2701]
 [3499 6291]]
[56300 87165 45577 ... 31605 56718 50796],[23934   420 79881 ... 41419 24499  3861]
====Iteration binomial 3 ====
accuracy: 0.6863049354289719
confusion matrix:
 [[7419 2751]
 [3516 6292]]
[73373 86422 76661 ... 10940 46648   966],[40759 50430 85307 ... 77239 81788 25146]
====Iteration binomial 4 ====
accuracy: 0.6944639103013315
confusion matrix:
 [[7608 2640]
 [3464 6266]]


In [38]:
logistic = linear_model.SGDClassifier(n_jobs=-1, loss="log")
logistic.fit(x_train,y_train)

y_pred = logistic.predict(x_test)
print('Logistic Accuracy: {:.2f}'.format(metrics.accuracy_score(y_test, y_pred)))
print('Logistic Precision: {:.2f}'.format(metrics.precision_score(y_test, y_pred, average='micro')))
print('Logistic classification_report: ')
class_report=metrics.classification_report(y_test, y_pred, output_dict=True)
pd.DataFrame(class_report)

AttributeError: 'LogisticRegression' object has no attribute 'SGDClassifier'

# Support Vector Machine

## Stoichastic Gradient Descent 

In [None]:
import sklearn
import numpy as np
from sklearn import linear_model
import sklearn.metrics as metrics


svm = linear_model.SGDClassifier(n_jobs=-1, loss="hinge")
svm.fit(x_train,y_train)

In [None]:
y_pred = svm.predict(x_test)
print('SVM Accuracy: {:.2f}'.format(metrics.accuracy_score(y_test, y_pred)))
print('SVM Precision: {:.2f}'.format(metrics.precision_score(y_test, y_pred, average='micro')))
print('SVM classification_report: ')
class_report=metrics.classification_report(y_test, y_pred, output_dict=True)
pd.DataFrame(class_report)

## Linear [too slow]

In [None]:
#https://stackabuse.com/implementing-svm-and-kernel-svm-with-pythons-scikit-learn/
#following this beautiful guide

from sklearn.svm import SVC
svclassifier = SVC(kernel='linear')
svclassifier.fit(x_train, y_train)

y_pred = svclassifier.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

## POLY [too slow]

In [None]:
from sklearn.svm import SVC
svclassifier = SVC(kernel='poly', degree=8)
svclassifier.fit(x_train, y_train)

y_pred = svclassifier.predict(x_train)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

## Gaussian Kernel (RBF) [too slow]

In [None]:
from sklearn.svm import SVC
svclassifier = SVC(kernel='rbf')
svclassifier.fit(x_train, y_train)

y_pred = svclassifier.predict(x_train)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

## Sigmoid Kernel [too slow]

In [None]:
from sklearn.svm import SVC
svclassifier = SVC(kernel='sigmoid')
svclassifier.fit(x_train, y_train)

y_pred = svclassifier.predict(x_train)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# Model Advantages	10

Discuss the advantages of each model for each classification task. Does one type of model offer superior performance over another in terms of prediction accuracy? In terms of training time or efficiency? Explain in detail.

# Interpret Feature Importance	30

Use the weights from logistic regression to interpret the importance of different features for the classification task. Explain your interpretation in detail. Why do you think some variables are more important?

# Interpret Support Vectors	10

Look at the chosen support vectors for the classification task. Do these provide any insight into the data? Explain. If you used stochastic gradient descent (and therefore did not explicitly solve for support vectors), try subsampling your data to train the SVC model— then analyze the support vectors from the subsampled dataset.