# Create Models	50	

Create a logistic regression model and a support vector machine model for the classification task involved with your dataset. Assess how well each model performs (use 80/20 training/testing split for your data). Adjust parameters of the models to make them more accurate. If your dataset size requires the use of stochastic gradient descent, then linear kernel only is fine to use. That is, the SGDClassifier is fine to use for optimizing logistic regression and linear support vector machines. For many problems, SGD will be required in order to train the SVM model in a reasonable timeframe. 

In [None]:
import pandas as pd
import numpy as ny
import seaborn as sb
import matplotlib.pyplot as plt
import plotly.express as px
import os 
import sklearn.model_selection as ms

In [None]:
df_business_eda = pd.read_pickle("df_business_eda.pickle")

In [None]:
categorical_cols=df_business_eda.select_dtypes(include=['object']).columns

df_business_hot=df_business_eda

for col in categorical_cols:
    dummies=pd.get_dummies(df_business_hot[col], dummy_na=True, prefix=col)
    df_business_hot=df_business_hot.\
        drop(col,axis=1).\
    merge(
        dummies,
        how='left',
        left_index=True,
        right_index=True
        )
    
df_business_hot=df_business_hot.fillna(False)

In [None]:
y=df_business_hot.rating_category
x=df_business_hot.drop(["rating_category", "stars"], axis=1)

x_train,x_test,y_train,y_test=ms.train_test_split(x,y,test_size=0.2)

In [None]:
#printing shapes of testing and training sets :
print("shape of original dataset :", df_business_hot.shape)
print("shape of input - training set", x_train.shape)
print("shape of output - training set", y_train.shape)
print("shape of input - testing set", x_test.shape)
print("shape of output - testing set", y_test.shape)

# Logistic Regression

## Stoichastic Gradient Descent 

In [None]:
logistic = linear_model.SGDClassifier(n_jobs=-1, loss="log")
logistic.fit(x_train,y_train)

y_pred = logistic.predict(x_test)
print('Logistic Accuracy: {:.2f}'.format(metrics.accuracy_score(y_test, y_pred)))
print('Logistic Precision: {:.2f}'.format(metrics.precision_score(y_test, y_pred, average='micro')))
print('Logistic classification_report: ')
class_report=metrics.classification_report(y_test, y_pred, output_dict=True)
pd.DataFrame(class_report)

# Support Vector Machine

## Stoichastic Gradient Descent 

In [None]:
import sklearn
import numpy as np
from sklearn import linear_model
import sklearn.metrics as metrics


svm = linear_model.SGDClassifier(n_jobs=-1, loss="hinge")
svm.fit(x_train,y_train)

In [None]:
y_pred = svm.predict(x_test)
print('SVM Accuracy: {:.2f}'.format(metrics.accuracy_score(y_test, y_pred)))
print('SVM Precision: {:.2f}'.format(metrics.precision_score(y_test, y_pred, average='micro')))
print('SVM classification_report: ')
class_report=metrics.classification_report(y_test, y_pred, output_dict=True)
pd.DataFrame(class_report)

## Linear [too slow]

In [None]:
#https://stackabuse.com/implementing-svm-and-kernel-svm-with-pythons-scikit-learn/
#following this beautiful guide

from sklearn.svm import SVC
svclassifier = SVC(kernel='linear')
svclassifier.fit(x_train, y_train)

y_pred = svclassifier.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

## POLY [too slow]

In [None]:
from sklearn.svm import SVC
svclassifier = SVC(kernel='poly', degree=8)
svclassifier.fit(x_train, y_train)

y_pred = svclassifier.predict(x_train)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

## Gaussian Kernel (RBF) [too slow]

In [None]:
from sklearn.svm import SVC
svclassifier = SVC(kernel='rbf')
svclassifier.fit(x_train, y_train)

y_pred = svclassifier.predict(x_train)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

## Sigmoid Kernel [too slow]

In [None]:
from sklearn.svm import SVC
svclassifier = SVC(kernel='sigmoid')
svclassifier.fit(x_train, y_train)

y_pred = svclassifier.predict(x_train)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# Model Advantages	10

Discuss the advantages of each model for each classification task. Does one type of model offer superior performance over another in terms of prediction accuracy? In terms of training time or efficiency? Explain in detail.

# Interpret Feature Importance	30

Use the weights from logistic regression to interpret the importance of different features for the classification task. Explain your interpretation in detail. Why do you think some variables are more important?

# Interpret Support Vectors	10

Look at the chosen support vectors for the classification task. Do these provide any insight into the data? Explain. If you used stochastic gradient descent (and therefore did not explicitly solve for support vectors), try subsampling your data to train the SVC model— then analyze the support vectors from the subsampled dataset.