#### Create Models (50 POINTS)
Create a logistic regression model and a support vector machine model for the classification task involved with your dataset. Assess how well each model performs (use 80/20 training/testing split for your data). Adjust parameters of the models to make them more accurate. If your dataset size requires the use of stochastic gradient descent, then linear kernel only is fine to use. That is, the SGDClassifier is fine to use for optimizing logistic regression and linear support vector machines. For many problems, SGD will be required in order to train the SVM model in a reasonable timeframe. 

#### Model Advantages (10 POINTS)
Discuss the advantages of each model for each classification task. Does one type of model offer superior performance over another in terms of prediction accuracy? In terms of training time or efficiency? Explain in detail.

#### Interpret Feature Importance (30 POINTS)
Use the weights from logistic regression to interpret the importance of different features for the classification task. Explain your interpretation in detail. Why do you think some variables are more important?

#### Interpret Support Vectors (10 POINTS)
Look at the chosen support vectors for the classification task. Do these provide any insight into the data? Explain. If you used stochastic gradient descent (and therefore did not explicitly solve for support vectors), try subsampling your data to train the SVC model— then analyze the support vectors from the subsampled dataset.

# Mini-Lab

#### Alan Abadzic, John Girard, Eric Laigaie, Garrett Shankel

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv("NY_Listings_Validated.csv")

### Create Models

In [4]:
# First, we need to create our classification column. Any rentals with a Review Scores Rating above 90 receive "A", else
# recieve "Not A".

def categorise(row):  
    if row['Review Scores Rating'] > 89:
        return 'A'
    else:
        return 'Not A'
    return 'IDK'

df['Grade'] = df.apply(lambda row: categorise(row), axis=1)

#df['Grade'].value_counts()

A        27589
Not A    16436
Name: Grade, dtype: int64

In [60]:
# Filter to only useful columns
data = df[['Host Response Rate', 'Host Is Superhost', 'Host total listings count', 'City', 'Room type',
          'Accommodates', 'Bathrooms', 'Bedrooms', 'Price', 'Minimum nights', 'Maximum nights', 'Availability 365',
          'Number of reviews', 'Reviews per month', 'Grade']]


# One-hot Encode
city_one_hot = pd.get_dummies(data['City'])
room_one_hot = pd.get_dummies(data['Room type'])

data = data.drop('City',axis = 1)
data = data.drop('Room type',axis = 1)

data = data.join(city_one_hot)
data = data.join(room_one_hot)


# Map boolean to integer
data["Host Is Superhost"] = data["Host Is Superhost"].astype(int)


# Scale Data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

grade = data['Grade']
to_scale = data.drop("Grade", axis = 1)

data = scaler.fit_transform(to_scale)
data = pd.DataFrame(data)
data['Grade'] = grade

In [61]:
# Split into 80/20 train/test split.
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.2)

print("Train: " + str(len(train)) + ", Test: " + str(len(test)))

train_y = train['Grade']
train_x = train.drop('Grade', axis=1)
test_y = train['Grade']
test_x = train.drop('Grade', axis=1)

Train: 35220, Test: 8805


In [122]:
# Logistic Regression Model
from sklearn import metrics as mt
from sklearn.linear_model import LogisticRegression

# Testing different penalties with saga
print("Testing penalties with saga solver")
clf = LogisticRegression(max_iter=1000, penalty='none', solver='saga').fit(train_x, train_y)
predictions = clf.predict(test_x)
acc = round((mt.accuracy_score(predictions, test_y)) * 100,2)
print('None: ' + str(acc) + '%')

clf = LogisticRegression(max_iter=1000, penalty='l1', solver='saga').fit(train_x, train_y)
predictions = clf.predict(test_x)
acc = round((mt.accuracy_score(predictions, test_y)) * 100,2)
print('l1: ' + str(acc) + '%')

clf = LogisticRegression(max_iter=1000, penalty='l2', solver='saga').fit(train_x, train_y)
predictions = clf.predict(test_x)
acc = round((mt.accuracy_score(predictions, test_y)) * 100,2)
print('l2: ' + str(acc) + '%')

clf = LogisticRegression(max_iter=1000, penalty='elasticnet', solver='saga', l1_ratio=0.5).fit(train_x, train_y)
predictions = clf.predict(test_x)
acc = round((mt.accuracy_score(predictions, test_y)) * 100,2)
print('Elastic Net: ' + str(acc) + '%' + '\n')

print('Testing C values with Elastic Net penalty')
clf = LogisticRegression(max_iter=1000, penalty='elasticnet', solver='saga', l1_ratio=0.5, C=0.05).fit(train_x, train_y)
predictions = clf.predict(test_x)
acc = round((mt.accuracy_score(predictions, test_y)) * 100,2)
print('0.05: ' + str(acc) + '%')

clf = LogisticRegression(max_iter=1000, penalty='elasticnet', solver='saga', l1_ratio=0.5, C=.5).fit(train_x, train_y)
predictions = clf.predict(test_x)
acc = round((mt.accuracy_score(predictions, test_y)) * 100,2)
print('0.5: ' + str(acc) + '%')

clf = LogisticRegression(max_iter=1000, penalty='elasticnet', solver='saga', l1_ratio=0.5, C=1).fit(train_x, train_y)
predictions = clf.predict(test_x)
acc = round((mt.accuracy_score(predictions, test_y)) * 100,2)
print('1: ' + str(acc) + '%')

clf = LogisticRegression(max_iter=1000, penalty='elasticnet', solver='saga', l1_ratio=0.5, C=100).fit(train_x, train_y)
predictions = clf.predict(test_x)
acc = round((mt.accuracy_score(predictions, test_y)) * 100,2)
print('100: ' + str(acc) + '%')

Testing penalties with saga solver
None: 71.41%
l1: 71.37%
l2: 71.35%
Elastic Net: 71.5%

Testing C values with Elastic Net penalty
0.05: 70.44%
0.5: 71.41%
1: 71.5%
100: 71.4%


In [124]:
lr_clf = LogisticRegression(max_iter=1000, penalty='elasticnet', solver='saga', l1_ratio=0.5).fit(train_x, train_y)
predictions = lr_clf.predict(test_x)
acc = round((mt.accuracy_score(predictions, test_y)) * 100,2)
conf = mt.confusion_matrix(predictions, test_y)


weights = lr_clf.coef_.T # take transpose to make a column vector
variable_names = data.columns
for coef, name in zip(weights,variable_names):
    #print(name, 'has weight of', coef[0])

SyntaxError: unexpected EOF while parsing (<ipython-input-124-aca92513dac2>, line 10)

In [114]:
from sklearn.svm import SVC

# train the model just as before
svm_clf = SVC(C=0.5, kernel='rbf', degree=3, gamma='auto').fit(train_x, train_y)
predictions = svm_clf.predict(test_x)
acc = mt.accuracy_score(predictions, test_y)
print('accuracy:', acc )

accuracy: 0.6311186825667234
[[21992 12912]
 [   80   236]]
