## Airbnb DC Hosting Helper ##

## 5_modeling ##

### Summary ###

In this notebook, I will be exploring different modeling tecniques to see what the best fit is for my data. 

Import libraries and read in data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score, recall_score, confusion_matrix, classification_report, accuracy_score, f1_score, ConfusionMatrixDisplay
from sklearn import metrics
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from xgboost import XGBClassifier
from sklearn.naive_bayes import BernoulliNB

In [2]:
np.random.seed(100)

In [3]:
pd.set_option('display.max_columns', 300)

In [4]:
pd.set_option('display.max_rows', 300)

In [19]:
df = pd.read_csv('../data/cleaned_numerical_df.csv')

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3652 entries, 0 to 3651
Data columns (total 48 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Unnamed: 0                      3652 non-null   int64  
 1   id                              3652 non-null   int64  
 2   name                            3652 non-null   object 
 3   description                     3609 non-null   object 
 4   neighborhood_overview           2796 non-null   object 
 5   host_id                         3652 non-null   int64  
 6   host_about                      2538 non-null   object 
 7   host_response_time              3652 non-null   object 
 8   host_response_rate              3652 non-null   float64
 9   host_acceptance_rate            3652 non-null   float64
 10  host_is_superhost               3652 non-null   int64  
 11  host_has_profile_pic            3652 non-null   int64  
 12  host_identity_verified          36

Drop unnecessary columns for modeling.

In [21]:
df.drop(columns = ['Unnamed: 0', 'id', 'name', 'description', 'neighborhood_overview',
                   'host_id', 'host_about', 'latitude_x', 'longitude_x', 'amenities'
                  ], inplace=True)

Get dummies for categorical columns

In [22]:
df = pd.get_dummies(df,columns = ['host_response_time', 'neighbourhood_cleansed', 'room_type'], drop_first=True)

In [23]:
df.head()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_is_superhost,host_has_profile_pic,host_identity_verified,accommodates,bathrooms,bedrooms,beds,price,minimum_nights,maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,availability_30,availability_60,availability_90,availability_365,instant_bookable,calculated_host_listings_count,historic site,museum,metro,music venue,perfomring arts venue,college and university,food,nightlife spot,outdoors and recreation,government building,clothing store,popular,days_being_host,days_since_first_review,days_since_last_review,host_response_time_unknown,host_response_time_within a day,host_response_time_within a few hours,host_response_time_within an hour,"neighbourhood_cleansed_Brookland, Brentwood, Langdon","neighbourhood_cleansed_Capitol Hill, Lincoln Park","neighbourhood_cleansed_Capitol View, Marshall Heights, Benning Heights","neighbourhood_cleansed_Cathedral Heights, McLean Gardens, Glover Park","neighbourhood_cleansed_Cleveland Park, Woodley Park, Massachusetts Avenue Heights, Woodland-Normanstone Terrace","neighbourhood_cleansed_Colonial Village, Shepherd Park, North Portal Estates","neighbourhood_cleansed_Columbia Heights, Mt. Pleasant, Pleasant Plains, Park View","neighbourhood_cleansed_Congress Heights, Bellevue, Washington Highlands","neighbourhood_cleansed_Deanwood, Burrville, Grant Park, Lincoln Heights, Fairmont Heights","neighbourhood_cleansed_Douglas, Shipley Terrace","neighbourhood_cleansed_Downtown, Chinatown, Penn Quarters, Mount Vernon Square, North Capitol Street","neighbourhood_cleansed_Dupont Circle, Connecticut Avenue/K Street","neighbourhood_cleansed_Eastland Gardens, Kenilworth","neighbourhood_cleansed_Edgewood, Bloomingdale, Truxton Circle, Eckington","neighbourhood_cleansed_Fairfax Village, Naylor Gardens, Hillcrest, Summit Park","neighbourhood_cleansed_Friendship Heights, American University Park, Tenleytown","neighbourhood_cleansed_Georgetown, Burleith/Hillandale","neighbourhood_cleansed_Hawthorne, Barnaby Woods, Chevy Chase",neighbourhood_cleansed_Historic Anacostia,"neighbourhood_cleansed_Howard University, Le Droit Park, Cardozo/Shaw","neighbourhood_cleansed_Ivy City, Arboretum, Trinidad, Carver Langston","neighbourhood_cleansed_Kalorama Heights, Adams Morgan, Lanier Heights","neighbourhood_cleansed_Lamont Riggs, Queens Chapel, Fort Totten, Pleasant Hill","neighbourhood_cleansed_Mayfair, Hillbrook, Mahaning Heights","neighbourhood_cleansed_Near Southeast, Navy Yard","neighbourhood_cleansed_North Cleveland Park, Forest Hills, Van Ness","neighbourhood_cleansed_North Michigan Park, Michigan Park, University Heights","neighbourhood_cleansed_River Terrace, Benning, Greenway, Dupont Park","neighbourhood_cleansed_Shaw, Logan Circle","neighbourhood_cleansed_Sheridan, Barry Farm, Buena Vista","neighbourhood_cleansed_Southwest Employment Area, Southwest/Waterfront, Fort McNair, Buzzard Point","neighbourhood_cleansed_Spring Valley, Palisades, Wesley Heights, Foxhall Crescent, Foxhall Village, Georgetown Reservoir","neighbourhood_cleansed_Takoma, Brightwood, Manor Park","neighbourhood_cleansed_Twining, Fairlawn, Randle Highlands, Penn Branch, Fort Davis Park, Fort Dupont","neighbourhood_cleansed_Union Station, Stanton Park, Kingman Park","neighbourhood_cleansed_West End, Foggy Bottom, GWU","neighbourhood_cleansed_Woodland/Fort Stanton, Garfield Heights, Knox Hill","neighbourhood_cleansed_Woodridge, Fort Lincoln, Gateway",room_type_Hotel room,room_type_Private room,room_type_Shared room
0,0.8,0.75,0,1,1,1,1.0,1.0,1.0,55.0,2,365,2.0,365.0,1,31,61,336,0,2,1,2,1,0,3,10,25,5,28,21,5,0,4610,2576,180,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
1,1.0,0.93,0,1,1,2,1.0,1.0,2.0,97.0,7,200,7.0,1125.0,9,20,50,140,0,43,2,1,0,0,1,40,18,4,29,6,0,1,4437,1912,16,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
2,1.0,0.35,0,1,1,4,1.0,1.0,2.0,185.0,2,180,2.0,180.0,17,47,76,76,0,2,36,28,2,15,46,44,50,50,45,48,47,0,4346,2223,19,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1.0,0,1,1,4,1.0,1.0,3.0,125.0,1,365,1.0,1125.0,12,42,72,347,1,4,5,0,0,1,4,17,48,27,44,24,2,0,4347,3929,23,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,1.0,0,1,1,2,1.5,1.0,1.0,61.0,1,365,1.0,1125.0,19,49,79,354,1,4,5,0,0,1,4,17,48,27,44,24,2,0,4347,2121,28,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


Set testing and training groups

In [24]:
X = df.drop(columns=['popular'])

y = df['popular']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y )

In [25]:
ss = StandardScaler()

X_train_sc = ss.fit_transform(X_train)

X_test_sc = ss.transform(X_test)

In [26]:
def pipe(model):
    #instantiate model
    model = model()
    #fit to scaled data
    model.fit(X_train_sc, y_train)
    
    #make predictions
    predictions = model.predict(X_test_sc)

    #print results
    print(f'{model} training score: {model.score(X_train_sc, y_train)}')
    print(f'{model} testing score: {model.score(X_test_sc, y_test)}')

    return

In [27]:
pipe(LogisticRegression)

LogisticRegression() training score: 0.8357064622124863
LogisticRegression() testing score: 0.8466593647316539


In [28]:
pipe(DecisionTreeClassifier)

DecisionTreeClassifier() training score: 1.0
DecisionTreeClassifier() testing score: 0.8138006571741512


In [29]:
pipe(GradientBoostingClassifier)

GradientBoostingClassifier() training score: 0.9087258123402702
GradientBoostingClassifier() testing score: 0.8630887185104053


In [30]:
pipe(AdaBoostClassifier)

AdaBoostClassifier() training score: 0.8678349762687112
AdaBoostClassifier() testing score: 0.8466593647316539


In [31]:
pipe(RandomForestClassifier)

RandomForestClassifier() training score: 1.0
RandomForestClassifier() testing score: 0.8532311062431545


In [32]:
pipe(SVC)

SVC() training score: 0.8587075575027382
SVC() testing score: 0.8510405257393209


In [47]:
pipe(BernoulliNB)

BernoulliNB() training score: 0.8079591091639284
BernoulliNB() testing score: 0.8324205914567361


In [34]:
pipe(KNeighborsClassifier)

KNeighborsClassifier() training score: 0.8532311062431545
KNeighborsClassifier() testing score: 0.7973713033953997


In [33]:
logreg = LogisticRegression()

logreg.fit(X_train_sc, y_train)

predictions = logreg.predict(X_test_sc)

print(f'training score: {logreg.score(X_train_sc, y_train)}')
print(f'testing score: {logreg.score(X_test_sc, y_test)}')

training score: 0.8357064622124863
testing score: 0.8466593647316539


In [44]:
sorted(set(zip(logreg.coef_[0], X.columns)), reverse=True)

[(1.1061461623865914, 'days_since_first_review'),
 (0.9224258202238735, 'host_is_superhost'),
 (0.5017011279325907, 'accommodates'),
 (0.2910542683249673, 'minimum_nights'),
 (0.28130804748358, 'host_acceptance_rate'),
 (0.18161039229625672, 'host_has_profile_pic'),
 (0.14774008691026486, 'neighbourhood_cleansed_West End, Foggy Bottom, GWU'),
 (0.14772158131632482, 'neighbourhood_cleansed_Brookland, Brentwood, Langdon'),
 (0.14268857522242176,
  'neighbourhood_cleansed_Woodridge, Fort Lincoln, Gateway'),
 (0.10095969126110649, 'maximum_nights'),
 (0.0955933455900523, 'availability_90'),
 (0.09546067200988377, 'host_identity_verified'),
 (0.09519088470296674, 'neighbourhood_cleansed_Eastland Gardens, Kenilworth'),
 (0.07834970059334948, 'host_response_rate'),
 (0.0768194492434986,
  'neighbourhood_cleansed_Mayfair, Hillbrook, Mahaning Heights'),
 (0.07314851560072703,
  'neighbourhood_cleansed_Friendship Heights, American University Park, Tenleytown'),
 (0.0713956343510424, 'college and

As days since first review increases by 1, the log-odds of someone being a popular listing increases by 1.11.

In [45]:
sorted(set(zip(np.exp(logreg.coef_[0]), X.columns ) ), reverse=True)


[(3.0226869742470863, 'days_since_first_review'),
 (2.515384866375245, 'host_is_superhost'),
 (1.651528343424749, 'accommodates'),
 (1.337837184163135, 'minimum_nights'),
 (1.324861661382376, 'host_acceptance_rate'),
 (1.199146905983806, 'host_has_profile_pic'),
 (1.1592115629454387, 'neighbourhood_cleansed_West End, Foggy Bottom, GWU'),
 (1.159190111245453, 'neighbourhood_cleansed_Brookland, Brentwood, Langdon'),
 (1.1533705575608024,
  'neighbourhood_cleansed_Woodridge, Fort Lincoln, Gateway'),
 (1.1062320500458507, 'maximum_nights'),
 (1.1003115264690373, 'availability_90'),
 (1.1001655538830868, 'host_identity_verified'),
 (1.0998687832154215, 'neighbourhood_cleansed_Eastland Gardens, Kenilworth'),
 (1.081500794018528, 'host_response_rate'),
 (1.0798470915815115,
  'neighbourhood_cleansed_Mayfair, Hillbrook, Mahaning Heights'),
 (1.0758903115458327,
  'neighbourhood_cleansed_Friendship Heights, American University Park, Tenleytown'),
 (1.074006055513004, 'college and university'),


As days since first review increases by 1, the listing is 3.02 times more likely to be popular.

* Reminder that this is an increase of scaled units. We just need to make this caviot for interpreting results. reminder of what standard scalar does (mean of 0 std of 1?)

In [60]:
test = pd.DataFrame(y_test)

test['preds'] = predictions

test['accuracy'] = test['popular'] - test['preds']

In [63]:
new = test[~(test['accuracy']==0)]

new.shape

(140, 3)

In [64]:
new

Unnamed: 0,popular,preds,accuracy
956,1,0,1
1180,0,1,-1
1132,1,0,1
955,0,1,-1
867,1,0,1
350,0,1,-1
852,0,1,-1
655,1,0,1
498,1,0,1
2327,1,0,1
