more about the data: http://insideairbnb.com/copenhagen/

In [46]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import model_selection
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

In [26]:
df = pd.read_csv('data/listings_CPH.csv')

In [27]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13805,13806,13807,13808,13809,13810,13811,13812,13813,13814
id,6983,26057,26473,29118,31094,32379,32841,33680,37159,38499,...,638839422729041798,630779770351807878,632474141549608786,646726550705810749,646941499450912133,653494030951422457,647809303952891559,650819220455514228,658612163688161695,648436253362373119
name,Copenhagen 'N Livin',Lovely house - most attractive area,City Centre Townhouse Sleeps 1-10 persons,Best Location in Cool Istedgade,"Beautiful, spacious, central, renovated Penthouse","165 m2 artist flat on Vesterbro, with 2 bathr...",Cozy flat for Adults/Quiet for kids,Best location on Vesterbro/Cph,Unique space on greatest location,0 min. from everything in Cph.,...,"T31B, 1. A 1 bedroom apartment in Rødovre",Skøn lejlighed i Hvidovre,3 Room apartment. 8 min walk from CPH Airport,Ny rummelig lejlighed med tilhørende kat.,Big beautiful and charming apartment,Top-floor Villa Apartment in the Heart of Hell...,Dejligt lille hus med flere hyggekroge ude og ...,Lejlighed i Storkøbenhavn. 13 minutter fra cen...,Cosy apartment with a great view in Copenhagen,Big Bedroom connected with a large living room
host_id,16774,109777,112210,125230,129976,140105,142143,145671,160390,122489,...,424093643,60240725,76731355,54229471,141288846,465189427,4862421,134647873,256903668,141288846
host_name,Simon,Kari,Julia,Nana,Ebbe,Lise,Anders & Maria,Mette,Jeanette,Christina,...,Seneca,Henrik,Jørgen,Morten,Tanja,Martine,Kathrine,Tim,Sam,Tanja
neighbourhood_group,,,,,,,,,,,...,,,,,,,,,,
neighbourhood,Nrrebro,Indre By,Indre By,Vesterbro-Kongens Enghave,Vesterbro-Kongens Enghave,Vesterbro-Kongens Enghave,sterbro,Vesterbro-Kongens Enghave,Indre By,Indre By,...,Vanlse,Valby,Amager st,Brnshj-Husum,Bispebjerg,sterbro,Valby,Valby,Brnshj-Husum,Bispebjerg
latitude,55.68641,55.69307,55.67602,55.67023,55.666602,55.672638,55.71176,55.66631,55.68547,55.684288,...,55.67286,55.6387,55.6309,55.739028,55.730481,55.73284,55.667782,55.659536,55.730971,55.73142
longitude,12.54741,12.57649,12.5754,12.55504,12.555283,12.552493,12.57091,12.54555,12.56543,12.573019,...,12.4561,12.49824,12.64248,12.487433,12.521243,12.57237,12.463285,12.474726,12.487993,12.52177
room_type,Entire home/apt,Entire home/apt,Entire home/apt,Entire home/apt,Entire home/apt,Entire home/apt,Entire home/apt,Entire home/apt,Entire home/apt,Entire home/apt,...,Entire home/apt,Entire home/apt,Entire home/apt,Entire home/apt,Entire home/apt,Entire home/apt,Entire home/apt,Entire home/apt,Entire home/apt,Private room
price,898,2600,3250,725,1954,1280,617,1000,2916,1900,...,490,1000,845,856,1050,1250,450,400,850,420


### Multi Class Classification
predict whether the price of a rental is either ‘low’, ‘med’ or ‘high’. We again leave it up to you to decide how you want to define ‘low’, ‘med’ and ‘high’. For example, ‘low’ could mean that the price of the rental is lower than the 33th percentile of prices in the city, ‘medium’ is a price between the 33th and 66th percentile, and ‘high’ is a price higher than the 66th percentile. You may also use any other threshold to define the classes low,
medium and high. Here, there is no benchmark, try to build the best possible classifier you can (considering all the metrics we looked at f1 score, confusion matrix, etc.).

Lets use the method suggested where
- price < 33 percentile is classified as low
- 33 percentile <= price <= 66 percentile is classified as med
- price > 66 percentile is classified as high

In [28]:
#calculate the 33 and 66 percentile of prices
quantile_66 = df['price'].quantile(0.66)
quantile_33 = df['price'].quantile(0.33)

# preparing the targer variable (price)
df['price_class']=(df['price'].apply(lambda x: 'high' if x > quantile_66 else ('medium' if x > quantile_33 else 'low'))).astype(str)

In [29]:
# preparing the data
# decide which columns to feature in the classifier
featured_columns = ['neighbourhood', 'room_type', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'availability_365', 'calculated_host_listings_count']

#get dummy variables
features = pd.get_dummies(df[featured_columns], columns=['neighbourhood', 'room_type']).fillna(0)

#standardize remaining data
std_columns = ['minimum_nights', 'number_of_reviews', 'reviews_per_month', 'availability_365', 'calculated_host_listings_count']
std_features = features.copy()
std_features[std_columns] = (std_features[std_columns] - std_features[std_columns].mean()) / std_features[std_columns].std()

In [30]:
#splitting data into training and testing sets (80 - 20 split)

x_train, x_test, y_train, y_test = model_selection.train_test_split(features, df['price_class'], test_size= 0.2, random_state = 1)

This is a multiclass classification problem, so binary classifiers like logistic regression and support vector machines will not directly work.

However, we can split the multiclass problem into multiple binary classification problems which then could be plugged into a logistic regression or SVM model.

First, lets look at logistic regression. Here, we can use a method called One-vs-Rest (OVR).
- we split the problem into 3 different binary classification datasets:
    - High vs [Medium, Low]
    - Medium vs [High, Low]
    - Low vs [High, Medium]

This can be easily implemented with LogisticRegression by setting the multi_class input to 'ovr'

In [31]:
lr = LogisticRegression(multi_class='ovr', solver='liblinear')
lr.fit(x_train, y_train)

In [33]:
#lets check the accuracy
y_pred_lr = lr.predict(x_test)

print("accuracy score: %f" % accuracy_score(y_test, y_pred_lr))
print("F1-score: %f" % f1_score(y_test, y_pred_lr, average='micro'))
print("confusion matrix:")
print(confusion_matrix(y_test, y_pred_lr))

accuracy score: 0.498371
F1-score: 0.498371
confusion matrix:
[[415 103 317]
 [ 92 460 403]
 [283 188 502]]


In [40]:
svc = SVC(decision_function_shape='ovo')

In [41]:
svc.fit(x_train, y_train)

In [44]:
#lets check the accuracy
y_pred_svc = svc.predict(x_test)

print("accuracy score: %f" % accuracy_score(y_test, y_pred_svc))
print("F1-score: %f" % f1_score(y_test, y_pred_svc, average='micro'))
print("confusion matrix:")
print(confusion_matrix(y_test, y_pred_svc))

accuracy score: 0.412233
F1-score: 0.412233
confusion matrix:
[[378 345 112]
 [222 616 117]
 [293 535 145]]


In [51]:
dt=DecisionTreeClassifier()

In [52]:
dt.fit(x_train, y_train)

In [53]:
#lets check the accuracy
y_pred_dt = dt.predict(x_test)

print("accuracy score: %f" % accuracy_score(y_test, y_pred_dt))
print("F1-score: %f" % f1_score(y_test, y_pred_dt, average='micro'))
print("confusion matrix:")
print(confusion_matrix(y_test, y_pred_dt))

accuracy score: 0.452045
F1-score: 0.452045
confusion matrix:
[[386 152 297]
 [196 489 270]
 [308 291 374]]
