# AirBNB NYC Property Clustering - Preprocessing

## Springboard Data Science Track - Third Capstone - Travis Martin

In [88]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, adjusted_rand_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

In [89]:
df_NYC = pd.read_csv('D://Springboard/ThirdCapstone/RawData/EDA_df.csv')
df_NYC.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16242 entries, 0 to 16241
Data columns (total 89 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       16242 non-null  float64
 1   log_price                16242 non-null  float64
 2   property_type            16242 non-null  object 
 3   accommodates             16242 non-null  float64
 4   bathrooms                16242 non-null  float64
 5   bed_type                 16242 non-null  object 
 6   cancellation_policy      16242 non-null  object 
 7   cleaning_fee             16242 non-null  int64  
 8   host_has_profile_pic     16242 non-null  int64  
 9   host_identity_verified   16242 non-null  int64  
 10  host_response_rate       16242 non-null  float64
 11  instant_bookable         16242 non-null  int64  
 12  latitude                 16242 non-null  float64
 13  longitude                16242 non-null  float64
 14  neighbourhood         

In [90]:
print(df_NYC['neighbourhood'].nunique())
print(df_NYC['zipcode'].nunique())

187
183


In order to run KMeans clustering on our dataset, we need to ensure that all columns are numerical. We'll then scale these measures during the actual modeling step to ensure that those with larger ranges don't have an outsized impact on the model.

From the .info() output above, we see that we still have several "object" datatype columns. which contain our categorical variables. We need to convert these to 0/1 dummy variable columns and remove the originals. 

One such column is 'neighbourhood', but it contains 187 unique neighbourhood names. This is approximately how many unique 'zipcode' entities there are (183), and they both indicate geographical positioning, so in the interest of not increasing dimensionality with 187 new dummy columns, we'll instead drop 'neighbourhood' and rely solely on 'zipcode'.

'id' is a unique identifier, but has no predictive value from a clustering perspective, so we can drop this measure as well.

In [91]:
df_NYC.drop(['neighbourhood'], axis=1, inplace = True)
df_NYC.drop(['id'], axis=1, inplace = True)

In [92]:
df_NYC['property_type'].value_counts()

Apartment             14187
House                   912
Loft                    339
Townhouse               297
Condominium             267
Other                   110
Timeshare                35
Guest suite              21
Bed & Breakfast          21
Guesthouse                9
Bungalow                  9
Boutique hotel            6
Boat                      6
Serviced apartment        6
Vacation home             5
Villa                     4
In-law                    3
Earth House               1
Yurt                      1
Cabin                     1
Hostel                    1
Chalet                    1
Name: property_type, dtype: int64

In [93]:
df_NYC.loc[df_NYC["property_type"] == 'Timeshare','property_type'] = 'Other'
df_NYC.loc[df_NYC["property_type"] == 'Guest suite','property_type'] = 'Other'
df_NYC.loc[df_NYC["property_type"] == 'Bed & Breakfast','property_type'] = 'Other'
df_NYC.loc[df_NYC["property_type"] == 'Guesthouse','property_type'] = 'Other'
df_NYC.loc[df_NYC["property_type"] == 'Bungalow','property_type'] = 'Other'
df_NYC.loc[df_NYC["property_type"] == 'Boutique hotel','property_type'] = 'Other'
df_NYC.loc[df_NYC["property_type"] == 'Boat','property_type'] = 'Other'
df_NYC.loc[df_NYC["property_type"] == 'Serviced apartment','property_type'] = 'Other'
df_NYC.loc[df_NYC["property_type"] == 'Vacation home','property_type'] = 'Other'
df_NYC.loc[df_NYC["property_type"] == 'Villa','property_type'] = 'Other'
df_NYC.loc[df_NYC["property_type"] == 'In-law','property_type'] = 'Other'
df_NYC.loc[df_NYC["property_type"] == 'Earth House','property_type'] = 'Other'
df_NYC.loc[df_NYC["property_type"] == 'Yurt','property_type'] = 'Other'
df_NYC.loc[df_NYC["property_type"] == 'Cabin','property_type'] = 'Other'
df_NYC.loc[df_NYC["property_type"] == 'Hostel','property_type'] = 'Other'
df_NYC.loc[df_NYC["property_type"] == 'Chalet','property_type'] = 'Other'

df_NYC['property_type'].value_counts()

Apartment      14187
House            912
Loft             339
Townhouse        297
Condominium      267
Other            240
Name: property_type, dtype: int64

In [94]:
df_NYC = pd.get_dummies(data=df_NYC, columns=['property_type', 'borough'], drop_first=True)

In [95]:
df_NYC['bed_type'].value_counts()

Real Bed         15994
Pull-out Sofa       98
Futon               90
Airbed              38
Couch               22
Name: bed_type, dtype: int64

In [96]:
df_NYC.loc[df_NYC["bed_type"] == 'Real Bed','bed_type'] = 1
df_NYC.loc[df_NYC["bed_type"] == 'Pull-out Sofa','bed_type'] = 0
df_NYC.loc[df_NYC["bed_type"] == 'Futon','bed_type'] = 0
df_NYC.loc[df_NYC["bed_type"] == 'Airbed','bed_type'] = 0
df_NYC.loc[df_NYC["bed_type"] == 'Couch','bed_type'] = 0
df_NYC.rename(columns={"bed_type": "real_bed"}, inplace=True)
df_NYC['real_bed'] = df_NYC['real_bed'].astype(int)
df_NYC['real_bed'].value_counts()

1    15994
0      248
Name: real_bed, dtype: int64

In [97]:
df_NYC['cancellation_policy'].value_counts()

strict             8284
flexible           4014
moderate           3934
super_strict_30       9
super_strict_60       1
Name: cancellation_policy, dtype: int64

In [98]:
df_NYC.loc[df_NYC["cancellation_policy"] == 'super_strict_30','cancellation_policy'] = 'strict'
df_NYC.loc[df_NYC["cancellation_policy"] == 'super_strict_60','cancellation_policy'] = 'strict'

df_NYC.loc[df_NYC["cancellation_policy"] == 'strict','cancellation_policy'] = 1
df_NYC.loc[df_NYC["cancellation_policy"] == 'moderate','cancellation_policy'] = 0
df_NYC.loc[df_NYC["cancellation_policy"] == 'flexible','cancellation_policy'] = -1

df_NYC['cancellation_policy'] = df_NYC['cancellation_policy'].astype(int)

df_NYC['cancellation_policy'].value_counts()

 1    8294
-1    4014
 0    3934
Name: cancellation_policy, dtype: int64

In [99]:
df_NYC.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16242 entries, 0 to 16241
Data columns (total 94 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   log_price                  16242 non-null  float64
 1   accommodates               16242 non-null  float64
 2   bathrooms                  16242 non-null  float64
 3   real_bed                   16242 non-null  int32  
 4   cancellation_policy        16242 non-null  int32  
 5   cleaning_fee               16242 non-null  int64  
 6   host_has_profile_pic       16242 non-null  int64  
 7   host_identity_verified     16242 non-null  int64  
 8   host_response_rate         16242 non-null  float64
 9   instant_bookable           16242 non-null  int64  
 10  latitude                   16242 non-null  float64
 11  longitude                  16242 non-null  float64
 12  number_of_reviews          16242 non-null  float64
 13  review_scores_rating       16242 non-null  flo

We now have what we need, as all of our data is either of type int or type float. We can proceed to the modeling step, where we will scale our data, run the KMeans clustering model, and evaluate the results.