# Multivariate K-nearest Neighbors

We'll be working with a dataset from October 3, 2015 on AirBnB listings from Washington, D.C. Each row in the dataset is a specific listing that's available for renting on AirBnB in the Washington, D.C. area. Here are the column discriptiobs:
- `host_response_rate`: the response rate of the host 
- `host_acceptance_rate`: number of requests to the host that convert to rentals 
- `host_listings_count`: number of other listings the host has 
- `latitude`: latitude dimension of the geographic coordinates 
- `longitude`: longitude part of the coordinates 
- `city`: the city the living space resides 
- `zipcode`: the zip code the living space resides 
- `state`: the state the living space resides 
- `accommodates`: the number of guests the rental can accommodate 
- `room_type`: the type of living space (Private room, Shared room or Entire home/apt 
- `bedrooms`: number of bedrooms included in the rental 
- `bathrooms`: number of bathrooms included in the rental 
- `beds`: number of beds included in the rental 
- `price`: nightly price for the rental 
- `cleaning_fee`: additional fee used for cleaning the living space after the guest leaves 
- `security_deposit`: refundable security deposit, in case of damages 
- `minimum_nights`: minimum number of nights a guest can stay for the rental 
- `maximum_nights`: maximum number of nights a guest can stay for the rental 
- `number_of_reviews`: number of reviews that previous guests have left

In [19]:
import pandas as pd
import numpy as np
from scipy.spatial import distance

In [7]:
dc_listings = pd.read_csv('dc_airbnb.csv')

### Exploring the Data

In [8]:
np.random.seed(1)
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')

In [9]:
dc_listings.head()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state
574,100%,100%,1,2,Private room,1.0,1.0,1.0,125.0,,$300.00,1,4,149,38.913548,-77.031981,Washington,20009,DC
1593,87%,100%,2,2,Private room,1.0,1.5,1.0,85.0,$15.00,,1,30,49,38.953431,-77.030695,Washington,20011,DC
3091,100%,,1,1,Private room,1.0,0.5,1.0,50.0,,,1,1125,1,38.933491,-77.029679,Washington,20010,DC
420,58%,51%,480,2,Entire home/apt,1.0,1.0,1.0,209.0,$150.00,,4,730,2,38.904054,-77.051991,Washington,20037,DC
808,100%,95%,3,12,Entire home/apt,5.0,2.0,5.0,215.0,$135.00,$100.00,2,1825,34,38.906118,-76.988873,Washington,20002,DC


In [10]:
dc_listings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3723 entries, 574 to 1061
Data columns (total 19 columns):
host_response_rate      3289 non-null object
host_acceptance_rate    3109 non-null object
host_listings_count     3723 non-null int64
accommodates            3723 non-null int64
room_type               3723 non-null object
bedrooms                3702 non-null float64
bathrooms               3696 non-null float64
beds                    3712 non-null float64
price                   3723 non-null float64
cleaning_fee            2335 non-null object
security_deposit        1426 non-null object
minimum_nights          3723 non-null int64
maximum_nights          3723 non-null int64
number_of_reviews       3723 non-null int64
latitude                3723 non-null float64
longitude               3723 non-null float64
city                    3723 non-null object
zipcode                 3714 non-null object
state                   3723 non-null object
dtypes: float64(6), int64(5), objec

### Removing features

We remove "room_type", "city", "state" and state columns that have nun-numerical values.

We remove "latitude", "longitude", "zipcode" columns that contain numerical but non-ordinal values.

We aslo remove "host_response_rate", "host_acceptance_rate", and "host_listings_count" columns as they do not directly describe the living space or the listing itself.

In [12]:
cols_to_drop = ['room_type','city','state','latitude','longitude','zipcode','host_response_rate','host_acceptance_rate','host_listings_count']
dc_listings = dc_listings.drop(labels = cols_to_drop, axis=1)

In [13]:
dc_listings.head()

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews
574,2,1.0,1.0,1.0,125.0,,$300.00,1,4,149
1593,2,1.0,1.5,1.0,85.0,$15.00,,1,30,49
3091,1,1.0,0.5,1.0,50.0,,,1,1125,1
420,2,1.0,1.0,1.0,209.0,$150.00,,4,730,2
808,12,5.0,2.0,5.0,215.0,$135.00,$100.00,2,1825,34


### Handling missing values

In [14]:
dc_listings = dc_listings.drop(labels=['cleaning_fee','security_deposit'],axis=1)
dc_listings = dc_listings.dropna(axis=0)    

In [17]:
dc_listings.isnull().sum()

accommodates         0
bedrooms             0
bathrooms            0
beds                 0
price                0
minimum_nights       0
maximum_nights       0
number_of_reviews    0
dtype: int64

In [15]:
dc_listings.head()

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
574,2,1.0,1.0,1.0,125.0,1,4,149
1593,2,1.0,1.5,1.0,85.0,1,30,49
3091,1,1.0,0.5,1.0,50.0,1,1125,1
420,2,1.0,1.0,1.0,209.0,4,730,2
808,12,5.0,2.0,5.0,215.0,2,1825,34


In [16]:
dc_listings.shape

(3671, 8)

### Normalizing columns

In [18]:
normalized_listings = (dc_listings - dc_listings.mean())/(dc_listings.std())
normalized_listings['price'] = dc_listings['price']
normalized_listings.head(5)

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
574,-0.596544,-0.249467,-0.439151,-0.546858,125.0,-0.341375,-0.016604,4.57965
1593,-0.596544,-0.249467,0.412923,-0.546858,85.0,-0.341375,-0.016603,1.159275
3091,-1.095499,-0.249467,-1.291226,-0.546858,50.0,-0.341375,-0.016573,-0.482505
420,-0.596544,-0.249467,-0.439151,-0.546858,209.0,0.487635,-0.016584,-0.448301
808,4.393004,4.507903,1.264998,2.829956,215.0,-0.065038,-0.016553,0.646219


### Multivariate Euclidean distance

In [20]:
first =normalized_listings.iloc[0][['accommodates', 'bathrooms']]
fifth =normalized_listings.iloc[4][['accommodates', 'bathrooms']]
first_fifth_distance = distance.euclidean(first, fifth)
first_fifth_distance

5.272543124668404

### Fitting a model and making predictions

In [25]:
from sklearn.neighbors import KNeighborsRegressor

# Instntiate Machine Learning Model
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')

In [23]:
# Split full dataset into train and test sets

train_df = normalized_listings.iloc[0:2792]
test_df = normalized_listings.iloc[2792:]

In [26]:
features = train_df[['accommodates','bathrooms']]
target = train_df['price']

# Fit model to dta
knn.fit(features, target)

KNeighborsRegressor(algorithm='brute', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='uniform')

In [31]:
# Use the model to make predictions
predictions = knn.predict(test_df[['accommodates','bathrooms']])

### Calculating MSE using Scikit-Learn

In [32]:
from sklearn.metrics import mean_squared_error

In [35]:
two_features_mse = mean_squared_error(test_df['price'], predictions)
two_features_mse

15184.425164960181

In [37]:
two_features_rmse = two_features_mse**(1/2)
two_features_rmse

123.2250995737483

### Training the model using more features

In [39]:
features = ['accommodates', 'bedrooms', 'bathrooms', 'number_of_reviews']
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')
knn.fit(train_df[features], train_df['price'])
four_predictions = knn.predict(test_df[features])

In [40]:
four_mse = mean_squared_error(test_df['price'], four_predictions)
four_mse

14044.065665529011

In [41]:
four_rmse = four_mse ** (1/2)
four_rmse

118.50766078836006

### Training the model using all the features 

In [42]:
features = train_df.columns.tolist()
features.remove('price')
knn.fit(train_df[features], train_df['price'])
all_features_predictions = knn.predict(test_df[features])

In [43]:
all_features_mse = mean_squared_error(test_df['price'], all_features_predictions)
all_features_mse

15392.625392491465

In [44]:
all_features_rmse = all_features_mse ** (1/2)
all_features_rmse

124.06701976146387