# Introduction to K-Nearest Neighbors

We'll be working with a dataset from October 3, 2015 on AirBnB listings from Washington, D.C. Each row in the dataset is a specific listing that's available for renting on AirBnB in the Washington, D.C. area. Here are the column discriptiobs:
- `host_response_rate`: the response rate of the host 
- `host_acceptance_rate`: number of requests to the host that convert to rentals 
- `host_listings_count`: number of other listings the host has 
- `latitude`: latitude dimension of the geographic coordinates 
- `longitude`: longitude part of the coordinates 
- `city`: the city the living space resides 
- `zipcode`: the zip code the living space resides 
- `state`: the state the living space resides 
- `accommodates`: the number of guests the rental can accommodate 
- `room_type`: the type of living space (Private room, Shared room or Entire home/apt 
- `bedrooms`: number of bedrooms included in the rental 
- `bathrooms`: number of bathrooms included in the rental 
- `beds`: number of beds included in the rental 
- `price`: nightly price for the rental 
- `cleaning_fee`: additional fee used for cleaning the living space after the guest leaves 
- `security_deposit`: refundable security deposit, in case of damages 
- `minimum_nights`: minimum number of nights a guest can stay for the rental 
- `maximum_nights`: maximum number of nights a guest can stay for the rental 
- `number_of_reviews`: number of reviews that previous guests have left

In [7]:
import pandas as pd
import numpy as np

In [2]:
dc_listings=pd.read_csv('dc_airbnb.csv')

### Exploring the Data

In [5]:
dc_listings.iloc[0]

host_response_rate                  92%
host_acceptance_rate                91%
host_listings_count                  26
accommodates                          4
room_type               Entire home/apt
bedrooms                              1
bathrooms                             1
beds                                  2
price                           $160.00
cleaning_fee                    $115.00
security_deposit                $100.00
minimum_nights                        1
maximum_nights                     1125
number_of_reviews                     0
latitude                          38.89
longitude                      -77.0028
city                         Washington
zipcode                           20003
state                                DC
Name: 0, dtype: object

In [6]:
dc_listings.columns

Index(['host_response_rate', 'host_acceptance_rate', 'host_listings_count',
       'accommodates', 'room_type', 'bedrooms', 'bathrooms', 'beds', 'price',
       'cleaning_fee', 'security_deposit', 'minimum_nights', 'maximum_nights',
       'number_of_reviews', 'latitude', 'longitude', 'city', 'zipcode',
       'state'],
      dtype='object')

### Euclidean Distance

In [8]:
# We consider a living space that can accommodate 3 people

first_distance  = np.abs(3 - dc_listings.iloc[0]['accommodates'])
first_distance

1

### Finding the distance for all observations

In [10]:
def distance(row):
    return np.abs(3 - row['accommodates'])

In [11]:
dc_listings['distance'] = dc_listings.apply(distance, axis=1)

In [12]:
dc_listings['distance'].value_counts()

1     2294
2      503
0      461
3      279
5       73
4       35
7       22
6       17
9       12
13       8
8        7
12       6
11       4
10       2
Name: distance, dtype: int64

In [14]:
dc_listings[dc_listings["distance"] == 0]["accommodates"].head(10)

26    3
34    3
36    3
40    3
44    3
45    3
48    3
65    3
66    3
71    3
Name: accommodates, dtype: int64

### Randomizing and sorting

In [19]:
np.random.permutation(10)

array([4, 9, 2, 3, 8, 1, 0, 6, 5, 7])

In [16]:
np.random.seed(1)

dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]
dc_listings = dc_listings.sort_values('distance')
dc_listings['distance'].head()

577     0
2166    0
3631    0
71      0
1011    0
Name: distance, dtype: int64

In [17]:
dc_listings.head()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state,distance
577,98%,52%,49,3,Private room,1.0,1.0,2.0,$185.00,,,2,14,1,38.908356,-77.028146,Washington,20005,DC,0
2166,100%,89%,2,3,Entire home/apt,1.0,1.0,1.0,$180.00,,$100.00,1,14,10,38.905808,-77.000012,Washington,20002,DC,0
3631,98%,52%,49,3,Entire home/apt,1.0,1.0,2.0,$175.00,,,3,14,1,38.889065,-76.993576,Washington,20003,DC,0
71,100%,94%,1,3,Entire home/apt,1.0,1.0,1.0,$128.00,$40.00,,1,1125,9,38.87996,-77.006491,Washington,20003,DC,0
1011,,,1,3,Entire home/apt,0.0,1.0,1.0,$115.00,,,1,1125,0,38.907382,-77.035075,Washington,20005,DC,0


### Cleaning the price column

In [20]:
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollar = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollar.astype('float')

In [21]:
dc_listings['price'].head()

577     185.0
2166    180.0
3631    175.0
71      128.0
1011    115.0
Name: price, dtype: float64

### Finding the mean

In [22]:
mean_price = dc_listings.iloc[0:5]['price'].mean()
mean_price

156.6

Based on the average price of other listings that accommdate 3 people, we should charge 156.6 dollars per night for a guest to stay at our living space.

### Writing a function to make predictions

In [24]:
dc_listings = pd.read_csv('dc_airbnb.csv')
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]

In [25]:
def predict_price(new_listing):
    dc_listing_copy = dc_listings.copy()
    dc_listing_copy['distance'] = dc_listing_copy['accommodates'].apply(lambda x: np.abs(x-new_listing))
    dc_listing_copy = dc_listing_copy.sort_values('distance')
    predicted_price = dc_listing_copy.iloc[0:5]['price'].mean()
    
    return(predicted_price)

In [26]:
acc_one = predict_price(1)
acc_one

62.2

In [27]:
acc_two = predict_price(2)
acc_two

98.4

In [28]:
acc_four = predict_price(4)
acc_four

154.8