<center><h1>Introduction to K-Nearest Neighbors</h1></center>

## 1. Introduction. 

At its core, data science helps us make sense of the massive world of information all around us — a world that's far too complex to study directly by ourselves. Data is the record of everything that's going on and what we should learn from it. The real value of all this information is what it means.

Machine learning helps us discover patterns in data, which is where meaning lives. When we can see what the data means, we can make predictions about the future.

In this lesson, we'll explore machine learning with a technique called "K-Nearest Neighbors."

We'll use a dataset of AirBnB rental rates to identify similar rates in one area for competing AirBnB units and make predictions for ideal rates to maximize profit. You'll need to be comfortable programming in Python, and you'll need to be familiar with the NumPy and pandas libraries.

Here are a few takeaways you can expect from this lesson:

- The basics of the machine learning workflow
- How the K-Nearest Neighbors algorithm works
- The role of Euclidean distance in machine learning

Now, let's get to know our dataset.


## 2. Introduction to the data. 

While AirBnB doesn't release any data on the listings in their marketplace, a separate group named Inside AirBnB has extracted data on a sample of the listings for many of the major cities on the website.

In this lesson, we'll be working with one of their datasets. To make the dataset less cumbersome to work with, we've removed many of the columns in the original dataset and renamed the file as dc_airbnb.csv. Here are the columns we kept:

- `host_response_rate`: the response rate of the host
- `host_acceptance_rate`: number of requests to the host that convert to rentals
- `host_listings_count`: number of the host's other listings
- `latitude`: latitude of the geographic coordinates
- `longitude`: longitude of the geographic coordinates
- `city`: the city of the rental
- `zipcode`: the zip code of the rental
- `state`: the state the rental
- `accommodates`: the number of guests the rental can accommodate
- `room_type`: the type of rental (Private room, Shared room or Entire home/apt
- `bedrooms`: number of bedrooms included in the rental
- `bathrooms`: number of bathrooms included in the rental
- `beds`: number of beds included in the rental
- `price`: nightly price for the rental
- `cleaning_fee`: additional fee for cleaning the rental after the guest leaves
- `security_deposit`: refundable security deposit, in case of damages
- `minimum_nights`: minimum number of nights a guest can stay at the rental
- `maximum_nights`: maximum number of nights a guest can stay at the rental
- `number_of_reviews`: number of reviews that previous guests have left

Let's read the dataset into Pandas and become more familiar with it.


### Exercise

- In the code editor on the right, write code that does the following:
    - Read dc_airbnb.csv into a DataFrame named dc_listings.
    - Use the print function to display the first row in dc_listings.


In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
dc_listings = pd.read_csv("dc_airbnb.csv")

In [2]:
dc_listings

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state
0,92%,91%,26,4,Entire home/apt,1.0,1.0,2.0,$160.00,$115.00,$100.00,1,1125,0,38.890046,-77.002808,Washington,20003,DC
1,90%,100%,1,6,Entire home/apt,3.0,3.0,3.0,$350.00,$100.00,,2,30,65,38.880413,-76.990485,Washington,20003,DC
2,90%,100%,2,1,Private room,1.0,2.0,1.0,$50.00,,,2,1125,1,38.955291,-76.986006,Hyattsville,20782,MD
3,100%,,1,2,Private room,1.0,1.0,1.0,$95.00,,,1,1125,0,38.872134,-77.019639,Washington,20024,DC
4,92%,67%,1,4,Entire home/apt,1.0,1.0,1.0,$50.00,$15.00,$450.00,7,1125,0,38.996382,-77.041541,Silver Spring,20910,MD
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3718,100%,60%,1,4,Entire home/apt,1.0,1.0,2.0,$135.00,$45.00,$400.00,3,60,19,38.885492,-76.987765,Washington,20003,DC
3719,100%,50%,1,2,Private room,1.0,2.0,1.0,$79.00,,,3,365,36,38.889401,-76.986646,Washington,20003,DC
3720,100%,100%,2,6,Entire home/apt,2.0,1.0,3.0,$275.00,$100.00,$500.00,2,2147483647,12,38.889533,-77.001010,Washington,20003,DC
3721,88%,100%,1,2,Entire home/apt,1.0,1.0,1.0,$179.00,$25.00,,2,21,48,38.890815,-77.002283,Washington,20002,DC


In [12]:
dc_listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3723 entries, 0 to 3722
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   host_response_rate    3289 non-null   object 
 1   host_acceptance_rate  3109 non-null   object 
 2   host_listings_count   3723 non-null   int64  
 3   accommodates          3723 non-null   int64  
 4   room_type             3723 non-null   object 
 5   bedrooms              3702 non-null   float64
 6   bathrooms             3696 non-null   float64
 7   beds                  3712 non-null   float64
 8   price                 3723 non-null   object 
 9   cleaning_fee          2335 non-null   object 
 10  security_deposit      1426 non-null   object 
 11  minimum_nights        3723 non-null   int64  
 12  maximum_nights        3723 non-null   int64  
 13  number_of_reviews     3723 non-null   int64  
 14  latitude              3723 non-null   float64
 15  longitude            

## 3. K-nearest neighbors. 

Here's the strategy we wanted to use:

Find a few similar listings
Calculate the average nightly rental price of these listings
Set the average price as the price for our listing
The k-nearest neighbors algorithm is similar to this strategy. Here's an overview:

<img src="figs/3.1-m139.svg"/>

There are two things we need to unpack in more detail:

- The similarity metric
- How to choose the k value

In this lesson, we'll define what similarity metric we're going to use. Then, we'll implement the k-nearest neighbors algorithm and use it to suggest a price for a new, unpriced listing. We'll use a k value of 5 in this lesson.


## 4. Euclidean Distance. 

The similarity metric works by comparing a fixed set of numerical features (another word for attributes) between two observations, or living spaces in our case. When trying to predict a continuous value, like price, the main similarity metric is Euclidean distance. Here's the general formula for Euclidean distance:

\begin{equation}
d = \sqrt{(q_1-p_1)^2 + (q_2-p_2)^2 + \ldots + (q_n-p_n)^2}
\end{equation}

where $q_1$ to $q_n$ represent the feature values for one observation and $p_1$ to $p_2$  represent the feature values for the other observation. Here's a diagram that breaks down the Euclidean distance between the first two observations in the dataset using only the host_listings_count, accommodates, bedrooms, bathrooms, and beds columns:

![fig1](figs/4.1-m139.svg)

![fig2](figs/4.2-m139.svg)


In this lesson, we'll use just one feature to keep things simple as you become familiar with the machine learning workflow. Since we're only using one feature, this is the univariate case. The formula for the univariate case is $d = \sqrt{(q_1 - p_1)^2}$



The square root and the squared power cancel, and the formula simplifies to $d = \vert q_1 - p_1 \vert$

The living space that we want to rent can accommodate three people. Let's first calculate the distance, using just the `accommodates` feature, between the first living space in the dataset and our own.

### Exercise

1. Calculate the Euclidean distance between our living space, which can accommodate three people, and the first living space in the dc_listings DataFrame.
1. Assign the result to first_distance and display the value using the print function.

In [11]:
print(dc_listings.index)

RangeIndex(start=0, stop=3723, step=1)


In [9]:
print(dc_listings.loc[dc_listings['accommodates'] == 3, 'accommodates'].head().values[0])

3


In [10]:
first_distance = dc_listings.loc[dc_listings['accommodates'] == 3, 'accommodates'].head().values[0] - dc_listings.loc[0, 'accommodates']
first_distance = np.abs(first_distance)
print(first_distance)

1


## 5. Calculate Distance for All Observations. 

The Euclidean distance between the first row in the dc_listings DataFrame and our own living space is 1.

How do we know if this is high or low? If you look at the Euclidean distance equation itself, the lowest value you can achieve is 0. This happens when the value for the feature is exactly the same for both observations you're comparing. If $p_1 = q_1$ , then $d = |q_1 - p_1|$, which results in $d = 0$. The closer to 0 the distance is, the more similar the living spaces are.

If we want to calculate the Euclidean distance between each living space in the dataset and a living space that accommodates 8 people, here's a preview of what that would look like.

![fig3](figs/5.1-m139.svg)

Then, we can rank the existing living spaces by ascending distance values, the proxy for similarity.


### Exercise

1. Calculate the distance between each value in the accommodates column from dc_listings and the value 3, which is the number of people our listing accommodates:
    - Use the apply method to calculate the absolute value between each value in accommodates and 3 and return a new Series containing the distance values.
1. Assign the distance values to the distance column.
1. Use the Series method value_counts and the print function to display the unique value counts for the distance column.

In [13]:
dc_listings['distance'] =  3 - dc_listings['accommodates']

In [14]:
print(dc_listings['distance'].head())

0   -1
1   -3
2    2
3    1
4   -1
Name: distance, dtype: int64


In [None]:
dc_listings['distance'] = dc_listings['distance'].apply(np.abs)

In [17]:
print(dc_listings['distance'].value_counts())
print(dc_listings['distance'].value_counts().sort_index())

1     2294
2      503
0      461
3      279
5       73
4       35
7       22
6       17
9       12
13       8
8        7
12       6
11       4
10       2
Name: distance, dtype: int64
0      461
1     2294
2      503
3      279
4       35
5       73
6       17
7       22
8        7
9       12
10       2
11       4
12       6
13       8
Name: distance, dtype: int64


## 6. Randomizing and Sorting. 

It looks like there are quite a few living spaces (461, to be precise) that can accommodate three people just like ours. This means the five "nearest neighbors" we select after sorting all will have a distance value of zero.

If we sort by the distance column and then select the first 5 living spaces, we would be biasing the result to the ordering of the dataset.

In [18]:
print(dc_listings[dc_listings["distance"] == 0]["accommodates"])

26      3
34      3
36      3
40      3
44      3
       ..
3675    3
3697    3
3707    3
3714    3
3722    3
Name: accommodates, Length: 461, dtype: int64


Instead, let's randomize the ordering of the dataset and then sort the DataFrame by the distance column. This way, all of the living spaces that accommodate the same number of people will still be at the top of the DataFrame, but they will be in random order across the first 461 rows.

We have set a random seed, so we can perform answer-checking on our end.

### Exercise

1. Randomize the order of the rows in `dc_listings`:
    - Use the np.random.permutation() function to return a NumPy array of shuffled index values.
    - Use the DataFrame method loc[] to return a new DataFrame containing the shuffled order.
    - Assign the new DataFrame back to dc_listings.
1. After randomization, sort dc_listings by the distance column, and assign back to dc_listings.
1. Display the first 10 values in the price column using the print function.

In [19]:
np.random.seed(1)

In [20]:
dc_listings = dc_listings.loc[np.random.default_rng().permutation(dc_listings.index)]
dc_listings = dc_listings.sort_values(by='distance')

In [21]:
print(dc_listings.loc[:, 'price'].head(10))

3349    $105.00
899     $200.00
45      $100.00
1855    $100.00
625     $150.00
3037    $139.00
3546    $200.00
934     $220.00
2257    $129.00
1065     $99.00
Name: price, dtype: object


## 7. Average price. 

Before we can select the five most similar living spaces and compute the average price, we need to clean the price column.

Right now, the price column contains comma characters (,) and dollar sign characters and is a text column instead of a numeric column. We need to remove these values and convert the entire column to the float datatype. Then, we can calculate the average price.

### Exercise

1. Remove the commas (,) and dollar sign characters ($) from the price column:
    - Use the str accessor so we can apply string methods to each value in the column followed by the string method replace to replace all comma characters with the empty character: stripped_commas = dc_listings['price'].str.replace(',', '')
    - Repeat to remove the dollar sign characters.
1. Convert the new Series object containing the cleaned values to the float datatype and assign back to the price column in dc_listings.
1. Calculate the mean of the first five values in the price column and assign to mean_price.
1. Use the `print` function or the variable inspector below to display `mean_price`.

In [22]:
dc_listings['price'] = dc_listings['price'].str.replace('[\$,]', '', regex=True).astype(float)

  dc_listings['price'] = dc_listings['price'].str.replace('[\$,]', '').astype(float)


In [23]:
mean_price = dc_listings['price'].head().mean()
print(mean_price)

131.0


## 8. Function to Make Predictions. 

Congrats! You've just made your first prediction! Based on the average price of other listings that accommodate three people, we should charge **131.6** dollars per night for a guest to stay at our living space.

Let's write a more general function that can suggest the optimal price for other values of the accommodates column.

The dc_listings DataFrame has information specific to our living space (e.g., the distance column).

To save time, we've reset the dc_listings DataFrame to a clean slate and only kept the data cleaning and randomization we did since those weren't unique to the prediction we were making for our living space.

### Exercise

1. Write a function named predict_price that can use the k-nearest neighbors machine learning technique to calculate the suggested price for any value for accommodates. This function should do the following:
    - Take in a single parameter, new_listing, that describes the number of bedrooms.
    (We've added code that assigns dc_listings to a new DataFrame named temp_df. We used the pandas.DataFrame.copy() method, so the underlying DataFrame is assigned to temp_df, instead of just a reference to dc_listings.)
    - Calculate the distance between each value in the accommodates column and the new_listing value that was passed in. Assign the resulting Series object to the distance column in temp_df.
    - Sort temp_df by the distance column and select the first five values in the price column. Don't randomize the ordering of temp_df.
    - Calculate the mean of these five values and use that as the return value for the entire predict_price function.

1. Use the predict_price function to suggest a price for a living space that does the following:
    - If it accommodates 1 person, assign the suggested price to acc_one.
    - If it accommodates 2 people, assign the suggested price to acc_two.
    - If it accommodates 4 people, assign the suggested price to acc_four.

In [24]:
dc_listings = pd.read_csv("dc_airbnb.csv")
dc_listings['price'] = dc_listings['price'].str.replace('[\$,]', '', regex=True).astype(float)
dc_listings = dc_listings.loc[np.random.default_rng().permutation(dc_listings.index)]

In [25]:
def predict_price(new_listing):
    temp_df = dc_listings.copy()
    temp_df['distance'] = (temp_df['accommodates'] - new_listing).abs()
    return temp_df.sort_values(by='distance')['price'].head().mean()    

In [26]:
acc_one = predict_price(1)
acc_two = predict_price(2)
acc_four = predict_price(4)

In [27]:
print(acc_one)
print(acc_two)
print(acc_four)

65.6
74.8
155.8


## 9. Next Steps. 
In this lesson, we explored the problem of predicting the optimal listing price for an AirBnB rental based on the price of similar listings on the site. We worked through the entire machine learning workflow, from selecting a feature to testing the model. To explore the basics of machine learning, we limited ourselves to only using one feature (the univariate case) and a fixed k value of 5.

In the next lesson, we'll learn how to evaluate a model's performance.
