# K-Nearest-Neighbors (KNN)

## Introduction

In this exercise, we will analyze the price per night on Airbnb (in Copenhagen). The dataset is available on the website [Airbnb Inside](http://insideairbnb.com/get-the-data.html), and we already downloaded the CSVs in the exercise folder. They are located under the data subfolder.

Imagine you have a flat and you want to rent it on Airbnb. You'll probably want to choose the right price so you can maximize your revenue, right?

Our goal today is to __predict the price per night thanks to the KNN Regressor Algorithm__.

## Data Exploration

- Read the file 'listings-copenhague.csv'
- Display the first 3 lines of the listing

# KNN with one feature

As you can notice, we have different features (columns) available. To simplify this exercise, we will first focus on one single column : __bedrooms__. Our goal is to find the K most similar apartments in terms of number of "bedrooms" and then calculate the mean price per night of this K apartments.

So, in this exercise, we will proceed like this : 
1. Calculate the "distance" between our flat (2 bedrooms) and each flat of the dataset :
    
    __d = |q1 - p1|__ where :
    - __q1__ : number of bedrooms in the target flat
    - __p1__ : number of bedrooms in the flat we compare it to
2. Randomize the dataset (so the result does not depend on the order of the data)
3. Sort the dataset by distance
4. Keep the 5 first flats (= __5 nearest neighbors__ since the dataset is sorted by distance)
5. Calculate the mean price for these 5 flats


## Euclidean Distance 

So, we will start by computing the [distance](https://en.wikipedia.org/wiki/Euclidean_distance) between each example and our 2 bedrooms flat.

As a quick reminder, in this table, the distance with the 1st, 2nd and 3th flat would be 1, 2 and 4 respectively.

- Create a new column named "distance" and compute the euclidean distance between each example and __our 2 bedrooms apartment__
- Display the result in a bar plot (distance on X-axis and number of apartmentst corresponding to each distance on Y-axis

## Randomize and sort

- Use "np.random.seed(1)" for the repeatability of the test
- Randomize the dataframe so we can choose close neighbors at random
- Sort the dataframe by distance
- Display the 5 first results (__k = 5__) and make sure all the distance are equal to 0. 
- __Hint__ : np.random.permutation(), loc[]

## Clean the data

Congratulations, you are almost done... We just have to compute the mean of the prices and it's finished. But we have a problem :

In [None]:
price1 = copenhagen_listing["price"][0]
price2 = copenhagen_listing["price"][1]
price1 + price2

What happened? The column "price" is seen as "string". So the "+" operator acts as "a concatenation". To fix this : 
- Replace the values in the column "price" with the same value but without the "$" and the commas.
- Convert the type from string to float

In [14]:
price1 = copenhagen_listing["price"][0]
price2 = copenhagen_listing["price"][1]
price1 + price2

2758.0

Is it working fine now? OK, Let's finish this ! 

## Mean price

Now that we have our 5 nearest neighbors, we can estimate the price of our appartment by calculating the mean price of the 5 neighbors

- Calculate the mean price of the five nearest values
- Display the result.

# Multivariate K-nearest Neighbors

Good job, you just made your first prediction based on the K (=5) nearest neighbors. It's a big step ! However, this model is not perfect. In the following we will work on 2 different aspects to improve our model :
- How many neighbors should we use ? --> Choose the value of K
- Which features should we include in the model ? --> Choose the relevant columns

To do that, we will now use the Scikit-Learn library, which is a very helpful library to do the exact same thing we did until now (but in just a few lines). With this library, it's always the same 4 steps :
1. Instanciate the Machine Learning model we want to use (for ex : KNN Regressor --> class KNeighborsRegressor)
2. Use the data to train the model
3. Use the model to make predictions
4. Evaluate the precision of these predictions

### Feature scaling

Since the range of values of data varies widely, in machine learning algorithms, which calculate the distance between two points by the Euclidean distance, the objective functions will not work properly without normalization. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance (see the [link](https://en.wikipedia.org/wiki/Feature_scaling) for more information)

- Z-score normalize the columns and create a new dataframe `normalized_listing`. Add the column 'price' not normalized!
- Display the first 3 lines of these new dataframe

### KNeighborsRegressor

- Instanciate and train your model on the whole dataframe (use __k = 5__ and the columns 'bedrooms' and 'accommodates')

__Hint__ : checks the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)

### Make predictions

- Add a new column "predicted_price" to the dataframe with the price predicted by your 2 features model
- What would be the price predicted for a flat :
    - with bedrooms = 2 and accommodatets = 2
    - with bedrooms = 3 and accommodatets = 3

# Improve your model by selecting the right features and K

Well done, now you understand how to make predictions based on chosen columns! It's a huge step... But it's only the beginning of your job as a Data Scientist. Based on what you have learned (KNN algorithm) today, play with the features of this dataset to make the best predictions.

- Create a new model and trained it with all the columns (except the price of course) and K = 10
- Compare the score with the score from the previous model

0.2147789867226768

# Optimize your model

- plot the score vs the number of features you use (fix K to 10)
- plot the score vs the value of K (keep the columns "accommodates","bedrooms" and "beds") up to 20 neighbors
- __Note__ : This exercise can take a while to compute... (alternative : keep a subset of the DF)

**By looking at what is doing the model 1-NN, explain why there is clearly a problem in the way we evaluate the performance of our model?**