# Exercise - KNeighborsRegressor - Bike Sharing - Solution

### Introducing the exercise

This exercise uses the **bike sharing** dataset. The function of all features as well as the target is described in the **Readme.txt** file. The goal is to predict the count of total rental bikes using KNN's regressor and the OLS linear regression implemented in sklearn.

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn-neighbors-kneighborsregressor

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn-linear-model-linearregression

The code implemented below reads the **day.csv** file located in the **bike-sharing-dataset** folder and stores the data in a **pandas** DataFrame. Some of the columns are omitted. The goal of predicting the count can be achieved by following the steps below.

* One of the OLS assumptions is that each feature bears a linear dependency with the target. To check whether this condition is satisfied, create separate plots of **cnt** versus each of the six features. Choose two that bear a linear, or close to linear, dependency with the target.
* Let these two features be the inputs and **cnt** be the target.
* Create an 80:20 train-test split.
* Having the features scaled is an essential part of working with the KNN algorithm. Make sure that the two features are scaled properly. You can use sklearn's StandardScaler() class.
* Create instances of the KNeighborsRegressor and LinearRegression classes. For the KNeighborsRegressor model, try to find the number of neighbors that works best.
* Fit the model to the training data.
* Make predictions on the test data.
* For each of the models, make a plot of the true test values versus the predicted test values. A perfect model would draw a 45-degree line. On both figures, plot a 45-degree line for a reference.
* Return the R-squared value for both models. You can find out how to do that on the pages of the two regression algorithms. What is the R-squared value of a perfect model? Based on that, which of the two models performed better?

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn-preprocessing-standardscaler

Have fun!

### Import the relevant libraries

In [None]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

### Load the dataset and displaying the inputs

In [None]:
data = pd.read_csv('bike-sharing-dataset\\day.csv').drop(['instant', 
                                                          'dteday', 
                                                          'season', 
                                                          'yr', 
                                                          'mnth', 
                                                          'holiday', 
                                                          'weekday', 
                                                          'weathersit', 
                                                          'workingday'], axis = 1)

In [None]:
data

### Identify the features that bear a linear relation with the target

In [None]:
sns.set()

sns.scatterplot(data = data, x = 'temp', y = 'cnt');

### Based on the results from above, define the inputs and the target. Create a train-test split

In [None]:
inputs = data[['casual', 'registered']]
target = data['cnt']

In [None]:
x_train, x_test, y_train, y_test = train_test_split(inputs, target, 
                                                    test_size=0.2, 
                                                    random_state=365)

### Scale the training and test data to zero mean and unit variance

In [None]:
scaler = StandardScaler()

x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

### Create the KNN and Linear regression models. Make predictions on the test data

In [None]:
reg_knn = KNeighborsRegressor(n_neighbors = 2)
reg_knn.fit(x_train_scaled, y_train)
y_test_pred_knn = reg_knn.predict(x_test_scaled)

In [None]:
reg_lin = LinearRegression()
reg_lin.fit(x_train_scaled, y_train)
y_test_pred_lin = reg_lin.predict(x_test_scaled)

### Plot the predicted values from both models versus the true values

In [None]:
x = [50, 8000]
y = [50, 8000]

In [None]:
sns.set()

plt.scatter(y_test, y_test_pred_knn)
plt.plot(x, y)
plt.xlabel('True values')
plt.ylabel('Predicted values')
plt.title('KNN Regressor');

In [None]:
sns.set()

plt.scatter(y_test, y_test_pred_lin)
plt.plot(x, y);
plt.xlabel('True values')
plt.ylabel('Predicted values')
plt.title('Linear Regression');

### Calculate the R2 value for both models

In [None]:
reg_knn.score(x_test_scaled, y_test), reg_lin.score(x_test_scaled, y_test)