# Cross Validation 

## 1. Introduction

In this exercise, we will dive into the whole machine learning development workflow. A common mistake (and we made it intentionnaly during the previous days) is to learn the parameters of a prediction function and testing it on the same data. What's wrong with that? We can’t fit the model to our training data and hope it would accurately work for the real data it has never seen before.

To avoid that to happen, there are several techniques: we could remove a part of the training data and using it to get predictions from the model trained on rest of the data (= __Holdout Method__). But, by reducing the training data, we risk losing patterns in data set and increase the error. __K-Fold cross validation__ will help us to solve this problem.

In K Fold cross validation, we  split our data into k separated "folds". Then, the Holdout Method is repeated k times, such as each time, one of the k folds will be the test subset and the (k-1) other folds will be used together as the training set.

__Note__ that this method does not depend on the model. In this example, we will use it on a Linear Regression but you could use it on any methods you want (KNN, Logistic Regression,...).

This following image schematize this algorithm.
<img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png" style="width:50%;">

The general workflow to apply the Cross Validation is always the same:
1. Instanciate the model from scikit-learn you want to use (LinearRegression for the today exercise);
2. Instanciate the KFold class with the parameters you want;
3. Use the cross_val_score() function to measure the performance of your model.

## 2. Data exploration

In this section, we will analyse the sale price of houses in Iowa. We will try to predict the final sale price depending on different criterias. 

- read the data from the file "saleprice_housing.txt" (__do not forget__ to specify the delimiter as "\t" for tabulation)
- display the 5 first lines
- Generate 3 diagrams (scatter plot) : 
    1. The first one represents the "Garage Area" on the X-axis and the "SalePrice" on the Y-axis
    2. The second one represents the "Gr Liv Area" on the X-axis and the "SalePrice" on the Y-axis
    3. And the last one is the "Overall Cond" on the X-axis and the "SalePrice" on the Y-axis

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

data = pd.read_csv("data/saleprice_housing.txt", delimiter="\t")

data

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2925,2926,923275080,80,RL,37.0,7937,Pave,,IR1,Lvl,...,0,,GdPrv,,0,3,2006,WD,Normal,142500
2926,2927,923276100,20,RL,,8885,Pave,,IR1,Low,...,0,,MnPrv,,0,6,2006,WD,Normal,131000
2927,2928,923400125,85,RL,62.0,10441,Pave,,Reg,Lvl,...,0,,MnPrv,Shed,700,7,2006,WD,Normal,132000
2928,2929,924100070,20,RL,77.0,10010,Pave,,Reg,Lvl,...,0,,,,0,4,2006,WD,Normal,170000


<details>
  <summary>Hint</summary>

Using `plot(..., kind='scatter')` on a DataFrame would solve some of your problems ! Try to do this one without looking at the solution !
</details>
<details>
  <summary>Hint 2</summary>

You could use matplotlib's `add_subplot` method to create 3 charts on the same figure !
</details>
<details>
  <summary>View solution</summary>

```python
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

data = pd.read_csv("data/saleprice_housing.txt", delimiter="\t")
data.head()
target = "SalePrice"

fig = plt.figure(figsize=(7,15))
ax1 = fig.add_subplot(3,1,1)
ax2 = fig.add_subplot(3,1,2)
ax3 = fig.add_subplot(3,1,3)

data.plot(x = "Garage Area", y = "SalePrice", ax = ax1, kind ="scatter")
data.plot(x = "Gr Liv Area", y = "SalePrice", ax = ax2, kind ="scatter")
data.plot(x = "Overall Cond", y = "SalePrice", ax = ax3, kind ="scatter")
plt.show()
```
</details>

## 3. Linear Regression with Scikit-Learn

- Use the LinearRegression() model from Scikit Learn with the column "Gr Liv Area" for the training and "SalePrice" for the target.
- Plot the line on the corresponding graph.
- What would be the price for a house with "Gr Live Area" equal to 2000?

In [1]:
# TODO: your code here !

<details>
  <summary>Hint</summary>

We have seen in lecture yesterday ! You can find some corresponding example in the lecture slides.
</details>
<details>
  <summary>View solution</summary>

```python
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(data[["Gr Liv Area"]], data['SalePrice'])
# y = a1 * x + a0
a0 = lr.intercept_
a1 = lr.coef_
x = np.linspace(0, 5000, num=1000)
y = a1 * x + a0

plt.scatter(data["Gr Liv Area"], data["SalePrice"])
plt.plot(x, y, "red")
plt.show()

gr_live_area = 2000
price = a1 * gr_live_area + a0
price
```
</details>

## 4. Cross Validation with Scikit-Learn 

Well done, you made a prediction (based on one feature). Now we will apply the Cross Validation in order to measure the performance of our model.
- Create your K-Fold class with the following parameters: 
    - 5 folds;
    - shuffle = True;
    - random_state = 1 (so we get the same values);
    - assign this model to the variable `kf`.

In [2]:
# TODO: your code here !

<details>
  <summary>Hint</summary>

Check [this](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) out 
</details>
<details>
  <summary>View solution</summary>

```python
from sklearn.model_selection import KFold, cross_val_score
kf = KFold(5, shuffle = True, random_state = 1)
```
</details>

- Create a new instance of the class `LinearRegression` and assign it to the variable `lr`.

In [3]:
# TODO: your code here !

<details>
  <summary>Hint</summary>

Instantiate means "initialize". It's only one line of code !
</details>
<details>
  <summary>View solution</summary>

```python
lr = LinearRegression()
```
</details>

- Use the `cross_val_score()` function to do the cross validation of your K-folds: 
    - use the `LinearRegression` instance `lr` you just created;
    - use the column "Gr Liv Area" for the training and the column "SalePrice" for the target;
    - use the scoring parameter with the value "neg_mean_squared_error";
    - return an array with the MSE values (one for each fold);
    - assign the result to the variable `mses`.

In [4]:
# TODO: your code here !

<details>
  <summary>Hint</summary>
    
You have already been [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)!
</details>
<details>
  <summary>View solution</summary>

```python
mses = cross_val_score(lr, data[["Gr Liv Area"]], data['SalePrice'], scoring='neg_mean_squared_error', cv=kf)
```
</details>

- Compute the squared root of the absolute value of each mse and assign it to the variable `rmses`.
- Compute the mean of each RMSE and assign it to the variable `avg_rmses`.
- Compute the standard deviation of each RMSE and assign it to the variable `std_rmses`.
- Print your results (`avg_rmses` and `std_rmses`).

In [5]:
# TODO: your code here !

<details>
  <summary>Hint</summary>
    
Here, we simply ask you to use numpy and its square root, mean and standard deviation methods ! Nothing more !
</details>
<details>
  <summary>View solution</summary>

```python
rmses = np.sqrt(np.absolute(mses))
avg_rmses = np.mean(rmses)
std_rmses = np.std(rmses)
print(avg_rmses)
print(std_rmses)
```
</details>

## 5. Explore different values for K

Well done! You just calculated the standard deviation and the mean of your Root Mean Squared Error (RMSE). But what does that mean? To answer this question, let's compare the `avg_rmses` and `std_rmses` for different values of K (number of folds).

- Using a For-Loop `for k in num_folds:`, compute the `avg_rmses` and `std_rmses` for each k in num_folds = [3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 100, 1000].
- For each iteration print the result with the following command : `print(str(k), "folds: ", "AVG RMSE: ", str(avg_rmses), "STD RMSE: ", str(std_rmses))`.
- __hint__: do no hesitate to copy paste from the previous exercise...

In [6]:
# TODO: your code here !

<details>
  <summary>Hint</summary>
    
We want you to generalize what you have just done for one value of K. Now we want to loop over different values of K, and do the exact same steps as you did until here !
</details>
<details>
  <summary>View solution</summary>

```python
num_folds = [3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 100, 1000]
stock_avg_rmses = []
stock_std_rmses = []
for k in num_folds:
    kf = KFold(k, shuffle = True, random_state = 1)
    lr = LinearRegression()
    mses = cross_val_score(lr, data[["Gr Liv Area"]], data['SalePrice'], scoring='neg_mean_squared_error', cv=kf)
    rmses = np.sqrt(np.absolute(mses))
    avg_rmses = np.mean(rmses)
    std_rmses = np.std(rmses)
    stock_std_rmses.append(std_rmses)
    stock_avg_rmses.append(avg_rmses)
    print(str(k), "folds: ", "AVG RMSE: ", str(avg_rmses), "STD RMSE: ", str(std_rmses))
```
</details>

It seems that as k become bigger, the average RMSE decrease but the standard deviation increase. In an ideal world, we would like to have an avg_rmses and a std_rmses as small as possible. So, we have to choose a k that offers a good compromise between a small `avg_rmse` and a `small std_rmse`.

In [7]:
# TODO: your code here !

<details>
  <summary>Hint</summary>
    
To plot a vector, you can use `plt.plot(x, y)` where x and y are lists of x-coordinates and y-coordinates for a 2D-representation.
</details>
<details>
  <summary>View solution</summary>

```python
plt.plot(num_folds, stock_std_rmses)
plt.plot(num_folds, stock_avg_rmses)
```
</details>