<img src="https://drive.google.com/uc?id=1-cL5eOpEsbuIEkvwW2KnpXC12-PAbamr" style="Width:1000px">

# The Learning Curve

This exercise will get you more familiar with the concept of the learning curve, and how to use it appropriately. Let's start by opening our data.

## Load and prepare the data:

We will work with an IODP dataset that I assembled myself, and that contains porosity measured from logs as well as various information on the cores. Open the `core_data.csv` dataset in your `Datasets` folder. To make your life a little easier, I have already cleaned the dataset for you: there are no duplicates or null values and the data has been scaled using a `MinMaxScaler` (except for our target variable). Don't take my word for it: explore quickly the data to convince yourself that this is the case.

Now do the following using a `random_state` value of 42:
1. Create a `y` target variable that contains only the `Porosity (vol%)` values, and a feature set (`X`) that contains all of the other features
2. Split `X` and `y` into `X_train`/`y_train` (70% of the data) and `X_test`/`y_test` (30% of the data)
3. Further split the `X_train`/`y_train`  into `X_train`/`y_train` (80% of the original `X_train`/`y_train` ) and `X_val`/`y_val`  (20% of the original `X_train`/`y_train` )

In the end, you should have 6 variables: `X_train`, `X_val`, `X_test`, `y_train`, `y_val`, `y_test`.

In [1]:
from nbta.utils import download_data
download_data(id='13NdioEz4vdjsz00IbIpwXR-6KvO1O2R3')

100% [..........................................................................] 2275646 / 2275646

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

# Plotting the learning curve

Time to put what you have learned in practice! We will assess how well a simple `LinearRegression` model applies to our dataset by monitoring the learning curve. For this, you will need to do the following:

1. Create a loop that will run `m` times, `m` being the number of samples in your `X_train` **divided by 10 and rounded down to the nearest integer** (to make the whole thing a bit faster!). In concrete terms, `m` should be equal to `281`.
2. At each loop iteration do the following: 
   a. train a `LinearRegression` based on the `X_train[0:m*10]` and `y_train[0:m*10]` data points
   b. predict the `y_train[0:m*10]` based on the `X_train[0:m*10]` and calcuate a `RMSE_score`. Save this value in a `train_rmse` list.
   c. predict `y_pred` based on the entire `X_val` data and calculate a `RMSE_score` based on `y_pred` and `y_val`. Save this in a `val_rmse` list.
3. Plot the two curves (`train_rmse` and `val_rmse`) against number of training data to see the training curve: because your list will be sorted in increasing number of `X_train` samples, all you need to do is plot the values of the lists on the y-axis and the index of the item on the x-axis.

**Recommendation:** You will need to do this exercise multiple times in this notebook, so you might want to consider writing two functions to do so. Why two, and not one function? Well, points 1. and 2. above take a long time to compute, so you don't want to have to repeat them too often. On the other hand, point 3. (the plot) is quick, and you may decide to zoom in on some of the areas of the plot. So I recommend to do this:

* Write one function (mine is called `calculate_learning_curves`) that will take care of points 1. and 2. above. You would pass a model to the function (for instance, LinearRegression) and the data (`X_train`, `X_val`, `y_train`, `y_val`),  it would calculate the two lists (`train_rmse` and `val_rmse`) and return them. 
* Write a second function (mine is called `plot_learning_curves`) to actually plot the learning curve in your notebook. This function would be passed the values of `train_rmse` and `val_rmse` as inputs

**Questions for you to answer::** 
* How does the curve look in general?
* How does the curve look when you zoom in between (0,200) on the x-axis, and (0,20) on the y-axis. You can use `plt.axis([0, 200, 0, 20])` to change the limits of your axis (but there are other ways to do that too). 
* How does the curve look at the end of training (`[2500, 2800, 10, 12]`)? What can you conlude about the suitability of a linear regression for this task?

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

Now let's look at how these learning curves look for the simple linear model:

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

#### How to interpret the curve above:

<details>
<summary> 💡 Make your own interpretations before checking this  </summary>
    
<li> Notice that for the training data, the error is 0 at first. This is because we have only 1 or 2 datapoints in our training set, so we can always fit a linear model through that.</li>
<li> However, it is clear that the linear model grossly overfits the training data, because the error on the training set is much higher. Our problem is of course that we don't have enough data at this stage (remember, I am talking about the begining of the curve here, with 1-2 datapoints in our dataset)</li>
<li> As we increase the number of points the dataset becomes more noisy and the linear model can no longer perfectly fit each instance perfectly. Instead, it finds the best compromise between all of the training instances (an instance is how we call a datapoint). In other words, the algorithm generalizes better with increasing number of instances. As a consequence, the RMSE of the training curve increases (now we are not fitting all of our instances) but importantly the error on the test curve is decreasing.</li>
<li> There is a little bit of randomness in how these two curves evolve because it depends on the nature of each new instance, but at roughly 2500 instances we can see that the error on the test (orange) and training (blue) curves are very close and remain so: we seem to have convergence of our training algorithm.</li>
<li> We deduct that the algorithm is fit for purpose, because the error is relatively low (RMSE of 11.2) and both the test and training set have similar errors (not too much overfitting). This is not to say that another algorithm cannot do better...</li>

</details>

Now let's introduce a new algorithm, and see how this one will perform on our dataset.

# Testing KNNs

Plot the learning curve for our dataset using a `KNeighborsRegressor` using `n_neighbors=1`. If you elected to write your code as functions, this should be very easy to do (just swap `KNeighborsRegressor` with `DecisionTreeRegressor` as your algorithm). Otherwise you will need to do some copy-pasting / rewritting the code.

Plot the entire range of the learning curve. What do you conclude from this plot? 

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

#### How to interpret the curves above:
<details>
<summary> 💡 Make your own interpretations before checking this  </summary>
    
<li> You can see that the validation error decreases at first with number of samples, but then ascillates at a high RMSE. In addition, the training error is zero across the entire training range: this is a feature of KNNs - because we predict based on the nearest neighbors, when we predict our train set, we effectively have 100% accuracy (since the datapoint is itself in our model).</li>

In other words, we overfit the dataset with the `KNneighborRegressor` algorithm. Can we do better? Yes! What we need is to tune one simple hyperparameter...

</details>


# Choosing a more appropriate `k`

Clearly, our problem is that we are overfitting with `k=1`. So, what we need to do is find the best value of `k` to fit our dataset. To do this, we will use the elbow method. Plot a curve of **RMSE** for a `KNneighborsRegressor` fitted with variable values of `k` between `1` and `30`. Indicate the best value of `k` in a variable named `best_k`, and test your result.

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

### ☑️ Check your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('best_k',
                         best_k = best_k,
)

result.write()
print(result.check())

# Plot the learning curve for the best `k`

Now plot the learning curve for the KNN again, but this time when you create your `KNeighborClassifier`  set the `n_neighbors` hyperparameter to 'best_k'. Plot the entire range of the learning curve.


What do you conclude from this plot? 

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

#### How to interpret the curves above:
<details>
<summary> 💡 Make your own interpretations before checking this  </summary>
    
<li> Setting `k` to 9 has made a massive difference: we now see that the training error is converging with the validation error, as expected</li>
<li> The variance of the data is much less, which means we have reached a good bias/variance tradeoff</li>   
<li> Tweaking this simple hyperparameter has also resulted in an RMSE for the validation set below 10, which is better than the `LinearRegression` model!</li>

</details>


## How many samples did we need?

The learning curve can also indicate how many samples are needed, and help us decide if we need to acquire more data. Based on your curve, what is roughly the minimum number of samples needed to obtain a flat validation score? Save your answer in a variable named `min_samples`.


In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

### ☑️ Check your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('min_samples',
                         min_samples = min_samples,
)

result.write()
print(result.check())

# 🏁 Finished!

Well done! <span style="color:teal">**Push your exercise to GitHub**</span>, and move on to the next one.