#### Note: Please make your own copy of this notebook to run and execute, thank you!

1. Go to the menu tab on the top left corner
2. Click on "File"
3. Under the File tab menu click on "Save a copy in Drive..."


# Introduction to Model Validation








Model validation is the process of verifying how well a model works with data that wasn't used in to train it. This allows us to identify problems and determine whether that particular model is useful for making predictions on new data. Model validation techniques assign each model a score, which allows us to compare performance between models. Our goal is to choose a model that will perform as well as possible on unseen data. We say that such a model is generalizable.

[Model Validation Glossary](https://docs.google.com/document/d/1Hd2kF7rMN-cY6ohzt291nO3I4W6LUPmKJfPVXIwzshs/edit):

Provided as a list of terms and defintions used in the Model Validation lesson to help keep track of the content.

# Example of Model Evaluation - Final Exam Scores

You've decided to tutor students in your math class and you want to predict who might need extra help preparing for their final exam. You only have time to tutor two students, so you want to make a model that does the best job of predicting who is likely to score the lowest on their final. 


You think maybe you can predict a student's final exam score based on how many hours they spend studying per week. You ask three friends who have already completed the class for their hours per week spent studying and final exam score. Jasmine spends 8 hours per week and scored a 95, George spends 6 hours per week and scored a 75, and Anne spends 6.5 hours per week and scored an 80. You think you have a perfect model!

Exam score = 10 * hours per week + 15

You decide to test your model and so you find your friend Xavier, who tells you he only spends 2 hours per week studying but scored a 78 on the final exam. By testing out your model data that wasn't included the model fitting, you performed model validation! In this case, the predicted score for Xavier was much lower than his actual score. The results of your model validation let you know that your model isn't capturing the relationship as perfectly as you initially thought. You will want to take this into consideration when deciding whether to use this model to predict future exam scores (unseen data).


In the lesson, we'll talk more about how we use model validation techniques for supervised learning models to simulate the model's performance on unseen data. Specifically, we will look at how we might use evaluation metrics to quantify our model's performance, why a model might underperform due to the data and the model representation and how we can we validate and tune our models for maximum performance. 


# Evaluation Metrics

















Let's start by taking a look at some evaluation metrics that we can use to quantify our model's performance such as R-Squared, Adjusted R-Squared, Mean Absolute Error, Mean Squared Error, and Root Mean Square Error. These metrics are all appropriate continuous values for regression models. Other types of models such as classification (such as identifying a fruit as an apple or an orange) will use different metrics. 



Below is a simple dataset that we'll use for our calculations. It contains actual apple weights in grams ($y_{i}$) and apple weights that were predicted by a model ($\hat y_{i}$).

$$ 
y_{i} = [197, 183, 185, 192, 178] 
$$
$$
\hat y_{i} = [196, 181, 187, 192, 179] 
$$

We'll enter these values into NumPy arrays so that we can make calculations on them using Python

In [0]:
# Import the Python Numpy Data Science Library
import numpy as np

# True values
y_i = np.array([197, 183, 185, 192, 178]) 

# Our model's predictions
y_i_hat = np.array([196, 181, 187, 192, 179])

### R-squared

[How to calculate R-squared](https://www.youtube.com/watch?v=w2FKXOa0HGA) (7:40)

In this tutorial, David Longstreet from the statisticsfun Youtube channel will walk through a demo describing how R-squared is calculated with some helpful visuals.



R-squared is often used as the default scoring metric for many implementation of Linear Regression models. The R-squared value is a calculation of how much better our model does at explaining the variance in the data compared to the line drawn through the mean of the data. The result is expressed as a percentage with values ranging from 0 (none of the variance is explained) to 1 (all of the variance is explained). 

Variance in our data comes from two sources. The first is the actual trend in the data, such as saying that the circumference of an apple is equal to $\pi $ times its diameter. The second source of variance is noise - apples vary in shape from perfect spheres and so their diameters won't be predicted exactly by this formula. You can expect all real-world data to contain some amount of random variation otherwise called noise. 


Although we generally like to see R-squared values that are as high as possible, we want to make sure that our model is only describing the underlying trend in the data and not the random noise. A model that described both the underlying trend and the noise would have a high R-squared but it wouldn't be the most generalizable model. We'll talk more about this in the section on Bias and Variance but for now, let's start by looking at how R-squared is calculated.

$$
R^2 = 1- \frac{RSS}{TSS}
$$


The total sum of squares (TSS) describes the total variance in the data. In this equation we are calculating for each data point how much the measured value ($y_{i}$) deviates from the average of all measured values ($\bar y$), squaring the difference and summing the results. For our set of apple weights $\bar y = 187$

$$
TSS = {\sum _{i=0}(y_{i}-{\bar {y_{}}})^{2}}
$$

**Example: **

$$
TSS = (197-187)^{2} + (183-187)^{2} + (185-187)^{2} + (192-187)^{2} + (178-187)^{2}
$$

$$
 = (10)^{2} + (-4)^{2} + (-2)^{2} + (5)^{2} + (-9)^{2}
$$

$$
= 226
$$

The residual sum of squares (RSS) is the amount of variance that remains unexplained by the model. This calculation looks very similar to the total sum of squares, except instead of summing the squared difference of each measured value and the average value ($\bar y$), we are summing the squared difference of each measured value and the model's predicted value for that data point ($\hat y_{i}$). The values of ($y_{i}-\hat y_{i}$) are called residuals.

$$
RSS = {\sum _{i=0}(y_{i}-{\hat {y_{i}}})^{2}}
$$

**Example:**

$$
RSS = (197-196)^{2} + (183-181)^{2} + (185-187)^{2} + (192-192)^{2} + (178-179)^{2}
$$

$$
 = (1)^{2} + (2)^{2} + (-2)^{2} + (0)^{2} + (-1)^{2}
$$

$$
= 10
$$

From the TSS and RSS, we can now calculate R-squared.

**Example:**
$$
R^2 = 1- \frac{10}{226}
$$
$$
 = 1- 0.04
$$
$$
 = 0.96
$$

Our R-squared value is 0.96, or we could say that our model is explaining 96% of the variance in the data. Is this a good value? That depends on your data. For something like predicting the weights of apples, we would probably be pretty happy with this number. However, if we were modeling data that we would expect to be very low in noise, such as the diameter of an optical component in a telescope, we might not think this R-squared was acceptable.

If a model explains all the variance in the data, the RSS would be 0 and R-squared would equal 1. Here is the calculation for our apple dataset where all the variance is explained by the model, that is to say, all of the predicted values equal the actual values.

**Example:**

$$
RSS_{best} = (197-197)^{2} + (183-183)^{2} + (185-185)^{2} + (192-192)^{2} + (178-178)^{2}
$$

$$
 = (0)^{2} + (0)^{2} + (0)^{2} + (0)^{2} + (0)^{2}
$$

$$
= 0
$$

$$
R_{best}^2 = 1- \frac{0}{226}
$$
$$
 = 1
$$

An R-squared value of zero is telling you that your model isn't doing any better at explaining the variance in the data than just guessing the mean. In this case, the RSS equals the TSS and the R-squared would equal 0. Technically you could have a negative R-squared value, but if your model performs that poorly, you should probably just replace it with a model that predicts the mean every time. Therefore an R-squared of zero is the lowest we consider from a practical standpoint. 

**Example:**

$$
RSS_{worst} = (197-187)^{2} + (183-187)^{2} + (185-187)^{2} + (192-187)^{2} + (178-187)^{2}
$$

$$
 = (10)^{2} + (-4)^{2} + (-2)^{2} + (5)^{2} + (-9)^{2}
$$

$$
= 226
$$


$$
R_{worst}^2 = 1- \frac{226}{226}
$$
$$
 = 0
$$

If we had many more data points, we would probably want to have a function to calculate R-squared for us. We could write our own:

In [0]:
def my_r2_score(y_true, y_predicted):
  # Calculate total sum of squares
  tss = np.sum((y_true - np.mean(y_true))**2)
  
  # Calculate residual sum of squares
  rss = np.sum((y_true - y_predicted)**2)

  # Calculate R-squared from RSS and TSS 
  my_r2 = (1 - rss/tss).round(2)

  return(my_r2)

Now let's test our custom R-Squared Score on our data.

In [0]:
# Call our function
r2 = my_r2_score(y_i, y_i_hat)

print(r2)

Or we could use a function from the [sklearn metrics module](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics).

In [0]:
from sklearn.metrics import r2_score

r2_apple = r2_score(y_i, y_i_hat).round(2)

print(r2_apple)

### Adjusted R-squared


The trouble with R-squared is that it always increases as more features are added to the model. As we will learn later in the lesson, more features are not always better! We can account for R-squared's tendency to increase as new features are added by using the adjusted R-squared value. The adjusted R-squared adds a penalty term to the calculation for adding additional features. Adjusted R-squared only increases if the added feature improves the model more than would be expected by chance. The adjusted R-squared will decrease if the new feature improves the model less than would be expected by chance. In the equation below, n = number of data points and p = the number of features.

$$
R^2_{adj} = R^2 - \frac{(1-R^2) p}{n-p-1}
$$

We have five data points (n = 5) and for this calculation, let's say that our model included three features (p = 3) 

**Example:**

$$
R^2_{adj} = 0.96 - \frac{(1-0.96)*3}{5-3-1}
$$

$$
 = 0.84
$$

What's happening here? We're subtracting an amount as an adjustment to our R-squared value. This adjustment becomes larger the more features we have and smaller the more datapoints we have. It makes sense that this will give us information about whether a new feature is useful to our model given the amount of data that we have. 

[Root mean squared error](https://www.youtube.com/watch?v=zMFdb__sUpw): (6:42)

The next three error metrics all calculate the typical error associated with the model. Below is a short video from the Khan Academy describing one of these metrics, the root mean squared error (RMSE).

**Note: **The video uses $n-1$ to calculate from sample data whereas $n$ is used in populations.

### Mean Absolute Error (MAE)

The mean absolute error is calculated by summing the absolute values of the residuals ($y_{i}-\hat y_{i}$ values) and dividing by n = the number of data points. We use the absolute value so that errors in opposite directions don't cancel one another out. The units of mean absolute error are the same as the units of the target. 

$$
MAE = {{\frac {1}{n}}\sum _{i=1}^{n}|y_{i}-{\hat {y_{i}}}|}
$$

MAE calculation for our apple dataset

**Example:**

$$
MAE = \frac{1}{5} *[ |197-196| + |183-181| + |185-187| + |192-192| + |178-179| ]
$$
$$
= 1.2
$$

In [0]:
# Confirm our results with the sklearn function
from sklearn.metrics import  mean_absolute_error

MAE = mean_absolute_error(y_i, y_i_hat).round(2)

print(MAE)

Our dataset has an MAE of 1.2 grams

### Mean Squared Error (MSE)

The mean squared error metric is calculated similarly to MAE, except that we square the residuals rather than taking their absolute values. You might also notice that this equation is equal to the RSS divided by the number of data points. In our R-squared calculation, we normalized the data using the TSS so that R-squared would range from zero to one. Here we instead divide the RSS by the number of data points which gives us the average variance of our measurements. 

$$
MSE = {{\frac {1}{n}}\sum _{i=1}^{n}(y_{i}-{\hat {y_{i}}})^{2}}
$$

MSE calculation for our apple dataset

$$
MSE = \frac {1}{5}*[(197-196)^{2} + (183-181)^{2} + (185-187)^{2} + (192-192)^{2} + (178-179)^{2}]
$$

$$
 = \frac {1}{5}*[(1)^{2} + (2)^{2} + (-2)^{2} + (0)^{2} + (-1)^{2}]
$$

$$
= 2
$$

In [0]:
# Confirm our results with the sklearn function
from sklearn.metrics import  mean_squared_error

MSE = mean_squared_error(y_i, y_i_hat).round(2)

print(MSE)

Our dataset has an MSE of 2 squared-grams. If you're thinking that the meaning of a squared-gram is not very intuitive - keep reading! 

### Root Mean Squared Error (RMSE)

You may have looked at the results of the MSE calculation and wondered what a squared-gram means. The root mean squared error addresses the issue of interpretability by taking the square root of the MSE. This means that the units of RMSE will be the same as the units of our target. RMSE is often favored over MSE or MAE because of its relationship to the standard deviation (RMSE is also called the standard deviation of the residuals) and its interpretability.

$$
RMSE = \sqrt{{\frac {1}{n}}\sum _{i=1}^{n}(y_{i}-{\hat {y_{i}}})^{2}}
$$

MSE calculation for our exam dataset

**Example:**

$$
RMSE = \sqrt{\frac {1}{5}*[(197-196)^{2} + (183-181)^{2} + (185-187)^{2} + (192-192)^{2} + (178-179)^{2}]}
$$

$$
 = \sqrt{\frac {1}{5}*[(1)^{2} + (2)^{2} + (-2)^{2} + (0)^{2} + (-1)^{2}]}
$$

$$
= \sqrt{2} 
$$


Let's code up our custom RMSE.

In [0]:
# Sklearn doesn't have a separate RMSE function so let's write our own
def my_rmse(y_true, y_predicted):
  
  # using the sklearn mean squared error function to calculate MSE
  rmse_calc = (mean_squared_error(y_true, y_predicted)**0.5).round(2)
  
  return (rmse_calc)

Now we'll verify our function based on the prior data we calculated.

In [0]:
# Use our function to calculate RMSE
RMSE = my_rmse(y_i, y_i_hat)
print(RMSE)

Our dataset has an RMSE of approximately grams points. This is much more intuitive than 2 squared-grams. 

# Final Exam Scores Continued

Your idea for modeling final exam score based on hours spent studying didn't seem very promising when you validated it, but you think that maybe you can improve it by collecting data on more students and including more features. 






Here is the data you have on 20 students from the previous year:

- average_quiz: average score on quizzes 
- homework_hours: hours spent on homework per week
- final_exam: grade on the final exam 
- shoe_size: student's shoe size
- siblings: number of siblings
- ice_cream: favorite ice cream flavor; 1 = chocolate, 2 = pistachio


Taking a look at the possible features, your intuition tells you that the average quiz score and hours spent on homework per week might be good predictors of how the student will do on the final. But what about a model that includes the number of siblings? Maybe an older sibling would help students with study tips. And although it's possible that students who prefer pistachio ice cream are naturally better test takers, probably this isn't a meaningful feature for our model. Including features that don't correlate well with what we are trying to predict can lead our model fitting noise in the data. Shoe size is another feature that probably doesn't correlate with final exam score. This feature is also problematic because it could be an unintended proxy for whether the student is male or female. From an ethcis standpoint, we really don't want to build a model that predicts for example that female students are likely to score higher than male students. We therefore want to try to leave these types of features out. In this exercise, we're going to build models based on different sets of features and use model validation techniques to compare their performance.

In [0]:
# Import the Pandas Data Science Library
import pandas as pd

In [0]:
# Upload and inspect data
grades = pd.read_csv('https://raw.githubusercontent.com/openlearningbeta/Student-Dataset/master/student_dataset.csv')

# You can change the 5 to a 20 if you want to see all the rows of data
grades.head(5)

In [0]:
# Separate out out target values
y = grades.pop('final_exam')
X = grades

We'll start off by building a multivariate linear regression model that predicts final exam scores based on average quiz score and hours per week spent studying. We'll call this model "model A".


In [0]:
from sklearn.linear_model import LinearRegression

# Use the average_quiz and homework_hours features
X_A = X[['average_quiz', 'homework_hours']]

# Instantiate a linear regression object
model_A = LinearRegression()

# Fit to the data
model_A.fit(X_A, y)

# Calculate the predicted y values 
predicted_grades = model_A.predict(X_A)

We'll check our model using RMSE as our error metric. Remember that RMSE is the standard deviation of your residuals. You think that you can accurately identify which students will need help on their upcoming final if your model has a standard deviation of residuals that is below 2 points. 

In [0]:
# Calculate RMSE using the my_rmse function we wrote
model_A_RMSE = my_rmse(y, predicted_grades)

print(f' The RMSE of model A is {model_A_RMSE} points')

The RMSE of your model is 1.46 points, which is lower (and that's a good thing) than the 2 points you were hoping for. 

# Training and Testing our Model

We mentioned in the introduction that our goal is for our model to perform as well as possible on unseen data. In our example, how can we ensure before deploying our model that it will do a good job of predicting the test scores of future students?


The answer is that before we do any model fitting, we split our data into a training and a test set. The training data is used to fit the model. The test set is sometimes also called the holdout set because we set this data aside and don’t touch it until the end of our model fitting process. This is the best way to simulate unseen data. Evaluating how our model performs on the test set is how we ensure that we are building a model that is generalizable. 

sklearn has a handy function called [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for splitting the data randomly into training and test sets. 

In [0]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

# See what our splits look like
print(f'The training features matrix has a shape of {X_train.shape}')
print(f'The test features matrix has a shape of {X_test.shape}')

We can take a look at which indices are assigned to the training set and which are assigned to the test set

In [0]:
print(f'Training set indices: {list(X_train.index)}')
print(f'Test set indices: {list(X_test.index)}')

Model A looked promising with an RMSE below the two-point threshold that we defined. Let's re-fit the model on the training data and check the model's performance on the test data to be sure. 


In [0]:
# Use the average_quiz and homework_hours features
X_train_A = X_train[['average_quiz', 'homework_hours']]
X_test_A = X_test[['average_quiz', 'homework_hours']]

# Fit our model on the training data
model_A.fit(X_train_A, y_train)

# Use our model to predict y-values for training and test sets
model_A_predictedytrain = model_A.predict(X_train_A)
model_A_predictedytest = model_A.predict(X_test_A)

# Calculate error metrics (RMSE) for training and test sets
train_RMSE = my_rmse(y_train, model_A_predictedytrain)
test_RMSE = my_rmse(y_test, model_A_predictedytest)

print(f'The train RMSE for model A is {train_RMSE} points and the test RMSE is {test_RMSE} points')

Both the training and test errors are under two points. 

# Causes of Error






























Why do some models perform better than others and what causes differences between our training and testing results? We mentioned in the introduction that we want to create a model that does the best at predicting on unseen data. In order for this to be the case, we need two things:

1. Ensure our model to capture the underlying trend in the data
2. Ensure our model is not capturing random noise in the data

In the following section, we'll first walk through some examples of how a model either cannot learn from the data or is capturing noise. Next, we'll analyze whether the model is actually learning by comparing the training and testing results over time, and then see how the number and size of our features affect the ability of the model to learn the data accurately.

### Errors Due to Bias and Variance

You will often hear models discussed in terms of their bias and variance and the closely related concepts of overfitting and underfitting. Bias in machine learning can come from a few possible sources and in this case, we are talking about bias in the way that we fit our model. This is distinct from sampling bias which is when we use a non-representative sample to train our model. We'll talk a little more about sampling bias in the summary section. 

[Overfitting and Underfitting](https://www.youtube.com/watch?time_continue=30&v=xj4PlXMsN-Y): (5:07)

This first video from Udacity provides a conceptual overview of what it means to overfit or underfit. The example here uses  a classification problem, but the same concepts apply to regression. 

[Bias and Variance](https://www.youtube.com/watch?v=EuBBz3bI-aA&t=285s): (6:35)

In the second video, Josh Starmer from StatQuest will illustrate a high bias and high variance model for some sample data and talk about overfitting and underfitting specific to regression.


Bias describes how closely the model predicts the true values. A model that does a good job of capturing the actual trend in the data will have a lower bias than one that does not. A model that doesn't describe the underlying trend in the data is going to perform poorly on both our training data as well as unseen data. Imagine that you have data where the actual relationship between x and y is $ y = x^2$, but you attempt to fit it using the line $y = x$. Your model has a high bias because it isn't capturing the true quadratic relationship. However since the model will do similarly poorly on any dataset, it is a low variance model. We call this problem **underfitting**.  

Let's see if we can identify an underfit model. Model A had an RMSE of 1.27 points on test data. But maybe getting students to report their time spent studying is difficult. Could we fit the data as well just by using average quiz score?


In [0]:
# Instantiate a new model object, model B
model_B = LinearRegression()

# Use average_quiz as the only feature
X_train_B = X_train[['average_quiz']]
X_test_B = X_test[['average_quiz']]

# Fit model B using training data with a single feature
model_B.fit(X_train_B, y_train)

# Use model B to predict y-values for training and test sets
model_B_predictedytrain = model_B.predict(X_train_B)
model_B_predictedytest = model_B.predict(X_test_B)

# Calculate error metrics (RMSE) for training and test sets
train_RMSE = my_rmse(y_train, model_B_predictedytrain)
test_RMSE = my_rmse(y_test, model_B_predictedytest)

print(f'The train RMSE for model B is {train_RMSE} points and the test RMSE is {test_RMSE} points')


Yikes, the RMSE got much worse for both the training and test sets compared with model A. This is a characteristic of an underfit model. If your model is underfitting, you should consider adding additional features. 

Variance describes the amount the model would change if a different set of training data was used. A model that is capturing the noise in the data has a high variance. This is because noise is random and the noise from a different set of data would look completely different. Imagine that your data has an actual relationship of $ y = x$ and there's some noise in your measurements. Instead of fitting it with the line $y = x$, you find a function that produces a squiggly line that passes directly through every data point. This model is going to do a great job describing that one particular training data set. ***But,*** if you use this model to predict on unseen data with a different pattern of noise, it will perform far less than expected. Since every dataset would be fit with a differently shaped squiggly line using this model, it is a high variance model. A high variance model will have a low bias - it's capturing the trend in the data, it's just doing it a little too well. We call this problem **overfitting**. 


Let's see if we can identify an overfit model. Model A has an RMSE of 1.27 points on test data. But let's be greedy! Using fewer features in model B didn't work very well. I wonder if we can make a better model by adding in more features...


In [0]:
# Instantiate a new model object
model_C = LinearRegression()

# Use every feature
X_train_C = X_train
X_test_C = X_test

# Fit model C on the training data
model_C.fit(X_train_C, y_train)

# Use model C to predict y-values for training and test sets
model_C_predictedytrain = model_C.predict(X_train_C)
model_C_predictedytest = model_C.predict(X_test_C)

# Calculate error metrics (RMSE) for training and test sets
train_RMSE = my_rmse(y_train, model_C_predictedytrain)
test_RMSE = my_rmse(y_test, model_C_predictedytest)

print(f'The train RMSE for model C is {train_RMSE} points and the test RMSE is {test_RMSE} points')

The training RMSE went down, but the test RMSE went up. This is characteristic of an overfit model. If your model is overfitting, you should fix it by removing features.

Now, you may argue that the test RMSE for model C didn't increase by very much. It's good to remember that all other things being equal, simpler is better! Simpler models are at less risk for overfitting and require less computational time. A good rule of thumb is to choose the simplest model that has an error within one standard error of the model with the lowest error. When we talk about model fine tuning later in this lesson, we'll demonstrate how to find the standard error of your error metric. 

### Learning Curves

Now that we have seen how a model can under or overfit, let's see how our training and testing results change over time when we give our models more data by watching some example videos offered by Udacity and Coursera:

*   [Learning Curves](https://www.youtube.com/watch?v=02soMdTCMQo): (3:02)

*   [Learning Curves](https://www.coursera.org/lecture/machine-learning/learning-curves-Kont7): (11:53)




### Sampling Density and the Curse of Dimensionality

So what causes our model to under or overfit? We know that a high bias model is one that has too few parameters and features to capture the relationship in the data, but why might too many features and parameters cause issues?

It is important to realize these models ***find correlations*** in our data, but this is ***not the same as causation***. In statistics, this is why we require large samples of data to verify we are obtaining valid results. To help answer this question in more detail, the following two videos from Udacity will explain the curse of dimensionality and provide an illustration of why we need exponentially more data for each feature we add to our model.

**Note:** although the video mentions a supervised machine learning called K-Nearest Neighbor (KNN) this curse, unfortunately, applies to all machine learning algorithms:

*  [Curse of Dimensionality](https://www.youtube.com/watch?v=QZ0DtNFdDko) (3:02)
*  [Curse of Dimensionality Two](https://www.youtube.com/watch?v=OyPcbeiwps8) (7:09)

As we decide how many features to use in our model, a useful thought experiment is to consider the sampling density. Sampling density tells us how much data we have per dimension (feature) we add to the model. The formula for sampling density is given below where N is the number of data points, and D is the number of features (or dimensionality)


$$
Sampling Density = N^{1/D}
$$

In our dataset, we have 20 data points. If include only two features in our model, we have around 4.5 data points per dimension. If we increase the number of features in the model to 5, we are down to fewer than two data points per dimension. When data is sparse, it becomes more and more difficult for our machine learning model to find meaningful relationships. This dilemma is known as the curse of dimensionality. 


In [0]:
# Change the denominator value of the exponent to find the density with different numbers of features
sampling_density = (20)**(1/2)
print (sampling_density)

### Model Complexity

As you may have noticed there is a trade-off in the simplicity or complexity of a model given a fixed size of data. If it is too simple, our model cannot learn about the relationship in the data and misrepresents it. However, if our model is too complex, we need more data to learn the underlying relationship. Otherwise, it is very common for a model to infer relationships that might not actually exist.

What this means for us is that each model will have some number of features that result in the best performance. Using fewer features and model parameters will result in an underfit model. Using more features and parameters will result in overfitting. You will often hear about as the bias-variance trade-off. This occurs because as the complexity of the model grows, the bias tends to decrease and the variance tends to increase. The good news is that model validation techniques allow us to identify the ideal number of features.


A secondary reason we may want to eliminate features from our model is that each feature adds another dimension to our model and with it, the run time on our fit increases. This can be an important consideration when working with big data or resource intensive models such as neural nets. 

In [0]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
import timeit

# How does the run time of this fit change with differing numbers of features or samples? 
X, y = make_regression(n_samples = 10000, n_features = 100)
model = LinearRegression()
%timeit model.fit(X, y)

# Model Selection and Fine Tuning







 











Earlier we looked at how to spot if our model is underperforming and what are its root causes. In this section, we'll learn how to split our training data into a validation set which we can use to verify how well our model is generalizing and then use a parameter tuning algorithm called grid search to help automate the model parameters for us going forward.

### Adding a Validation Set

[Cross-Validation](https://www.youtube.com/watch?v=sFO2ff-gTh0) (6:06)

The following video from Udacity will start you off with some perspective on why we use training and test sets and what it means to perform cross-validation. 




So far we've been using data that's split into training and testing sets. An even better way to validate models is to further split the training data into training and validation sets. 



Recall the training set is what we use to fit the model. So why might we want to split it further? The validation data is what we use to check how the model is performing as we tune its parameters. Validation scores allow us to compare models and check for things like overfitting and underfitting so that we can choose what we think is the best model. 



The validation score affects which model we select, but since we ultimately want our model to perform on data that has nothing to do with the model training or selection, we maintain a test (or holdout set) as a final check of performance. Once we have finished tuning our model parameters and select what we think is the best option, we use the test set to confirm the generalizability of the model. 

### K-fold Cross-Validation


While having a training and a validation set is a good start, it's possible that the model fit and validation error could differ significantly from one random split to the next. In order to reduce the variance in our validation error, we use k-fold cross-validation. In k-fold cross validation, we repeat the validation process k times using different splits of the data and take the average of the resulting scores. 

[Cross-Validation)](https://www.youtube.com/watch?v=fSytzGwwBVw): (6:04)

This video from Josh Starmer at StatQuest explains why we use k-fold cross-validation and illustrates how the training data is divided into folds.

Let's look at an example split using indices for our training set of exam data:

In [0]:
# Print the list of training indices
print(f'Training set indices: {list(X_train.index)}')

If we performed 3-fold cross-validation on this data, our splits would look like this

Fold 1:  

*   Validation indices (first 1/3 of the list): 5, 11, 3, 18, 16 
*   Training indices:  13, 2, 9, 19, 4, 12, 7, 10, 14, 6

Fold 2:  

*   Validation indices (second 1/3 of the list):  13, 2, 9, 19, 4
*   Training indices:  5, 11, 3, 18, 16, 12, 7, 10, 14, 6

Fold 3:  

*   Validation indices (third 1/3 of the list): 12, 7, 10, 14, 6
*   Training indices:  5, 11, 3, 18, 16, 13, 2, 9, 19, 4, 


We're now going to perform three-fold cross-validation on model A. This model had the lowest RMSE when we used just the test set for confirmation, but cross-validation is a more rigorous validation technique so we want to use this to make sure this is really our best model.

We'll use sklearn's [cross-validate](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html) function for our k-fold cross-validation. This function performs both the splitting of the data and the scoring of the model. 

Our inputs to the cross_validate function are:

*   model_A: our model
*   X_train[['average_quiz', 'homework_hours']]: X values for training model A
*   Y_train: y values for training model A
*   scoring = 'neg_mean_squared_error': RMSE isn't available as a scoring metric for this function but we can calculate it later from the negative mean squared error
*   cv = 3: Perform 3-fold cross-validation
*   return_train_score = True: the function returns the validation errors by default. We're telling it to return the training errors also










In [0]:
from sklearn.model_selection import cross_validate

model_A_errors = cross_validate(model_A, X_train[['average_quiz', 'homework_hours']], y_train, scoring = 'neg_mean_squared_error', cv=3, 
                            return_train_score = True)

print(model_A_errors)


The model has returned a dictionary which contains train_score and test_score as keys and a list of three errors as the corresponding values. You can see that these errors vary a bit from fold to fold. We now average these together to get our final result. 

In [0]:
# Calculate root mean squared error (RMSE) for each training fold
train_RMSE = [((-error)**0.5).round(1) for error in model_A_errors['train_score']]

# Average the RMSEs
train_error = np.mean(train_RMSE).round(2)

# Calculate RMSE for each test fold
validation_RMSE = [((-error)**0.5).round(1) for error in model_A_errors['test_score']]

# Average the RMSEs
validation_error = np.mean(validation_RMSE).round(2)

# We'll look at the avearge RMSE and also the standard deviation or our result
print(f'The validation RMSE for model A is {validation_error} and the training RMSE is {train_error}')
print(f'The standard error of the validation RMSE is {(np.std(validation_RMSE)/np.sqrt(len(validation_RMSE))).round(2)}')

Model A has a validation RMSE of 1.87. Let's apply the same code to model C so that we can compare the two

In [0]:
# Same thing for model C
model_C_errors = cross_validate(model_C, X_train, y_train, scoring = 'neg_mean_squared_error', 
                                cv=3, return_train_score = True)

# Calculate RMSE for each training fold
train_RMSE = [((-error)**0.5).round(1) for error in model_C_errors['train_score']]

# Average RMSE
train_error = np.mean(train_RMSE).round(2)

# Calculate RMSE for each test fold
validation_RMSE = [((-error)**0.5).round(1) for error in model_C_errors['test_score']]

# Average RMSE
validation_error = np.mean(validation_RMSE).round(2)

print(f'The validation RMSE for model C is {validation_error} and the training RMSE is {train_error}')
print(f'The standard error of the validation RMSE is {(np.std(validation_RMSE)/np.sqrt(len(validation_RMSE))).round(2)}')

Model C has a validation RMSE of 2.4. Based on our K-fold cross-validation, Model A is the better model because it has a lower RMSE (1.87).

**Making Predictions**

We've verified our choice of model A using 3-fold cross evaluation. We can use this model to get our final score on the test data. 

In [0]:
# Predict using model A
model_A_predictedytest = model_A.predict(X_test_A)

# Calculate error metrics (RMSE) for training and test sets
test_RMSE = my_rmse(y_test, model_A_predictedytest)

print(f'The test RMSE is {test_RMSE} points')

###  Grid Search

[Grid search](https://scikit-learn.org/stable/modules/grid_search.html) is a powerful tool that allows us to define which hyperparameters that we want to tune along with which values to try. Hyperparameters are inputs to a model that control how the model is fit. Linear Regression doesn't have a lot of hyperparameters to adjust (some types of models have many more) but let's say we wanted to try both True and False for the fit_intercept parameter and True and False for the normalize parameter. For each combination of hyperparameter values, the grid search will create a model and score it using k-fold cross-validation. The grid search object has many available attributes which contain information about the fit.  Here we're going to look at the attribute that contains the best hyperparameter values. You will see another example of Grid Search as part of the student project on Medical Costs.

In [0]:
from sklearn.model_selection import GridSearchCV

# Instantiate a model object to use in the grid search 
gridsearch_model = LinearRegression()

# Dictionary of which hyperparameters to test and which values to use
parameters = [{'fit_intercept': [True, False], 'normalize': [True, False]}]

# Instantiate a grid search object for our linear regression model
# Use all possible combinaitons of values in the hyperparamter grid
# Perform 3-fold CV
# Use negative MSE for scoring
# Return training scores in addition to validation scores which are returned by default
# At the end of the search, refit the estimator using the best parameters that the grid search found
search = GridSearchCV(estimator = gridsearch_model,  param_grid = parameters, cv = 3, scoring = 'neg_mean_squared_error', 
          return_train_score = True, refit = True)

# Fit the data using Grid Search object
search.fit(X_train, y_train)

# Print the best combination of hyperparameters from our parameters grid
print(search.best_params_)

Using False for fit_intercept and True for normalize gave our best score for this model.

### Regularization

One extra step we can use to help tune our model is to use a process called regularization. While the name might sound fancy it simply is a process of keeping the model weights or parameters small to make the model simpler and prevent overfitting.

For example, instead of using an ordinary regression model we can use ***Regularized Regression*** which applies a penalty in the loss function for adding features. As a result, the predicted coefficients of some features will shrink and may even be reduced to zero, thus eliminating that particular feature from the model. The penalty term for regularized regression is usually called lambda (in Sklearn it's called alpha) and it determines how much the coefficients will shrink by and how many are eliminated. Ridge regression, Lasso regression, and Elastic net are commonly used regularized regression estimators.

Which features do we want to eliminate? Features that don't correlate well with the target should be eliminated as they don't provide us with information and including them could result in overfitting. Another common reason is that a feature can be derived from another feature or features in the dataset. This is known as multicollinearity. For example, the diameter of an apple is highly correlated to its circumference. Adding one to the model might give you new information about the apple's weight but using both is redundant and could result in a model with unstable coefficients. Regularized regression can help to eliminate these types of features from the model.

# Summary

In this lesson, we covered topics related to validating a model. These techniques are all aimed at choosing the model that will have the best performance on unseen data. We talked first about what metrics are typically used to evaluate regression models. We then discussed how the number of features included in the model can cause overfitting (a high variance model) or underfitting (a high bias model). We discussed how cross-validation techniques allow us to compare models and how we can use them to find the ideal number of features. We also introduced the concepts of grid search and regularization as tools for model fine-tuning, which you will see more of in the mini-project.

***Some final words on model performance***

What if you do your due diligence with model validation and you find out that your model doesn't perform as you hoped in the real world. How might the test data differ from unseen data? 

When we apply machine learning algorithms to data, we are making some assumptions about that data. One assumption is that the data is representative of our population. If the sample isn't representative of our population, we might have a bias in our sample and our sampling methods should be examined.

For example what if you had an idea to predict a person's likelihood of developing breast cancer but you only used male subjects to train your model. If you tried to deploy such a model, you would probably find that it vastly underpredicts breast cancer rates for the entire adult population because men are much less likely to develop this type of cancer than women. Although this sort of model may seem ridiculous, this type of bias isn't uncommon. It was only in the past few years that the NIH started requesting that grant applicants include [both male and female subjects in their animal studies](https://www.nature.com/news/policy-nih-to-balance-sex-in-cell-and-animal-studies-1.15195). 

**Reflections:**

*   What do you think are possible unintended consequences of AI?

*   Give an example of a decision that an AI could make that may have serious consequences. Do you think such dire situations should be determined by AI or by humans?