### UBC Extended Learning
#### Instructor: Socorro Dominguez
#### Module 03

- [] Split a dataset into train and test sets using train_test_split function.
- [] Explain the difference between train, validation, test, and "deployment" data.
- [] Identify the difference between training error, validation error, and test error.
- [] Explain cross-validation and use cross_val_score() and cross_validate() to calculate cross-validation error.
- [] Explain overfitting, underfitting, and the fundamental tradeoff.
- [] State the golden rule and identify the scenarios when it's violated.

#### Test/train split


- If you were studying from practice exams, what would you do to get an idea of how you will perform on the real one?
- Idea:
    - Goal is to *learn the underlying concepts* rather than *memorize the questions/answers*
        1. Learn from some practice exams / questions that you do again and again (training data)
        2. Write a *new* practice exam that you left to the side (test data)
- This strategy will give you an idea of how well you will generalize



- With more powerful models, we can always arbitrarily fit the training data better and better (lower training error)
- What we're actually trying to achieve is good performance on **new data that our model was never trained on**
- We can train our model on just a subset of our whole dataset (**train set**) and evaluate its performance on the rest (**test set**)

<img src="images/test_set.png" style="width: 700px;"/>

![](images/overfitting_training.png)

#### Overfitting & Underfitting

- Imagine you only had to study one exam - and then, the real exam are the exact same questions as the mock test (and you know this)

- Later on, if you had a different exam, would you feel just as confient?

![](images/overfitting.png)
- Data has two components: signal (pattern) + noise
- Example: predicting house prices from # of bedrooms, area, age, etc.
    - Signal: degree to which these features influence the price
    - Noise: random variation, or variation due to unknown features
- When the model is fitting the noise, it is overfitting
- **Overfitting** is when the model produces an analysis that corresponds too closely or exactly to a particular set of data, and may fail to fit to additional data or predict future observations reliably. The model is memorizing noise.
- **Underfitting** is when a data model is unable to capture the relationship between the input and output variables accurately, generating a high error rate on both the training set and unseen data.

## Sampling bias

- Remember that the model just learns patterns in the data you give it.
- If our dataset isn't representative of the type of data we want our model to handle afterwards, it is not going to perform well.
- Most machine learning models fail dramatically for **out of distribution** data that is very different from what we've trained on.

#### Example
You studied very hard for a test. And you were very comfortable with the mock exams. 
But the day of the test, you are very nervous and you don't perform very well.

You never studied, and you partied the night before the test. You barely showed up to the test but the test was extremely easy and the few questions you didn't know, they were multiple choice and luck was on your side. You achieve a perfect score.

If you only take that **one** test, is it enough to really measure your knowledge?

Essentially **all** datasets are biased one way or another. Train yourself to think of different ways
that they could be biased and different reasons why. Understanding the conditions under which your model
will work and fail is essential when deploying it in the real world.

#### Cross-validation
- We wish to evaluate our model on a dataset with little sampling bias, so that we know how it will perform in the real world
- Smaller datasets are far more likely to suffer from sampling bias
- This creates a conflict:
    - We want to have a big test set to get a realistic estimate of our model's performance without sampling bias
    - We want to have a big training set so that our model can learn without sampling bias
- How can we reconcile these goals?
- **One answer:** cross-validation. Even if any single fold suffers from sampling bias, at least we are averaging the results from many


<img src="images/cross_validation.png" style="width: 1000px;"/>

#### The Golden Rule

Even though we care the most about test score:

**THE TEST DATA CANNOT INFLUENCE THE TRAINING PHASE IN ANY WAY.**

### Train / validation / test split

**Question**: Have you noticed anything wrong with what we are doing? Should we trust our cross-validation score as a true estimate of performance on out-of-sample data?

We have been using the cross-validation score to select hyperparameters. You can think of this as having used the validation data in each fold to train our model in a way. So, our model will be biased to the dataset and our cross-validation score will be overly-optimistic.

**Solution**:
1. Split the dataset into a train/test set
2. Perform grid search with cross-validation on the training set. Each fold, the training set will be split into a train/validation set
3. Using the hyperparameters that obtained the best cross-validation score, retrain the model on the entire training set (not just training folds within this training set)
4. Evaluate the model on the test set to get a true estimate of the out-of-sample error

<img src="images/cross_validation_with_test.png" style="width: 500px;"/>


To not break the golden rule:
- Do any EDA after splitting
- Do any data preprocessing after splitting.
- Module 5 we will talk about pipelines - this is going to be helpful too.