# Machine Learning Fundamentals - Cumulative Lab

## Introduction

In this cumulative lab, you will work through an end-to-end machine learning workflow, focusing on the fundamental concepts of machine learning theory and processes. The main emphasis is on modeling theory (not EDA or preprocessing), so we will skip over some of the data visualization and data preparation steps that you would take in an actual modeling process.

## Objectives

You will be able to:

* Recall the purpose of a train-test split
* Practice performing a train-test split
* Recall the difference between bias and variance
* Practice identifying bias and variance in model performance
* Practice applying strategies to minimize bias and variance
* Practice selecting a final model and evaluating it on a holdout set

## Your Task: Build a Model to Predict Blood Pressure

![stethoscope sitting on a case](images/stethoscope.jpg)

<span>Photo by <a href="https://unsplash.com/@marceloleal80?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Marcelo Leal</a> on <a href="https://unsplash.com/s/photos/blood-pressure?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></span>

### Business and Data Understanding

Hypertension (high blood pressure) is a treatable condition, but measuring blood pressure requires specialized equipment that most people do not have at home.

The question, then, is ***can we predict blood pressure using just a scale and a tape measure***? These measuring tools, which individuals are more likely to have at home, might be able to flag individuals with an increased risk of hypertension.

[Researchers in Brazil](https://doi.org/10.1155/2014/637635) collected data from several hundred college students in order to answer this question. We will be specifically using the data they collected from female students.

The measurements we have are:

* Age (age in years)
* BMI (body mass index, a ratio of weight to height)
* WC (waist circumference in centimeters)
* HC (hip circumference in centimeters)
* WHR (waist-hip ratio)
* SBP (systolic blood pressure)

The chart below describes various blood pressure values:

<a title="Ian Furst, CC BY-SA 4.0 &lt;https://creativecommons.org/licenses/by-sa/4.0&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Hypertension_ranges_chart.png"><img width="512" alt="Hypertension ranges chart" src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8b/Hypertension_ranges_chart.png/512px-Hypertension_ranges_chart.png"></a>

### Requirements

#### 1. Perform a Train-Test Split

Load the data into a dataframe using pandas, separate the features (`X`) from the target (`y`), and use the `train_test_split` function to separate data into training and test sets.

#### 2. Build and Evaluate a First Simple Model

Using the `LinearRegression` model and `mean_squared_error` function from scikit-learn, build and evaluate a simple linear regression model using the training data. Also, use `cross_val_score` to simulate unseen data, without actually using the holdout test set.

#### 3. Use `PolynomialFeatures` to Reduce Underfitting

#### 4. Use Regularization to Reduce Overfitting

#### 5. Evaluate a Final Model on the Test Set

## 1. Perform a Train-Test Split

Before looking at the text below, try to remember: why is a train-test split the *first* step in a machine learning process?

.

.

.

A machine learning (predictive) workflow fundamentally emphasizes creating *a model that will perform well on unseen data*. We will hold out a subset of our original data as the "test" set that will stand in for truly unseen data that the model will encounter in the future.

We make this separation as the first step for two reasons:

1. Most importantly, we are avoiding *leakage* of information from the test set into the training set. Leakage can lead to inflated metrics, since the model has information about the "unseen" data that it won't have about real unseen data. This is why we always want to fit our transformers and models on the training data only, not the full dataset.
2. Also, we want to make sure the code we have written will actually work on unseen data. If we are able to transform our test data and evaluate it with our final model, that's a good sign that the same process will work for future data as well.

### Loading the Data

In the cell below, we import the pandas library and open the full dataset for you. It has already been formatted and subsetted down to the relevant columns.

In [None]:
# Run this cell without changes
import pandas as pd
df = pd.read_csv("data/blood_pressure.csv", index_col=0)
df

In [1]:
# __SOLUTION__
import pandas as pd
df = pd.read_csv("data/blood_pressure.csv", index_col=0)
df

Unnamed: 0,Age,bmi,wc,hc,whr,SBP
0,31,28.76,88,101,87,128.00
1,21,27.59,86,110,78,123.33
2,23,22.45,72,104,69,90.00
3,24,28.16,89,108,82,126.67
4,20,25.05,81,108,75,120.00
...,...,...,...,...,...,...
219,21,45.15,112,132,85,157.00
220,24,37.89,96,124,77,124.67
221,37,33.24,104,108,96,126.67
222,28,35.68,103,130,79,114.67


### Identifying Features and Target

Once the data is loaded into a pandas dataframe, the next step is identifying which columns represent features and which column represents the target.

Recall that in this instance, we are trying to predict systolic blood pressure.

In the cell below, assign `X` to be the features and `y` to be the target. Remember that `X` should **NOT** contain the target.

In [None]:
# Replace None with appropriate code

X = None
y = None

X

In [2]:
# __SOLUTION__
X = df.drop("SBP", axis=1)
y = df["SBP"]

X

Unnamed: 0,Age,bmi,wc,hc,whr
0,31,28.76,88,101,87
1,21,27.59,86,110,78
2,23,22.45,72,104,69
3,24,28.16,89,108,82
4,20,25.05,81,108,75
...,...,...,...,...,...
219,21,45.15,112,132,85
220,24,37.89,96,124,77
221,37,33.24,104,108,96
222,28,35.68,103,130,79


Make sure the assert statements pass before moving on to the next step:

In [None]:
# Run this cell without changes

# X should be a 2D matrix with 224 rows and 5 columns
assert X.shape == (224, 5)

# y should be a 1D array with 224 values
assert y.shape == (224,)

In [3]:
# __SOLUTION__

# X should be a 2D matrix with 224 rows and 5 columns
assert X.shape == (224, 5)

# y should be a 1D array with 224 values
assert y.shape == (224,)

### Performing Train-Test Split

In the cell below, import `train_test_split` from scikit-learn ([documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)).

Then create variables `X_train`, `X_test`, `y_train`, and `y_test` using `train_test_split` with `X`, `y`, and `random_state=42`.

In [None]:
# Replace None with appropriate code

# Import the relevant function
None

# Create train and test data using random_state=42
None, None, None, None = None

In [4]:
# __SOLUTION__

# Import the relevant function
from sklearn.model_selection import train_test_split

# Create train and test data using random_state=42
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Make sure that the assert statements pass:

In [None]:
# Run this cell without changes

assert X_train.shape == (168, 5)
assert X_test.shape == (56, 5)

assert y_train.shape == (168,)
assert y_test.shape == (56,)

In [5]:
# __SOLUTION__

assert X_train.shape == (168, 5)
assert X_test.shape == (56, 5)

assert y_train.shape == (168,)
assert y_test.shape == (56,)

## 2. Build and Evaluate a First Simple Model

For our baseline model (FSM), we'll use a `LinearRegression` from scikit-learn ([documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).

### Instantiating the Model

In the cell below, instantiate a `LinearRegression` model and assign it to the variable `baseline_model`.

In [None]:
# Replace None with appropriate code

# Import the relevant class
None

# Instantiate a linear regression model
baseline_model = None

In [6]:
# __SOLUTION__

# Import the relevant class
from sklearn.linear_model import LinearRegression

# Instantiate a linear regression model
baseline_model = LinearRegression()

Make sure the assert passes:

In [None]:
# Run this cell without changes

# baseline_model should be a linear regression model
assert type(baseline_model) == LinearRegression

In [7]:
# __SOLUTION__

# baseline_model should be a linear regression model
assert type(baseline_model) == LinearRegression

If you are getting the type of `baseline_model` as `abc.ABCMeta`, make sure you actually invoked the constructor of the linear regression class with `()`.

If you are getting `NameError: name 'LinearRegression' is not defined`, make sure you have the correct import statement.

### Fitting and Evaluating the Model on the Full Training Set

In the cell below, fit the model on `X_train` and `y_train`:

In [None]:
# Your code here

In [8]:
# __SOLUTION__
baseline_model.fit(X_train, y_train)

LinearRegression()

Then, evaluate the model using root mean squared error (RMSE). To do this, first import the `mean_squared_error` function from scikit-learn ([documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html)). Then pass in both the actual and predicted y values, along with `squared=False` (to get the RMSE rather than MSE).

In [None]:
# Replace None with appropriate code

# Import the relevant function
None

# Generate predictions using baseline_model and X_train
y_pred_baseline = None

# Evaluate using mean_squared_error with squared=False
baseline_rmse = None
baseline_rmse

In [9]:
# __SOLUTION__

# Import the relevant function
from sklearn.metrics import mean_squared_error

# Generate predictions using baseline_model and X_train
y_pred_baseline = baseline_model.predict(X_train)

# Evaluate using mean_squared_error with squared=False
baseline_rmse = mean_squared_error(y_train, y_pred_baseline, squared=False)
baseline_rmse

13.404369445571641

Your RMSE calculation should be around 13.4:

In [None]:
# Run this cell without changes
assert round(baseline_rmse, 1) == 13.4

In [10]:
# __SOLUTION__
assert round(baseline_rmse, 1) == 13.4

This means that on the *training* data, our predictions are off by about 13 mmHg on average.

But what about on *unseen* data?

To stand in for true unseen data (and avoid making decisions based on this particular data split, therefore not using `X_test` or `y_test` yet), let's use cross-validation.

### Fitting and Evaluating the Model with Cross Validation

In the cell below, import `cross_val_score` ([documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)) and call it with `baseline_model`, `X_train`, and `y_train`.

For specific implementation reasons within the scikit-learn library, you'll need to use `scoring="neg_root_mean_squared_error"`, which returns the RMSE values with their signs flipped to negative. Then we take the average and negate it at the end, so the number is directly comparable to the RMSE number above.

In [None]:
# Replace None with appropriate code

# Import the relevant function
None

# Get the cross validated scores for our baseline model
baseline_cv = None

# Display the average of the cross-validated scores
baseline_cv_rmse = -(baseline_cv.mean())
baseline_cv_rmse

In [11]:
# __SOLUTION__

# Import the relevant function
from sklearn.model_selection import cross_val_score

# Get the cross validated scores for our baseline model
baseline_cv = cross_val_score(baseline_model, X_train, y_train, scoring="neg_root_mean_squared_error")

# Display the average of the cross-validated scores
baseline_cv_rmse = -(baseline_cv.mean())
baseline_cv_rmse

13.797218918749715

The averaged RMSE for the cross-validated scores should be around 13.8:

In [None]:
# Run this cell without changes

assert round(baseline_cv_rmse, 1) == 13.8

In [12]:
# __SOLUTION__

assert round(baseline_cv_rmse, 1) == 13.8

### Analysis of Baseline Model

So, we got about 13.4 RMSE for the training data, 13.8 RMSE for the test data. RMSE is a form of *error*, so this means the performance is somewhat better on the training data than the test data.

Referring back to the chart above, both errors mean that on average we would expect to mix up someone with stage 1 vs. stage 2 hypertension, but not someone with normal blood pressure vs. critical hypertension. So it appears that the features we have might be predictive enough to be useful.

Are we overfitting? Underfitting?

.

.

.

The RMSE values for the training data and test data are fairly close to each other, so we are probably not overfitting too much, if at all.

It seems like our model has some room for improvement, but without further investigation it's impossible to know whether we are underfitting, or we are simply missing the features that we would need to make predictions with less error. (For example, we don't know anything about the diets of these study participants, and we know that diet can influence blood pressure.)

In the next step, we'll attempt to reduce underfitting by applying some polynomial features transformations to the data.