## Table of Contents

<a href="#Model-Training"><font size="+0.5">Model Training</font></a>
* Fitting Models
* Predicting with Models

<a href="#Multivariable-Models"><font size="+0.5">Multivariate Models</font></a>
* Fitting models where X is two-dimensional

<a href="#Evaluation"><font size="+0.5">Evaluation</font></a>
* Cross Validation
* Regression Error

<a href="#Regularisation-and-Tuning"><font size="+0.5">Regularisation and Tuning</font></a>
* Ridge Regression
* Lasso Regression

---
<center><h1><font size=6>Supervised Learning<font></h1></center>
    
Supervised learning covers machine learning tasks which aim to make predictions by training a model. Models are said to be "supervised" because each record has a label telling it what the true value to be predicted is. Data is labelled when it contains both the thing we want to predict (the target) and the values used to predict it (features). There are a number of different models that allow us to make predictions all with different advantages and disadvantages.
    
We will look at two different supervised problems: regression and classification.

<center><h1><font size=7>Regression<font></h1></center>

## Learning Objectives

- Understand what regression is and when it is used
- Understand how to implement regression using **`sklearn`**
- Be able to build multi-variable models
- Be able to evaluate regression model performance
- Understand what regularization is and how to use it
- Be able to understand and change the values of model hyperparameters

Regression is used when we train a model to to predict a continuous variable, a numerical value. We have one variable we are trying to predict: the target, and other feature variables we are going to use to make that prediction.

In machine learning we use the same regression models and theory that is implemented in the realm of statistics, however, the emphasis is slightly different. In general statistics we perform regression in order to explore our data and get values for characteristic quanities and coefficients which tell us about our data. However, in machine learning we largely use regression to predict values; this means it is a results led area, where we are trying to optimise our models to get the best predictions we can. 

In supervised learning we have two important values, the **true** value of a target variable $y_i$, meaning the value of an individual data point, and the **predicted** value $\hat{y_i}$, meaning the value our model produces for a given input. 

> The difference between these two values is therefore the error/residual in our model, how different our model predicts compared to the true values. The purpose of our different machine learning techniques is to make this error value, $$\epsilon_i = y_i - \hat{y_i}$$ as small as possible. 

There are a range of ways to measure this $\epsilon$, each of which has appropriate situations one might use them. This chapter will stick for the most part with simple linear regression models, however, the programming and design techniques shown are applicable to a wide range of regression methods. 

We are going to use the TfL "bikes" data set to predict the **`count`** of people hiring bikes on a day. Although the **`count`** data is integer, we are going to assume it can be represented by a continuous floating point number.

Before we get started training regression models, we are going to briefly prepare our data by filling in missing values and scaling a few selected columns. 

## Data Preparation

In [3]:
import pandas
print(pandas.__version__)

0.20.1


In [1]:
# Import libraries
import pandas as pd 
from sklearn.preprocessing import RobustScaler
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

In [5]:
# Import data
bikes_filepath = '../../data/bikes.csv'
bikes_imported_data = pd.read_csv(filepath_or_buffer=bikes_filepath, delimiter=",")

# Clean missing data
bikes_clean_data = bikes_imported_data.interpolate(method='linear')
bikes_clean_data = bikes_clean_data.fillna(method="bfill")

# Set our target variable as the count
y = bikes_clean_data["count"].to_numpy()

# Create scaler object
rb_scaler = RobustScaler()

# We are just going to use numerical data to begin with
# Drop our count and multicolinear feel_temperature
numerical_data = bikes_clean_data.select_dtypes(include='number')
numerical_data = numerical_data.drop(columns=["count", "feel_temperature"])

# Scale the data.
scaled_features = rb_scaler.fit_transform(numerical_data)
scaled_data = pd.DataFrame(scaled_features, columns=numerical_data.columns)

scaled_data.head()

TypeError: include and exclude must both be non-string sequences

---
# Model Training
We are first going to train a simple linear model that uses one variable to predict another, a univariable model. This will be in the form: $$ f(x) = a_1 x + a_0 $$ Where $a_1$ is the gradient and $a_0$ is the intercept. For a univariable model $x$ is one variable and is therefore a vector.

A univariable regression model has one target variable that we are trying to predict, and one feature variable we are using to make that prediction. If $n$ is the number of features used to predict the target variable then this model has $n = 1$.

A multivariable regression model has one target variable to be predicted and multiple feature variables. Using the same formulatiuon as above for a multivariable model $n > 1$. We will often be using this type of model to make predictions as multiple variables often give us more information about patterns producing a better prediction.

We are going to use the Ordinary Least Squares method of regression in this first example. For an interactive guide through how this method works you can look [here](http://setosa.io/ev/ordinary-least-squares-regression/), and for a deeper dive into the maths behind the method you can look [here](https://people.revoledu.com/kardi/tutorial/Regression/OLS.html).

Lets first use the **`humidity`** attribute to predict the value of the **`count`** of bikes hired. We train a model by creating the model object, then calling the **`fit(X, y)`** method. 

We need our data to be in the $X$, $y$ format in order for our model to use it. The $R^2$ measure is a commonly used metric to see how well a regression performs. More formally, the "coefficient of determination" quantifies the proportion of variance that our training data predicts in the target data. 

The coefficient is defined generally as $$ R^2 = \frac{explained~variation}{total~variation}$$ We therefore have a maximum possible value for $R^2$ of 1, which corresponds to the model being able to predict all the variation in the data. Bad models will produce a value of $R^2 \sim 0$ and very bad models will be $R^2 < 0$


A more in depth explaination of this metric can be found [here.](https://en.wikipedia.org/wiki/Coefficient_of_determination)

In order to evaluate our model we are going to need to split our data into training and test sets just like we did at the end of Chapter 1 - Data Preparation. 

It is important that we remember the shapes of our data structures when working with models. Before we pass our data into the model let's check what it looks like.

In [None]:
# Selecting only the humidity attribute for this example.
X_humidity = scaled_data["humidity"].to_numpy()
print("X_humidity shape: ", X_humidity.shape)
print(X_humidity[:10])

For our feature data we need a 2D matrix, however it currently a vector not a matrix. If we use **numpy** to reshape the vector we can make it into a 2D matrix. We use the `.reshape(-1, 1)` command.

In [None]:
X_humidity = X_humidity.reshape(-1, 1)
print("X_humidity new shape: ", X_humidity.shape)
print(X_humidity[:10])

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Set our training and test sets.
X_train, X_test, y_train, y_test = train_test_split(X_humidity, y, test_size=0.2, random_state=123)

# Create model object
linear_model_univariate = LinearRegression()

# Fit model
linear_model_univariate.fit(X_train, y_train);

We have now fit our model on the training data, we can now make predictions based on other data, the test data. All **`sklearn`** supervised models come with a **`predict()`** function, which produces the $\hat{y_i}$ data.

In [None]:
# Predict values using our model
y_pred_humidity = linear_model_univariate.predict(X_test)

# Show the first 10 predictions
print(y_pred_humidity[:10])

Our regression model produces coefficients, and these can be shown easily. It is important to remember we have produced this model on scaled data, the coefficients produced will indicate some informative results, however, they cannot be used directly with the original pre-scaled data. 

Instead of fitting then predicting the values separately, you can do both simultaneously using the **`fit_predict`** method. This is called on the created model and produces the prediction. 

In [None]:
# Showing coefficients learnt from the model.
print("Humidity model coefficient: \n", linear_model_univariate.coef_[0].round(2))
print("Humidity model intercept: \n", linear_model_univariate.intercept_.round(2))

# Calculate R^2 value using the true and predicted values of y test.
r2_value = r2_score(y_test, y_pred_humidity)

print("Humidity R^2 value: \n", r2_value)

The value of the model coefficient corresponds to $a_1$ and the intercept coefficient to $a_0$.

From our $R^2$ value above we can see that our simple model explains some of the variance in **`count`**, however, most of it is unexplained, there is a lot of room for improvement in this model!

In [None]:
# Plot the scaled data against the daily count.
plt.scatter(X_test, y_test, color="navy", label="Data")
plt.plot(X_test, y_pred_humidity, color='red', linewidth=2.5, 
         label="Regression model:\nCount = {:.1f}{}Humidity + {:.1f}"\
                 .format(linear_model_univariate.coef_[0], r"$\times$", 
                     linear_model_univariate.intercept_))
plt.title("Humidity Predicting Count")
plt.xlabel("Standardised Humidity")
plt.ylabel("Count")
plt.legend();

---
# Multivariable Models

Our model above is able to predict the count to some extent. Our graph shows it captures the trend, but our target is more complicated than being only a result of the days humidity. We shall take into account some other factors, the **`real_temperature`** and **`wind_speed`**. We could also put into the model some of our categorical features that we have encoded from the previous chapter. 

The form that the model will take is $$f(X) = a_3 X_3 + a_2 X_2 + a_1 X_1 + a_0$$ You can see from this each $a_{1-3}$ corresponds to a gradient for each of the different variables given to the model with $a_0$ as the intercept again. For each new feature we put into our model we will get another $a_i$ coefficient associated.

From a model building and programming perspective, multi-variable models are very similar to univariate, we use the same functions, only the data we input into the model changes. In addition, as we increase the number of variables involved in our model, the harder our results will be to visualise. 

> Let's use the three attributes to predict the bike hire count, do we think this will increase or decrease our $R^2$ score?

In [None]:
# Using all of the columns from the scaled data.
X = scaled_data.to_numpy()

# Split the training and test data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

# Create model object
linear_model_multi_variable = LinearRegression()

# Fit model
linear_model_multi_variable.fit(X_train, y_train);

In [None]:
# Generate predictions from data.
y_pred = linear_model_multi_variable.predict(X_test)

In [None]:
# Calculate R^2 value using the true and predicted values of y_test.
r2_value = r2_score(y_test, y_pred)

print("Multivariable R2 value: \n", r2_value)

Using the other attributes in the data set has allowed us to improve our model significantly, it is now able to explain the majority of the variance. 

<div class="alert alert-block alert-info">
<b><font size="4">Exercise 1:</font></b> 

<p> Re-run the previous multi-variable model, this time change the value of the <b>random_state=</b> argument in <b>train_test_split()</b> function. What happens to your $R^2$ values?
</p> </div>

In [None]:
# Write your code below


---
# Evaluation

We have thus far seen one way of measuring how our regression model has performed, the $R^2$ value. This is however, only one of many different methods. We will explore some that use the error $\epsilon$ between **predicted** and **true** values explicitly. 

We also want to consider whether our method of evaluating the performance of a model is consistent, or will it be easily changed when we use slightly different data?

## Cross Validation

What we saw in Exercise 1 is that our evaluation of our regression performance is influenced heavily by the data we put into it, this is a problem especially when we have smaller data sets. This is a result of the model variance discussed in the previous chapter. Taking only one split of the data means our whole model is dependent on that one split. If it randomly chooses to have all the training data with high **`count`** values then that will make our predictions higher than they should be.

Cross validation allows us to use our data to create a range of different data sets, we can then use these different data sets together to get a more general understanding of our model. This method is often called K-fold cross validation. 

The training data is divided into $K$ equally sized partitions ("folds"). From this collection of partitions, we remove one partition to be the test set, and the other $K-1$ folds are the training set. The model is then trained on the $K-1$ set and evaluated on the $K^{th}$ set. This is then repeated with each of the $K$ partitions being the one removed. There are therefore $K$ different models that are trained and tested on different subsets of the data. We can then get a score or metric measurement of our models performance for each.

<img src="../../images/crossval.png"  width="750" height="600" alt="A visualisation of a data set being split into four, four different ways, each with a different split being used as a test set">

Let's see what the different scores we would get for 5 different partitions of the data for our previously used model.

In [None]:
from sklearn.model_selection import cross_val_score

# Set the chosen K
K = 5

# Create model object
linear_model = LinearRegression()

# General cross-validation scores using the model pre-split data.
cv_scores = cross_val_score(estimator=linear_model, X=X, y=y, cv=K, scoring="r2")

print(cv_scores)

<div class="alert alert-block alert-info">
<b><font size="4">Exercise 2:</font></b> 

<p> 
Copying the code above change the $K$ value to 10, what happens to the scores? Calculate the average of <b>cv_scores</b> using <b>.mean()</b>. What happens when $K=2$? What does this tell us about our model and how well it is generalised?
</p> </div>

In [None]:
# Write your code here


It's clear that how we evaluate the model, what value of K we use, will greatly impact how our model appears to perform. For example, if for our data set we use $K = 100$, as there are approximately 600 records we would only have 6 in each fold, which clearly is not a representative sample to test our data against. 

We can also make predictions using the **`cross_val_predict()`** function. This will give us a prediction for point as each point is only in the test fold once.

In [None]:
from sklearn.model_selection import cross_val_predict

# Predict the test values using cross validation.
y_pred = cross_val_predict(estimator=linear_model, X=X, y=y, cv=K)

# We can then compare our predicted and true values using a plot.
fig, ax = plt.subplots(figsize=(6,6))
ax.scatter(y, y_pred, edgecolors=(0, 0, 0))
ax.plot([0, y.max()], [0, y.max()], 'k--', lw=4)
ax.set_xlabel('Measured Count')
ax.set_ylabel('Predicted Count')
ax.set_title(r"True and Predicted $y_i$")
plt.show()

Cross validation can inform us about how well our model is predicting based on subsets of the data. But if we want a true evaluation of it's performance we need to use a held out test set.

## Regression Error

As alluded to previously, there are other ways to measure the performance of a regression model based on the error between the **true** and **predicted** values.

### Mean Absolute Error

If we take the error as $\epsilon = y_i - \hat{y_i}$ then it makes intuative sense that we want to have a small average error. However, as this is defined as $y_i - \hat{y_i}$ using $\epsilon$ alone would mean we could have values above and below $y_i$. When we take the mean, the average could become zero. One method for handling this issue is to take the absolute value of the error, which means we equally take into account the values that are predicted higher and lower than the **true** value.

This is shown below:

$$MAE = \frac{\sum_{i=1}^{n}| y_i - \hat{y_i} |}{n}$$

And we can implement this metric either explicitly using predictions as in the "Model Training" section, or by using the cross validation **`scoring=`**  argument with **`neg_mean_absolute_error`**. 

Below is an example of MAE using cross validation. The **`sklearn`** library uses a negative mean absolute error measure, so we are just going to multiply the output by $-1$. This is because in machine learning there is an idea called the "loss function" for a model, something we want to maximise. The "loss function" is used heavily in neural networks, but is beyond the scope of this introductory course.

In [None]:
# Using the X_humidity data as a comparison with the multivariable model.
K = 5

# Generate the cross validation scores using the scoring metric.
cv_scores = cross_val_score(estimator=linear_model, 
                            X=X, y=y, cv=K, 
                            scoring="neg_mean_absolute_error")

cv_scores_humidity = cross_val_score(estimator=linear_model, 
                                     X=X_humidity, y=y, cv=K, 
                                     scoring="neg_mean_absolute_error")


print("All three features MAE: \n", cv_scores.round(3)*-1)
print("Mean MAE across the folds: \n\t", cv_scores.mean().round(3)*-1)
print("Standard Deviation MAE across folds: \n\t", cv_scores.std().round(3))
print("\nHumidity MAE: \n", cv_scores_humidity.round(3)*-1)
print("Mean MAE across the folds: \n\t", cv_scores_humidity.mean().round(3)*-1)
print("Standard Deviation MAE across folds: \n\t", cv_scores_humidity.std().round(3))

We can see that the MAE for the univariate model is significantly higher than that of the multi-variable model. 

### Mean Squared Error

What the MAE has shown is that it is important to have a metric which generates non-negative values from the error. This prevents positive and negative values cancelling out. The Mean Squared Error does this by squaring the value of the error. The effect of this squaring is to weight larger errors more significantly. The MSE is  a useful metric as it is similar to the underlying method for how Ordinary Least Squares regression is calculated, it minimises the MSE value to fit the model. 

> The mean square error is given by $$MSE = \frac{\sum_{i=1}^{n}( y_i - \hat{y_i} )^2}{n}$$

We can use the MSE in the exact same way we used the MAE programming-wise using the cross validation scoring method. By square rooting the MSE we get the Root Mean Squared Error, which will be on the same scale as the MAE. in addition, we can directly calculate the error score for either method by importing the scoring function from `sklearn.metrics` as shown below.

<div class="alert alert-block alert-info">
<b><font size="4">Exercise 3:</font></b> 

<p> 
Using a train test split of 80% training test on <b>X</b> and <b>y</b>, predict the values of <b>y_pred</b>. Use the function <b>mean_squared_error</b> to calculate the MSE value. Then calculate the RMSE using the <b>numpy.sqrt()</b>.

</p> </div>

In [None]:
from sklearn.metrics import mean_squared_error
# Write your code here


---
# Regularisation and Tuning

> So far we have only used a simple linear model with uses least square optimization. The model minimizes the sum of the residuals in the form $$ L = min( \sum_{i=1}^{n} (y_i - \hat{y_i})^2 ) $$ and $\hat{y} = ax + b$. This results in values of $a$ and $b$ which form our model. 

When our regression line fits our training data well, it has a $L$ of near zero, this means the model has low bias. If however, the model does not fit the testing data, it has high variance as a result of being unable to predict new values. 

However, we can improve our model's ability to predict by introducing a small amount of bias to the model. This will help to decrease the variance in the model. There are two similar methods of doing regularisation that we will discuss, ridge regression and lasso regression. By introducing regularisation into the model the sensitivity to the training data is decreased making the model more robust. This is done by adding a penalty term into the equation shown above.

## Ridge Regression

The first method of regularisation we will discuss is used in Ridge regression. Ridge regression penalises high value slopes in a model, which lowers the values of some parameters. This is the bias that is introduced to the model in exchange for lower variance. 

> The equation for minimizing values for regression are given below, there is an added term dependent on the gradients (slopes) of each variable in the model $$ L2 = min( \sum_{i=1}^{n} (y_i - \hat{y_i})^2 + \alpha^2 \times \sum_{j=1}^{m}(a_j)^2) $$
where $i$ is the index of the data point, there are $n$ total data points, $j$ is the index of the attribute, and there are $m$ number of attributes, $a_j$ is the gradient of variable $j$. 

What our new model will do is minimise the sum of the residuals and also now the sum of the gradients. But there is a parameter required to determine how much weight the sum of the gradients gets. This is the $\alpha$ term, and can take a value from $0 \rightarrow \infty $. This is the first of many variables we will discuss that will impact our model. We will call them *hyperparameters*, they are values we give to the model that are neither the data, nor learnt from the data. The hyperparameters can be altered, or *tuned* in order to improve model performance.

In [None]:
from sklearn.linear_model import Ridge

# Set a value for alpha
alpha_value = 100

# Create model object
ridge_model = Ridge(alpha=alpha_value)

# Do some cross validation to get a feel for what the model performance is.
cv_scores = cross_val_score(estimator=ridge_model, X=X, y=y, cv=5, scoring="neg_mean_absolute_error")

print("Value of α:", alpha_value)
print("K-fold MAE values: \n", -1*cv_scores)
print("Mean value across K-folds: \n", -1*cv_scores.mean())

<div class="alert alert-block alert-info">
<b><font size="4">Exercise 4:</font></b> 

<p>

Copying the code from the cell above, change the value of **alpha_value** in order to improve the mean MAE value. What is the best value of the mean MAE you can achieve?

</p> </div>

In [None]:
# Write your code here


The below code shows for our example how the weights of each parameter change with the value of $\alpha$, alpha.

In [None]:
from sklearn.metrics import mean_squared_error

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

# Create the model object
ridge_model = Ridge()

# Generate different values for alpha.
alphas = np.logspace(-6, 6, 200)

# Train the model with different regularisation strengths.
coefs = []
MSEs = []
for each_value in alphas:
    ridge_model.set_params(alpha=each_value)
    ridge_model.fit(X_train, y_train)
    coefs.append(ridge_model.coef_)
    y_pred = ridge_model.predict(X_test)
    MSEs.append(mean_squared_error(y_test, y_pred))
    
# Plot parameters of the model as they change wth alpha.
fig, ax = plt.subplots(figsize=(8,6))
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
plt.xlabel(r'$\alpha$')
plt.ylabel('Feature weights')
plt.title('Ridge variable coefficients compared to MSE')
plt.axis('tight')
labels = scaled_data.columns.to_list()

ax2 = ax.twinx()
ax2.plot(alphas, MSEs, color="red")
ax2.set_ylabel("MSE", color="red")
ax2.tick_params(axis='y', labelcolor="red")
labels.append("MSE")
fig.legend(labels, loc=(0.13, 0.4));

In the plot above we can see that the size of each weight in the regression changes significantly with the value of $\alpha$ and that the relationship is not linear. Picking the right value of $\alpha$ is important for getting a good fit to our data. When $\alpha$ is small it will not affect the feature weights, when it is high the feature weights withh go to zero. We can see that the MSE increases with $\alpha$ here.

## Lasso Regression

Lasso regression works in a similar manner to Ridge regression. There is an additional term added to the minimised function which takes into account the value of sum of the gradient coefficients. However, instead of squaring the sum, the absolute value is used.

> The equation for Lasso regression loss is given as $$ L1 = min( \sum_{i=1}^{n} (y_i - \hat{y_i})^2 + \alpha^2 \times \sum_{j=1}^{m}|a_j| ) $$

As we can see this is largely the same as ridge regression, however, having this absolute value instead of squared given lasso regression a special property. In ridge regression, if a variable is not significant in a regression, the value of it's gradient becomes small, but it is unable to go completely to zero. In Lasso regression however, the non-useful variables can have their coefficients go completely to zero. This is an extremely useful property which helps reduce variance in a model with large numbers of potentially low prediction quality variables.

Below is an example of Lasso regression, using a prediction and train/test split.

In [None]:
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_absolute_error

alpha_value = 1000

lasso_model = Lasso(alpha=alpha_value)

# Split the training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

# Train model
lasso_model.fit(X_train, y_train)

# Predict values
y_pred = lasso_model.predict(X_test)

# Get the MAE score from the test and prediction data.
MAE_value = mean_absolute_error(y_test, y_pred)

print("Value of α:", alpha_value)
print("MAE score: \n", MAE_value)

<div class="alert alert-block alert-info">
<b><font size="4">Exercise 5:</font></b> 

<p>

Copying the code above, change the value of **alpha_value** in order to improve the mean MAE value. Which gets a better MAE value, Ridge or Lasso? Should we compare a single test set to an average cross validation score? Why or why not? It is tedious to change the **alpha_value** by hand, what could we do to make this better?

</p> </div>

In [None]:
# Write your code here


Below is a demonstration of how the different parameter weights change with the value of $\alpha$.

In [None]:
# Create the model object
lasso_model = Lasso()

# set a range of alpha values
alphas = np.logspace(-6, 6, 200)

# Train the model with different regularisation strengths
coefs = []
MSEs = []
for each_value in alphas:
    lasso_model.set_params(alpha=each_value)
    lasso_model.fit(X_train, y_train)
    coefs.append(lasso_model.coef_)
    y_pred = lasso_model.predict(X_test)
    MSEs.append(mean_squared_error(y_test, y_pred))

# Plot parameters of the model as they change wth alpha.
fig, ax = plt.subplots(figsize=(8,6))
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
plt.xlabel(r'$\alpha$')
plt.ylabel('Feature weights')
plt.title('Lasso variable coefficients compared to MSE')
plt.axis('tight')
labels = scaled_data.columns.to_list()

ax2 = ax.twinx()
ax2.plot(alphas, MSEs, color="red")
ax2.set_ylabel("MSE", color="red")
ax2.tick_params(axis='y', labelcolor="red")
labels.append("MSE")
fig.legend(labels, loc=(0.13, 0.4));

As with the ridge regression we can see that the value of $\alpha$ significantly impacts the weights of the different coefficients in the model, we should take care and empirically choose our $\alpha$ value. We can see here that the different features converge to zero at different values of $\alpha$, which is an improvement on ridge regularisation.

The value of $\alpha$ is just one example of a *hyperparameter*, it's value impacts the model's performance. Each type of model will have different hyperparameters and it is therefore important to understand how they should be used.


## Summary


At the end of this chapter you should now:

- Understand univariate and multi-variable regression
- Know how to predict new values
- Have multiple ways of quantifying the quality of our regression
- Be able to use cross validation
- Be able to use regularised regression models
- Understand what hyperparameters are and that we can tune them


We can now load in our data, clean and prepare it, then use it from start to finish to make machine learning predictions!

<div class="alert alert-block alert-success">
<b><font size="4"> Next Chapter: Classification</font> </b> 
<p> 
The next chapter will discuss classification, the prediction of categorical values. We will use census data to predict whether or not an individual's salary is $\geq50K$. The chapter will also explain classification evaluation and more advanced model optimization. 
</p>
</div>