# Supervised Learning

![](https://s-media-cache-ak0.pinimg.com/564x/fe/aa/1a/feaa1a16a315823b2d9ad24da7eccdaf.jpg)

In [3]:
# %load utils/imports.py
import numpy as np
import pandas as pd

from utils import *
from utils.plotting import *

from utils.styles import *

from IPython.display import IFrame

# Ordinary Least Squares

## Why Regression?

There are a few major uses for regression analysis:

1. correlation analysis - determining if $X$ is correlated with $y$.
1. forecasting an effect - predicting if a change in $X$ will also change $y$.
1. trend forecasting - determining if changes in $X$ are causing $y$ to trend in a certain direction.
1. influence analysis - determining the strength of relationships between two or more varibles or correlated relationships between one or more independent and one dependent variable.

## A Visual Introduction

One of the simplest ways to fit a model to a set of training data is a linear function that attempts to fit a line to a dataset.

In [4]:
IFrame(src='http://setosa.io/ev/ordinary-least-squares-regression/', width='100%', height=800)

## A Gentle Introduction with Python

In Scikit-learn this can be accomplished using the `LinearRegression` linear model. In these models your input features are typically referred to as $X$ and your target is $y$.

To begin, let's create a trivial example to show how this works. We will create a set of training points `(10, 10)`, `(20, 20)`, `(30, 30)` and a set of test points that fall on the same line. `LinearRegression` should find and fit this line!

In [10]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

def lin_reg(X_train, y_train, X_test, y_test, graph=True, normalize=False):
    regr = LinearRegression(normalize=normalize)
    regr.fit(X_train, y_train)
    predictions = regr.predict(X_test)

    
    if graph:
        # Graphing with Plotly via Cufflinks
        df = pd.concat([X_test, y_test],axis=1,)
        df.columns = ['X','y']
        df.set_index('y').iplot(
            mode="markers",
            title="Estimation with LinearRegression",
            bestfit=True,
            colors=['#F7CD94'],
            bestfit_colors=['#F794AA'],
            error_type='data')
    
    mse = mean_squared_error(y_test, predictions)
    print('Mean Squared Error: {:.2}'.format(mse))
    print('Root Mean Squared Error: {:.2}'.format(np.sqrt(mse)))
    print('Variance Score: {:.2}'.format(regr.score(X_test, y_test)))
    print('Coefficients:', regr.coef_)
    
    return regr

X_train = pd.DataFrame([10, 20, 30])
y_train = pd.DataFrame([10, 20, 30])
X_test = pd.DataFrame([15, 25, 35])
y_test = pd.DataFrame([15, 25, 35])

model = lin_reg(X_train, y_train, X_test, y_test)


The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://statsmodels.sourceforge.net/stable/regression.html



Mean Squared Error: 0.0
Root Mean Squared Error: 0.0
Variance Score: 1.0
Coefficients: [[ 1.]]



### Analysis

As we can see, the line was a perfect fit for our test data points. Notice that we printed a few other pieces of information here:

#### Mean Squared Error (MSE)

$$\textrm{MSE} = \frac{1}{n} \sum_{i=1}^{n} (prediction_i - actual_i)^2$$

MSE is a measure of the error between the predicted points and the actual results. Essentially this means that for each predicted point we compute the Euclidean distance from it to the actual point, square that value, sum all the squared values together, and finally divide by the number of points to get the mean or average.

For this simple example, we had no error. This is not typical of real datasets!

#### Root Mean Squared Error (RMSE)

$$\textrm{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (prediction_i - actual_i)^2}$$

RMSE is simply the square root of the MSE. The big difference between from MSE is that RMSE severely punishes large errors.

#### Root Mean Squared Logarithmic Error (RMSLE)

$$RMSLE = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(prediction_i + 1) - \log(actual_i+1))^2 }$$

RMSLE is very similar to RMSE, but uses the natural log value of your prediction and actual results. This results in a metric that penalizes an under-predicted estimate greater than an over-predicted estimate.

#### Variance Score ($R^2$)

The variance score is also known as the coefficient of determination and typically denoted as $R^2$ and pronounced R-squared. One advantage of $R^2$ is that it is typically scaled between 0 and 1, but it can be negative in a model without an intercept. Typically, the closer to 1 you are, the better your model predicts the data. This allows it to be more easily interpreted than MSE. However, the downside is that it is more difficult to tell how much predictions deviate, on average, from the actual values in the dataset.

$R^2$ is a useful and intuitive metric, but it can be [misleading](http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit). Low scores are not inherently bad and high scores cannot be presumed to be good. You should generally evaluate $R^2$ values in conjunction with residual plots, other model statistics, and knowledge of your dataset.

To understand what $R^2$ represents it's good to know what baseline you are comparing it agains - this baseline is the best possible guess _without_ the aid of a model. This value is known as the _expected_ valye, and you would know it as the _mean_. $R^2$ is based on the following three _(S)um of (S)quares_:

$$SST = \Sigma(y_i-\bar{y})^2$$
$$SSR = \Sigma(\hat{y}_i-\bar{y})^2$$
$$SSE = \Sigma(y_i-\hat{y})^2$$

![](assets/total-regression-error-sum-of-squares.png)

These are explained as :
* $SST$ = Total Sum of Squared Deviations in $y$ from its mean $\hat{y}$
* $SSR$ = Sum of squares due to regression
* $SSE$ = Sum of Squared Residiuals

Where $R^2$ finally is defined as 

$$R = 1 - \frac{SSE}{SST}$$

So a good way of thinking about $R^2$ is how much of the _total_ variance in the data is explained by your model explains.

#### Coefficients

Regression coefficients represent the mean change in the response variable for one unit of change in the predictor variable while holding other predictors in the model constant. This statistical control that regression provides is important because it isolates the role of one variable from all of the others in the model.

The key to understanding the coefficients is to think of them as slopes, and they’re often called slope coefficients. In our example above we can see that there is one coefficient for our single feature and it has a value of $1.0$. This indicates that for every change of $1$ in $X$ we get a change of $1$ in $y$, which is exactly what our line shows.

Coefficients can be useful in assessing which of your features have the most profound effect on your model. Coefficients that have a higher absolute value typically indicate a greater influence on your model while those with a lower absolute value influence your model less - provided that they are of the same scale!

Interpretting coefficients can be quite tricky though and more detailed explanations are beyond the scope of this session. If you are interested in more, here are a few resources to help you along:

* [Interpreting Regression Coefficients](http://www.theanalysisfactor.com/interpreting-regression-coefficients/)
* [How to Interpret Regression Analysis Results: P-values and Coefficients](http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients)
* [Common Mistakes in Interpretation of Regression Coefficients](https://www.ma.utexas.edu/users/mks/statmistakes/regressioncoeffs.html)

#### QUIZ

1. What values are we looking for when we consider $SSE$? What is the best value we could potentially have?
1. What is the best value we could have for $R^2$ ?
1. What’s the primary difference between these two values?

## Introducing the Inevitable Errros

Let's try another example that adds a little noise to our input values so we can see how this affects the model.

In [35]:
def make_some_noise(seed):
    np.random.seed(seed)
    return pd.DataFrame(np.random.randint(0, 7, size=(3, 1)))

X_noise = make_some_noise(1)
X_train_noisy = X_train + X_noise
X_test_noisy = X_test + X_noise

_ = lin_reg(X_train_noisy, y_train, X_test_noisy, y_test)


The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://statsmodels.sourceforge.net/stable/regression.html



Mean Squared Error: 0.6
Root Mean Squared Error: 0.77
Variance Score: 0.99
Coefficients: [[ 1.04395604]]


### Analysis
We can see that with a little noise on our inputs we still get good results, but we are no longer able to make perfect predictions. This is an important point because noise is very common in real data. It is possible to have noise on your input feature values, on your target values, and often on both!

Noise can come from many places. A few common examples include:
- Sensor error - sensor readings aren't always perfect.
- Malicious error - intentionally bad datasets.
- Transcription error - someone makes a mistake recording data.
- Unmodeled influences - you may not account for all features or you may account for the wrong features. For example, if you only look at price you can't make a great prediction about how stocks will move in the market.