# Establishing a Baseline

For most human endeavors it helps to begin by establishing a baseline. This baseline may be used to track performance over time. For instance, if you have decided to lose weight, a baseline would be your starting weight. You could also track the exact number and type of calories you eat for a few weeks before beginning any sort of diet. Similar baselines can be made for lifting weights, running, learning data science, etc... Without establishing a baseline, tracking progress becomes more difficult.

### No previous baseline was established

In our previous work with the housing dataset, we never established a baseline. We went straight into modeling with a single-variable linear regression. Let's read in the housing dataset and reproduce this result.

In [None]:
import pandas as pd
import numpy as np
housing = pd.read_csv('../data/housing_sample.csv')
housing.head()

In [None]:
X = housing[['GrLivArea']]
y = housing['SalePrice']

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X, y)
lr.score(X, y)

## How to establish a baseline?

A typical baseline is going to be a very simple model that we can build without much thought. A common simple model for supervised regression is predicting the same value for each observation, such as the mean or median.

### A baseline is built into linear regression

By definition, predicting the mean for every observation yields an $R^2$ of exactly 0. Remembering that the equation for $R^2$ is $1 - \frac{SSE_{model}}{SSE_{mean}}$, then using the mean as our model should make it clear that $R^2$ would be in fact 0.

## The `dummy` module

There actually exists a formal estimator for building a baseline in scikit-learn. The `DummyRegressor`, found in the `dummy` module, defaults to predicting the mean for each observation. The input data is completed ignored during the `fit` method.

In [None]:
from sklearn.dummy import DummyRegressor
dr = DummyRegressor()
dr.fit(X, y)
dr.predict(X)

Let's verify that the mean of the sale price is what the `DummyRegressor` predicts.

In [None]:
y.mean()

Use the `score` method to verify that $R^2$ is 0.

In [None]:
dr.score(X, y)

### A negative $R^2$ means you have a terrible model

If you build a model that has a negative value for $R^2$, you have built a terrible model that cannot beat a baseline of guessing the mean.

### Slowly build more complex models

When doing machine learning, it's good practice to begin with a simple baseline model. Then, slowly and incrementally build more complex models while recording the score so that you can track progress.

## Exercises

### Problem 1
<span  style="color:green; font-size:16px">Use the DummyRegressor on other numeric features and verify that r-squared is 0.</span>