## What We'll Do In This Notebook

Now is your chance to put it all together. There'll be no lecturing in this notebook. After you run the code to load in the data you're up.

There are some data cleaning techniques you'll need for this notebook that we did not explicitly touch on in our lectures. Feel free to follow along with Notebook 7 - Predicting California Housing Prices, as you work through this notebook.

In addition, you may want to look at how to make more sophisticated pipelines than what we've done up to this point. In Notebook 7 we detail how to make a proper pipeline for transforming numerical and categorical data, up to this point we've only dealt with one or the other in our pipelines.


Your goal is to take the given features and predict the Median value of owner-occupied homes in $1000's, this is denoted as `MEDV` in the data and stored in `y` below. In the code chunk with `print(boston['DESCR'])` you can read through a description of the data to get a sense of the problem.

Try to build a model with the lowest root average mean square error over a cross-validation split. Feel free to use any of the techniques we've covered up to this point, but nothing else. It's okay if you don't finish this in class or even in one sitting. The point is to expose you to a somewhat realistic modeling process.

In [1]:
# Import packages

## For data handling
import pandas as pd
import numpy as np

## For plotting
import matplotlib.pyplot as plt
import seaborn as sns

## This sets the plot style
## to have a grid on a white background
sns.set_style("whitegrid")

In [2]:
# import the boston housing data

# First get the load data function from sklearn
from sklearn.datasets import load_boston

In [3]:
boston = load_boston()

# The features are stored in X
X = boston['data']

# The target is stored in y
y = boston['target']

# Here it is as a df for you to see what the 
# data looks like
boston_df = pd.DataFrame(X,columns = boston['feature_names'])
boston_df['MEDV'] = y

In [4]:
print(boston['DESCR'])

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu