# Overview
This report details Python code and explanations for Project 1, Predicting Boston Housing Prices. It was created using the Jupyter notebook and implemented using Python 2.7 features. 

The module, boston_housing.py, can be executed individually without this notebook.

# 1) Statistical Analysis and Data Exploration

The below code snippet will calculate the following characteristics in the data set:

* Number of data points and features
* Minimum, maximum, mean and median prices of the housing prices
* The standard deviation of the housing prices

The data, which is built into the sklearn library, is originally from from the StatLib library which is maintained at Carnegie Mellon University. You can read more about the data here:

* https://archive.ics.uci.edu/ml/datasets/Housing

In [43]:
############################################################
# LOAD LIBRARIES
############################################################

import numpy as np
import pylab as pl
from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor

def load_data():
    """Load the Boston dataset."""

    boston = datasets.load_boston()
    return boston

def explore_city_data(city_data):
    """Calculate the Boston housing statistics."""

    # Get the labels and features from the housing data
    housing_prices = city_data.target
    housing_features = city_data.data
    
    # Calculate features per project requirements
    print 'Size of data (number of houses): ' + str(len(housing_prices))
    print 'Number of features: ' + str(housing_features.shape[1])
    
    # Note: Median value of owner-occupied homes in $1000's. See
    # https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names
    print 'Minimum price: $' + str(np.min(housing_prices)) + 'K'
    print 'Maximum price: $' + str(np.max(housing_prices)) + 'K'
    print 'Mean price: $' + str(round(np.mean(housing_prices),1)) + 'K'
    print 'Median price: $' + str(round(np.median(housing_prices),1)) + 'K'
    print 'Standard deviation: $' + str(round(np.std(housing_prices),2)) + 'K'
    
# Load data
city_data = load_data()

# Explore the data
explore_city_data(city_data)

Size of data (number of houses): 506
Number of features: 13
Minimum price: $5.0K
Maximum price: $50.0K
Mean price: $22.5K
Median price: $21.2K
Standard deviation: $9.19K


# 2) Evaluating Model Performance
## Measure of Model Performance

Since this is a regression analysis (and not classification), metrics that are capable of evaluating performance errors for continuous outputs are required. The following performance metrics satisfying these requirements are available in scikit-learn:

* Mean absolute error (MAE)
* Mean squared error (MSE)
* Median absolute error

From the above, the median absolute error was selected since it is resistant to the presence of outliers. 

Note that R-squared is also available for regression; however, it is not a valid error metric in the context of the supplied assignment.

## Splitting into Training & Testing Data
For all machine learning examples, we want to separate the data into training and test datasets. We use the former to fit the model and the latter to evaluate its performance.

Prediction algorithms that do not use split training and test data sets may have high variance error, which is a result of overfitting the dataset. In other words, we can make great predictions using existing data but will make poor predictions when faced with unseen data.

## Grid Search & Cross Validation
Grid search is a function for tuning the hyperparameters (e.g. maximum depth, C) used in machine learning algorithms. It iterates over different combinations of hyperparameters and determines which combination results in the highest performance.

Cross-validation is a technique where some of the data is removed from the dataset before training begins. The removed data is later used to evaluate the performance of the model. 

Note that by default, grid search uses 3-fold cross-validation. Cross-validation is useful when combined with grid search to ensure that our selected hyperparameters do not result in a model that overfits the data.

# 3) Evaluating Model Performance
## Trends of Training & Testing Error
As the training size increases, the training and testing errors generally converge to the same absolute error. For example, the two errors nearly converge to an error of about 2 in the below image.

<img src='learning-graph-example.png'>

## Decision Tree Regressor
When the model is fully trained, the decision tree regressor for max depth of 1 suffers from high bias/underfitting since both training and test data set errors converge to the same high value.
<img src='learning-graph-example-depth-1.png'>
For a max depth of 10, the model suffers from high variance/overfitting since there are no errors with the training data set but high errors with the test data set.
<img src='learning-graph-example-depth-10.png'>

## Model Complexity
The training set error continually decreases to zero as the model complexity increases; however, the test set error plateus to a steady-state value with increasing complexity.

From just this graph, a max depth of 5 appears to balance between underfitting and overfitting the model. This is the point where the test data set error starts to plateau and before the test and training data set errors diverge.

<img src='model-complexity.png'>

# 4) Model Prediction
Running the GridSearchCV command several times for:

House: [11.95, 0.0, 18.1, 0, 0.659, 5.609, 90.0, 1.385, 24, 680.0, 20.2, 332.09, 12.13]

yields a prediction between \$20K and \$21K, with optimal models having a max depth between 5 and 7. Regardless, the model appears reasonable and valid since the median was \$21.2K and the mean was \$22.5K with a standard deviation of \$9.19K.

The values for the home provided yield an average priced home.