# Overview
This report details Python code and explanations for Project 1, Predicting Boston Housing Prices. It was created using the Jupyter notebook and implemented using Python 2.7 features. 

The module, boston_housing.py, can be executed individually without this notebook.

# Step 1: Statistical Analysis and Data Exploration

The below code snippet will calculate the following characteristics in the data set:

* Number of data points and features
* Minimum, maximum, mean and median prices of the housing prices
* The standard deviation of the housing prices

The data, which is built into the sklearn library, is originally from from the StatLib library which is maintained at Carnegie Mellon University. You can read more about the data here:

* https://archive.ics.uci.edu/ml/datasets/Housing

In [43]:
############################################################
# LOAD LIBRARIES
############################################################

import numpy as np
import pylab as pl
from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor

def load_data():
    """Load the Boston dataset."""

    boston = datasets.load_boston()
    return boston

def explore_city_data(city_data):
    """Calculate the Boston housing statistics."""

    # Get the labels and features from the housing data
    housing_prices = city_data.target
    housing_features = city_data.data
    
    # Calculate features per project requirements
    print 'Size of data (number of houses): ' + str(len(housing_prices))
    print 'Number of features: ' + str(housing_features.shape[1])
    
    # Note: Median value of owner-occupied homes in $1000's. See
    # https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names
    print 'Minimum price: $' + str(np.min(housing_prices)) + 'K'
    print 'Maximum price: $' + str(np.max(housing_prices)) + 'K'
    print 'Mean price: $' + str(round(np.mean(housing_prices),1)) + 'K'
    print 'Median price: $' + str(round(np.median(housing_prices),1)) + 'K'
    print 'Standard deviation: $' + str(round(np.std(housing_prices),2)) + 'K'
    
# Load data
city_data = load_data()

# Explore the data
explore_city_data(city_data)

Size of data (number of houses): 506
Number of features: 13
Minimum price: $5.0K
Maximum price: $50.0K
Mean price: $22.5K
Median price: $21.2K
Standard deviation: $9.19K


# Step 2: Partition the Data Set
The train_test_split function from the cross_validation module within sklearn is used to partition the data into training and testing sets.

In [65]:
from sklearn.cross_validation import train_test_split

def split_data(city_data):
    """Randomly shuffle the sample set. Divide it into 70 percent training and 30 percent testing data."""

    # Get the features and labels from the Boston housing data
    X, y = city_data.data, city_data.target
    
    # Split the data into training and test sets (70-30 split)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
    
    return X_train, y_train, X_test, y_test

# Load data
city_data = load_data()

# Training/Test dataset split
X_train, y_train, X_test, y_test = split_data(city_data)

# Print out metrics for splitting training/test data set for verification
print 'Size of X training set:' + str(X_train.shape) 
print 'Size of y training set:' + str(y_train.shape)
print 'Size of X test set:' + str(X_test.shape) 
print 'Size of y test set:' + str(y_test.shape)
print 'Ratio (training set): ' + str(round(float(X_train.shape[0])/(X_test.shape[0] + X_train.shape[0]),2))
print 'Ratio (test set): ' + str(round(float(X_test.shape[0])/(X_test.shape[0] + X_train.shape[0]),2))

Size of X training set:(354, 13)
Size of y training set:(354,)
Size of X test set:(152, 13)
Size of y test set:(152,)
Ratio (training set): 0.7
Ratio (test set): 0.3


# Step 3: Performance Metrics

Since this is a regression and not a classification model, both the mean squared error (MSE) and coefficient of determination (R^2) can be useful metrics. 

I've chosen to use R^2 since it is a standardized form of the MSE.

In [None]:
from sklearn.metrics import r2_score

def performance_metric(label, prediction):
    """Calculate and return the appropriate error performance metric."""

    print('R^2 train: %.3f, test: %.3f' %
    
    pass

# Evaluating Model Performance

* Which measure of model performance is best to use for predicting Boston housing data and analyzing the errors? 
* Why do you think this measurement most appropriate? Why might the other measurements not be appropriate here?
* Why is it important to split the Boston housing data into training and testing data? What happens if you do not do this?
* What does grid search do and why might you want to use it?
* Why is cross validation useful and why might we use it with grid search?

## 3) Analyzing Model Performance

* Look at all learning curve graphs provided. What is the general trend of training and testing error as training size increases?
* Look at the learning curves for the decision tree regressor with max depth 1 and 10 (first and last learning curve graphs). When the model is fully trained does it suffer from either high bias/underfitting or high variance/overfitting?
* Look at the model complexity graph. How do the training and test error relate to increasing model complexity? Based on this relationship, which model (max depth) best generalizes the dataset and why?

## 4) Model Prediction

* Model makes predicted housing price with detailed model parameters (max depth) reported using grid search. Note due to the small randomization of the code it is recommended to run the program several times to identify the most common/reasonable price/model complexity.
* Compare prediction to earlier statistics and make a case if you think it is a valid model.

Questions and Report Structure

1) Statistical Analysis and Data Exploration

Number of data points (houses)?
Number of features?
Minimum and maximum housing prices?
Mean and median Boston housing prices?
Standard deviation?

3) Analyzing Model Performance

Look at all learning curve graphs provided. What is the general trend of training and testing error as training size increases?
Look at the learning curves for the decision tree regressor with max depth 1 and 10 (first and last learning curve graphs). When the model is fully trained does it suffer from either high bias/underfitting or high variance/overfitting?
Look at the model complexity graph. How do the training and test error relate to increasing model complexity? Based on this relationship, which model (max depth) best generalizes the dataset and why?
4) Model Prediction

Model makes predicted housing price with detailed model parameters (max depth) reported using grid search. Note due to the small randomization of the code it is recommended to run the program several times to identify the most common/reasonable price/model complexity.
Compare prediction to earlier statistics and make a case if you think it is a valid model.

# Statistical Analysis and Data Exploration





# Analyzing Model Performance

# Model Prediction

# 2) Evaluating Model Performance
## Measure of Model Performance

Since this is a regression analysis (and not classification), metrics that are capable of evaluating performance errors for continous outputs are required. The following performance metrics satisfying these requirements are available in scikit-learn:

* Mean absolute error (MAE)
* Mean squared error (MSE)
* Median absolute error

From the above, the median absolute error was selected since it is resistant to the presence of outliers. 

Note that R-squared is also available for regression; however, it is not a valid error metric in the context of the supplied assignment.

## Splitting into Training & Testing Data
For all machine learning examples, we want to separate the data into training and test datasets. We use the former to fit the model and the latter to evaluate its performance.

Prediction algorithms that do not use split training and test data sets may have high variance error, which is a result of overfitting the dataset. In other words, we can make great predictions using existing data but will make poor predictions when faced with unseen data.

## Grid Search & Cross Validation
Grid search is a function for tuning the hyperparameters (e.g. maximum depth, C) used in machine learning algorithms. It iterates over different combinations of hyperparameters and determines which combination results in the highest performance.

Cross-validation is a technique where some of the data is removed from the dataset before training begins. The removed data is later used to evaluate the performance of the model. 

Note that by default, grid search uses 3-fold cross-validation. Cross-validation is useful when combined with grid search to ensure that our selected hyperparameters do not result in a model that overfits the data.

# 3) Evaluating Model Performance
## Trends of Training & Testing Error
When reviewing all of the learning curves, the general trend of training and testing error is to converge to the same absolute error as the training size increases. For example, a maximum depth of 5 demonstrates this trend between the two errors.

<img src='learning-graph-example.png'>

## Decision Tree Regressor
Look at the learning curves for the decision tree regressor with max depth 1 and 10 (first and last learning curve graphs). When the model is fully trained does it suffer from either high bias/underfitting or high variance/overfitting?

## Model Complexity
Look at the model complexity graph. How do the training and test error relate to increasing model complexity? Based on this relationship, which model (max depth) best generalizes the dataset and why?