# Project: Predicting Boston Housing Prices
## Nanodegree Machine Learning Engineer
The goal of this project is to develop a machine learning model to predict boston housing pricing based of infomation such as (number of room, neighborhood poverty, and student-teacher ratio). 

### Import data 
Before develop and train the model, data must be import to the program. As included with the program, the file 'housing.csv' contain the data that is collected in the boston area for housing's price. The block of code bellow is used to import data into the program ready for processing.

In [2]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from sklearn.cross_validation import ShuffleSplit

# Import supplementary visualizations code visuals.py
import visuals as vs

# Pretty display for notebooks
%matplotlib inline

# Load the Boston housing dataset
data = pd.read_csv('housing.csv')
prices = data['MEDV']
features = data.drop('MEDV', axis = 1)
    
# Success
print "Boston housing dataset has {} data points with {} variables each.".format(*data.shape)

Boston housing dataset has 489 data points with 4 variables each.


### Data Exploration
To provide a general idea of what the data is like, the following statistical infomation of the data will be calcuated:
 * Minimum price of the house
 * Maximum price of the house
 * Mean price of the house
 * Medium price of the house
 * Standard deviation of the prices of the house
 

In [3]:
# Calculate minimum price of the data
minimum_price = min(prices)

# Calculate maximum price of the data
maximum_price = max(prices)

# TCalculate mean price of the data
mean_price = np.mean(prices)

# TCalculate median price of the data
median_price = np.median(prices)

# Calculate standard deviation of prices of the data
std_price = np.std(prices)

# Show the calculated statistics
print "Statistics for Boston housing dataset:\n"
print "Minimum price: ${:,.2f}".format(minimum_price)
print "Maximum price: ${:,.2f}".format(maximum_price)
print "Mean price: ${:,.2f}".format(mean_price)
print "Median price ${:,.2f}".format(median_price)
print "Standard deviation of prices: ${:,.2f}".format(std_price)

Statistics for Boston housing dataset:

Minimum price: $105,000.00
Maximum price: $1,024,800.00
Mean price: $454,342.94
Median price $438,900.00
Standard deviation of prices: $165,171.13


----

# Developing a Model
In this section of the project, a model will be develop to make prediction on the price of the data.

## r2_score
The r2 score is the coefficient of determination. The r2 score determines goodness of the fit of the model to the data point. The maximum score is 1. 
* 0% indicates that the model explains none of the variability of the response data around its mean.
* 100% indicates that the model explains all of the variability of the response data around its mean.

In [4]:
# Import 'r2_score'
from sklearn.metrics import r2_score 

def performance_metric(y_true, y_predict):
    """ Calculates and returns the performance score between 
        true and predicted values based on the metric chosen. """
    
    # Calculate the performance score between 'y_true' and 'y_predict'
    score = r2_score(y_true, y_predict)
    
    # Return the score
    return score

### Quick test for the performance_metric function
Assume that a dataset contains five data points and a model made the following predictions for the target variable:

| True Value | Prediction |
| :-------------: | :--------: |
| 3.0 | 2.5 |
| -0.5 | 0.0 |
| 2.0 | 2.1 |
| 7.0 | 7.8 |
| 4.2 | 5.3 |

The code cell below use the `performance_metric` function and calculate this model's coefficient of determination.

In [5]:
# Calculate the performance of this model
score = performance_metric([3, -0.5, 2, 7, 4.2], [2.5, 0.0, 2.1, 7.8, 5.3])
print "Model has a coefficient of determination, R^2, of {:.3f}.".format(score)

Model has a coefficient of determination, R^2, of 0.923.


The result turns out to be 92.3%; this is not bad!

### Implementation: Shuffle and Split Data
* Data must be split into two sets (training and testing) to train and validate the model. If most of the data are being used to train, a limited amount of data to test for model validation. If the test data is too big, then there is enough data to train the model. Thus, find the optimal ratio between train and test data is crucial to train and validate the model.
* For this data, 30% of the data will be used for model validation, and the rest will be used for training. 

In [6]:
# Import 'train_test_split'
from sklearn.cross_validation import train_test_split

# Shuffle and split the data into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.3, random_state=42)

# Success
print "Training and testing split was successful."

Training and testing split was successful.


-----

## Evaluating Model Performance
In this final section of the project, a model will be constructed and make a prediction on the client's feature set using an optimized model from `fit_model`.

### Implementation: Fitting a Model
* In this project **decision tree algorithm** will be used to develop a model to predict the housing price. 
* ShuffleSplit will be used to split the data into training and test sets. 
* Decision Tree Regressor will be used to develop the machine learning algorithm. 
* To produce an accurate and robust machine learning model to predict the optimal result, the machine learning parameter must be tuned to the optimal. For un-seem data set like this, GridSearchCV will be used to test a varieties parameters for the model to optimize it.

* In this case, the fit_model function will help to determine the best max_depth for the model. 

In [8]:
# TODO: Import 'make_scorer', 'DecisionTreeRegressor', and 'GridSearchCV'
from sklearn.metrics import make_scorer
from sklearn.tree import DecisionTreeRegressor
from sklearn.grid_search import GridSearchCV

def fit_model(X, y):
    """ Performs grid search over the 'max_depth' parameter for a 
        decision tree regressor trained on the input data [X, y]. """
    
    # Create cross-validation sets from the training data
    # sklearn version 0.18: ShuffleSplit(n_splits=10, test_size=0.1, train_size=None, random_state=None)
    # sklearn versiin 0.17: ShuffleSplit(n, n_iter=10, test_size=0.1, train_size=None, random_state=None)
    cv_sets = ShuffleSplit(X.shape[0], n_iter = 10, test_size = 0.20, random_state = 0)

    # Create a decision tree regressor object
    regressor = DecisionTreeRegressor()

    # Create a dictionary for the parameter 'max_depth' with a range from 1 to 10
    params = {'max_depth': [1,2,3,4,5,6,7,8,9,10]}

    # TTransform 'performance_metric' into a scoring function using 'make_scorer' 
    scoring_fnc = make_scorer(performance_metric)

    # Create the grid search cv object --> GridSearchCV()
    # Make sure to include the right parameters in the object:
    # (estimator, param_grid, scoring, cv) which have values 'regressor', 'params', 'scoring_fnc', and 'cv_sets' respectively.
    grid = GridSearchCV(regressor, params)

    # Fit the grid search object to the data to compute the optimal model
    grid = grid.fit(X, y)

    # Return the optimal model after fitting the data
    return grid.best_estimator_

### Optimal max_depth
The code bellow the use the fit_model function test produce the optimal valve for max_depth of the model.

In [9]:
# Fit the training data to the model using grid search
reg = fit_model(features, prices)

# Produce the value for 'max_depth'
print "Parameter 'max_depth' is {} for the optimal model.".format(reg.get_params()['max_depth'])

Parameter 'max_depth' is 4 for the optimal model.


### Predicting Selling Prices
The table bellow is the data to predict the price using the machine learning model that is aready trained. Run the code section bellow to see the result

| Feature | Client 1 | Client 2 | Client 3 |
| :---: | :---: | :---: | :---: |
| Total number of rooms in home | 5 rooms | 4 rooms | 8 rooms |
| Neighborhood poverty level (as %) | 17% | 32% | 3% |
| Student-teacher ratio of nearby schools | 15-to-1 | 22-to-1 | 12-to-1 |

*

In [10]:
# Produce a matrix for client data
client_data = [[5, 17, 15], # Client 1
               [4, 32, 22], # Client 2
               [8, 3, 12]]  # Client 3

# Show predictions
for i, price in enumerate(reg.predict(client_data)):
    print "Predicted selling price for Client {}'s home: ${:,.2f}".format(i+1, price)

Predicted selling price for Client 1's home: $408,800.00
Predicted selling price for Client 2's home: $231,253.45
Predicted selling price for Client 3's home: $938,053.85
