# Connect Intensive - Machine Learning Nanodegree
# Lesson 2: Predicting Automobile Gas Mileage 

## Objectives
  - Use [Numpy](https://docs.scipy.org/doc/numpy/reference/routines.statistics.html) to compute statistics  
  - Split a dataset into training and testing sets using scikit-learn's [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) 
  - Build a [decision tree regressor](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) model using scikit-learn
  - Use [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to perform an exhaustive search over specified parameter values for an estimator
  




## Background

The purpose of this project is to use a decision tree regressor to predict automobile gas mileage. The dataset was obtained from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Auto+MPG).


 **Target variable** 
 - mpg: The vehicle’s gas mileage 
 
**Predictor variables**
 - displacement: The vehicle’s engine displacement
 - horsepower: The vehicle’s horsepower
 - weight: The vehicle’s weight





##  Import the necessary libraries and dataset

In [1]:
import numpy as np
import pandas as pd
from sklearn.cross_validation import ShuffleSplit
# If you receive a deprecation warning, change "cross_validation" to "model_selection" in ShuffleSplit import

car_data = pd.read_csv('car_mpg_data.csv')
mpg = car_data['mpg']
predictor_variables = car_data.drop('mpg', axis = 1)
    
print "Rows: {} Columns: {} ".format(*car_data.shape)

Rows: 392 Columns: 4 




##  Use Numpy to learn more about the dataset by calculating basic statistics 

In [2]:
mpg_min = np.min(mpg)
mpg_max = np.max(mpg)
mean = np.mean(mpg)
median = np.median(mpg)
std = np.std(mpg)

print "Minimum MPG:{:,.0f}".format(mpg_min)
print "Maximum MPG:{:,.0f}".format(mpg_max)
print "Mean MPG:{:,.0f}".format(mean)
print "Median MPG: {:,.0f}".format(median)
print "MPG standard deviation:{:,.0f}".format(std)

Minimum MPG:9
Maximum MPG:47
Mean MPG:23
Median MPG: 23
MPG standard deviation:8


## Import the r2_score function for evaluating the model


In [3]:
from sklearn.metrics import r2_score

def performance_metric(y_true, y_predict):
    
    score = r2_score(y_true,y_predict)
    
    return score

## Import the train_test_split function for dividing the dataset into training and testing sets

In [4]:
from sklearn.cross_validation import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(predictor_variables, mpg, test_size = .1, random_state = 1)


print "Predictor variables: displacement, horsepower, weight\n"
print "X_train: rows:  {},  columns: {}".format(*X_train.shape)
print "X_test:  rows:  {},  columns: {}\n".format(*X_test.shape)
print "---------------------------------------------------------\n"
print "Target variable: mpg\n"
print "y_train: rows:  {},  columns: 1".format(*y_train.shape)
print "y_test:  rows:  {},  columns: 1".format(*y_test.shape)

Predictor variables: displacement, horsepower, weight

X_train: rows:  352,  columns: 3
X_test:  rows:  40,  columns: 3

---------------------------------------------------------

Target variable: mpg

y_train: rows:  352,  columns: 1
y_test:  rows:  40,  columns: 1


## Use the grid search technique to train the decision tree algorithm 

In [5]:
# TODO: Import 'make_scorer', 'DecisionTreeRegressor', and 'GridSearchCV'

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import make_scorer
from sklearn.grid_search import GridSearchCV
# If you receive a deprecation warning, change "grid_search" to "model_selection" in GridSearchCV import

def fit_model(X, y):
    
    
    cross_validation_sets = ShuffleSplit(X.shape[0], n_iter = 5, test_size = 0.25, random_state = 1)

    reg_model = DecisionTreeRegressor()

    max_depth = {'max_depth':range(1,11)}
   
    scorer = make_scorer(performance_metric, greater_is_better = True)
    
    

    
    grid = GridSearchCV(estimator = reg_model, param_grid = max_depth, scoring = scorer, cv = cross_validation_sets)

    grid = grid.fit(X, y)
  
    return grid.best_estimator_



## Fit the model to the training data

In [6]:
reg = fit_model(X_train, y_train)

## Make predictions using the model

In [7]:
car_specs = [[100, 100, 2000 ], [400, 200, 4000]] 
                               
for i, mpg in enumerate(reg.predict(car_specs)):
    print "Car {}'s predicted MPG: {:,.0f}".format(i+1, mpg)

Car 1's predicted MPG: 29
Car 2's predicted MPG: 15
