# Machine Learning Engineer Nanodegree
## Model Evaluation & Validation
## Project: Predicting Boston Housing Prices

Welcome to the first project of the Machine Learning Engineer Nanodegree! In this notebook, some template code has already been provided for you, and you will need to implement additional functionality to successfully complete this project. You will not need to modify the included code beyond what is requested. Sections that begin with **'Implementation'** in the header indicate that the following block of code will require additional functionality which you must provide. Instructions will be provided for each section and the specifics of the implementation are marked in the code block with a 'TODO' statement. Please be sure to read the instructions carefully!

In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a **'Question X'** header. Carefully read each question and provide thorough answers in the following text boxes that begin with **'Answer:'**. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.  

>**Note:** Code and Markdown cells can be executed using the **Shift + Enter** keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.

## Getting Started
In this project, you will evaluate the performance and predictive power of a model that has been trained and tested on data collected from homes in suburbs of Boston, Massachusetts. A model trained on this data that is seen as a *good fit* could then be used to make certain predictions about a home — in particular, its monetary value. This model would prove to be invaluable for someone like a real estate agent who could make use of such information on a daily basis.

The dataset for this project originates from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Housing). The Boston housing data was collected in 1978 and each of the 506 entries represent aggregated data about 14 features for homes from various suburbs in Boston, Massachusetts. For the purposes of this project, the following preprocessing steps have been made to the dataset:
- 16 data points have an `'MEDV'` value of 50.0. These data points likely contain **missing or censored values** and have been removed.
- 1 data point has an `'RM'` value of 8.78. This data point can be considered an **outlier** and has been removed.
- The features `'RM'`, `'LSTAT'`, `'PTRATIO'`, and `'MEDV'` are essential. The remaining **non-relevant features** have been excluded.
- The feature `'MEDV'` has been **multiplicatively scaled** to account for 35 years of market inflation.

Run the code cell below to load the Boston housing dataset, along with a few of the necessary Python libraries required for this project. You will know the dataset loaded successfully if the size of the dataset is reported.

In [73]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from IPython.display import display
#import seaborn as sns
import matplotlib.pyplot as plt

# Import supplementary visualizations code visuals.py
import visuals as vs

# Pretty display for notebooks
%matplotlib inline



# Load the Boston housing dataset
data = pd.read_csv('data.csv',  index_col='ID')

    
# Success
print("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))

Boston housing dataset has 100000 data points with 12 variables each.


In [74]:
data['Date'] = pd.to_datetime(data['Date'], format="%Y-%m-%d")
data = data.set_index('Date')
#data['Date'] = pd.to_datetime(data['Date'], format="%Y-%m-%d").apply(lambda x:x.date().strftime('%y'))
data_test = data[98818:]
data_train = data[:98818]
#data_train['Price'] = data_train['Price'].astype(int).apply(lambda x: int(x/50000))
data_train['Price'] = data_train['Price'].astype(int).apply(lambda x: np.around(np.log(x), 1))
#data_train.loc[data_train['Price'] > 30, 'Price'] = 31

prices = data_train['Price']
features = data_train.drop('Price', axis = 1)
features_test = data_test.drop('Price', axis = 1)
print(data.index.dtype)


print(len(data_train))
print(data_train.isnull().any())
#data_train = data_train.dropna()
print(len(data_train))
print(data_train.isnull().any())
#display(data_train)
print(data_test.isnull().any())
display(data_train)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


datetime64[ns]
98818
Price                False
Postcode              True
Property_Type        False
Old_New              False
Duration             False
Street               False
Locality              True
Town                 False
District             False
County               False
PPD_Category_Type    False
dtype: bool
98818
Price                False
Postcode              True
Property_Type        False
Old_New              False
Duration             False
Street               False
Locality              True
Town                 False
District             False
County               False
PPD_Category_Type    False
dtype: bool
Price                 True
Postcode              True
Property_Type        False
Old_New              False
Duration             False
Street               False
Locality              True
Town                 False
District             False
County               False
PPD_Category_Type    False
dtype: bool


Unnamed: 0_level_0,Price,Postcode,Property_Type,Old_New,Duration,Street,Locality,Town,District,County,PPD_Category_Type
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1995-01-01,11.7,SW17 9QF,T,N,F,196 CROWBOROUGH ROAD,LONDON,LONDON,WANDSWORTH,GREATER LONDON,A
1995-01-03,10.6,EN1 1DN,F,N,L,"7 POYNTER ROAD, FLAT B",,ENFIELD,ENFIELD,GREATER LONDON,A
1995-01-03,10.3,DE22 3SE,T,N,F,69 WOLFA STREET,DERBY,DERBY,DERBY,DERBYSHIRE,A
1995-01-03,11.2,W2 6HD,F,N,L,"READING HOUSE, HALLFIELD ESTATE, FLAT 32",LONDON,LONDON,CITY OF WESTMINSTER,GREATER LONDON,A
1995-01-03,9.9,M28 3LB,T,N,L,29 ALFRED STREET,WORSLEY,MANCHESTER,SALFORD,GREATER MANCHESTER,A
1995-01-03,11.4,BN15 8LJ,D,N,F,206 BRIGHTON ROAD,LANCING,LANCING,ADUR,WEST SUSSEX,A
1995-01-03,10.4,FY4 2LW,S,N,F,"50A, BELVERE AVENUE",BLACKPOOL,BLACKPOOL,BLACKPOOL,BLACKPOOL,A
1995-01-03,11.5,TW5 0QN,S,N,F,33 ALDERNEY AVENUE,HOUNSLOW,HOUNSLOW,HOUNSLOW,GREATER LONDON,A
1995-01-04,11.7,BN6 9RH,T,N,F,"YEOMANS, WEST FURLONG LANE",HURSTPIERPOINT,HASSOCKS,MID SUSSEX,WEST SUSSEX,A
1995-01-04,11.4,RG40 1RH,S,N,F,49 BEAN OAK ROAD,WOKINGHAM,WOKINGHAM,WOKINGHAM,WOKINGHAM,A


In [75]:
print(pd.__version__)

0.19.2


## Data Exploration
In this first section of this project, you will make a cursory investigation about the Boston housing data and provide your observations. Familiarizing yourself with the data through an explorative process is a fundamental practice to help you better understand and justify your results.

Since the main goal of this project is to construct a working model which has the capability of predicting the value of houses, we will need to separate the dataset into **features** and the **target variable**. The **features**, `'RM'`, `'LSTAT'`, and `'PTRATIO'`, give us quantitative information about each data point. The **target variable**, `'MEDV'`, will be the variable we seek to predict. These are stored in `features` and `prices`, respectively.

### Implementation: Calculate Statistics
For your very first coding implementation, you will calculate descriptive statistics about the Boston housing prices. Since `numpy` has already been imported for you, use this library to perform the necessary calculations. These statistics will be extremely important later on to analyze various prediction results from the constructed model.

In the code cell below, you will need to implement the following:
- Calculate the minimum, maximum, mean, median, and standard deviation of `'MEDV'`, which is stored in `prices`.
  - Store each calculation in their respective variable.

In [76]:
p_train = prices

# TODO: Minimum price of the data
minimum_price = p_train.min()

# TODO: Maximum price of the data
maximum_price = p_train.max()

# TODO: Mean price of the data
mean_price = p_train.mean()

# TODO: Median price of the data
median_price = p_train.median()

# TODO: Standard deviation of prices of the data
std_price = p_train.std()

# Show the calculated statistics
print("Statistics for Boston housing dataset:\n")
print("Minimum price: ${:,.2f}".format(minimum_price))
print("Maximum price: ${:,.2f}".format(maximum_price))
print("Mean price: ${:,.2f}".format(mean_price))
print("Median price ${:,.2f}".format(median_price))
print("Standard deviation of prices: ${:,.2f}".format(std_price))


Statistics for Boston housing dataset:

Minimum price: $9.20
Maximum price: $16.00
Mean price: $11.72
Median price $11.80
Standard deviation of prices: $0.77


In [77]:
p_train = prices
print(p_train.dtype)
#p_bins = range(minimum_price,maximum_price,50000)
#print(p_bins)
#display(pd.cut(p_train, bins=p_bins))
#p_train = pd.cut(p_train, bins=p_bins)
print(p_train.value_counts().sort_values())
#p_train[p_train > 30] = 31
#display(p_train)
#print p_train.value_counts().sort_values()

float64
15.7       1
15.8       1
15.9       2
15.4       3
16.0       3
15.6       3
9.2        4
15.5       4
15.0       5
15.3       6
15.2       7
14.8      11
14.9      12
15.1      13
14.7      15
14.5      29
14.6      29
14.4      34
14.3      47
9.3       47
14.2      66
13.9      75
14.1      75
14.0      91
9.4       92
9.5      131
13.8     159
9.6      175
9.8      193
9.7      197
        ... 
10.2     683
13.1     898
10.3     951
13.0    1073
10.4    1178
12.9    1314
10.5    1729
12.8    1843
10.6    1883
12.7    1994
10.7    2182
12.6    2289
12.5    2308
10.8    2705
10.9    2711
11.1    3488
11.0    3493
11.2    3715
11.3    3885
12.2    3994
11.5    4045
11.6    4101
12.3    4290
11.4    4510
11.9    4674
12.4    4761
12.1    5021
11.7    5974
11.8    5987
12.0    6014
Name: Price, dtype: int64


In [78]:
#f_train = features.drop(['Street', 'Locality', 'Town', 'District', 'County'], axis=1)
#f_train = features.drop(['Street'], axis=1)
f_train = features
#f_train['Date'] = pd.to_datetime(f_train['Date'], format="%Y-%m-%d").apply(lambda x:x.date().strftime('%y'))
#display(f_train.head(10))
#print len(f_train['Postcode'].unique())
#print len(f_train['Postcode'])
#f_train = f_train.apply(str, axis=1)
f_train['Postcode'] = f_train['Postcode'].apply(lambda x : str(x).split()[0])
#print len(f_train['Postcode'].unique())
#print len(f_train['Postcode'])
display(f_train.tail(10))

Unnamed: 0_level_0,Postcode,Property_Type,Old_New,Duration,Street,Locality,Town,District,County,PPD_Category_Type
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2015-12-23,W11,O,N,L,"6 PORTLAND ROAD, GROUND FLOOR",,LONDON,KENSINGTON AND CHELSEA,GREATER LONDON,B
2015-12-23,M19,S,N,L,16 WESTDALE GARDENS,,MANCHESTER,MANCHESTER,GREATER MANCHESTER,A
2015-12-23,PE27,O,N,F,"KINGS HALL, ST IVES BUSINESS PARK, SUITE 1A",,ST IVES,HUNTINGDONSHIRE,CAMBRIDGESHIRE,B
2015-12-23,WF4,D,Y,F,6 WILLOWBROOK MANOR,HORBURY,WAKEFIELD,WAKEFIELD,WEST YORKSHIRE,A
2015-12-23,IG3,F,N,L,"69A, WESTWOOD ROAD, FLAT 1",,ILFORD,REDBRIDGE,GREATER LONDON,A
2015-12-23,BS49,S,N,F,137 HIGH STREET,YATTON,BRISTOL,NORTH SOMERSET,NORTH SOMERSET,A
2015-12-23,E2,F,N,L,"72B, CHESHIRE STREET, FIRST FLOOR FLAT",,LONDON,TOWER HAMLETS,GREATER LONDON,A
2015-12-24,SE1,F,Y,L,"55 UPPER GROUND, APARTMENT 1205",,LONDON,SOUTHWARK,GREATER LONDON,B
2015-12-24,NW1,F,N,L,"MARYS COURT, 4, PALGRAVE GARDENS, APARTMENT 42",,LONDON,CITY OF WESTMINSTER,GREATER LONDON,A
2015-12-31,EX8,F,N,L,"THE MOORINGS, MAER LANE, FLAT 10",,EXMOUTH,EAST DEVON,DEVON,B


In [79]:
#ts1 = (data_train['Date'] - pd.datetime(1995,1,1)).dt.year
#ts1 = data_train['Date'].dt.year - 1994
#ts = pd.DataFrame(data_train['Price']/(data_train['Date'].dt.year - 1994)).set_index(data_train['Date'])
#ts = pd.DataFrame(data_train['Price']).set_index(data_train['Date'])
#plt.plot(ts)
#display(ts1)
data_train = data[:98818]
data_train['Price'] = data_train['Price'].astype(int).apply(lambda x: np.around(np.log(x), 1))
t = data_train['Price']
display(t)
t.rolling(window='Y')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Date
1995-01-01    2.4
1995-01-03    2.3
1995-01-03    2.3
1995-01-03    2.4
1995-01-03    2.2
1995-01-03    2.4
1995-01-03    2.3
1995-01-03    2.4
1995-01-04    2.4
1995-01-04    2.4
1995-01-04    2.4
1995-01-04    2.4
1995-01-04    2.4
1995-01-04    2.4
1995-01-04    2.5
1995-01-05    2.4
1995-01-05    2.3
1995-01-05    2.4
1995-01-05    2.4
1995-01-06    2.3
1995-01-06    2.3
1995-01-06    2.3
1995-01-06    2.4
1995-01-06    2.3
1995-01-06    2.4
1995-01-06    2.3
1995-01-06    2.3
1995-01-06    2.3
1995-01-06    2.2
1995-01-06    2.2
             ... 
2015-12-22    2.5
2015-12-22    2.5
2015-12-22    2.4
2015-12-22    2.4
2015-12-22    2.4
2015-12-22    2.6
2015-12-22    2.5
2015-12-22    2.5
2015-12-22    2.5
2015-12-22    2.5
2015-12-22    2.5
2015-12-22    2.4
2015-12-22    2.4
2015-12-22    2.6
2015-12-22    2.5
2015-12-22    2.4
2015-12-22    2.4
2015-12-22    2.5
2015-12-23    2.4
2015-12-23    2.5
2015-12-23    2.3
2015-12-23    2.5
2015-12-23    2.6
2015-12-23    2.6
2015-

ValueError: passed window Y in not compat with a datetimelike index

In [None]:
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
d_f = defaultdict(LabelEncoder)
#d_p = LabelEncoder()

# Encoding the variable
f_train_le = f_train.apply(lambda x: d_f[x.name].fit_transform(x))
#p_train_le = d_p.fit_transform(p_train)

# Inverse the encoded
f_train_de = f_train_le.apply(lambda x: d_f[x.name].inverse_transform(x))
#p_train_de = d_p.inverse_transform(p_train_le)

# Using the dictionary to label future data
#df.apply(lambda x: d[x.name].transform(x))

display(f_train.head(10))
display(f_train_le.head(10))
display(f_train_de.head(10))
#display(p_train.head(10))
#display(p_train_le[:10])
#display(p_train_de[:10])
print(d_f)
#print d_p

In [13]:
from time import time
def train_classifier(clf, X_train, y_train):
    ''' Fits a classifier to the training data. '''
    
    # Start the clock, train the classifier, then stop the clock
    start = time()
    clf.fit(X_train, y_train)
    end = time()
    
    # Print the results
    print("Trained model in {:.4f} seconds".format(end - start))

    
def predict_labels(clf, features, target):
    ''' Makes predictions using a fit classifier based on F1 score. '''
    
    # Start the clock, make predictions, then stop the clock
    start = time()
    y_pred = clf.predict(features)
    end = time()
    
    # Print and return results
    print("Made predictions in {:.4f} seconds.".format(end - start))
    #return f1_score(target.values, y_pred, pos_label='yes')
    return f1_score(target, y_pred, pos_label='yes', average='weighted')


def train_predict(clf, X_train, y_train, X_test, y_test):
    ''' Train and predict using a classifer based on F1 score. '''
    
    # Indicate the classifier and the training set size
    print("Training a {} using a training set size of {}. . .".format(clf.__class__.__name__, len(X_train)))
    
    # Train the classifier
    train_classifier(clf, X_train, y_train)
    
    # Print the results of prediction for both training and testing
    print("F1 score for training set: {:.4f}.".format(predict_labels(clf, X_train, y_train)))
    print ("F1 score for test set: {:.4f}.".format(predict_labels(clf, X_test, y_test)))

In [14]:
display(f_train_le)
display(p_train)

NameError: name 'f_train_le' is not defined

In [15]:
from sklearn.cross_validation import ShuffleSplit
from sklearn.cross_validation import train_test_split
from sklearn.metrics import f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor

X_train, X_test, y_train, y_test = train_test_split(f_train_le, p_train_le, test_size=0.1, random_state=22)

clf_A = DecisionTreeClassifier(random_state=0)
clf_B = KNeighborsClassifier()
clf_C = GaussianNB()
clf_D = RandomForestClassifier(random_state=0, n_estimators=20)
#regressor1 = DecisionTreeRegressor(random_state=20)
regressor2 = RandomForestRegressor(random_state=20, n_estimators=20)
#regressor3 = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05, max_depth=3, max_features='sqrt',
#                                               min_samples_leaf=15, min_samples_split=10, loss='huber')

#train_predict(clf_A, X_train, y_train, X_test, y_test)
#train_predict(clf_B, X_train, y_train, X_test, y_test)
##train_predict(clf_C, X_train, y_train, X_test, y_test)
#train_predict(clf_D, X_train, y_train, X_test, y_test)

#regressor1.fit(X_train, y_train)

#print "R^2 score on training :",
#print regressor1.score(X_train, y_train)
#print "R^2 score on test     :",
#print regressor1.score(X_test, y_test)
#print "RMSE score on test   :",
#print mean_squared_error(y_test, regressor1.predict(X_test), sample_weight=None, multioutput='uniform_average')

regressor2.fit(X_train, y_train)

print("R^2 score on training :", regressor2.score(X_train, y_train))
print("R^2 score on test     :", regressor2.score(X_test, y_test))
print("RMSE score on test   :", mean_squared_error(y_test, regressor2.predict(X_test), sample_weight=None, multioutput='uniform_average'))

#regressor3.fit(X_train, y_train)

#print "R^2 score on training :",
#print regressor3.score(X_train, y_train)
#print "R^2 score on test     :",
#print regressor3.score(X_test, y_test)
#print "RMSE score on test   :",
#print mean_squared_error(y_test, regressor3.predict(X_test), sample_weight=None, multioutput='uniform_average')

SyntaxError: Missing parentheses in call to 'print' (<ipython-input-15-255a3d995b8b>, line 40)

### Making Predictions
Once a model has been trained on a given set of data, it can now be used to make predictions on new sets of input data. In the case of a *decision tree regressor*, the model has learned *what the best questions to ask about the input data are*, and can respond with a prediction for the **target variable**. You can use these predictions to gain information about data where the value of the target variable is unknown — such as data the model was not trained on.

**Answer: **

In [196]:
f_test = features_test
f_test = f_test.drop(['Street'], axis=1)
f_test['Postcode'] = f_test['Postcode'].apply(lambda x : str(x).split()[0])
f_test_le = f_test.apply(lambda x: d_f[x.name].fit_transform(x))

display(f_test_le.head(10))

predict_test_le = regressor2.predict(f_test_le)

predict_test_le = np.exp(predict_test_le)
display(predict_test_le)


Unnamed: 0_level_0,Date,Postcode,Property_Type,Old_New,Duration,Locality,Town,District,County,PPD_Category_Type
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
98818,0,674,1,0,1,368,345,243,82,0
98819,1,392,2,0,0,0,211,148,101,1
98820,1,622,4,0,0,197,444,12,67,0
98821,1,77,0,1,1,209,42,22,38,0
98822,1,188,4,0,0,201,374,232,29,0
98823,1,82,1,0,1,0,445,311,100,0
98824,1,320,1,0,1,0,197,203,37,0
98825,1,270,3,0,0,0,440,308,88,0
98826,1,315,4,0,0,0,349,41,101,0
98827,1,630,1,0,1,0,225,245,37,0


array([  1.71542288e+04,   2.96558565e+05,   5.40364937e+05, ...,
         4.65096412e+05,   7.97837026e+09,   1.46507194e+07])

In [None]:
# Produce a matrix for client data
client_data = [[5, 17, 15], # Client 1
               [4, 32, 22], # Client 2
               [8, 3, 12]]  # Client 3

# Show predictions
for i, price in enumerate(reg.predict(client_data)):
    print "Predicted selling price for Client {}'s home: ${:,.2f}".format(i+1, price)

**Answer: **