# Project: Predicting Median House Price in Boston


In this project, you are going to model the median home price of various houses across U.S. Census tracts in the city of Boston. Note that we are predicting a continuous, numeric output (price) based on a combination of discrete features. Thus, this is a regression problem.

In [3]:
# Let us import some libraries
import matplotlib.pyplot as plt
%matplotlib inline

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

In [4]:
# loading the data
boston = load_boston()

X = pd.DataFrame(boston.data,
                 columns=boston.feature_names)
y = pd.DataFrame(boston.target,
                 columns=['MEDV'])

print(boston['DESCR'])

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

## 1. Clean Up Data and Perform Exporatory Data Analysis


Boston data is from scikit-learn datasets library. Thus, it should be in clean format. Nevertheless, we should always perform exploratory data analysis before we start our analysis in detail.



In [5]:
# Exploratory data analysis:
# Include: total nulls, data types, shape, summary statistics, and the number of unique values for each column in the dataset.
# You may use plots to describe any relationships between different columns.
df = pd.DataFrame(boston.data)
print("[INFO] df type : {}".format(type(df)))
print("[INFO] df shape: {}".format(df.shape))
df.columns = boston.feature_names
print(df.head())
print(pd.isnull(df).any()) # To find if a column in our dataset has missing values

[INFO] df type : <class 'pandas.core.frame.DataFrame'>
[INFO] df shape: (506, 13)
      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0   
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0   

   PTRATIO       B  LSTAT  
0     15.3  396.90   4.98  
1     17.8  396.90   9.14  
2     17.8  392.83   4.03  
3     18.7  394.63   2.94  
4     18.7  396.90   5.33  
CRIM       False
ZN         False
INDUS      False
CHAS       False
NOX        False
RM         False
AGE        False
DIS        False
RAD        False
TAX        False
PTRATIO    False
B          False
LSTAT      False
dtype: bool


## 2. Build a Lasso regression model using Scikit-Learn

Use the toolset in Scikit-Learn to build a Lasso regression model to predict our target variable, MEDV. Use a 10-fold cross validation in building your model. Score your predictions. What do these results tell us?


In [6]:
df["PRICE"] = boston.target
print(df.head())
X = df.drop("PRICE", axis=1) # Inserting target column in our dataframe
Y = df["PRICE"]
print(X.shape)
print(Y.shape)

      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0   
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0   

   PTRATIO       B  LSTAT  PRICE  
0     15.3  396.90   4.98   24.0  
1     17.8  396.90   9.14   21.6  
2     17.8  392.83   4.03   34.7  
3     18.7  394.63   2.94   33.4  
4     18.7  396.90   5.33   36.2  
(506, 13)
(506,)


In [7]:
# Lasso regression model with 10-fold cross-validation
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error, make_scorer
scaler = MinMaxScaler()
scaled_X = scaler.fit_transform(X)
test_size = 0.20

model = Lasso()
fold_scores = cross_val_score(model, scaled_X, Y, cv=10, scoring='r2')
mean_fold_score = np.mean(fold_scores)
print('Determination Coefficient for 10 folds:', fold_scores)
print('Mean Determination Coefficient for test:', mean_fold_score)

Determination Coefficient for 10 folds: [ 0.07012894  0.17924575 -1.71633797 -0.68829972 -0.17881512 -0.93935171
 -0.12930186 -0.04526711 -3.62267359 -0.28227388]
Mean Determination Coefficient for test: -0.7352946268305928


## 3. Build a Random forest regression model using Scikit-Learn

Use the toolset in Scikit-Learn to build a Random forest regression model to predict our target variable, MEDV. Use a 10-fold cross validation in building your model. Score your predictions. What do these results tell us?

In [8]:
# Random forest regression model with 10-fold cross-validation
from sklearn.ensemble import RandomForestRegressor
folds = 10
model = RandomForestRegressor()
fold_scores = cross_val_score(model, scaled_X, Y, cv=10, scoring='r2')
mean_fold_score = np.mean(fold_scores)
print('Determination Coefficient for 10 folds:', fold_scores)
print('Mean Determination Coefficient for test:', mean_fold_score)

Determination Coefficient for 10 folds: [ 0.67829511  0.80752137  0.32182459  0.79386394  0.83466287  0.71769526
  0.56889053  0.33908733 -0.33674329  0.23654659]
Mean Determination Coefficient for test: 0.49616442975676495


## 4. Build a Support Vector Machine (SVM) model using Scikit-Learn

Use the toolset in Scikit-Learn to build a Support Vector Machine (SVM) regression model to predict our target variable, MEDV. Use a 10-fold cross validation in building your model. Score your predictions. What do these results tell us?

In [18]:
# Support Vector Machine (SVM) regression model with 10-fold cross-validation
from sklearn.svm import SVR
svc_model = SVR()
fold_scores = cross_val_score(svc_model, scaled_X, Y, cv=10, scoring='r2')
mean_fold_score = np.mean(fold_scores)
print('Determination Coefficient for 10 folds:', fold_scores)
print('Mean Determination Coefficient for test:', mean_fold_score)

Determination Coefficient for 10 folds: [ 0.70232793  0.55446441  0.18962265 -0.02953406  0.39341342  0.06795413
  0.48776984  0.2366922  -0.31115425  0.18903015]
Mean Determination Coefficient for test: 0.24805864179054
