# Machine Learning Project: Boston Housing Median House Value Prediction

## Problem Statement
The aim is to accurately predict the median housing values using machine learning algorithms.

## Evaluation Metrics
Evaluation metrics will be using r-squared for this regression task. Rather than using RMSE or MSE, r-squared is interpretable and common.

## Project Layout

The following steps represent the project layout (active hyperlinks that navigate the notebook). Though these are sequentially listed, it is an iterative process and more cyclical in nature as the problem is more understood. 

* [Step 1](#step1): Retrieve Data
* [Step 2](#step2): Exploratory Data Analysis: Clean & Explore
* [Step 3](#step3): Prepare & Transform
* [Step 4](#step4): Develop & Train Model(s)
    * [Model 1](#benchmark):  Benchmark model
    * [Model 2](#stepwise):  Stepwise Model
    * [Model 3](#step4_MobileNet):  Multi-linear Regression    
* [Step 5](#step5): Validate & Evaluate Results
* [Step 6](#step6): Deployment Discussion

---
<a id='step1'></a>
## Step 1: Retrieve Data
Importing Datasets & Packages


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
sns.set(style="whitegrid")

from sklearn.datasets import load_boston

#suppressing warnings
import warnings
warnings.filterwarnings("ignore")

Obtaining the feature and target data from the boston dataset.

In [2]:
boston = load_boston()

#Loading our feature data
df = pd.DataFrame(boston.data, columns = boston.feature_names)

#Getting our target data
df['MEDV'] = boston.target

In [3]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


---
<a id='step2'></a>
## Step 2: Exploratory Data Analysis: Clean & Explore

In [4]:
#insepecting the data type

print("Missing Data - there are this many occurances: {}".format(df.isna().sum().sum()))
df.info()

Missing Data - there are this many occurances: 0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
MEDV       506 non-null float64
dtypes: float64(14)
memory usage: 55.4 KB


In [5]:
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu