# Data Cleaning  
In this notebook, we will prepare the data for a Maching Learning algorithm. We will develop a preprocessing pipeline that does the following: 
1. adds useful attributes
2. Deals with missing values
3. Encodes any non-numerical data
4. Rescales our features to an appropriate range and variance for our algorithm

In [18]:
import os
import joblib

# graphing
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

import pandas as pd
import numpy as np


In [6]:
# import our stratified training set from the ExploratoryAnalysis notebook
housing = pd.read_pickle("StratifiedTrainingSet.pkl")

# only keep the attributes we're interested in
attributes = []

In [8]:
"""
Imputing Missing Numerical Data

for missing values of any attribute, we will use an imputer to fill in the median value
this only works for numerical values, so we will drop ocean_proximity
the only attribute that is missing values here is total_bedrooms, but our solution is general
"""

# import and create an imputer object
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

# drop our categorical attribute
housing_num = housing.drop(columns=["ocean_proximity"])

# 'fit' the imputer to the data (given our strategy, it calculates the median)
imputer.fit(housing_num)

# check that the imputer calculated the median
print(imputer.statistics_)
print(housing_num.median().values)

[-1.1851e+02  3.4260e+01  2.9000e+01  2.1195e+03  4.3300e+02  1.1640e+03
  4.0800e+02  3.5409e+00  1.7950e+05]
[-1.1851e+02  3.4260e+01  2.9000e+01  2.1195e+03  4.3300e+02  1.1640e+03
  4.0800e+02  3.5409e+00  1.7950e+05]


In [9]:
# use the imputer to transform the data and fill missing data.
# returns a numpy array
X = imputer.transform(housing_num)

# convert back to pandas dataframe
housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index)

In [10]:
"""
Encoding Categorical Data

We will encode our ocean_proximity text data to be numerical
Since there are only 5 categories, and they are not necessarily "adjacent" to one another in the way integers
are, we will use one-hot encoding.
This will convert our 1 categorical variable into 5 boolean variables. Each data entry will have a value of 1
for one of the resulting columns, and 0 for all others.
"""
from sklearn.preprocessing import OneHotEncoder

# get the column of categorical data
housing_cat = housing[["ocean_proximity"]]

# create encoder object
cat_encoder = OneHotEncoder()

# fit the encoder to the data and transform the data at the same time, using fit_transform
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

print(housing_cat_1hot.toarray()[:10])
print(cat_encoder.categories_)

[[1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1.]
 [0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]]
[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
      dtype=object)]


In [13]:
"""
Feature Scaling

We will employ a standardized scaling approach to the data, since many of our attributes
have long positive tails (i.e. many positive outliers)
"""
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# fit the scaler to the data (finds mean and variance)
scaler.fit(housing_tr)

# print calculated values for each column
print(scaler.mean_)
print(scaler.var_)

# transform the data
housing_scaled = scaler.transform(housing_tr)

[-1.19575834e+02  3.56395773e+01  2.86531008e+01  2.62272832e+03
  5.33998123e+02  1.41979082e+03  4.97060380e+02  3.87558937e+00
  2.06990921e+05]
[4.00720174e+00 4.57101322e+00 1.58114157e+02 4.57272746e+06
 1.68778972e+05 1.24468040e+06 1.41157604e+05 3.62861320e+00
 1.33863769e+10]


# Creating the Pipeline  
using a scikit-learn pipeline, we will combine and automate all the preprocessing steps that we have designed so far.  
1. create new attributes (rooms per household, bedrooms per room, population per household)
2. replace missing values with the median statistic for that column
3. encode the ocean_proximity categorical variable using the one-hot method
4. scale all numerical data with standardized scales

In [14]:
# create a transformer class we can use to add our new attributes
from sklearn.base import BaseEstimator, TransformerMixin

# store the column indices we will use to make new attributes
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

# note: do not add *args or **kargs for BaseEstimator
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X):
        return self # nothing else to do
    def transform(self, X):
        rooms_per_household = X[:,rooms_ix]/X[:,households_ix]
        bedrooms_per_room = X[:,bedrooms_ix]/X[:,rooms_ix]
        population_per_household = X[:,population_ix]/X[:,households_ix]
        
        return np.c_[X, rooms_per_household, bedrooms_per_room, population_per_household]

In [15]:
# create a pipeline for our numerical attributes only
from sklearn.pipeline import Pipeline

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
])

In [19]:
# combine the numerical pipeline with the categorical one (just one-hot encoding)
from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", OneHotEncoder(), cat_attribs)
])

['PreprocessingPipeline.pkl']