# Data Cleaning  
In this notebook, we will prepare the data for a Maching Learning algorithm. We will develop a preprocessing pipeline that does the following: 
1. adds useful attributes
2. Deals with missing values
3. Encodes any non-numerical data
4. Rescales our features to an appropriate range and variance for our algorithm

In [3]:
import joblib
import os

# graphing
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

import pandas as pd
import numpy as np

HOUSING_PATH = os.path.join("datasets","housing")

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

In [4]:
# import our stratified training set from the ExploratoryAnalysis notebook
housing = joblib.load("StratifiedTrainingSet.pkl")

In [5]:
"""
Imputing Missing Numerical Data
"""
# for missing values of any attribute, we will use an imputer to fill in the median value
# this only works for numerical values, so we will drop ocean_proximity
# the only attribute that is missing values here is total_bedrooms, but our solution is general

# import and create an imputer object
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

# drop our categorical attribute
housing_num = housing.drop("ocean_proximity",axis=1)

# 'fit' the imputer to the data (given our strategy, it calculates the median)
imputer.fit(housing_num)

# check that the imputer calculated the median
print(imputer.statistics_)
print(housing_num.median().values)

[-1.1851e+02  3.4260e+01  2.9000e+01  2.1195e+03  4.3300e+02  1.1640e+03
  4.0800e+02  3.5409e+00  1.7950e+05]
[-1.1851e+02  3.4260e+01  2.9000e+01  2.1195e+03  4.3300e+02  1.1640e+03
  4.0800e+02  3.5409e+00  1.7950e+05]


In [6]:
# use the imputer to transform the data and fill missing data.
# returns a numpy array
X = imputer.transform(housing_num)

# convert back to pandas dataframe
housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index)

In [7]:
"""
Encoding Categorical Data

We will encode our ocean_proximity text data to be numerical
Since there are only 5 categories, and they are not necessarily "adjacent" to one another in the way integers
are, we will use one-hot encoding.
This will convert our 1 categorical variable into 5 boolean variables. Each data entry will have a value of 1
for one of the resulting columns, and 0 for all others.
"""
from sklearn.preprocessing import OneHotEncoder

# get the column of categorical data
housing_cat = housing[["ocean_proximity"]]

# create encoder object
cat_encoder = OneHotEncoder()

# fit the encoder to the data and transform the data at the same time, using fit_transform
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

print(housing_cat_1hot.toarray()[:10])
print(cat_encoder.categories_)

[[1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1.]
 [0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]]
[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
      dtype=object)]


In [None]:
"""
Feature Scaling

We will employ a standardized scaling approach to the data, since many of our attributes
have long positive tails (i.e. many positive outliers)
"""
