### Preprocessing

In [1]:
import pandas as pd

If we create a preprocessing tool instead of a script, preprocessed data file, we'll be able to change things on the fly during the model build process and speed up iterations.

In this Notebook, we'll do some feature engineering and missing value imputation. This will include:
* Filter features
* Encoding missing values with a unique value (for that column)
* Encoding categorical features for us in common Machine Learning algorithms

### Create Preprocessor Object

Object-oriented Programming (OOP) is helpful here because it will allow us to keep information derived from the training data (attributes) and the code to perform the preprocessing (methods) in the same place.

we'll create a "preprocessor" object (class) from the training data that can be applied to the test data as well any new data in production. we'll use the 'fit' and 'transform' paradigm to ensure we're only learning info from the training data (not the test data yet).

In [2]:
class preprocessor:
    
    def __init__(self, cols_to_filter=None):
        
        self.cols_to_filter
        
    def fit(self, X, y=None):
        """learn any information from the training data we may need to transform the test data"""
        
        #learn from the training data and return the class itself
        #allows you to chain fit and predict methods like
        # > p = preprocessor()
        # > p.fit(X).transform(X)
        
        return self
    
    def transform(self, X, y=None):
        """transform the training or test data"""
        #transform the training or test data based on class attributes learned in the `fit` step
        
        return X_new

Note the unused y=None argument. This will be important later when we use this in a Scikit-Learn pipeline.

We will also right a function called 'get_data'

In [3]:
import sys
import inspect

sys.path.insert(0, './modules')

#read in the new function
from helpers import get_data
print(inspect.getsource(get_data))

def get_data(dset):
    
    """Create the training dataset (2016) or the test dataset (2017)

    Keyword arguments:
    dset -- a string in {train, test}
    
    Returns:
    a tuple of pandas dataframe (X) and pandas series (y)
    """
    
    year = {'train':2016, 'test':2017}[dset]
    
    train = read_in_dataset('train_{0}'.format(year))
    properties = read_in_dataset('properties_{0}'.format(year))
    merged = merge_dataset(train, properties)
    
    if dset == 'train':
        merged = filter_duplicate_parcels(merged)
    
    y = merged.pop('logerror')
    return merged, y



Notice that if you pass the argument 'dset=train', we get the 2016 data, and if you pass the argument 'dset=test', we get the 2017 data. The 2017 data wasn not originially included in the Zillow Home Value Prediction competition. 

In [8]:
train_X, train_y = get_data(dset='train')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [9]:
train_X.head()

Unnamed: 0,parcelid,transactiondate,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,calculatedbathnbr,...,numberofstories,fireplaceflag,structuretaxvaluedollarcnt,taxvaluedollarcnt,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock
0,11016594,2016-01-01,1.0,,,2.0,3.0,,4.0,2.0,...,,,122754.0,360170.0,2015.0,237416.0,6735.88,,,60371070000000.0
1,14366692,2016-01-01,,,,3.5,4.0,,,3.5,...,,,346458.0,585529.0,2015.0,239071.0,10153.02,,,
2,12098116,2016-01-01,1.0,,,3.0,2.0,,4.0,3.0,...,,,61994.0,119906.0,2015.0,57912.0,11484.48,,,60374640000000.0
3,12643413,2016-01-02,1.0,,,2.0,2.0,,4.0,2.0,...,,,171518.0,244880.0,2015.0,73362.0,3048.74,,,60372960000000.0
4,14432541,2016-01-02,,,,2.5,4.0,,,2.5,...,2.0,,169574.0,434551.0,2015.0,264977.0,5488.96,,,60590420000000.0


In [10]:
class preprocessor:
    
    def __init__(self, cols_to_filter=None):
        
        self.cols_to_filter = cols_to_filter
        
    def fit(self, X, y=None):
        """learn any information from the training data we may need to transform the test data"""
        
        #learn from the training data and return the class itself
        #allows you to chain fit and predict methods like
        
        # > p = preprocessor()
        # > p.fit(X).transform(X)
        
        return self
    
    def transform(self, X, y=None):
        """transform the training or test data"""
        #transform the training or test data based on class attributes learned in the `fit` step
        
        X_new = X.drop(self.cols_to_filter, axis=1)
        
        return X_new

In [11]:
p=preprocessor(cols_to_filter = ['parcelid'])
p.transform(train_X).head()

Unnamed: 0,transactiondate,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,calculatedbathnbr,decktypeid,...,numberofstories,fireplaceflag,structuretaxvaluedollarcnt,taxvaluedollarcnt,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock
0,2016-01-01,1.0,,,2.0,3.0,,4.0,2.0,,...,,,122754.0,360170.0,2015.0,237416.0,6735.88,,,60371070000000.0
1,2016-01-01,,,,3.5,4.0,,,3.5,,...,,,346458.0,585529.0,2015.0,239071.0,10153.02,,,
2,2016-01-01,1.0,,,3.0,2.0,,4.0,3.0,,...,,,61994.0,119906.0,2015.0,57912.0,11484.48,,,60374640000000.0
3,2016-01-02,1.0,,,2.0,2.0,,4.0,2.0,,...,,,171518.0,244880.0,2015.0,73362.0,3048.74,,,60372960000000.0
4,2016-01-02,,,,2.5,4.0,,,2.5,,...,2.0,,169574.0,434551.0,2015.0,264977.0,5488.96,,,60590420000000.0


### Deal with Datetime Columns

Encoding the datetime variable as month and year (disregard day because it won't be included in the data we'll be scoring)

In [14]:
class preprocessor:
    
    def __init__(self, cols_to_filter=None, datecols=None):
        
        self.cols_to_filter = cols_to_filter
        self.datecols = datecols
        
    def fit(self, X):
        """learn any information from the training data we may need to transform the test data"""
        
        #learn from the training data and return the class itself
        #allows you to chain fit and predict methods like
        
        # > p = preprocessor()
        # > p.fit(X).transform(X)
        
        return self
    
    def transform(self, X):
        """transform the trianing or test data"""
        #transform the training or test data based on class attributes learned in the `fit` step
        
        X_new = X.drop(self.cols_to_filter, axis=1)
        
        if self.datecols:
            for x in self.datecols:
                X_new[x + 'month'] = pd.to_datetime(X_new[x]).apply(lambda x: x.month)
                X_new[x + '_year'] = pd.to_datetime(X_new[x]).apply(lambda x: x.year)
                X_new = X_new.drop(x, axis=1)
                
                return X_new
        

In [15]:
p=preprocessor(cols_to_filter = ['parcelid'], datecols=['transactiondate'])
train_X_transformed = p.transform(train_X)

In [16]:
train_X_transformed.head()

Unnamed: 0,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,calculatedbathnbr,decktypeid,finishedfloor1squarefeet,...,structuretaxvaluedollarcnt,taxvaluedollarcnt,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock,transactiondatemonth,transactiondate_year
0,1.0,,,2.0,3.0,,4.0,2.0,,,...,122754.0,360170.0,2015.0,237416.0,6735.88,,,60371070000000.0,1,2016
1,,,,3.5,4.0,,,3.5,,,...,346458.0,585529.0,2015.0,239071.0,10153.02,,,,1,2016
2,1.0,,,3.0,2.0,,4.0,3.0,,,...,61994.0,119906.0,2015.0,57912.0,11484.48,,,60374640000000.0,1,2016
3,1.0,,,2.0,2.0,,4.0,2.0,,,...,171518.0,244880.0,2015.0,73362.0,3048.74,,,60372960000000.0,1,2016
4,,,,2.5,4.0,,,2.5,,,...,169574.0,434551.0,2015.0,264977.0,5488.96,,,60590420000000.0,1,2016


### Define an Imputation Strategy

An easy strategy would be to make an educated assumption that all the numeric variables are positive and encode missing values with a '-1'

In [17]:
train_X.loc[:, train_X.isna().sum() > 0].min()

airconditioningtypeid                    1
architecturalstyletypeid                 2
basementsqft                           100
buildingclasstypeid                      4
buildingqualitytypeid                    1
calculatedbathnbr                        1
decktypeid                              66
finishedfloor1squarefeet                44
calculatedfinishedsquarefeet             2
finishedsquarefeet12                     2
finishedsquarefeet13                  1056
finishedsquarefeet15                   560
finishedsquarefeet50                    44
finishedsquarefeet6                    257
fireplacecnt                             1
fullbathcnt                              1
garagecarcnt                             0
garagetotalsqft                          0
hashottuborspa                        True
heatingorsystemtypeid                    1
lotsizesquarefeet                      167
poolcnt                                  1
poolsizesum                             28
pooltypeid1

Because there are no features with negative values, we can impute missing values witha -1

In [18]:
class preprocessor:
    
    def __init__(self, cols_to_filter=None, datecols=None):
        
        self.cols_to_filter = cols_to_filter
        self.datecols = datecols
        
    def fit(self, X):
        """learn any information from the training data we may need to transform the test data"""
        
        #learn from the training data and return the class itself
        #allows you to chain fit and predict methods like
        
        # > p = preprocessor()
        # > p.fit(X).transform(X)
        
        return self
    
    def transform(self, X):
        """transform the trianing or test data"""
        #transform the training or test data based on class attributes learned in the `fit` step
        
        # filter
        X_new = X.drop(self.cols_to_filter, axis=1)
        # fill NA
        X_new = X_new.fillna(-1)
        
        if self.datecols:
            for x in self.datecols:
                X_new[x + 'month'] = pd.to_datetime(X_new[x]).apply(lambda x: x.month)
                X_new[x + '_year'] = pd.to_datetime(X_new[x]).apply(lambda x: x.year)
                X_new = X_new.drop(x, axis=1)
                
                return X_new
        

In [19]:
p=preprocessor(cols_to_filter = ['parcelid'], datecols=['transactiondate'])
train_X_transformed = p.transform(train_X)

In [20]:
train_X_transformed.head()

Unnamed: 0,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,calculatedbathnbr,decktypeid,finishedfloor1squarefeet,...,structuretaxvaluedollarcnt,taxvaluedollarcnt,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock,transactiondatemonth,transactiondate_year
0,1.0,-1.0,-1.0,2.0,3.0,-1.0,4.0,2.0,-1.0,-1.0,...,122754.0,360170.0,2015.0,237416.0,6735.88,-1,-1.0,60371070000000.0,1,2016
1,-1.0,-1.0,-1.0,3.5,4.0,-1.0,-1.0,3.5,-1.0,-1.0,...,346458.0,585529.0,2015.0,239071.0,10153.02,-1,-1.0,-1.0,1,2016
2,1.0,-1.0,-1.0,3.0,2.0,-1.0,4.0,3.0,-1.0,-1.0,...,61994.0,119906.0,2015.0,57912.0,11484.48,-1,-1.0,60374640000000.0,1,2016
3,1.0,-1.0,-1.0,2.0,2.0,-1.0,4.0,2.0,-1.0,-1.0,...,171518.0,244880.0,2015.0,73362.0,3048.74,-1,-1.0,60372960000000.0,1,2016
4,-1.0,-1.0,-1.0,2.5,4.0,-1.0,-1.0,2.5,-1.0,-1.0,...,169574.0,434551.0,2015.0,264977.0,5488.96,-1,-1.0,60590420000000.0,1,2016


In [21]:
assert all(train_X_transformed.isna().sum() == 0) 

### Encoding Categorical/Discrete Features

Most ML algorithms don't handle categorical features well, so we need to encode them as Real Numbers. We have a couple options. We could encode each "level" of the categorical feature as an integer, effectively mapping the discrete space to an integer space. This works OK for non-linear algorithms, but nof for any that assumes linearity. To be more flexible during modeling, we can encode set of categorical variables with a set of binary features ("dummy coding").

In [26]:
class preprocessor:
    
    def __init__(self, cols_to_filter=None, datecols=None):
        
        self.cols_to_filter = cols_to_filter
        self.datecols = datecols
        self.was_fit = False
        
    def fit(self, X, y=None):
        """learn any information from the training data we may need to transform the test data"""
        
        # learn from the trianing data and return the class itself.
        # allows you to chain fit and predict methods like
        
        # > p = preprocessor()
        # > p.fit(X).transform(X)
        
        self.was_fit = True
        
        # filter
        X_new = X.drop(self.cols_to_filter, axis=1)
        
        categorical_features = X_new.dtypes[X_new.dtypes == 'object'].index
        self.categorical_features = [x for x in categorical_features if 'date' not in x]
        
        dummied = pd.get_dummies(X_new, columns=self.categorical_features, dummy_na=True)
        
        self.colnames = dummied.columns
        del dummied
        
        return self
    
    def tranform(self, X, y=None):
        """transform the training or test data"""
        # transform the training or test based on class attributes learned in the `fit` step
        
        if not self.was_fit:
            raise Error("need to fit preprocessor firtst")
            
        #filter
        X_new = X.drop(self.cols_to_filter, axis=1)
        
        # dummy code
        X_new = pd.get_dummies(X_new, columns=self.categorical_features, dummy_na=True)
        newcols = set(self.colnames) - set(X_new.columns)
        for x in newcols:
            X_new[x] = 0
            
        X_new = X_new[self.colnames]
        
        # fill NA after dummy code
        X_new = X_new.fillna(-1)
        
        if self.datecols:
            for x in self.datecols:
                X_new[x + '_month'] = pd.to_datetime(X_new[x]).apply(lambda x: x.month)
                X_new[x + '_year'] = pd.to_datetime(X_new[x]).apply(lambda x: x.year)
                X_new = X_new.drop(x, axis=1)
                
        return X_new
        
    def fit_transform(self, X, y=None):
        """fit and transform wrapper method, used for sklearn pipeline"""
        
        return self.fit(X).transform(X)
        
        

In [27]:
p=preprocessor(cols_to_filter = ['rawcensustractandblock', 'censustractandblock', 'propertyzoningdesc', 'regionidneighborhood', 'regionidzip', 'parcelid'], datecols=['transactiondate'])

p.fit(train_X)

<__main__.preprocessor at 0x16c69c690>

In [30]:
train_X_transformed = p.tranform(train_X)

In [31]:
train_X_transformed.head()

Unnamed: 0,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,calculatedbathnbr,decktypeid,finishedfloor1squarefeet,...,propertycountylandusecode_73,propertycountylandusecode_8800,propertycountylandusecode_96,propertycountylandusecode_nan,fireplaceflag_True,fireplaceflag_nan,taxdelinquencyflag_Y,taxdelinquencyflag_nan,transactiondate_month,transactiondate_year
0,1.0,-1.0,-1.0,2.0,3.0,-1.0,4.0,2.0,-1.0,-1.0,...,0,0,0,0,0,1,0,1,1,2016
1,-1.0,-1.0,-1.0,3.5,4.0,-1.0,-1.0,3.5,-1.0,-1.0,...,0,0,0,0,0,1,0,1,1,2016
2,1.0,-1.0,-1.0,3.0,2.0,-1.0,4.0,3.0,-1.0,-1.0,...,0,0,0,0,0,1,0,1,1,2016
3,1.0,-1.0,-1.0,2.0,2.0,-1.0,4.0,2.0,-1.0,-1.0,...,0,0,0,0,0,1,0,1,1,2016
4,-1.0,-1.0,-1.0,2.5,4.0,-1.0,-1.0,2.5,-1.0,-1.0,...,0,0,0,0,0,1,0,1,1,2016


In [32]:
assert all(train_X_transformed.isna().sum() == 0)