# Preprocessing

This is where _feature engineering_ and _missing value imputation_ take place. A few points to hit:

1. should convert raw data to the data model expects
2. should be a static 1:1 mapping of inputs to outputs
3. should not depend on size of input; should produce same result for one as it does for 1000
4. should not drop or change order of any records

In [2]:
#import anything we might need
import pandas as pd
from sklearn.model_selection import train_test_split

In [3]:
#load in data
observed_df = pd.read_csv( open ('train.csv' ) )

we are not going to delete anything for this model

In [4]:
#split dataset containing target values
observation_train , observation_test = train_test_split( 
    observed_df , random_state = 25 )

### Deal With Missing Values

however we are going to impute missing values with -1

In [7]:
observation_train = observation_train.fillna( -1 )

In [9]:
observation_test = observation_test.fillna( -1 )

confirm no missing values remain ( no error message appears ) #nice!

In [12]:
assert all( observation_train.isna().sum() == 0 )

In [13]:
assert all( observation_test.isna().sum() == 0 )

### Encode Categorical Features

In [14]:
observ_train = pd.get_dummies( observation_train , columns = observation_train.dtypes[ 
    observation_train.dtypes == 'object'].index , dummy_na = True )

In [15]:
observ_test = pd.get_dummies( observation_test , columns = observation_test.dtypes[ 
    observation_test.dtypes == 'object'].index , dummy_na = True )

Now we have alot more columns because get_dummies returned all levels in a feature

In [16]:
observ_train.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_Oth,SaleType_WD,SaleType_nan,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial,SaleCondition_nan
1276,1277,60,-1.0,12936,6,6,1972,1972,0.0,593,...,0,1,0,0,0,0,0,1,0,0
1217,1218,20,72.0,8640,8,5,2009,2009,72.0,936,...,0,0,0,0,0,0,0,0,1,0
1036,1037,20,89.0,12898,9,5,2007,2008,70.0,1022,...,0,1,0,0,0,0,0,1,0,0
1320,1321,20,70.0,8400,6,3,1957,1957,0.0,189,...,0,1,0,0,0,0,0,1,0,0
80,81,60,100.0,13000,6,6,1968,1968,576.0,448,...,0,1,0,0,0,0,0,1,0,0


make sure there is no NAs

In [17]:
assert all( observ_train.isna().sum() == 0 )

In [18]:
assert all( observ_test.isna().sum() == 0 )