# Preprocessing

This is where _feature engineering_ and _missing value imputation_ take place. A few points to hit:

1. should convert raw data to the data model expects
2. should be a static 1:1 mapping of inputs to outputs
3. should not depend on size of input; should produce same result for one as it does for 1000
4. should not drop or change order of any records

In [22]:
#import anything we might need
import pandas as pd
from sklearn.model_selection import train_test_split

In [23]:
#load in data
observed_df = pd.read_csv( open ('train.csv' ) )

we are not going to delete anything for this model

In [24]:
#split dataset containing target values
observation_train , observation_test = train_test_split( 
    observed_df , random_state = 25 )

### Deal With Missing Values

however we are going to impute missing values with -1

In [25]:
observation_train = observation_train.fillna( -1 )

In [26]:
observation_test = observation_test.fillna( -1 )

confirm no missing values remain ( no error message appears ) #nice!

In [27]:
assert all( observation_train.isna().sum() == 0 )

In [28]:
assert all( observation_test.isna().sum() == 0 )

### Encode Categorical Features

In [29]:
observ_train = pd.get_dummies( observation_train , columns = observation_train.dtypes[ 
    observation_train.dtypes == 'object'].index , dummy_na = True )

In [30]:
observ_test = pd.get_dummies( observation_test , columns = observation_test.dtypes[ 
    observation_test.dtypes == 'object'].index , dummy_na = True )

Now we have alot more columns because get_dummies returned all levels in a feature

In [31]:
observ_train.columns.tolist()

['Id',
 'MSSubClass',
 'LotFrontage',
 'LotArea',
 'OverallQual',
 'OverallCond',
 'YearBuilt',
 'YearRemodAdd',
 'MasVnrArea',
 'BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'TotalBsmtSF',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'GrLivArea',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'Fireplaces',
 'GarageYrBlt',
 'GarageCars',
 'GarageArea',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch',
 'PoolArea',
 'MiscVal',
 'MoSold',
 'YrSold',
 'SalePrice',
 'MSZoning_C (all)',
 'MSZoning_FV',
 'MSZoning_RH',
 'MSZoning_RL',
 'MSZoning_RM',
 'MSZoning_nan',
 'Street_Grvl',
 'Street_Pave',
 'Street_nan',
 'Alley_-1',
 'Alley_Grvl',
 'Alley_Pave',
 'Alley_nan',
 'LotShape_IR1',
 'LotShape_IR2',
 'LotShape_IR3',
 'LotShape_Reg',
 'LotShape_nan',
 'LandContour_Bnk',
 'LandContour_HLS',
 'LandContour_Low',
 'LandContour_Lvl',
 'LandContour_nan',
 'Utilities_AllPub',
 'Utilities_nan',
 'LotConfig_C

make sure there is no NAs

In [32]:
assert all( observ_train.isna().sum() == 0 )

In [33]:
assert all( observ_test.isna().sum() == 0 )

however we do have encoded features with a column header description including "nan" , which is bad