# Engineer
Here, we will engineer our features as follows:
* Convert numeric data to categorical
    * MoSold
    * MSSubClass
* Convert categorical data with orderings (likert scale type data) into ordinal data
E.g.,  `GarageQual`: NoGarage->0, Po->1, Fa->2, TA->3, Gd->4, Ex->5
* Create new features
    * Derived
    E.g., `NumFloors` in a property
    * Indicators (booleans)
    E.g., `IsPUD` whether a property is in a PUD
* Collapse number of categories for categorical features on a case-by-case basis
E.g., `OverallQual` has scale 1-10. 1-3 are simplified to "bad", 4-6 are "normal", 7-10 are "good"
* Dummify categorical features

In [325]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [326]:
import pandas as pd
import data_dict
import features

In [327]:
df = pd.read_csv("../data/cleaned.csv")
df.drop('Unnamed: 0', inplace=True, axis=1)

## Convert numerical data to categorical
`YrSold`, `MoSold`, and `MSSubClass` are categorical data but are encoded as integers in the dataset

In [328]:
# NOTE: We will treat MoSold and YrSold as categorical because there are only 12 (Jan->Dec) and 5 values (2006->2010).
# This might not be the case if we had many years of sales data in our dataset
df['YrSold'] = df['YrSold'].astype(str)
df['MoSold'] = df.MoSold.map(data_dict.convert_mosold)
df['MSSubClass'] = df['MSSubClass'].astype(str)

## Convert categorical to ordinal

In [329]:
# Overall features
# NO OP: These are already ordinalized

In [330]:
# Exterior features
df['ExterQual'] = df.ExterQual.map(data_dict.convert_exterqual)
df['ExterCond'] = df.ExterCond.map(data_dict.convert_extercond)

In [331]:
# Basement features
df['BsmtQual'] = df.BsmtQual.map(data_dict.convert_bsmtqual)
df['BsmtCond'] = df.BsmtCond.map(data_dict.convert_bsmtcond)
df['BsmtExposure'] = df.BsmtExposure.map(data_dict.convert_bsmtexposure)
df['BsmtFinType1'] = df.BsmtFinType1.map(data_dict.convert_bsmtfintype)
df['BsmtFinType2'] = df.BsmtFinType2.map(data_dict.convert_bsmtfintype)

In [332]:
# Home Interior features
df['Functional'] = df.Functional.map(data_dict.convert_functional)
df['FireplaceQu'] = df.FireplaceQu.map(data_dict.convert_fireplacequ)
df['HeatingQC'] = df.HeatingQC.map(data_dict.convert_heatingqc)
df['KitchenQual'] = df.KitchenQual.map(data_dict.convert_kitchenqual)

In [333]:
# Land features
df['LandSlope'] = df.LandSlope.map(data_dict.convert_landslope)
df['LotShape'] = df.LotShape.map(data_dict.convert_lotshape)

In [334]:
# Garage features
df['GarageCond'] = df.GarageCond.map(data_dict.convert_garagecond)
df['GarageQual'] = df.GarageQual.map(data_dict.convert_garagequal)

In [335]:
# Road features
df['Street'] = df.Street.map(data_dict.convert_street)
df['PavedDrive'] = df.PavedDrive.map(data_dict.convert_paveddrive)
df['Alley'] = df.Alley.map(data_dict.convert_alley)

In [336]:
# Other features
df['Utilities'] = df.Utilities.map(data_dict.convert_utilities)
df['PoolQC'] = df.PoolQC.map(data_dict.convert_poolqc)

## New Features

#### Indicators

In [337]:
# See whether a property is in a PUD from the dwelling type
df['IsPUD'] = df.MSSubClass.map(data_dict.get_pud_indicator)

In [338]:
# TODO LotShape isRegular

#### Other new features

In [339]:
# Get number of floors for the property from the dwelling type
df['NumFloors'] = df.MSSubClass.map(data_dict.get_num_floors)

In [340]:
df['BsmtAllBaths'] = df['BsmtFullBath'] + df['BsmtHalfBath']*.25
df['AbvGrdBaths'] = df['FullBath'] + df['HalfBath']*.25

## Collapse Categorical Features
Use smaller scales for Likert scale-type categorical features

In [341]:
df['Collapse_MSSubClass'] = df.MSSubClass.map(data_dict.collapse_mssubclass)
# TODO ask team others: (*Qual, *Cond)->*QualCond, Year->Age, IsNew,

#### Housekeeping
Check that we organized every feature

In [342]:
# Test that we have all features labeled by type of data
features.check_features(df)

86 #columns == 86 #features+PID+SalePrice


In [343]:
# Manually check that these are the same
features.get_features(df)

Unnamed: 0,features,df.columns
0,1stFlrSF,1stFlrSF
1,2ndFlrSF,2ndFlrSF
2,3SsnPorch,3SsnPorch
3,AbvGrdBaths,AbvGrdBaths
4,Alley,Alley
...,...,...
81,Utilities,Utilities
82,WoodDeckSF,WoodDeckSF
83,YearBuilt,YearBuilt
84,YearRemodAdd,YearRemodAdd


### Dummify
Dummify the categorical features

In [344]:
dummies = pd.get_dummies(df, columns = features.categorical, prefix = features.categorical, drop_first=True)

In [345]:
dummies

Unnamed: 0,PID,GrLivArea,SalePrice,LotFrontage,LotArea,Street,Alley,LotShape,Utilities,LandSlope,...,SaleType_Oth,SaleType_VWD,SaleType_WD,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial,Collapse_MSSubClass_Split,Collapse_MSSubClass_Traditional
0,909176150,856,126000,66.0,7890,2,0,4,4,3,...,0,0,1,0,0,0,1,0,0,1
1,905476230,1049,139500,42.0,4235,2,0,4,4,3,...,0,0,1,0,0,0,1,0,0,1
2,911128020,1001,124900,60.0,6060,2,0,4,4,3,...,0,0,1,0,0,0,1,0,0,1
3,535377150,1039,114000,80.0,8146,2,0,4,4,3,...,0,0,1,0,0,0,1,0,0,1
4,534177230,1665,227000,70.0,8400,2,0,4,4,3,...,0,0,1,0,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2572,903205040,952,121000,66.0,8854,2,0,4,4,3,...,0,0,1,0,0,0,1,0,0,1
2573,905402060,1733,139600,74.0,13680,2,0,3,4,3,...,0,0,1,0,0,0,1,0,0,1
2574,909275030,2002,145000,82.0,6270,2,0,4,4,3,...,0,0,1,0,0,0,1,0,0,0
2575,907192040,1842,217500,66.0,8826,2,0,4,4,3,...,0,0,1,0,0,0,1,0,0,1


Save outputs

In [346]:
df.to_csv("../data/engineered.csv")

In [347]:
dummies.to_csv("../data/engineered_encoded.csv")