# Engineer
Here, we will engineer our features as follows:
* Convert numeric data to categorical
    * MoSold
    * MSSubClass
* Convert categorical data with orderings (likert scale type data) into ordinal data
E.g.,  `GarageQual`: NoGarage->0, Po->1, Fa->2, TA->3, Gd->4, Ex->5
* Create new features
    * Derived
    E.g., `NumFloors` in a property
    * Indicators (booleans)
    E.g., `IsPUD` whether a property is in a PUD
* Collapse number of categories for categorical features on a case-by-case basis
E.g., `OverallQual` has scale 1-10. 1-3 are simplified to "bad", 4-6 are "normal", 7-10 are "good"
* Dummify categorical features

In [602]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [603]:
import pandas as pd
import numpy as np
import data_dict
import features

In [604]:
df = pd.read_csv("../data/cleaned.csv")
df.drop('Unnamed: 0', inplace=True, axis=1)

## Convert numerical data to categorical
`YrSold`, `MoSold`, and `MSSubClass` are categorical data but are encoded as integers in the dataset

In [605]:
# NOTE: We will treat MoSold and YrSold as categorical because there are only 12 (Jan->Dec) and 5 values (2006->2010).
# This might not be the case if we had many years of sales data in our dataset
df['YrSold'] = df['YrSold'].astype(str)
df['MoSold'] = df.MoSold.map(data_dict.convert_mosold)
df['MSSubClass'] = df['MSSubClass'].astype(str)

# Because we write out to a csv, we need to make sure that YrSold and MSSubClass are read back as categorical in other files.
# So we prepend a string to these columns
df['YrSold'] = df['YrSold'].apply(lambda x: "Yr_" + x)
df['MSSubClass'] = df['MSSubClass'].apply(lambda x: "Dwelling_" + x)

## Convert categorical to ordinal

In [606]:
# Overall features
# NO OP: These are already ordinalized

In [607]:
# Exterior features
df['ExterQual'] = df.ExterQual.map(data_dict.convert_exterqual)
df['ExterCond'] = df.ExterCond.map(data_dict.convert_extercond)

In [608]:
# Basement features
df['BsmtQual'] = df.BsmtQual.map(data_dict.convert_bsmtqual)
df['BsmtCond'] = df.BsmtCond.map(data_dict.convert_bsmtcond)
df['BsmtExposure'] = df.BsmtExposure.map(data_dict.convert_bsmtexposure)
df['BsmtFinType1'] = df.BsmtFinType1.map(data_dict.convert_bsmtfintype)
df['BsmtFinType2'] = df.BsmtFinType2.map(data_dict.convert_bsmtfintype)

In [609]:
# Home Interior features
df['Functional'] = df.Functional.map(data_dict.convert_functional)
df['FireplaceQu'] = df.FireplaceQu.map(data_dict.convert_fireplacequ)
df['HeatingQC'] = df.HeatingQC.map(data_dict.convert_heatingqc)
df['KitchenQual'] = df.KitchenQual.map(data_dict.convert_kitchenqual)

In [610]:
# Land features
df['LandSlope'] = df.LandSlope.map(data_dict.convert_landslope)
df['LotShape'] = df.LotShape.map(data_dict.convert_lotshape)

In [611]:
# Garage features
df['GarageCond'] = df.GarageCond.map(data_dict.convert_garagecond)
df['GarageQual'] = df.GarageQual.map(data_dict.convert_garagequal)

In [612]:
# Road features
df['Street'] = df.Street.map(data_dict.convert_street)
df['PavedDrive'] = df.PavedDrive.map(data_dict.convert_paveddrive)
df['Alley'] = df.Alley.map(data_dict.convert_alley)

In [613]:
# Other features
df['Utilities'] = df.Utilities.map(data_dict.convert_utilities)
df['PoolQC'] = df.PoolQC.map(data_dict.convert_poolqc)

## New Features
New data we can create from existing data. Does not mean we need to drop old data columns

In [614]:
# See whether a property is in a PUD from the dwelling type
df['IsPUD'] = df.MSSubClass.map(data_dict.get_pud_indicator)

In [615]:
# TODO LotShape isRegular

In [616]:
# Get number of floors for the property from the dwelling type
df['NumFloors'] = df.MSSubClass.map(data_dict.get_num_floors)

## Collapse/Combine Features
* Collapse: Use smaller scales for Likert scale-type categorical features (and drop the larger scale feature)
* Combine: Convert multiple features into a new feature (and drop the others)

Use smaller scales. Featured prefixed with `Collapse_`

In [617]:
df['Collapse_MSSubClass'] = df.MSSubClass.map(data_dict.collapse_mssubclass)
df.drop(['MSSubClass'], inplace=True, axis=1)
# TODO ask team others: (*Qual, *Cond)->*QualCond, Year->Age, IsNew,

Combine multiple features. Features prefixed with `Combined_`

In [618]:
# Bathrooms
df['Combine_BathroomsBsmt'] = df['BsmtFullBath'] + df['BsmtHalfBath']*.5
df['Combine_BathroomsAbvGrd'] = df['FullBath'] + df['HalfBath']*0.5
df.drop(['BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath'], inplace=True, axis=1)

In [619]:
# Age (since remodeling, if property was remodeled)
df['tempYrSold'] = df['YrSold'].apply(lambda y: (y[3:])).astype(int)
df['tempWasRenovatedAfterBuilding'] = df.apply(lambda x: x['YearRemodAdd'] > x['YearBuilt'], axis=1)
df['Combine_Age'] = df.apply(lambda x: (x['tempYrSold'] - x['YearRemodAdd']) if x['tempWasRenovatedAfterBuilding'] else (x['tempYrSold'] - x['YearBuilt']),
                             axis=1)
# Check
df[['Combine_Age', 'tempYrSold', 'YearBuilt', 'YearRemodAdd']]

Unnamed: 0,Combine_Age,tempYrSold,YearBuilt,YearRemodAdd
0,60,2010,1939,1950
1,25,2009,1984,1984
2,0,2007,1930,2007
3,6,2009,1900,2003
4,8,2009,2001,2001
...,...,...,...,...
2572,59,2009,1916,1950
2573,54,2009,1955,1955
2574,57,2007,1949,1950
2575,7,2007,2000,2000


In [620]:
df.drop(['tempYrSold', 'tempWasRenovatedAfterBuilding'], inplace=True, axis=1)
df.drop(['YearBuilt', 'YearRemodAdd'], inplace=True, axis=1)

## Drop Other Features
Drop features that have been identified via EDA to be
* causes of multicolinearity
* have low explanatory power (insignificant; having low p-values)

In [621]:
# TODO drop features

## Add Log Target (SalePrice)
From EDA, we found that the distributions of features are skewed when plotted against `SalePrice`

In [622]:
df['LogSalePrice'] = np.log(df.SalePrice)

#### Housekeeping
Check that we organized every feature

In [623]:
# Test that we have all features labeled by type of data
features.check_features(df)

81 #columns == 80 #features+PID+SalePrice


In [624]:
# Manually check that these are the same
features.check_features2(df)

Unnamed: 0,df.columns,features
0,1stFlrSF,1stFlrSF
1,2ndFlrSF,2ndFlrSF
2,3SsnPorch,3SsnPorch
3,Alley,AbvGrdBaths
4,BedroomAbvGr,Alley
...,...,...
76,TotRmsAbvGrd,TotalBsmtSF
77,TotalBsmtSF,Utilities
78,Utilities,WoodDeckSF
79,WoodDeckSF,YrSold


### Dummify
Dummify the categorical features

In [625]:
dummies = pd.get_dummies(df, columns = features.get_categorical_features(),
                         prefix = features.get_categorical_features(), drop_first=True)

In [626]:
dummies

Unnamed: 0,PID,GrLivArea,SalePrice,LotFrontage,LotArea,Street,Alley,LotShape,Utilities,LandSlope,...,MoSold_December,MoSold_February,MoSold_January,MoSold_July,MoSold_June,MoSold_March,MoSold_May,MoSold_November,MoSold_October,MoSold_September
0,909176150,856,126000,66.0,7890,2,0,4,4,3,...,0,0,0,0,0,1,0,0,0,0
1,905476230,1049,139500,42.0,4235,2,0,4,4,3,...,0,1,0,0,0,0,0,0,0,0
2,911128020,1001,124900,60.0,6060,2,0,4,4,3,...,0,0,0,0,0,0,0,1,0,0
3,535377150,1039,114000,80.0,8146,2,0,4,4,3,...,0,0,0,0,0,0,1,0,0,0
4,534177230,1665,227000,70.0,8400,2,0,4,4,3,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2572,903205040,952,121000,66.0,8854,2,0,4,4,3,...,0,0,0,0,0,0,1,0,0,0
2573,905402060,1733,139600,74.0,13680,2,0,3,4,3,...,0,0,0,0,1,0,0,0,0,0
2574,909275030,2002,145000,82.0,6270,2,0,4,4,3,...,0,0,0,0,0,0,0,0,0,0
2575,907192040,1842,217500,66.0,8826,2,0,4,4,3,...,0,0,0,1,0,0,0,0,0,0


Save outputs

In [627]:
df.to_csv("../data/engineered.csv")

In [628]:
dummies.to_csv("../data/engineered_encoded.csv")