# Feature Extraction
In this step, create new features for more efficient ml model. Let's look own feature:
* SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
* MSSubClass: The building class
* MSZoning: The general zoning classification
* LotFrontage: Linear feet of street connected to property
* LotArea: Lot size in square feet
* Street: Type of road access
* Alley: Type of alley access
* LotShape: General shape of property
* LandContour: Flatness of the property
* Utilities: Type of utilities available
* LotConfig: Lot configuration
* LandSlope: Slope of property
* Neighborhood: Physical locations within Ames city limits
* Condition1: Proximity to main road or railroad
* Condition2: Proximity to main road or railroad (if a second is present)
* BldgType: Type of dwelling
* HouseStyle: Style of dwelling
* OverallQual: Overall material and finish quality
* OverallCond: Overall condition rating
* YearBuilt: Original construction date
* YearRemodAdd: Remodel date
* RoofStyle: Type of roof
* RoofMatl: Roof material
* Exterior1st: Exterior covering on house
* Exterior2nd: Exterior covering on house (if more than one material)
* MasVnrType: Masonry veneer type
* MasVnrArea: Masonry veneer area in square feet
* ExterQual: Exterior material quality
* ExterCond: Present condition of the material on the exterior
* Foundation: Type of foundation
* BsmtQual: Height of the basement
* BsmtCond: General condition of the basement
* BsmtExposure: Walkout or garden level basement walls
* BsmtFinType1: Quality of basement finished area
* BsmtFinSF1: Type 1 finished square feet
* BsmtFinType2: Quality of second finished area (if present)
* BsmtFinSF2: Type 2 finished square feet
* BsmtUnfSF: Unfinished square feet of basement area
* TotalBsmtSF: Total square feet of basement area
* Heating: Type of heating
* HeatingQC: Heating quality and condition
* CentralAir: Central air conditioning
* Electrical: Electrical system
* 1stFlrSF: First Floor square feet
* 2ndFlrSF: Second floor square feet
* LowQualFinSF: Low quality finished square feet (all floors)
* GrLivArea: Above grade (ground) living area square feet
* BsmtFullBath: Basement full bathrooms
* BsmtHalfBath: Basement half bathrooms
* FullBath: Full bathrooms above grade
* HalfBath: Half baths above grade
* Bedroom: Number of bedrooms above basement level
* Kitchen: Number of kitchens
* KitchenQual: Kitchen quality
* TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
* Functional: Home functionality rating
* Fireplaces: Number of fireplaces
* FireplaceQu: Fireplace quality
* GarageType: Garage location
* GarageYrBlt: Year garage was built
* GarageFinish: Interior finish of the garage
* GarageCars: Size of garage in car capacity
* GarageArea: Size of garage in square feet
* GarageQual: Garage quality
* GarageCond: Garage condition
* PavedDrive: Paved driveway
* WoodDeckSF: Wood deck area in square feet
* OpenPorchSF: Open porch area in square feet
* EnclosedPorch: Enclosed porch area in square feet
* 3SsnPorch: Three season porch area in square feet
* ScreenPorch: Screen porch area in square feet
* PoolArea: Pool area in square feet
* PoolQC: Pool quality
* Fence: Fence quality
* MiscFeature: Miscellaneous feature not covered in other categories
* MiscVal: Value of miscellaneous feature
* MoSold: Month Sold
* YrSold: Year Sold
* SaleType: Type of sale
* SaleCondition: Condition of sale

It is important to increase accuracy score. It can be decrease accuracy with created feature therefore It have to control model succes one by one.

In [1]:
from _utils._sklearn_models import *

In [2]:
def get_score(df):
    X, y = prepare_dataframe(df, 'SalePrice')

    test_model = LinearRegressionSklearn(X, y)
    test_model.init_model()
    
    score = test_model.get_scores()
    
    return score[score[0] == 'r2'][1].values[0] # r2 score

In [3]:
def compare_score(first_score=None, second_score=None):
    return "Success" if max([first_score, second_score]) == second_score else "Failed"

In [4]:
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')

In [5]:
df = pd.read_parquet('datasets/encode/encoded.parquet')

In [6]:
df.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60.0,RL,65.0,8450.0,0,Reg,Lvl,0,Inside,Gtl,CollgCr,Norm,0,1Fam,2Story,7.0,5.0,2003.0,2003.0,Gable,0,VinylSd,VinylSd,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706.0,Unf,0.0,150.0,856.0,0,Ex,1,SBrkr,856.0,854.0,0.0,1710.0,1.0,0.0,2.0,1.0,3.0,1.0,Gd,8.0,Typ,0.0,Attchd,2003.0,RFn,2.0,548.0,TA,TA,Y,0.0,61.0,0.0,0.0,0.0,0.0,0.0,2.0,2008.0,WD,Normal,208500.0
1,20.0,RL,80.0,9600.0,0,Reg,Lvl,0,FR2,Gtl,Rare,Feedr,0,1Fam,1Story,6.0,8.0,1976.0,1976.0,Gable,0,MetalSd,MetalSd,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978.0,Unf,0.0,284.0,1262.0,0,Ex,1,SBrkr,1262.0,0.0,0.0,1262.0,0.0,1.0,2.0,0.0,3.0,1.0,TA,6.0,Typ,1.0,Attchd,1976.0,RFn,2.0,460.0,TA,TA,Y,298.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,WD,Normal,181500.0
2,60.0,RL,68.0,11250.0,0,IR1,Lvl,0,Inside,Gtl,CollgCr,Norm,0,1Fam,2Story,7.0,5.0,2001.0,2002.0,Gable,0,VinylSd,VinylSd,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486.0,Unf,0.0,434.0,920.0,0,Ex,1,SBrkr,920.0,866.0,0.0,1786.0,1.0,0.0,2.0,1.0,3.0,1.0,Gd,6.0,Typ,1.0,Attchd,2001.0,RFn,2.0,608.0,TA,TA,Y,0.0,42.0,0.0,0.0,0.0,0.0,0.0,9.0,2008.0,WD,Normal,223500.0
3,70.0,RL,60.0,9550.0,0,IR1,Lvl,0,Corner,Gtl,Crawfor,Norm,0,1Fam,2Story,7.0,5.0,1915.0,1970.0,Gable,0,Wd Sdng,Wd Shng,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216.0,Unf,0.0,540.0,756.0,0,Gd,1,SBrkr,961.0,756.0,0.0,1717.0,1.0,0.0,1.0,0.0,3.0,1.0,Gd,7.0,Typ,1.0,Detchd,1998.0,Unf,3.0,642.0,TA,TA,Y,0.0,35.0,272.0,0.0,0.0,0.0,0.0,2.0,2006.0,WD,Abnorml,140000.0
4,60.0,RL,84.0,14260.0,0,IR1,Lvl,0,FR2,Gtl,NoRidge,Norm,0,1Fam,2Story,8.0,5.0,2000.0,2000.0,Gable,0,VinylSd,VinylSd,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655.0,Unf,0.0,490.0,1145.0,0,Ex,1,SBrkr,1145.0,1053.0,0.0,2198.0,1.0,0.0,2.0,1.0,4.0,1.0,Gd,9.0,Typ,1.0,Attchd,2000.0,RFn,3.0,836.0,TA,TA,Y,192.0,84.0,0.0,0.0,0.0,0.0,0.0,12.0,2008.0,WD,Normal,250000.0


In [7]:
df.shape

(1427, 74)

In [8]:
score = get_score(df)

In [9]:
score

'0.8324483504916538'

## Creating Features

### Home Age Feature

In [10]:
dataset_year = 2016

In [11]:
df['YearBuilt'].head()

0    2003.0
1    1976.0
2    2001.0
3    1915.0
4    2000.0
Name: YearBuilt, dtype: float64

In [12]:
df['HomeAge'] = dataset_year - df['YearBuilt']

In [16]:
compare_score(score, second_score=get_score(df))

'Failed'

In [17]:
df.drop('HomeAge', axis=1, inplace=True)

### Total Square FootAge

In [61]:
df['TotalSF'] = df['GrLivArea'] + df['TotalBsmtSF']

In [19]:
compare_score(score, second_score=get_score(df))

'Failed'

In [20]:
df.drop('TotalSF', axis=1, inplace=True)

### Seasonality

In [21]:
df['MoSold'].head()

0     2.0
1     5.0
2     9.0
3     2.0
4    12.0
Name: MoSold, dtype: float64

In [22]:
season = {
    'Winter': [12, 1, 2],
    'Spring': [3, 4, 5], 
    'Summer':[6, 7, 8], 
    'Fall':[9, 10, 11]
}

In [23]:
def select_season(month, season: dict=season):
    for s in season:
        if month in season.get(s):
            return s

In [24]:
select_season(11)

'Fall'

In [25]:
df['Season'] = df['MoSold'].apply(lambda x: select_season(x))

In [26]:
df[['MoSold', 'Season']].head()

Unnamed: 0,MoSold,Season
0,2.0,Winter
1,5.0,Spring
2,9.0,Fall
3,2.0,Winter
4,12.0,


In [27]:
compare_score(first_score=score, second_score=get_score(df))

'Success'

### Lot Ratio

In [34]:
df['LotRatio'] = df['LotArea'] / df['LotFrontage']

In [35]:
compare_score(first_score=score, second_score=get_score(df))

'Failed'

In [36]:
score = get_score(df)
score

'0.8382459700143732'

### Quality Score

In [38]:
df[['OverallQual', 'OverallCond']].head()

Unnamed: 0,OverallQual,OverallCond
0,7.0,5.0
1,6.0,8.0
2,7.0,5.0
3,7.0,5.0
4,8.0,5.0


In [39]:
df['OverallQual_scaled'] = (df['OverallQual'] - df['OverallQual'].min()) / (df['OverallQual'].max() - df['OverallQual'].min())

In [40]:
df['OverallCond_scaled'] = (df['OverallCond'] - df['OverallCond'].min()) / (df['OverallCond'].max() - df['OverallCond'].min())

In [41]:
df['QualityScore'] = ((df['OverallQual_scaled'] + df['OverallCond_scaled']) / 2) * 10

In [42]:
df[['OverallQual', 'OverallCond', 'QualityScore']].head()

Unnamed: 0,OverallQual,OverallCond,QualityScore
0,7.0,5.0,5.833333
1,6.0,8.0,7.152778
2,7.0,5.0,5.833333
3,7.0,5.0,5.833333
4,8.0,5.0,6.388889


In [43]:
df.drop(['OverallQual_scaled', 'OverallCond_scaled'], axis=1, inplace=True)

In [46]:
compare_score(first_score=score, second_score=get_score(df))

'Failed'

In [47]:
df.drop(['QualityScore'], axis=1, inplace=True)

### Porch Area

In [48]:
df[['OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch']].head()

Unnamed: 0,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch
0,61.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0
2,42.0,0.0,0.0,0.0
3,35.0,272.0,0.0,0.0
4,84.0,0.0,0.0,0.0


In [49]:
df['PorchSF'] = df['OpenPorchSF'] + df['EnclosedPorch'] + df['3SsnPorch'] + df['ScreenPorch']

In [51]:
compare_score(first_score=score, second_score=get_score(df))

'Failed'

In [52]:
df.drop(['PorchSF'], axis=1, inplace=True)

### Total Bathrooms

In [53]:
df[['FullBath', 'HalfBath', 'BsmtFullBath', 'BsmtHalfBath']].head()

Unnamed: 0,FullBath,HalfBath,BsmtFullBath,BsmtHalfBath
0,2.0,1.0,1.0,0.0
1,2.0,0.0,0.0,1.0
2,2.0,1.0,1.0,0.0
3,1.0,0.0,1.0,0.0
4,2.0,1.0,1.0,0.0


In [54]:
df['TotalBath'] = df['FullBath'] + df['HalfBath'] + df['BsmtFullBath'] + df['BsmtHalfBath']

In [58]:
compare_score(first_score=score, second_score=get_score(df))

'Failed'

In [59]:
df.drop('TotalBath', axis=1, inplace=True)

### Lot Coverage

In [62]:
df['LotCoverage'] = df['TotalSF'] / df['LotArea']

In [66]:
compare_score(first_score=score, second_score=get_score(df))

'Failed'

In [67]:
df.drop(['LotCoverage', 'TotalSF'] , axis=1, inplace=True)

## Conclusion

In [68]:
get_score(df)

'0.8382459700143732'

In [69]:
df.to_parquet('datasets/featured/featured.parquet')

<div class="alert alert-block alert-info"> <b>Final:</b> With new features, accuracy increased <b>0.8324<b> to <b>0.8382<b> </div>