<a href="https://colab.research.google.com/github/saikrishna232/Advanced-House-Price-Prediction/blob/main/Feature_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Feature Engineering**

Below steps will be dealt in Feature Engineering

1. Missing values
2. Temporal variables
3. Categorical variables: remove rare labels
4. Standarise the values of the variables to the same range

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib.inline
pd.pandas.set_option('display.max_columns', None)

UsageError: Line magic function `%matplotlib.inline` not found.


In [2]:
df=pd.read_csv('https://raw.githubusercontent.com/saikrishna232/Advanced-House-Price-Prediction/main/Data.csv',index_col=0)

# **Handling Missing Values**

**Missing Values for Categorical Featues**

In [24]:
#Replacing the missing values with a new label
Cat_Nan=[i for i in df.columns if df[i].dtypes=='O' and df[i].isnull().sum()>0]
for i in Cat_Nan:
  df[i]=np.where(df[i].isnull(),'Missing',df[i])
df[Cat_Nan].isnull().sum()

Series([], dtype: float64)

**Numerical Features**

In [22]:
Num_Nan=[i for i in df.columns if df[i].dtypes!='O' and df[i].isnull().sum()>0]
for i in Num_Nan:
  print(i,np.round(df[i].isnull().mean()*100,3),'% missing values')

LotFrontage 17.74 % missing values
MasVnrArea 0.548 % missing values
GarageYrBlt 5.548 % missing values


In [35]:
#Replacing the missing values with Medain value since we have outliers.
for i in Num_Nan:
  median_val=df[i].median()
  df[i].fillna(median_val,inplace=True)
df[Num_Nan].isnull().sum()

LotFrontage    0
MasVnrArea     0
GarageYrBlt    0
dtype: int64

**Temporal Features**
> Converting the years into age using YearSold




In [38]:
temporal_cols=[i for i in df.columns if 'Yr' in i or 'Year' in i]
temporal_cols.remove('YrSold')
temporal_cols

['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']

In [39]:
for i in temporal_cols:
  df[i]=df['YrSold']-df[i]
df[temporal_cols]

Unnamed: 0_level_0,YearBuilt,YearRemodAdd,GarageYrBlt
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,5,5,5.0
2,31,31,31.0
3,7,6,7.0
4,91,36,8.0
5,8,8,8.0
...,...,...,...
1456,8,7,8.0
1457,32,22,32.0
1458,69,4,69.0
1459,60,14,60.0


**Continuos Numerical Variables**
Since some skewness was observed in the distribution of continuos variables,
Applying the log normal distribution.


In [42]:
Cont_Cols=[i for i in df.columns if 0 not in df[i].unique() and df[i].dtypes!='O' and len(df[i].unique())>25]
Cont_Cols

['LotFrontage', 'LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']

In [43]:
for i in Cont_Cols:
  df[i]=np.log(df[i])

**Handling Rare Categorical Features**
> Categorical variables that are present less than 1% of the total observations will be removed.



In [45]:
Cat_Cols=[i for i in df.columns if df[i].dtypes=='O']

In [54]:
for i in Cat_Cols:
  temp=df.groupby(i)['SalePrice'].count()/len(df)
  temp_df=temp[temp>0.01].index
  df[i]=np.where(df[i].isin(temp_df),df[i],'Rare')

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.S

# **Feature Encoding**

In [63]:
for feature in Cat_Cols:
    labels_ordered=df.groupby([feature])['SalePrice'].mean().sort_values().index
    labels_ordered={k:i for i,k in enumerate(labels_ordered,0)}
    df[feature]=df[feature].map(labels_ordered)

# **Feature Scaling**

In [74]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaled_cols=df.drop(['SalePrice'],axis=1)
scaler.fit(scaled_cols)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [75]:
scaler.transform(scaled_cols)

array([[ 0.07337496,  0.35904396, -0.07586857, ...,  0.13877749,
        -0.0116685 ,  0.18301409],
       [-0.87256276,  0.35904396,  0.57242366, ..., -0.61443862,
        -0.0116685 ,  0.18301409],
       [ 0.07337496,  0.35904396,  0.06500658, ...,  0.13877749,
        -0.0116685 ,  0.18301409],
       ...,
       [ 0.30985939,  0.35904396, -0.02820043, ...,  1.64520971,
        -0.0116685 ,  0.18301409],
       [-0.87256276,  0.35904396,  0.06500658, ...,  1.64520971,
        -0.0116685 ,  0.18301409],
       [-0.87256276,  0.35904396,  0.3709213 , ...,  0.13877749,
        -0.0116685 ,  0.18301409]])

In [79]:
Final_data=pd.concat([pd.DataFrame(scaler.transform(scaled_cols),columns=scaled_cols.columns),df['SalePrice'].reset_index(drop=True)],axis=1)

In [80]:
Final_data.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,0.073375,0.359044,-0.075869,-0.133231,0.064238,0.244717,-0.65704,-0.111168,0.02618,-0.485795,-0.225716,0.551913,0.045928,0.101885,0.12422,1.227931,0.651479,-0.5172,-1.043259,-0.869941,-0.52024,-0.134652,1.092673,1.116401,0.481584,0.514104,1.052302,0.321564,1.060705,0.61896,0.094944,-0.590555,1.006001,0.575425,0.337745,-0.288653,-0.944591,-0.459303,0.141339,0.891179,0.263813,0.282021,-0.80357,1.161852,-0.120242,0.52926,1.10781,-0.241061,0.789741,1.227585,0.163779,-0.211454,0.735994,0.91221,0.2445,-0.951226,-0.94419,0.586606,-1.008328,0.318475,0.311725,0.351,0.259467,0.304008,0.289745,-0.752176,0.216503,-0.359325,-0.116339,-0.270208,-0.068692,-0.069409,0.437409,0.189185,-0.087688,-1.599111,0.138777,-0.011669,0.183014,12.247694
1,-0.872563,0.359044,0.572424,0.113442,0.064238,0.244717,-0.65704,-0.111168,0.02618,1.41826,-0.225716,0.041271,-1.550606,0.101885,0.12422,-0.255544,-0.071836,2.179628,-0.183465,0.390141,-0.52024,-0.134652,-0.856383,-0.926137,-0.634468,-0.57075,-0.689604,0.321564,-0.649543,0.61896,0.094944,2.220999,-0.136559,1.171992,0.337745,-0.288653,-0.641228,0.466465,0.141339,0.891179,0.263813,0.282021,0.418585,-0.795163,-0.120242,-0.381846,-0.819964,3.948809,0.789741,-0.761621,0.163779,-0.211454,-0.771091,-0.318683,0.2445,0.600495,0.526229,0.586606,0.073805,0.318475,0.311725,-0.060731,0.259467,0.304008,0.289745,1.626195,-0.704483,-0.359325,-0.116339,-0.270208,-0.068692,-0.069409,0.437409,0.189185,-0.087688,-0.48911,-0.614439,-0.011669,0.183014,12.109011
2,0.073375,0.359044,0.065007,0.420061,0.064238,0.244717,0.872909,-0.111168,0.02618,-0.485795,-0.225716,0.551913,0.045928,0.101885,0.12422,1.227931,0.651479,-0.5172,-0.977121,-0.821476,-0.52024,-0.134652,1.092673,1.116401,0.481584,0.325915,1.052302,0.321564,1.060705,0.61896,0.094944,0.34663,1.006001,0.092907,0.337745,-0.288653,-0.301643,-0.313369,0.141339,0.891179,0.263813,0.282021,-0.57656,1.189351,-0.120242,0.659675,1.10781,-0.241061,0.789741,1.227585,0.163779,-0.211454,0.735994,-0.318683,0.2445,0.600495,0.526229,0.586606,-0.925087,0.318475,0.311725,0.631726,0.259467,0.304008,0.289745,-0.752176,-0.070361,-0.359325,-0.116339,-0.270208,-0.068692,-0.069409,0.437409,0.189185,-0.087688,0.990891,0.138777,-0.011669,0.183014,12.317167
3,0.309859,0.359044,-0.325778,0.103347,0.064238,0.244717,0.872909,-0.111168,0.02618,0.466233,-0.225716,0.89234,0.045928,0.101885,0.12422,1.227931,0.651479,-0.5172,1.800676,0.632464,-0.52024,-0.134652,-1.506069,-0.634346,-0.634468,-0.57075,-0.689604,0.321564,-1.504667,-0.655627,2.405256,-0.590555,-0.136559,-0.499274,0.337745,-0.288653,-0.06167,-0.687324,0.141339,-0.151386,0.263813,0.282021,-0.439287,0.937276,-0.120242,0.541511,1.10781,-0.241061,-1.026041,-0.761621,0.163779,-0.211454,0.735994,0.296763,0.2445,0.600495,1.261438,-1.008264,-0.883467,-0.801942,1.650307,0.790804,0.259467,0.304008,0.289745,-0.752176,-0.176048,4.092524,-0.116339,-0.270208,-0.068692,-0.069409,0.437409,0.189185,-0.087688,-1.599111,-1.367655,-0.011669,-3.302211,11.849398
4,0.073375,0.359044,0.724756,0.878409,0.064238,0.244717,0.872909,-0.111168,0.02618,1.41826,-0.225716,1.913623,0.045928,0.101885,0.12422,1.227931,1.374795,-0.5172,-0.944052,-0.724547,-0.52024,-0.134652,1.092673,1.116401,0.481584,1.366489,1.052302,0.321564,1.060705,0.61896,0.094944,1.283814,1.006001,0.463568,0.337745,-0.288653,-0.174865,0.19968,0.141339,0.891179,0.263813,0.282021,0.112267,1.617877,-0.120242,1.282191,1.10781,-0.241061,0.789741,1.227585,1.390023,-0.211454,0.735994,1.527656,0.2445,0.600495,0.526229,0.586606,-0.883467,0.318475,1.650307,1.698485,0.259467,0.304008,0.289745,0.780197,0.56376,-0.359325,-0.116339,-0.270208,-0.068692,-0.069409,0.437409,0.189185,-0.087688,2.100892,0.138777,-0.011669,0.183014,12.429216


In [82]:
Final_data.to_csv('Cleaned_Data')