**Problem Statement** : To predict the property/house price given some parameters regarding the property/house

**Approach**

1.   First we will be analysing the data via Exploratory Data Analysis to derive useful insights. This will help us later in the data preprocessing/cleaning stage.

2.   Next we will be performing the actual preprocessing/cleaning steps required in order to clean the data and convert it into a proper format for the Machine Learning model to train. This is neccessary as the Machine Learning model can only work with specific kind of data and thus data preprocessing is required to convert into that particular format!

3.   Finally we will be training the actual Machine Learning model to predict the SalePrice of a property given some input parameters!

***IMPORTING REQUIRED LIBRARIES***

In [48]:
import os
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor

warnings.filterwarnings('ignore')

Reading the data into a Dataframe

In [49]:
data = pd.read_csv('train1.csv', index_col = 0)

Taking a deep copy of the data so as to not modify the original data

In [50]:
data1 = data.copy(deep = True)
data2 = data.copy(deep = True)

Display the first 5 rows of the dataframe

In [51]:
data1.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


Display the last 5 rows of the dataframe

In [52]:
data1.tail()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,8,2007,WD,Normal,175000
1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,Inside,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,Inside,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,4,2010,WD,Normal,142125
1460,20,RL,75.0,9937,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,6,2008,WD,Normal,147500


Display all the columns of the dataframe

In [53]:
data1.columns

Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'Wo

Display the selected column only

In [56]:
data1['MSSubClass']#.head()

Id
1       60
2       20
3       60
4       70
5       60
        ..
1456    60
1457    20
1458    70
1459    20
1460    20
Name: MSSubClass, Length: 1461, dtype: int64

Display the data of multiple selected columns

In [58]:
data1[['MSSubClass', 'SaleType', 'Utilities']]

Unnamed: 0_level_0,MSSubClass,SaleType,Utilities
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,60,WD,AllPub
2,20,WD,AllPub
3,60,WD,AllPub
4,70,WD,AllPub
5,60,WD,AllPub
...,...,...,...
1456,60,WD,AllPub
1457,20,WD,AllPub
1458,70,WD,AllPub
1459,20,WD,AllPub


Display the Index of the Dataframe

In [57]:
data1.index

Int64Index([   1,    2,    3,    4,    5,    6,    7,    8,    9,   10,
            ...
            1451, 1452, 1453, 1454, 1455, 1456, 1457, 1458, 1459, 1460],
           dtype='int64', name='Id', length=1461)

Create a new column based on existing columns

In [59]:
data1['Age'] = data1['YrSold'] - data1['YearBuilt']

In [60]:
data1['Age']

Id
1        5
2       31
3        7
4       91
5        8
        ..
1456     8
1457    32
1458    69
1459    60
1460    43
Name: Age, Length: 1461, dtype: int64

Select particular rows using "loc" function

In [61]:
data.loc[3]

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500


Select row with Id "3" and column name "LotArea" and "Street"

In [62]:
data1.loc[3, ['LotArea', 'Street']]

Unnamed: 0_level_0,LotArea,Street
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
3,11250,Pave
3,11250,Pave


Select all rows with column name "LotArea" and "Street"

In [63]:
data1.loc[:, ['LotArea', 'Street']]

Unnamed: 0_level_0,LotArea,Street
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,8450,Pave
2,9600,Pave
3,11250,Pave
4,9550,Pave
5,14260,Pave
...,...,...
1456,7917,Pave
1457,13175,Pave
1458,9042,Pave
1459,9717,Pave


Select rows using "iloc" function

In [64]:
data1.iloc[2, 2]

68.0

In [65]:
data1.head(6)

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,Age
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,,,,0,2,2008,WD,Normal,208500,5
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,,,,0,5,2007,WD,Normal,181500,31
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,,,,0,9,2008,WD,Normal,223500,7
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,,,,0,2,2006,WD,Abnorml,140000,91
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,,,,0,12,2008,WD,Normal,250000,8
6,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,Inside,...,,MnPrv,Shed,700,10,2009,WD,Normal,143000,16


Conditional selection of rows using "iloc" function

In [66]:
data1[data1['LotArea'] >= 5000]

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,Age
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,,,,0,2,2008,WD,Normal,208500,5
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,,,,0,5,2007,WD,Normal,181500,31
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,,,,0,9,2008,WD,Normal,223500,7
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,,,,0,2,2006,WD,Abnorml,140000,91
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,,,,0,12,2008,WD,Normal,250000,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,Inside,...,,,,0,8,2007,WD,Normal,175000,8
1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,Inside,...,,MnPrv,,0,2,2010,WD,Normal,210000,32
1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,Inside,...,,GdPrv,Shed,2500,5,2010,WD,Normal,266500,69
1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,Inside,...,,,,0,4,2010,WD,Normal,142125,60


In [67]:
data1[data1['LotShape'] == 'IR1']

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,Age
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,,,,0,9,2008,WD,Normal,223500,7
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,,,,0,2,2006,WD,Abnorml,140000,91
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,,,,0,12,2008,WD,Normal,250000,8
6,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,Inside,...,,MnPrv,Shed,700,10,2009,WD,Normal,143000,16
8,60,RL,,10382,Pave,,IR1,Lvl,AllPub,Corner,...,,,Shed,350,11,2009,WD,Normal,200000,36
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1430,20,RL,,12546,Pave,,IR1,Lvl,AllPub,Corner,...,,,,0,4,2007,WD,Normal,182900,26
1432,120,RL,,4928,Pave,,IR1,Lvl,AllPub,Inside,...,,,,0,10,2009,WD,Normal,143750,33
1434,60,RL,93.0,10261,Pave,,IR1,Lvl,AllPub,Inside,...,,,,0,5,2008,WD,Normal,186500,8
1441,70,RL,79.0,11526,Pave,,IR1,Bnk,AllPub,Inside,...,,,,0,9,2008,WD,Normal,191000,86


Multiple conditions in "iloc" function

In [68]:
data1[(data1['LotShape'] == 'IR1') & (data1['LotArea'] >= 50000)]

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,Age
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
54,20,RL,68.0,50271,Pave,,IR1,Low,AllPub,Inside,...,,,,0,11,2006,WD,Normal,385000,25
336,190,RL,,164660,Grvl,,IR1,HLS,AllPub,Corner,...,,,Shed,700,8,2008,WD,Normal,228950,43
452,20,RL,62.0,70761,Pave,,IR1,Low,AllPub,Inside,...,,,,0,12,2006,WD,Normal,280000,31
458,20,RL,,53227,Pave,,IR1,Low,AllPub,CulDSac,...,,,,0,3,2008,WD,Normal,256000,54
1397,20,RL,,57200,Pave,,IR1,Bnk,AllPub,Inside,...,,,,0,6,2010,WD,Normal,160000,62


Conditional selection can help weed out outliers or incorrect data

In [69]:
data1[data1['Age'] > 100]

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,Age
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
186,75,RM,90.0,22950,Pave,,IR2,Lvl,AllPub,Inside,...,,GdPrv,,0,6,2006,WD,Normal,475000,114
243,50,RM,63.0,5000,Pave,,Reg,Lvl,AllPub,Corner,...,,,,0,4,2006,WD,Normal,79000,106
305,75,RM,87.0,18386,Pave,,Reg,Lvl,AllPub,Inside,...,,,,0,5,2008,WD,Normal,295000,128
391,50,RL,50.0,8405,Pave,Grvl,Reg,Lvl,AllPub,Inside,...,,MnPrv,,0,4,2008,WD,Normal,119000,108
489,190,RL,60.0,10800,Pave,,Reg,Lvl,AllPub,Corner,...,,,,0,5,2006,ConLD,Normal,160000,106
521,190,RL,60.0,10800,Pave,Grvl,Reg,Lvl,AllPub,Inside,...,,,,0,8,2008,WD,Normal,106250,108
584,75,RM,75.0,13500,Pave,,Reg,Lvl,AllPub,Inside,...,,,,0,7,2008,WD,Normal,325000,115
631,70,RM,50.0,9000,Pave,Grvl,Reg,Lvl,AllPub,Corner,...,,MnPrv,,0,6,2006,WD,Abnorml,124000,126
654,50,RM,60.0,10320,Pave,Grvl,Reg,Lvl,AllPub,Inside,...,,MnPrv,,0,6,2008,WD,Normal,135000,102
677,70,RM,60.0,9600,Pave,Grvl,Reg,Lvl,AllPub,Inside,...,,,,0,5,2006,WD,Normal,87000,106


Get statistics of the dataframe i.e. mean, median, mode, correlation, variance, standard deviation and more

In [70]:
data1.mean()

MSSubClass           57.043121
LotFrontage          70.021613
LotArea           10526.568789
OverallQual           6.108145
OverallCond           5.574949
YearBuilt          1971.417522
YearRemodAdd       1984.881588
MasVnrArea          103.399174
BsmtFinSF1          447.819986
BsmtFinSF2           46.423682
BsmtUnfSF           564.778919
TotalBsmtSF        1059.022587
1stFlrSF           1162.201916
2ndFlrSF            349.606434
LowQualFinSF          5.840520
GrLivArea          1517.648871
BsmtFullBath          0.430527
BsmtHalfBath          0.058179
FullBath              1.566735
HalfBath              0.383299
BedroomAbvGr          2.865845
KitchenAbvGr          1.047228
TotRmsAbvGrd          6.521561
Fireplaces            0.617385
GarageYrBlt        1978.597685
GarageCars            1.772758
GarageArea          474.438741
WoodDeckSF           94.457221
OpenPorchSF          46.863107
EnclosedPorch        22.270363
3SsnPorch             3.626283
ScreenPorch          14.924709
PoolArea

In [71]:
data1.median()

MSSubClass           50.0
LotFrontage          69.0
LotArea            9500.0
OverallQual           6.0
OverallCond           5.0
YearBuilt          1973.0
YearRemodAdd       1994.0
MasVnrArea            0.0
BsmtFinSF1          387.0
BsmtFinSF2            0.0
BsmtUnfSF           473.0
TotalBsmtSF         992.0
1stFlrSF           1086.0
2ndFlrSF              0.0
LowQualFinSF          0.0
GrLivArea          1466.0
BsmtFullBath          0.0
BsmtHalfBath          0.0
FullBath              2.0
HalfBath              0.0
BedroomAbvGr          3.0
KitchenAbvGr          1.0
TotRmsAbvGrd          6.0
Fireplaces            1.0
GarageYrBlt        1980.0
GarageCars            2.0
GarageArea          480.0
WoodDeckSF            0.0
OpenPorchSF          25.0
EnclosedPorch         0.0
3SsnPorch             0.0
ScreenPorch           0.0
PoolArea              0.0
MiscVal               0.0
MoSold                6.0
YrSold             2008.0
SalePrice        163000.0
Age                  35.0
dtype: float

In [84]:
data1.mode()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,Age
0,20,RL,60.0,7200,Pave,Grvl,Reg,Lvl,AllPub,Inside,...,Gd,MnPrv,Shed,0,6,2009,WD,Normal,140000,1


In [72]:
data1.std()

MSSubClass          42.399166
LotFrontage         24.264170
LotArea           9978.398335
OverallQual          1.383591
OverallCond          1.111904
YearBuilt           30.117957
YearRemodAdd        20.621815
MasVnrArea         179.807006
BsmtFinSF1         456.225704
BsmtFinSF2         161.237175
BsmtUnfSF          441.816636
TotalBsmtSF        437.345408
1stFlrSF           386.438405
2ndFlrSF           437.525242
LowQualFinSF        48.606667
GrLivArea          525.591053
BsmtFullBath         0.519614
BsmtHalfBath         0.239941
FullBath             0.551923
HalfBath             0.502972
BedroomAbvGr         0.816653
KitchenAbvGr         0.221671
TotRmsAbvGrd         1.629894
Fireplaces           0.645914
GarageYrBlt         24.666557
GarageCars           0.745101
GarageArea         213.167067
WoodDeckSF         125.298904
OpenPorchSF         66.336846
EnclosedPorch       61.769458
3SsnPorch           30.455221
ScreenPorch         55.565312
PoolArea            40.163610
MiscVal   

In [73]:
data1.var()

MSSubClass       1.797689e+03
LotFrontage      5.887499e+02
LotArea          9.956843e+07
OverallQual      1.914324e+00
OverallCond      1.236331e+00
YearBuilt        9.070913e+02
YearRemodAdd     4.252593e+02
MasVnrArea       3.233056e+04
BsmtFinSF1       2.081419e+05
BsmtFinSF2       2.599743e+04
BsmtUnfSF        1.952019e+05
TotalBsmtSF      1.912710e+05
1stFlrSF         1.493346e+05
2ndFlrSF         1.914283e+05
LowQualFinSF     2.362608e+03
GrLivArea        2.762460e+05
BsmtFullBath     2.699990e-01
BsmtHalfBath     5.757175e-02
FullBath         3.046187e-01
HalfBath         2.529812e-01
BedroomAbvGr     6.669217e-01
KitchenAbvGr     4.913786e-02
TotRmsAbvGrd     2.656555e+00
Fireplaces       4.172044e-01
GarageYrBlt      6.084390e+02
GarageCars       5.551752e-01
GarageArea       4.544020e+04
WoodDeckSF       1.569982e+04
OpenPorchSF      4.400577e+03
EnclosedPorch    3.815466e+03
3SsnPorch        9.275205e+02
ScreenPorch      3.087504e+03
PoolArea         1.613116e+03
MiscVal   

In [74]:
data1.dtypes

MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
Street            object
                  ...   
YrSold             int64
SaleType          object
SaleCondition     object
SalePrice          int64
Age                int64
Length: 81, dtype: object

In [75]:
data1.skew()

MSSubClass        1.408282
LotFrontage       2.167952
LotArea          12.207001
OverallQual       0.211145
OverallCond       0.699951
YearBuilt        -0.613119
YearRemodAdd     -0.503730
MasVnrArea        2.680841
BsmtFinSF1        1.666116
BsmtFinSF2        4.260865
BsmtUnfSF         0.931311
TotalBsmtSF       1.539384
1stFlrSF          1.380760
2ndFlrSF          0.800098
LowQualFinSF      9.014533
GrLivArea         1.359763
BsmtFullBath      0.573321
BsmtHalfBath      4.073444
FullBath          0.040390
HalfBath          0.673828
BedroomAbvGr      0.205781
KitchenAbvGr      4.450652
TotRmsAbvGrd      0.677647
Fireplaces        0.641404
GarageYrBlt      -0.654648
GarageCars       -0.342022
GarageArea        0.180645
WoodDeckSF        1.536496
OpenPorchSF       2.354891
EnclosedPorch     3.058898
3SsnPorch         9.934743
ScreenPorch       4.150322
PoolArea         14.833510
MiscVal          24.475863
MoSold            0.209596
YrSold            0.097290
SalePrice         1.873260
A

In [76]:
data1.corr()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice,Age
MSSubClass,1.0,-0.38697,-0.140204,0.029215,-0.05864,0.022406,0.036155,0.020924,-0.070292,-0.065841,...,-0.007651,-0.011567,-0.043641,-0.026597,0.008024,-0.006859,-0.019816,-0.018077,-0.087284,-0.023162
LotFrontage,-0.38697,1.0,0.426891,0.251937,-0.058561,0.126156,0.090753,0.195539,0.236045,0.050986,...,0.15281,0.005554,0.07231,0.041695,0.206251,0.004648,0.015889,0.004589,0.353502,-0.125763
LotArea,-0.140204,0.426891,1.0,0.105455,-0.00648,0.014888,0.0142,0.105924,0.213482,0.111224,...,0.084328,-0.019506,0.022131,0.043657,0.077601,0.038359,0.002714,-0.01376,0.263512,-0.015467
OverallQual,0.029215,0.251937,0.105455,1.0,-0.09209,0.570123,0.551546,0.41175,0.240979,-0.060263,...,0.307266,-0.107878,0.022725,0.061846,0.064677,-0.030169,0.071688,-0.028658,0.790975,-0.570446
OverallCond,-0.05864,-0.058561,-0.00648,-0.09209,1.0,-0.375386,0.067433,-0.125485,-0.047632,0.04078,...,-0.033467,0.066376,0.020872,0.058315,-0.001962,0.068289,-0.005177,0.045877,-0.078669,0.376783
YearBuilt,0.022406,0.126156,0.014888,0.570123,-0.375386,1.0,0.59764,0.319811,0.251461,-0.050106,...,0.186465,-0.390106,0.034848,-0.04895,0.004621,-0.031227,0.019572,-0.015426,0.523201,-0.999038
YearRemodAdd,0.036155,0.090753,0.0142,0.551546,0.067433,0.59764,1.0,0.185019,0.129753,-0.06691,...,0.225122,-0.19983,0.047059,-0.035263,0.005781,-0.008842,0.02449,0.03641,0.509081,-0.595063
MasVnrArea,0.020924,0.195539,0.105924,0.41175,-0.125485,0.319811,0.185019,1.0,0.270786,-0.071834,...,0.12898,-0.108869,0.014236,0.056217,0.011911,-0.029063,-0.003799,-0.006784,0.480492,-0.319572
BsmtFinSF1,-0.070292,0.236045,0.213482,0.240979,-0.047632,0.251461,0.129753,0.270786,1.0,-0.052076,...,0.109382,-0.103667,0.028895,0.060804,0.139774,0.004678,-0.012205,0.014152,0.387234,-0.250428
BsmtFinSF2,-0.065841,0.050986,0.111224,-0.060263,0.04078,-0.050106,-0.06691,-0.071834,-0.052076,1.0,...,0.003035,0.035386,-0.030853,0.090027,0.041769,0.005184,-0.014849,0.033595,-0.012058,0.051498


Display and remove the duplicate rows in the Dataframe

In [77]:
data1[data1.duplicated()]

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,Age
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,,,,0,2,2008,WD,Normal,208500,5
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,,,,0,5,2007,WD,Normal,181500,31
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,,,,0,9,2008,WD,Normal,223500,7
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,,,,0,2,2006,WD,Abnorml,140000,91
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,,,,0,12,2008,WD,Normal,250000,8
6,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,Inside,...,,MnPrv,Shed,700,10,2009,WD,Normal,143000,16
7,20,RL,75.0,10084,Pave,,Reg,Lvl,AllPub,Inside,...,,,,0,8,2007,WD,Normal,307000,3
8,60,RL,,10382,Pave,,IR1,Lvl,AllPub,Corner,...,,,Shed,350,11,2009,WD,Normal,200000,36
9,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,Inside,...,,,,0,4,2008,WD,Abnorml,129900,77
10,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,Corner,...,,,,0,1,2008,WD,Normal,118000,69


In [78]:
data1.drop_duplicates(keep = 'first', inplace = True)

In [79]:
data1[data1.duplicated()]

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,Age
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


Sort the dataframe in ascending or descending order based on a particular column

In [80]:
low_to_high_price = data1.sort_values('SalePrice', ascending = True)

In [81]:
low_to_high_price.head(15)

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,Age
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
496,30,C (all),60.0,7879,Pave,,Reg,Lvl,AllPub,Inside,...,,GdWo,,0,11,2009,WD,Abnorml,34900,89
917,20,C (all),50.0,9000,Pave,,Reg,Lvl,AllPub,Inside,...,,,,0,10,2006,WD,Abnorml,35311,57
969,50,RM,50.0,5925,Pave,,Reg,Lvl,AllPub,Inside,...,,GdWo,,0,5,2009,WD,Abnorml,37900,99
534,20,RL,50.0,5000,Pave,,Reg,Low,AllPub,Inside,...,,,,0,1,2007,WD,Normal,39300,61
31,70,C (all),50.0,8500,Pave,Pave,Reg,Lvl,AllPub,Inside,...,,MnPrv,,0,7,2008,WD,Normal,40000,88
711,30,RL,56.0,4130,Pave,,IR1,Lvl,AllPub,Inside,...,,,,0,7,2008,WD,Normal,52000,73
1338,30,RM,153.0,4118,Pave,Grvl,IR1,Bnk,AllPub,Corner,...,,,,0,3,2006,WD,Normal,52500,65
1326,30,RM,40.0,3636,Pave,,Reg,Lvl,AllPub,Inside,...,,MnPrv,,0,1,2008,WD,Normal,55000,86
706,190,RM,70.0,5600,Pave,,Reg,Lvl,AllPub,Inside,...,,,Othr,3500,7,2010,WD,Normal,55000,80
813,20,C (all),66.0,8712,Grvl,,Reg,Bnk,AllPub,Inside,...,,,Shed,54,6,2010,WD,Alloca,55993,58


Grouby data based on specific columns and statitical parameter

In [82]:
data1.groupby(['LotShape']).mean()#median/sum

Unnamed: 0_level_0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice,Age
LotShape,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
IR1,50.322245,76.034921,11905.686071,6.43659,5.559252,1980.02079,1988.650728,130.434238,523.790021,52.669439,...,52.336798,16.550936,4.769231,18.259875,3.767152,74.074844,6.471933,2007.733888,206279.326403,27.713098
IR2,52.073171,76.5,23733.658537,6.731707,5.560976,1985.268293,1996.02439,126.243902,468.146341,100.365854,...,71.170732,2.268293,7.073171,34.634146,0.0,23.902439,6.365854,2007.853659,239833.365854,22.585366
IR3,41.0,138.428571,41338.2,6.8,4.8,1987.8,1995.8,79.6,1114.1,82.0,...,64.6,0.0,0.0,14.7,48.0,0.0,6.2,2007.4,216036.5,19.6
Reg,60.926936,67.007109,8878.805889,5.892039,5.594329,1966.054526,1982.312977,88.220637,396.107961,40.920393,...,42.738277,25.930207,2.610687,12.491821,1.89313,27.954198,6.254089,2007.8506,164960.461287,41.796074


In [83]:
data1.groupby(['LotShape', 'Street']).mean()#median/sum

Unnamed: 0_level_0,Unnamed: 1_level_0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice,Age
LotShape,Street,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
IR1,Grvl,190.0,,164660.0,5.0,6.0,1965.0,1965.0,0.0,1249.0,147.0,...,0.0,0.0,0.0,0.0,0.0,700.0,8.0,2008.0,228950.0,43.0
IR1,Pave,50.03125,76.034921,11587.447917,6.439583,5.558333,1980.052083,1988.7,130.707113,522.279167,52.472917,...,52.445833,16.585417,4.779167,18.297917,3.775,72.770833,6.46875,2007.733333,206232.095833,27.68125
IR2,Grvl,90.0,110.0,8472.0,5.0,5.0,1963.0,1963.0,0.0,104.0,712.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2010.0,110000.0,47.0
IR2,Pave,51.125,75.16,24115.2,6.775,5.575,1985.825,1996.85,129.4,477.25,85.075,...,72.95,2.325,7.25,35.5,0.0,24.5,6.4,2007.8,243079.2,21.975
IR3,Pave,41.0,138.428571,41338.2,6.8,4.8,1987.8,1995.8,79.6,1114.1,82.0,...,64.6,0.0,0.0,14.7,48.0,0.0,6.2,2007.4,216036.5,19.6
Reg,Grvl,40.0,79.25,18421.5,4.75,4.5,1960.0,1963.75,82.5,493.75,0.0,...,78.75,0.0,0.0,65.75,0.0,153.5,6.0,2008.0,110548.25,48.0
Reg,Pave,61.01862,66.94881,8836.997809,5.897043,5.599124,1966.081051,1982.394304,88.245865,395.680175,41.099671,...,42.580504,26.043812,2.622125,12.258488,1.901424,27.404162,6.255203,2007.849945,165198.849945,41.768894


Get general info of the dataframe i.e. data types, non null values, statitical measures and more with just a couple lines of code!

In [83]:
data1.info()

In [83]:
data1.describe()

In [32]:
data1.isnull().sum()

MSSubClass         0
MSZoning           0
LotFrontage      257
LotArea            0
Street             0
                ... 
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Age                0
Length: 81, dtype: int64

In [19]:
num = data1.select_dtypes(include = 'number')
cat = data1.select_dtypes(include = 'object')

In [None]:
for i in num:
    num.boxplot(column = i, patch_artist = True, notch ='True')
    plt.ylabel(i)
    plt.show()

In [None]:
sns.set_style('whitegrid')
for j in num:
    sns.distplot(data1[j], kde = True, color = 'red')
    plt.show()

In [None]:
sns.scatterplot(x = data1['GarageArea'], y = data1['SalePrice'], palette='pastel')

In [None]:
sns.scatterplot(x = data1['YrSold'], y = data1['SalePrice'], palette='pastel')

In [None]:
for i in cat:
  sns.countplot(x = data1[i], palette = "Spectral")
  plt.show()

In [None]:
sns.barplot(x = data1['LotShape'], y = data1['SalePrice'])

In [None]:
sns.barplot(x = data1['BldgType'], y = data1['SalePrice'], ci = 0)

In [None]:
x = data1.drop(['SalePrice'], axis = 1)
y = data1['SalePrice']

In [None]:
train_x, test_x, train_y, test_y = train_test_split(x, y, random_state = 69)

In [None]:
train_num = train_x.select_dtypes(include = 'number')
train_cat = train_x.select_dtypes(include = 'object')

test_num = test_x.select_dtypes(include = 'number')
test_cat = test_x.select_dtypes(include = 'object')

In [None]:
print('Missing values before imputation \n', train_cat.isnull().sum())
train_cat.fillna(train_cat.mode().loc[0], inplace = True)
print('\n')
print('Missing values after imputation \n', cat.isnull().sum())

Missing values before imputation 
 MSZoning            0
Street              0
Alley            1030
LotShape            0
LandContour         0
Utilities           0
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
RoofStyle           0
RoofMatl            0
Exterior1st         0
Exterior2nd         0
MasVnrType          6
ExterQual           0
ExterCond           0
Foundation          0
BsmtQual           26
BsmtCond           26
BsmtExposure       27
BsmtFinType1       26
BsmtFinType2       27
Heating             0
HeatingQC           0
CentralAir          0
Electrical          0
KitchenQual         0
Functional          0
FireplaceQu       517
GarageType         62
GarageFinish       62
GarageQual         62
GarageCond         62
PavedDrive          0
PoolQC           1090
Fence             884
MiscFeature      1059
SaleType            0
SaleCondition       0
dtype: int64


Miss

In [None]:
print('Missing values before imputation \n', train_num.isnull().sum())
train_num.fillna(train_num.median(), inplace = True)
print('\n')
print('Missing values after imputation \n', train_num.isnull().sum())

Missing values before imputation 
 MSSubClass         0
LotFrontage      188
LotArea            0
OverallQual        0
OverallCond        0
YearBuilt          0
YearRemodAdd       0
MasVnrArea         6
BsmtFinSF1         0
BsmtFinSF2         0
BsmtUnfSF          0
TotalBsmtSF        0
1stFlrSF           0
2ndFlrSF           0
LowQualFinSF       0
GrLivArea          0
BsmtFullBath       0
BsmtHalfBath       0
FullBath           0
HalfBath           0
BedroomAbvGr       0
KitchenAbvGr       0
TotRmsAbvGrd       0
Fireplaces         0
GarageYrBlt       62
GarageCars         0
GarageArea         0
WoodDeckSF         0
OpenPorchSF        0
EnclosedPorch      0
3SsnPorch          0
ScreenPorch        0
PoolArea           0
MiscVal            0
MoSold             0
YrSold             0
dtype: int64


Missing values after imputation 
 MSSubClass       0
LotFrontage      0
LotArea          0
OverallQual      0
OverallCond      0
YearBuilt        0
YearRemodAdd     0
MasVnrArea       0
BsmtFinS

In [None]:
print('Missing values before imputation \n', test_cat.isnull().sum())
test_cat.fillna(train_cat.mode().loc[0], inplace = True)
print('\n')
print('Missing values after imputation \n', test_cat.isnull().sum())

Missing values before imputation 
 MSZoning           0
Street             0
Alley            339
LotShape           0
LandContour        0
Utilities          0
LotConfig          0
LandSlope          0
Neighborhood       0
Condition1         0
Condition2         0
BldgType           0
HouseStyle         0
RoofStyle          0
RoofMatl           0
Exterior1st        0
Exterior2nd        0
MasVnrType         2
ExterQual          0
ExterCond          0
Foundation         0
BsmtQual          11
BsmtCond          11
BsmtExposure      11
BsmtFinType1      11
BsmtFinType2      11
Heating            0
HeatingQC          0
CentralAir         0
Electrical         1
KitchenQual        0
Functional         0
FireplaceQu      173
GarageType        19
GarageFinish      19
GarageQual        19
GarageCond        19
PavedDrive         0
PoolQC           363
Fence            295
MiscFeature      347
SaleType           0
SaleCondition      0
dtype: int64


Missing values after imputation 
 MSZoning     

In [None]:
print('Missing values before imputation \n', test_num.isnull().sum())
test_num.fillna(train_num.median(), inplace = True)
print('\n')
print('Missing values after imputation \n', test_num.isnull().sum())

Missing values before imputation 
 MSSubClass        0
LotFrontage      71
LotArea           0
OverallQual       0
OverallCond       0
YearBuilt         0
YearRemodAdd      0
MasVnrArea        2
BsmtFinSF1        0
BsmtFinSF2        0
BsmtUnfSF         0
TotalBsmtSF       0
1stFlrSF          0
2ndFlrSF          0
LowQualFinSF      0
GrLivArea         0
BsmtFullBath      0
BsmtHalfBath      0
FullBath          0
HalfBath          0
BedroomAbvGr      0
KitchenAbvGr      0
TotRmsAbvGrd      0
Fireplaces        0
GarageYrBlt      19
GarageCars        0
GarageArea        0
WoodDeckSF        0
OpenPorchSF       0
EnclosedPorch     0
3SsnPorch         0
ScreenPorch       0
PoolArea          0
MiscVal           0
MoSold            0
YrSold            0
dtype: int64


Missing values after imputation 
 MSSubClass       0
LotFrontage      0
LotArea          0
OverallQual      0
OverallCond      0
YearBuilt        0
YearRemodAdd     0
MasVnrArea       0
BsmtFinSF1       0
BsmtFinSF2       0
BsmtUn

In [None]:
train_x1 = pd.concat([train_num, train_cat], axis = 1)
test_x1 = pd.concat([test_num, test_cat], axis = 1) 

In [None]:
encoder = OneHotEncoder(sparse = False, handle_unknown = 'ignore')
encoder.fit(train_x1)
train_x1 = pd.DataFrame(encoder.transform(train_x1), columns = encoder.get_feature_names_out())
test_x1 = pd.DataFrame(encoder.transform(test_x1), columns = encoder.get_feature_names_out())



In [None]:
scaler = StandardScaler()
scaler.fit(train_x1)
train_x1 = pd.DataFrame(scaler.transform(train_x1), columns = train_x1.columns)
test_x1 = pd.DataFrame(scaler.transform(test_x1), columns = test_x1.columns)

In [None]:
model1 = LinearRegression()
model2 = DecisionTreeRegressor(random_state = 69)
model3 = RandomForestRegressor(random_state = 69)
model4 = KNeighborsRegressor()

In [None]:
model1.fit(train_x1, train_y)
pred1 = model1.predict(test_x1)

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
mae1 = mean_absolute_error(test_y, pred1)

In [None]:
mae1

24605.462528804826

In [None]:
model2.fit(train_x1, train_y)
pred2 = model2.predict(test_x1)

from sklearn.metrics import mean_squared_error, mean_absolute_error
mae2 = mean_absolute_error(test_y, pred2)

mae2

32311.671232876713

In [None]:
model3.fit(train_x1, train_y)
pred3 = model3.predict(test_x1)

from sklearn.metrics import mean_squared_error, mean_absolute_error
mae3 = mean_absolute_error(test_y, pred3)

mae3

22463.731561643835

In [None]:
model4.fit(train_x1, train_y)
pred4 = model4.predict(test_x1)

from sklearn.metrics import mean_squared_error, mean_absolute_error
mae4 = mean_absolute_error(test_y, pred4)

mae4

81638.39287671233

Base model score is 24605.462 for Linear Regression, 32311.67 for Decision Tree Regressor, 22463.73 for Random Forest Regressor and 81638.39 for KNN Regressor. Now we will implement feature selection in order to increase the score!

In [None]:
nan_feat = [cname for cname in data2.columns if data2[cname].isnull().sum() >= 1]

# TO FIND OUT THE % OF NAN 
for i in nan_feat:
    print(i, np.round(data2[i].isnull().mean(), 2))

LotFrontage 0.18
Alley 0.94
MasVnrType 0.01
MasVnrArea 0.01
BsmtQual 0.03
BsmtCond 0.03
BsmtExposure 0.03
BsmtFinType1 0.03
BsmtFinType2 0.03
Electrical 0.0
FireplaceQu 0.47
GarageType 0.06
GarageYrBlt 0.06
GarageFinish 0.06
GarageQual 0.06
GarageCond 0.06
PoolQC 1.0
Fence 0.81
MiscFeature 0.96


In [None]:
cat_data = data2.select_dtypes(include = 'object')
pd.crosstab(cat_data['Street'], columns = 'counts', normalize = True)

col_0,counts
Street,Unnamed: 1_level_1
Grvl,0.00411
Pave,0.99589


In [None]:
list1 = []
for i in cat_data.columns:
    list1.append((i, pd.crosstab(cat_data[i], columns = 'counts', normalize = True)))

In [None]:
corr = data2.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
MSSubClass,1.0,-0.386347,-0.139781,0.032628,-0.059316,0.02785,0.040581,0.022936,-0.069836,-0.065649,-0.140759,-0.238518,-0.251758,0.307886,0.046474,0.074853,0.003491,-0.002333,0.131608,0.177354,-0.023438,0.281721,0.04038,-0.045569,0.085072,-0.04011,-0.098672,-0.012579,-0.0061,-0.012037,-0.043825,-0.02603,0.008283,-0.007683,-0.013585,-0.021407,-0.084284
LotFrontage,-0.386347,1.0,0.426095,0.251646,-0.059213,0.123349,0.088866,0.193458,0.233633,0.0499,0.132644,0.392075,0.457181,0.080177,0.038469,0.402797,0.100949,-0.007234,0.198769,0.053532,0.26317,-0.006069,0.352096,0.266639,0.07025,0.285691,0.344997,0.088521,0.151972,0.0107,0.070029,0.041383,0.206167,0.003368,0.0112,0.00745,0.351799
LotArea,-0.139781,0.426095,1.0,0.105806,-0.005636,0.014228,0.013788,0.10416,0.214103,0.11117,-0.002618,0.260833,0.299475,0.050986,0.004779,0.263116,0.158155,0.048046,0.126031,0.014259,0.11969,-0.017784,0.190015,0.271364,-0.024947,0.154871,0.180403,0.171698,0.084774,-0.01834,0.020423,0.04316,0.077672,0.038068,0.001205,-0.014261,0.263843
OverallQual,0.032628,0.251646,0.105806,1.0,-0.091932,0.572323,0.550684,0.411876,0.239666,-0.059119,0.308159,0.537808,0.476224,0.295493,-0.030429,0.593007,0.111098,-0.04015,0.5506,0.273458,0.101676,-0.183882,0.427452,0.396765,0.547766,0.600671,0.562022,0.238923,0.308819,-0.113937,0.030371,0.064886,0.065166,-0.031406,0.070815,-0.027347,0.790982
OverallCond,-0.059316,-0.059213,-0.005636,-0.091932,1.0,-0.375983,0.073741,-0.128101,-0.046231,0.040229,-0.136841,-0.171098,-0.144203,0.028942,0.025494,-0.079686,-0.054942,0.117821,-0.194149,-0.060769,0.01298,-0.087001,-0.057583,-0.02382,-0.324297,-0.185758,-0.151521,-0.003334,-0.032589,0.070356,0.025504,0.054811,-0.001985,0.068777,-0.003511,0.04395,-0.077856
YearBuilt,0.02785,0.123349,0.014228,0.572323,-0.375983,1.0,0.592855,0.315707,0.249503,-0.049107,0.14904,0.391452,0.281986,0.010308,-0.183784,0.19901,0.187599,-0.038162,0.468271,0.242656,-0.070651,-0.1748,0.095589,0.147716,0.825667,0.53785,0.478954,0.22488,0.188686,-0.387268,0.031355,-0.050364,0.00495,-0.034383,0.012398,-0.013618,0.522897
YearRemodAdd,0.040581,0.088866,0.013788,0.550684,0.073741,0.592855,1.0,0.179618,0.128451,-0.067759,0.181133,0.291066,0.240379,0.140024,-0.062419,0.287389,0.11947,-0.012337,0.439046,0.183331,-0.040581,-0.149598,0.19174,0.112581,0.642277,0.420622,0.3716,0.205726,0.226298,-0.193919,0.045286,-0.03874,0.005829,-0.010286,0.02149,0.035743,0.507101
MasVnrArea,0.022936,0.193458,0.10416,0.411876,-0.128101,0.315707,0.179618,1.0,0.264736,-0.072319,0.114442,0.363936,0.344501,0.174561,-0.069071,0.390857,0.08531,0.026673,0.276833,0.201444,0.102821,-0.03761,0.280682,0.24907,0.252691,0.364204,0.373066,0.159718,0.125703,-0.110204,0.018796,0.061466,0.011723,-0.029815,-0.005965,-0.008201,0.477493
BsmtFinSF1,-0.069836,0.233633,0.214103,0.239666,-0.046231,0.249503,0.128451,0.264736,1.0,-0.050117,-0.495251,0.522396,0.445863,-0.137079,-0.064503,0.208171,0.649212,0.067418,0.058543,0.004262,-0.107355,-0.081007,0.044316,0.260011,0.153484,0.224054,0.29697,0.204306,0.111761,-0.102303,0.026451,0.062021,0.140491,0.003571,-0.015727,0.014359,0.38642
BsmtFinSF2,-0.065649,0.0499,0.11117,-0.059119,0.040229,-0.049107,-0.067759,-0.072319,-0.050117,1.0,-0.209294,0.10481,0.097117,-0.09926,0.014807,-0.00964,0.158678,0.070948,-0.076444,-0.032148,-0.015728,-0.040751,-0.035227,0.046921,-0.088011,-0.038264,-0.018227,0.067898,0.003093,0.036543,-0.029993,0.088871,0.041709,0.00494,-0.015211,0.031706,-0.011378


In [None]:
low_var_list = ['Alley', 'YrSold', 'PoolQC', 'MiscFeature', 'MiscVal', 'GarageYrBlt', 'YearBuilt', 'MoSold', 
            '1stFlrSF', '2ndFlrSF', 'LotArea', 'YearRemodAdd', 'Street', 'Utilities', 'LandSlope', 
            'Condition2', 'RoofMatl', 'Heating', 'GarageCond']
data2.drop(low_var_list, axis = 1, inplace = True)

In [None]:
x1 = data2.drop(['SalePrice'], axis = 1)
y1 = data2['SalePrice']

In [None]:
train_x4, test_x4, train_y4, test_y4 = train_test_split(x1, y1, random_state = 69)

In [None]:
train_num1 = train_x4.select_dtypes(include = 'number')
train_cat1 = train_x4.select_dtypes(include = 'object')

test_num1 = test_x4.select_dtypes(include = 'number')
test_cat1 = test_x4.select_dtypes(include = 'object')

In [None]:
print('Missing values before imputation \n', train_num1.isnull().sum())
train_num1.fillna(train_num1.median(), inplace = True)
print('\n')
print('Missing values after imputation \n', train_num1.isnull().sum())


Missing values before imputation 
 MSSubClass         0
LotFrontage      188
OverallQual        0
OverallCond        0
MasVnrArea         6
BsmtFinSF1         0
BsmtFinSF2         0
BsmtUnfSF          0
TotalBsmtSF        0
LowQualFinSF       0
GrLivArea          0
BsmtFullBath       0
BsmtHalfBath       0
FullBath           0
HalfBath           0
BedroomAbvGr       0
KitchenAbvGr       0
TotRmsAbvGrd       0
Fireplaces         0
GarageCars         0
GarageArea         0
WoodDeckSF         0
OpenPorchSF        0
EnclosedPorch      0
3SsnPorch          0
ScreenPorch        0
PoolArea           0
dtype: int64


Missing values after imputation 
 MSSubClass       0
LotFrontage      0
OverallQual      0
OverallCond      0
MasVnrArea       0
BsmtFinSF1       0
BsmtFinSF2       0
BsmtUnfSF        0
TotalBsmtSF      0
LowQualFinSF     0
GrLivArea        0
BsmtFullBath     0
BsmtHalfBath     0
FullBath         0
HalfBath         0
BedroomAbvGr     0
KitchenAbvGr     0
TotRmsAbvGrd     0
Firepla

In [None]:
print('Missing values before imputation \n', train_cat1.isnull().sum())
train_cat1.fillna(train_cat1.mode().loc[0], inplace = True)
print('\n')
print('Missing values after imputation \n', train_cat1.isnull().sum())

Missing values before imputation 
 MSZoning           0
LotShape           0
LandContour        0
LotConfig          0
Neighborhood       0
Condition1         0
BldgType           0
HouseStyle         0
RoofStyle          0
Exterior1st        0
Exterior2nd        0
MasVnrType         6
ExterQual          0
ExterCond          0
Foundation         0
BsmtQual          26
BsmtCond          26
BsmtExposure      27
BsmtFinType1      26
BsmtFinType2      27
HeatingQC          0
CentralAir         0
Electrical         0
KitchenQual        0
Functional         0
FireplaceQu      517
GarageType        62
GarageFinish      62
GarageQual        62
PavedDrive         0
Fence            884
SaleType           0
SaleCondition      0
dtype: int64


Missing values after imputation 
 MSZoning         0
LotShape         0
LandContour      0
LotConfig        0
Neighborhood     0
Condition1       0
BldgType         0
HouseStyle       0
RoofStyle        0
Exterior1st      0
Exterior2nd      0
MasVnrType    

In [None]:
print('Missing values before imputation \n', test_cat1.isnull().sum())
test_cat1.fillna(train_cat1.mode().loc[0], inplace = True)
print('\n')
print('Missing values after imputation \n', test_cat1.isnull().sum())


Missing values before imputation 
 MSZoning           0
LotShape           0
LandContour        0
LotConfig          0
Neighborhood       0
Condition1         0
BldgType           0
HouseStyle         0
RoofStyle          0
Exterior1st        0
Exterior2nd        0
MasVnrType         2
ExterQual          0
ExterCond          0
Foundation         0
BsmtQual          11
BsmtCond          11
BsmtExposure      11
BsmtFinType1      11
BsmtFinType2      11
HeatingQC          0
CentralAir         0
Electrical         1
KitchenQual        0
Functional         0
FireplaceQu      173
GarageType        19
GarageFinish      19
GarageQual        19
PavedDrive         0
Fence            295
SaleType           0
SaleCondition      0
dtype: int64


Missing values after imputation 
 MSZoning         0
LotShape         0
LandContour      0
LotConfig        0
Neighborhood     0
Condition1       0
BldgType         0
HouseStyle       0
RoofStyle        0
Exterior1st      0
Exterior2nd      0
MasVnrType    

In [None]:
print('Missing values before imputation \n', test_num1.isnull().sum())
test_num1.fillna(train_num1.median(), inplace = True)
print('\n')
print('Missing values after imputation \n', test_num1.isnull().sum())

Missing values before imputation 
 MSSubClass        0
LotFrontage      71
OverallQual       0
OverallCond       0
MasVnrArea        2
BsmtFinSF1        0
BsmtFinSF2        0
BsmtUnfSF         0
TotalBsmtSF       0
LowQualFinSF      0
GrLivArea         0
BsmtFullBath      0
BsmtHalfBath      0
FullBath          0
HalfBath          0
BedroomAbvGr      0
KitchenAbvGr      0
TotRmsAbvGrd      0
Fireplaces        0
GarageCars        0
GarageArea        0
WoodDeckSF        0
OpenPorchSF       0
EnclosedPorch     0
3SsnPorch         0
ScreenPorch       0
PoolArea          0
dtype: int64


Missing values after imputation 
 MSSubClass       0
LotFrontage      0
OverallQual      0
OverallCond      0
MasVnrArea       0
BsmtFinSF1       0
BsmtFinSF2       0
BsmtUnfSF        0
TotalBsmtSF      0
LowQualFinSF     0
GrLivArea        0
BsmtFullBath     0
BsmtHalfBath     0
FullBath         0
HalfBath         0
BedroomAbvGr     0
KitchenAbvGr     0
TotRmsAbvGrd     0
Fireplaces       0
GarageCars     

In [None]:
train_x5 = pd.concat([train_num1, train_cat1], axis = 1)
test_x5 = pd.concat([test_num1, test_cat1], axis = 1) 

In [None]:
encoder1 = OneHotEncoder(sparse = False, handle_unknown = 'ignore')
encoder1.fit(train_x5)
train_x5 = pd.DataFrame(encoder1.transform(train_x5), columns = encoder1.get_feature_names_out())
test_x5 = pd.DataFrame(encoder1.transform(test_x5), columns = encoder1.get_feature_names_out())



In [None]:
scaler1 = StandardScaler()
scaler1.fit(train_x5)
train_x5 = pd.DataFrame(scaler1.transform(train_x5), columns = train_x5.columns)
test_x5 = pd.DataFrame(scaler1.transform(test_x5), columns = test_x5.columns)

In [None]:
model5 = LinearRegression()
model6 = DecisionTreeRegressor(random_state = 69)
model7 = RandomForestRegressor(random_state = 69)
model8 = KNeighborsRegressor()

In [None]:
model5.fit(train_x5, train_y4)
pred1 = model5.predict(test_x5)

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
mae1 = mean_absolute_error(test_y4, pred1)

In [None]:
mae1

2.5146330042665444e+16

In [None]:
model6.fit(train_x5, train_y4)
pred2 = model6.predict(test_x5)

from sklearn.metrics import mean_squared_error, mean_absolute_error
mae2 = mean_absolute_error(test_y4, pred2)

mae2

31779.45205479452

In [None]:
model7.fit(train_x5, train_y4)
pred3 = model7.predict(test_x5)

from sklearn.metrics import mean_squared_error, mean_absolute_error
mae3 = mean_absolute_error(test_y4, pred3)

mae3

22207.57888584475

In [None]:
model8.fit(train_x5, train_y4)
pred4 = model8.predict(test_x5)

from sklearn.metrics import mean_squared_error, mean_absolute_error
mae4 = mean_absolute_error(test_y4, pred4)

mae4

60116.882739726025