### Problem Statement ( Kaggle Link <a href='https://www.kaggle.com/c/house-prices-advanced-regression-techniques'> here</a> )

#### Goal
It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable. 

#### Metric
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

#### Submission File Format
The file should contain a header and have the following format:

Id,SalePrice <br>
1461,169000.1 <br>
1462,187724.1233 <br>
1463,175221 <br>
etc.

### Solution

In [62]:
import numpy as np
import pandas as pd
from IPython.core.interactiveshell import InteractiveShell
import seaborn as sns
import matplotlib.pyplot as plt
from preprocessing.nv import NVUtil
InteractiveShell.ast_node_interactivity='all'

In [63]:
# Kaggle - house price advanced regression techniques
train_df = pd.read_csv(r"D:\sanooj\datascience\data\house-prices-advanced-regression-techniques\train.csv")
test_df = pd.read_csv(r"D:\sanooj\datascience\data\house-prices-advanced-regression-techniques\test.csv")

In [64]:
train_df

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


In [65]:
test_df

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,2915,160,RM,21.0,1936,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,6,2006,WD,Normal
1455,2916,160,RM,21.0,1894,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,4,2006,WD,Abnorml
1456,2917,20,RL,160.0,20000,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,9,2006,WD,Abnorml
1457,2918,85,RL,62.0,10441,Pave,,Reg,Lvl,AllPub,...,0,0,,MnPrv,Shed,700,7,2006,WD,Normal


In [66]:
## Do null value treatment on both train and test
excluded_columns_train,selected_columns_train,numerical_columns_train,categorical_columns_train,train_df = NVUtil.nv_treatment(train_df, 30)
excluded_columns_test,selected_columns_test,numerical_columns_test,categorical_columns_test,test_df = NVUtil.nv_treatment(test_df, 30)


Before NV treatment the stats are as below
Id                 0
MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
                ... 
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 81, dtype: int64
Before NV treatment the stats are as below
Id               0
MSSubClass       0
MSZoning         0
LotFrontage      0
LotArea          0
                ..
MoSold           0
YrSold           0
SaleType         0
SaleCondition    0
SalePrice        0
Length: 76, dtype: int64

Before NV treatment the stats are as below
Id                 0
MSSubClass         0
MSZoning           4
LotFrontage      227
LotArea            0
                ... 
MiscVal            0
MoSold             0
YrSold             0
SaleType           1
SaleCondition      0
Length: 80, dtype: int64
Before NV treatment the stats are as below
Id               0
MSSubClass       0
MSZoning         0
LotFrontage      0
LotA

In [67]:
train_df

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,...,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,Reg,Lvl,AllPub,Inside,...,0,0,0,0,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,Reg,Lvl,AllPub,FR2,...,0,0,0,0,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,IR1,Lvl,AllPub,Inside,...,0,0,0,0,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,IR1,Lvl,AllPub,Corner,...,272,0,0,0,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,IR1,Lvl,AllPub,FR2,...,0,0,0,0,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,Reg,Lvl,AllPub,Inside,...,0,0,0,0,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,Reg,Lvl,AllPub,Inside,...,0,0,0,0,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,Reg,Lvl,AllPub,Inside,...,0,0,0,0,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,Reg,Lvl,AllPub,Inside,...,112,0,0,0,0,4,2010,WD,Normal,142125


In [68]:
test_df

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,Reg,Lvl,AllPub,Inside,...,0,0,0,120,0,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,IR1,Lvl,AllPub,Corner,...,36,0,0,0,0,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,IR1,Lvl,AllPub,Inside,...,34,0,0,0,0,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,IR1,Lvl,AllPub,Inside,...,36,0,0,0,0,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,IR1,HLS,AllPub,Inside,...,82,0,0,144,0,0,1,2010,WD,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,2915,160,RM,21.0,1936,Pave,Reg,Lvl,AllPub,Inside,...,0,0,0,0,0,0,6,2006,WD,Normal
1455,2916,160,RM,21.0,1894,Pave,Reg,Lvl,AllPub,Inside,...,24,0,0,0,0,0,4,2006,WD,Abnorml
1456,2917,20,RL,160.0,20000,Pave,Reg,Lvl,AllPub,Inside,...,0,0,0,0,0,0,9,2006,WD,Abnorml
1457,2918,85,RL,62.0,10441,Pave,Reg,Lvl,AllPub,Inside,...,32,0,0,0,0,700,7,2006,WD,Normal


In [69]:
## Label Encoders
from sklearn.preprocessing import LabelEncoder
for i in categorical_columns_train:
    le = LabelEncoder()
    le.fit(train_df[i])
    x=le.transform(train_df[i])
    train_df[i] = x

    
for i in categorical_columns_test:
    le = LabelEncoder()
    le.fit(test_df[i])
    x=le.transform(test_df[i])
    test_df[i] = x

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

In [70]:
train_df

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,...,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,3,65.0,8450,1,3,3,0,4,...,0,0,0,0,0,2,2008,8,4,208500
1,2,20,3,80.0,9600,1,3,3,0,2,...,0,0,0,0,0,5,2007,8,4,181500
2,3,60,3,68.0,11250,1,0,3,0,4,...,0,0,0,0,0,9,2008,8,4,223500
3,4,70,3,60.0,9550,1,0,3,0,0,...,272,0,0,0,0,2,2006,8,0,140000
4,5,60,3,84.0,14260,1,0,3,0,2,...,0,0,0,0,0,12,2008,8,4,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,3,62.0,7917,1,3,3,0,4,...,0,0,0,0,0,8,2007,8,4,175000
1456,1457,20,3,85.0,13175,1,3,3,0,4,...,0,0,0,0,0,2,2010,8,4,210000
1457,1458,70,3,66.0,9042,1,3,3,0,4,...,0,0,0,0,2500,5,2010,8,4,266500
1458,1459,20,3,68.0,9717,1,3,3,0,4,...,112,0,0,0,0,4,2010,8,4,142125


In [71]:
test_df

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,2,80.0,11622,1,3,3,0,4,...,0,0,0,120,0,0,6,2010,8,4
1,1462,20,3,81.0,14267,1,0,3,0,0,...,36,0,0,0,0,12500,6,2010,8,4
2,1463,60,3,74.0,13830,1,0,3,0,4,...,34,0,0,0,0,0,3,2010,8,4
3,1464,60,3,78.0,9978,1,0,3,0,4,...,36,0,0,0,0,0,6,2010,8,4
4,1465,120,3,43.0,5005,1,0,1,0,4,...,82,0,0,144,0,0,1,2010,8,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,2915,160,4,21.0,1936,1,3,3,0,4,...,0,0,0,0,0,0,6,2006,8,4
1455,2916,160,4,21.0,1894,1,3,3,0,4,...,24,0,0,0,0,0,4,2006,8,0
1456,2917,20,3,160.0,20000,1,3,3,0,4,...,0,0,0,0,0,0,9,2006,8,0
1457,2918,85,3,62.0,10441,1,3,3,0,4,...,32,0,0,0,0,700,7,2006,8,4


In [72]:
x_train.shape, y_train.shape

((1022, 75), (1022,))

In [73]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error,mean_absolute_error
import math

y = train_df['SalePrice']
x = train_df.drop('SalePrice',axis=1)

### Model 1 (89, 182941)

In [74]:
transformed_y = np.log(y)

x_train, x_test,y_train,y_test = train_test_split(x,transformed_y,test_size=0.3)

lr = LinearRegression()
lr.fit(x_train,y_train)
result = lr.predict(x_test)
result = pow(math.e,result)

mean_squared_log_error(y_test, result)
mean_absolute_error(y_test, result)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

89.61040631492111

182941.30273533592

### Model 2 (0.030, 21557 )

In [76]:
x_train, x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)

lr = LinearRegression()
lr.fit(x_train,y_train)
result = lr.predict(x_test)

mean_squared_log_error(y_test, result)
mean_absolute_error(y_test, result)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

0.031029530037837507

21557.45914505658

### Model 3 (89, 179192)

In [77]:
transformed_y = np.log(y)

x_train, x_test,y_train,y_test = train_test_split(x,transformed_y,test_size=0.2)

lr = LinearRegression()
lr.fit(x_train,y_train)
result = lr.predict(x_test)
result = pow(math.e,result)

mean_squared_log_error(y_test, result)
mean_absolute_error(y_test, result)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

89.67865621409074

179192.1787365064

### Model 4 (89, 174620)

In [79]:
x_train.describe()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition
count,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,...,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0
mean,735.984589,56.288527,3.041952,69.756849,10435.001712,0.995719,1.965753,2.778253,0.000856,3.011986,...,46.388699,22.123288,2.451199,16.36387,2.893836,50.037671,6.34589,2007.798801,7.513699,3.751712
std,422.568068,40.991731,0.61469,22.012518,8899.448474,0.065316,1.399789,0.713181,0.02926,1.625586,...,67.083403,61.188474,22.43076,58.681575,40.736249,551.945856,2.704571,1.320759,1.570339,1.120808
min,1.0,20.0,0.0,21.0,1300.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,0.0,0.0
25%,368.75,20.0,3.0,60.0,7572.75,1.0,0.0,3.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,8.0,4.0
50%,732.0,50.0,3.0,69.0,9548.5,1.0,3.0,3.0,0.0,4.0,...,24.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,8.0,4.0
75%,1101.25,70.0,3.0,79.0,11600.0,1.0,3.0,3.0,0.0,4.0,...,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,8.0,4.0
max,1460.0,190.0,4.0,313.0,215245.0,1.0,3.0,3.0,1.0,4.0,...,547.0,552.0,304.0,480.0,738.0,15500.0,12.0,2010.0,8.0,5.0


#### Take the top 60 columns with highest variance

In [103]:
high_var_columns = x.var().sort_values(ascending=False)[0:60].index

In [104]:
x[high_var_columns]

Unnamed: 0,LotArea,GrLivArea,MiscVal,BsmtFinSF1,BsmtUnfSF,TotalBsmtSF,2ndFlrSF,Id,1stFlrSF,GarageArea,...,GarageFinish,GarageCars,ExterCond,Foundation,LandContour,ExterQual,BsmtCond,Fireplaces,MSZoning,MasVnrType
0,8450,1710,0,706,150,856,854,1,856,548,...,1,2,4,2,3,2,3,0,3,1
1,9600,1262,0,978,284,1262,0,2,1262,460,...,1,2,4,1,3,3,3,1,3,2
2,11250,1786,0,486,434,920,866,3,920,608,...,1,2,4,2,3,2,3,1,3,1
3,9550,1717,0,216,540,756,756,4,961,642,...,2,3,4,0,3,3,1,1,3,2
4,14260,2198,0,655,490,1145,1053,5,1145,836,...,1,3,4,2,3,2,3,1,3,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,7917,1647,0,0,953,953,694,1456,953,460,...,1,2,4,2,3,3,3,1,3,2
1456,13175,2073,0,790,589,1542,0,1457,2073,500,...,2,2,4,1,3,3,3,2,3,3
1457,9042,2340,2500,275,877,1152,1152,1458,1188,252,...,1,1,2,4,3,0,1,2,3,2
1458,9717,1078,0,49,0,1078,0,1459,1078,240,...,2,1,4,1,3,3,3,0,3,2


In [105]:
transformed_y = np.log(y)

x_train, x_test,y_train,y_test = train_test_split(x[high_var_columns],transformed_y,test_size=0.3)

lr = LinearRegression()
lr.fit(x_train,y_train)
result = lr.predict(x_test)
result = pow(math.e,result)

mean_squared_log_error(y_test, result)
mean_absolute_error(y_test, result)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

89.1691384239556

174620.0187395167

In [107]:
### Let's go with this model for now.. let's run the model on the df_test
final_result = lr.predict(test_df[high_var_columns])
final_result

array([11.6980958 , 11.89855987, 12.04201142, ..., 12.01096238,
       11.72686807, 12.40315428])

In [108]:
final_result = pow(math.e,final_result)
final_result

array([120342.34078897, 147054.6943308 , 169738.01121027, ...,
       164548.78609279, 123855.15667337, 243568.69192411])

In [109]:
final_result.shape

(1459,)

In [111]:
test_df['prediction'] = final_result

In [115]:
test_df[['Id','prediction']].to_csv('house_price_predictions.csv',index=False)

### TODO - Try with Standard Scalar