Link https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

- Find the original problem description in kaggle here. SalePrice is the outcome variable. Use a clean version of the data that is treated for null values.
- Find number of categorical and continuous variables
- Some observations contain null values in SalesPrice. Drop those records from the analysis 
- Divide the data into training and test data 70/30 ratio with seed =1
- Build a model to estimate SalePrice excluding Id as feature. Calculate R2 and RMSE - Linear regression, Lasso, Ridge, Elastic net
- Take log of the sales price, does the R2 score improves?
- Try model with polynomial terms with degree = 2
- Try PCA - what is required number of principal components to retain 99% of variance?
- Try Feature selection - find 10 most significant features of the dataset. 


In [10]:
import pandas as pd
from sklearn import *
import numpy as np

In [8]:
df = pd.read_csv("/data/kaggle/data_combined_cleaned.csv")
df = df.dropna()
df = df.drop(columns = "Id")
df.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,LotConfig,LandSlope,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalesPrice
0,60,RL,65.0,8450,Pave,,Reg,Lvl,Inside,Gtl,...,0,,,,0,2,2008,WD,Normal,208500.0
1,20,RL,80.0,9600,Pave,,Reg,Lvl,FR2,Gtl,...,0,,,,0,5,2007,WD,Normal,181500.0
2,60,RL,68.0,11250,Pave,,IR1,Lvl,Inside,Gtl,...,0,,,,0,9,2008,WD,Normal,223500.0
3,70,RL,60.0,9550,Pave,,IR1,Lvl,Corner,Gtl,...,0,,,,0,2,2006,WD,Abnorml,140000.0
4,60,RL,84.0,14260,Pave,,IR1,Lvl,FR2,Gtl,...,0,,,,0,12,2008,WD,Normal,250000.0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 0 to 1459
Data columns (total 80 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            1460 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non

In [12]:
target = "SalesPrice"

y = df[target]
y = np.log(y)
X = df.drop(columns=target)

X_dummy = pd.get_dummies(X, drop_first=True)


X_train, X_test, y_train, y_test = model_selection.train_test_split(X_dummy.values, y
                                        , test_size = 0.3, random_state = 1)

pipe = pipeline.Pipeline([
    #("poly", preprocessing.PolynomialFeatures(degree=2, include_bias=False)),
    ("scaler", preprocessing.StandardScaler()),
    ("est", linear_model.LinearRegression())
])

pipe.fit(X_train, y_train)

est = pipe.steps[-1][-1]

y_train_pred = pipe.predict(X_train)
y_test_pred = pipe.predict(X_test)

print("r2 score on train", metrics.r2_score(y_train, y_train_pred))
print("r2 score on test", metrics.r2_score(y_test, y_test_pred))

print("rmse on train", np.sqrt(metrics.mean_squared_error(y_train, y_train_pred)))
print("rmse score on test", np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))

r2 score on train 0.9494518297088844
r2 score on test -3.0715943247509893e+22
rmse on train 0.08606817747710854
rmse score on test 76190488351.17424


In [16]:
summary = pd.DataFrame({"feature": X_dummy.columns, "coeffient": est.coef_})
summary["mag"] = np.abs(summary.coeffient)
summary.sort_values("mag", ascending = False)

Unnamed: 0,feature,coeffient,mag
15,GrLivArea,1.501695e+11,1.501695e+11
13,2ndFlrSF,-1.233520e+11,1.233520e+11
12,1stFlrSF,-1.112376e+11,1.112376e+11
219,GarageFinish_None,-1.072428e+11,1.072428e+11
134,Exterior2nd_CBlock,8.844505e+10,8.844505e+10
120,Exterior1st_CBlock,-8.844505e+10,8.844505e+10
224,GarageQual_None,7.930425e+10,7.930425e+10
229,GarageCond_None,7.930320e+10,7.930320e+10
166,BsmtCond_None,6.149525e+10,6.149525e+10
218,GarageType_None,-5.136464e+10,5.136464e+10


In [23]:
target = "SalesPrice"

y = df[target]
y = np.log(y)
X = df.drop(columns=target)

X_dummy = pd.get_dummies(X, drop_first=True)


X_train, X_test, y_train, y_test = model_selection.train_test_split(X_dummy.values, y
                                        , test_size = 0.3, random_state = 1)

pipe = pipeline.Pipeline([
    #("poly", preprocessing.PolynomialFeatures(degree=2, include_bias=False)),
    ("scaler", preprocessing.StandardScaler()),
    ("est", linear_model.Lasso(alpha = 0.005, random_state = 1))
])

pipe.fit(X_train, y_train)

est = pipe.steps[-1][-1]

y_train_pred = pipe.predict(X_train)
y_test_pred = pipe.predict(X_test)

print("r2 score on train", metrics.r2_score(y_train, y_train_pred))
print("r2 score on test", metrics.r2_score(y_test, y_test_pred))

print("rmse on train", np.sqrt(metrics.mean_squared_error(y_train, y_train_pred)))
print("rmse score on test", np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))

r2 score on train 0.9119727410198006
r2 score on test 0.8835797880453871
rmse on train 0.11357911391215857
rmse score on test 0.14833137474267682


In [24]:
summary = pd.DataFrame({"feature": X_dummy.columns, "coeffient": est.coef_})
summary["mag"] = np.abs(summary.coeffient)
summary.sort_values("mag", ascending = False)

Unnamed: 0,feature,coeffient,mag
15,GrLivArea,0.095050,0.095050
3,OverallQual,0.089564,0.089564
25,GarageCars,0.050550,0.050550
5,YearBuilt,0.035904,0.035904
4,OverallCond,0.029188,0.029188
70,Neighborhood_NridgHt,0.026866,0.026866
60,Neighborhood_Crawfor,0.026328,0.026328
39,MSZoning_RM,-0.021550,0.021550
90,Condition2_PosN,-0.021206,0.021206
16,BsmtFullBath,0.020075,0.020075
