# Mini Project: Regression with Kaggle housing dataset using Pycaret for Automatic ML

This is a perfect competition for data science students who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition. 

https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview
 
![image](https://user-images.githubusercontent.com/43855029/156053760-007e3d08-3472-47e5-ba96-c07d8d3fa325.png)

_**Project description:**_

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home. 


For simpilicity: I downloaded the data for you and put it here:
https://github.com/vuminhtue/SMU_Data_Science_workflow_R/tree/master/data/Kaggle_house_prices


## 10.1 Understand the data

There are 4 files in this folder: 
- train.csv: the trained data with 1460 rows and 81 columns. The last column "**SalePrice**" is for output with continuous value
- test.csv: the test data with 1459 rows and 80 columns. Note: There is no  "**SalePrice**" in the last column
- data_description.txt: contains informations on all columns
- sample_submission.csv: is where you save the output from model prediction and upload it to Kaggle for competition

**Objective:**
- We will use the **train.csv**__ data to create the actual train/test set and apply several algorithm to find the optimal ML algorithm to work with this data
- Once model built and trained, apply to the **test.csv**__ and create the output as in format of sample_submission.csv
- Write all analyses in Rmd format.


## Step 1: Load data from Kaggle housing dataset

In [2]:
import pandas as pd
import numpy as np

from pycaret.regression import *

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score



In [3]:
df_train = pd.read_csv("https://raw.githubusercontent.com/vuminhtue/SMU_Data_Science_workflow_R/master/data/Kaggle_house_prices/train.csv")

In [4]:
df_test = pd.read_csv("https://raw.githubusercontent.com/vuminhtue/SMU_Data_Science_workflow_R/master/data/Kaggle_house_prices/test.csv")

## Step 2: Select variables

- Since there are 80 input variables, we should not use all of them to avoid collinearity.
- For simplicity, select the following columns: "OverallQual","OverallCond","YearBuilt","1stFlrSF","FullBath","GarageCars","SalePrice"



In [5]:
numerical=df_train.select_dtypes(exclude=['object'])
categorical=df_train.select_dtypes(include=['object'])
numerical.columns

Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
       'MiscVal', 'MoSold', 'YrSold', 'SalePrice'],
      dtype='object')

In [6]:
d_train, d_test = train_test_split(numerical, random_state =100 , test_size = 0.3)

In [7]:
df_train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


In [10]:
setup_df = setup(data= d_train, target = 'SalePrice',html=False, silent=True, verbose=False)

In [9]:
best=compare_models(n_select=3)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lr,Linear Regression,22718.531,1470588000.0,36155.164,0.7286,0.1872,0.1317,1.813


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lasso,Lasso Regression,21924.4537,1458205000.0,35774.6161,0.7293,0.174,0.1272,1.12
lr,Linear Regression,22718.531,1470588000.0,36155.164,0.7286,0.1872,0.1317,1.813


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
ridge,Ridge Regression,20159.3356,1150912000.0,31968.7455,0.8,0.2003,0.1163,1.113
lasso,Lasso Regression,21924.4537,1458205000.0,35774.6161,0.7293,0.174,0.1272,1.12
lr,Linear Regression,22718.531,1470588000.0,36155.164,0.7286,0.1872,0.1317,1.813


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
ridge,Ridge Regression,20159.3356,1150912000.0,31968.7455,0.8,0.2003,0.1163,1.113
en,Elastic Net,22277.0639,1341069000.0,35127.8082,0.7765,0.1771,0.1297,0.725
lasso,Lasso Regression,21924.4537,1458205000.0,35774.6161,0.7293,0.174,0.1272,1.12
lr,Linear Regression,22718.531,1470588000.0,36155.164,0.7286,0.1872,0.1317,1.813


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
ridge,Ridge Regression,20159.34,1150912000.0,31968.75,0.8,0.2003,0.1163,1.113
en,Elastic Net,22277.06,1341069000.0,35127.81,0.7765,0.1771,0.1297,0.725
lasso,Lasso Regression,21924.45,1458205000.0,35774.62,0.7293,0.174,0.1272,1.12
lr,Linear Regression,22718.53,1470588000.0,36155.16,0.7286,0.1872,0.1317,1.813
lar,Least Angle Regression,1.345243e+25,3.943071e+52,6.416816e+25,-6.754242e+42,30.6839,8.870882e+19,0.029


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
ridge,Ridge Regression,20159.34,1150912000.0,31968.75,0.8,0.2003,0.1163,1.113
en,Elastic Net,22277.06,1341069000.0,35127.81,0.7765,0.1771,0.1297,0.725
llar,Lasso Least Angle Regression,21576.28,1435693000.0,35433.47,0.7332,0.1703,0.125,0.027
lasso,Lasso Regression,21924.45,1458205000.0,35774.62,0.7293,0.174,0.1272,1.12
lr,Linear Regression,22718.53,1470588000.0,36155.16,0.7286,0.1872,0.1317,1.813
lar,Least Angle Regression,1.345243e+25,3.943071e+52,6.416816e+25,-6.754242e+42,30.6839,8.870882e+19,0.029


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
ridge,Ridge Regression,20159.34,1150912000.0,31968.75,0.8,0.2003,0.1163,1.113
en,Elastic Net,22277.06,1341069000.0,35127.81,0.7765,0.1771,0.1297,0.725
omp,Orthogonal Matching Pursuit,20984.82,1448410000.0,35509.15,0.735,0.1659,0.124,0.019
llar,Lasso Least Angle Regression,21576.28,1435693000.0,35433.47,0.7332,0.1703,0.125,0.027
lasso,Lasso Regression,21924.45,1458205000.0,35774.62,0.7293,0.174,0.1272,1.12
lr,Linear Regression,22718.53,1470588000.0,36155.16,0.7286,0.1872,0.1317,1.813
lar,Least Angle Regression,1.345243e+25,3.943071e+52,6.416816e+25,-6.754242e+42,30.6839,8.870882e+19,0.029


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
ridge,Ridge Regression,20159.34,1150912000.0,31968.75,0.8,0.2003,0.1163,1.113
en,Elastic Net,22277.06,1341069000.0,35127.81,0.7765,0.1771,0.1297,0.725
br,Bayesian Ridge,23743.83,1474366000.0,37181.31,0.7525,0.1892,0.1378,0.027
omp,Orthogonal Matching Pursuit,20984.82,1448410000.0,35509.15,0.735,0.1659,0.124,0.019
llar,Lasso Least Angle Regression,21576.28,1435693000.0,35433.47,0.7332,0.1703,0.125,0.027
lasso,Lasso Regression,21924.45,1458205000.0,35774.62,0.7293,0.174,0.1272,1.12
lr,Linear Regression,22718.53,1470588000.0,36155.16,0.7286,0.1872,0.1317,1.813
lar,Least Angle Regression,1.345243e+25,3.943071e+52,6.416816e+25,-6.754242e+42,30.6839,8.870882e+19,0.029


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
ridge,Ridge Regression,20159.34,1150912000.0,31968.75,0.8,0.2003,0.1163,1.113
en,Elastic Net,22277.06,1341069000.0,35127.81,0.7765,0.1771,0.1297,0.725
br,Bayesian Ridge,23743.83,1474366000.0,37181.31,0.7525,0.1892,0.1378,0.027
omp,Orthogonal Matching Pursuit,20984.82,1448410000.0,35509.15,0.735,0.1659,0.124,0.019
llar,Lasso Least Angle Regression,21576.28,1435693000.0,35433.47,0.7332,0.1703,0.125,0.027
lasso,Lasso Regression,21924.45,1458205000.0,35774.62,0.7293,0.174,0.1272,1.12
lr,Linear Regression,22718.53,1470588000.0,36155.16,0.7286,0.1872,0.1317,1.813
par,Passive Aggressive Regressor,43333.87,7481203000.0,70352.98,-0.2989,0.3412,0.2386,0.019
lar,Least Angle Regression,1.345243e+25,3.943071e+52,6.416816e+25,-6.754242e+42,30.6839,8.870882e+19,0.029


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
ridge,Ridge Regression,20159.34,1150912000.0,31968.75,0.8,0.2003,0.1163,1.113
en,Elastic Net,22277.06,1341069000.0,35127.81,0.7765,0.1771,0.1297,0.725
br,Bayesian Ridge,23743.83,1474366000.0,37181.31,0.7525,0.1892,0.1378,0.027
omp,Orthogonal Matching Pursuit,20984.82,1448410000.0,35509.15,0.735,0.1659,0.124,0.019
llar,Lasso Least Angle Regression,21576.28,1435693000.0,35433.47,0.7332,0.1703,0.125,0.027
lasso,Lasso Regression,21924.45,1458205000.0,35774.62,0.7293,0.174,0.1272,1.12
lr,Linear Regression,22718.53,1470588000.0,36155.16,0.7286,0.1872,0.1317,1.813
huber,Huber Regressor,26994.72,1717115000.0,40687.3,0.7059,0.2041,0.1582,0.036
par,Passive Aggressive Regressor,43333.87,7481203000.0,70352.98,-0.2989,0.3412,0.2386,0.019
lar,Least Angle Regression,1.345243e+25,3.943071e+52,6.416816e+25,-6.754242e+42,30.6839,8.870882e+19,0.029


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
ridge,Ridge Regression,20159.34,1150912000.0,31968.75,0.8,0.2003,0.1163,1.113
en,Elastic Net,22277.06,1341069000.0,35127.81,0.7765,0.1771,0.1297,0.725
br,Bayesian Ridge,23743.83,1474366000.0,37181.31,0.7525,0.1892,0.1378,0.027
omp,Orthogonal Matching Pursuit,20984.82,1448410000.0,35509.15,0.735,0.1659,0.124,0.019
llar,Lasso Least Angle Regression,21576.28,1435693000.0,35433.47,0.7332,0.1703,0.125,0.027
lasso,Lasso Regression,21924.45,1458205000.0,35774.62,0.7293,0.174,0.1272,1.12
lr,Linear Regression,22718.53,1470588000.0,36155.16,0.7286,0.1872,0.1317,1.813
huber,Huber Regressor,26994.72,1717115000.0,40687.3,0.7059,0.2041,0.1582,0.036
knn,K Neighbors Regressor,33143.34,2535591000.0,49851.21,0.5879,0.2375,0.1878,0.032
par,Passive Aggressive Regressor,43333.87,7481203000.0,70352.98,-0.2989,0.3412,0.2386,0.019


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
ridge,Ridge Regression,20159.34,1150912000.0,31968.75,0.8,0.2003,0.1163,1.113
en,Elastic Net,22277.06,1341069000.0,35127.81,0.7765,0.1771,0.1297,0.725
br,Bayesian Ridge,23743.83,1474366000.0,37181.31,0.7525,0.1892,0.1378,0.027
omp,Orthogonal Matching Pursuit,20984.82,1448410000.0,35509.15,0.735,0.1659,0.124,0.019
llar,Lasso Least Angle Regression,21576.28,1435693000.0,35433.47,0.7332,0.1703,0.125,0.027
lasso,Lasso Regression,21924.45,1458205000.0,35774.62,0.7293,0.174,0.1272,1.12
lr,Linear Regression,22718.53,1470588000.0,36155.16,0.7286,0.1872,0.1317,1.813
huber,Huber Regressor,26994.72,1717115000.0,40687.3,0.7059,0.2041,0.1582,0.036
dt,Decision Tree Regressor,30045.6,2382721000.0,46798.79,0.612,0.2397,0.1739,0.022
knn,K Neighbors Regressor,33143.34,2535591000.0,49851.21,0.5879,0.2375,0.1878,0.032


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
rf,Random Forest Regressor,19119.39,1129643000.0,32118.74,0.8148,0.153,0.1105,0.146
ridge,Ridge Regression,20159.34,1150912000.0,31968.75,0.8,0.2003,0.1163,1.113
en,Elastic Net,22277.06,1341069000.0,35127.81,0.7765,0.1771,0.1297,0.725
br,Bayesian Ridge,23743.83,1474366000.0,37181.31,0.7525,0.1892,0.1378,0.027
omp,Orthogonal Matching Pursuit,20984.82,1448410000.0,35509.15,0.735,0.1659,0.124,0.019
llar,Lasso Least Angle Regression,21576.28,1435693000.0,35433.47,0.7332,0.1703,0.125,0.027
lasso,Lasso Regression,21924.45,1458205000.0,35774.62,0.7293,0.174,0.1272,1.12
lr,Linear Regression,22718.53,1470588000.0,36155.16,0.7286,0.1872,0.1317,1.813
huber,Huber Regressor,26994.72,1717115000.0,40687.3,0.7059,0.2041,0.1582,0.036
dt,Decision Tree Regressor,30045.6,2382721000.0,46798.79,0.612,0.2397,0.1739,0.022


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
rf,Random Forest Regressor,19119.39,1129643000.0,32118.74,0.8148,0.153,0.1105,0.146
et,Extra Trees Regressor,21044.51,1196369000.0,33637.15,0.8043,0.1646,0.1202,0.137
ridge,Ridge Regression,20159.34,1150912000.0,31968.75,0.8,0.2003,0.1163,1.113
en,Elastic Net,22277.06,1341069000.0,35127.81,0.7765,0.1771,0.1297,0.725
br,Bayesian Ridge,23743.83,1474366000.0,37181.31,0.7525,0.1892,0.1378,0.027
omp,Orthogonal Matching Pursuit,20984.82,1448410000.0,35509.15,0.735,0.1659,0.124,0.019
llar,Lasso Least Angle Regression,21576.28,1435693000.0,35433.47,0.7332,0.1703,0.125,0.027
lasso,Lasso Regression,21924.45,1458205000.0,35774.62,0.7293,0.174,0.1272,1.12
lr,Linear Regression,22718.53,1470588000.0,36155.16,0.7286,0.1872,0.1317,1.813
huber,Huber Regressor,26994.72,1717115000.0,40687.3,0.7059,0.2041,0.1582,0.036


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
rf,Random Forest Regressor,19119.39,1129643000.0,32118.74,0.8148,0.153,0.1105,0.146
et,Extra Trees Regressor,21044.51,1196369000.0,33637.15,0.8043,0.1646,0.1202,0.137
ridge,Ridge Regression,20159.34,1150912000.0,31968.75,0.8,0.2003,0.1163,1.113
en,Elastic Net,22277.06,1341069000.0,35127.81,0.7765,0.1771,0.1297,0.725
br,Bayesian Ridge,23743.83,1474366000.0,37181.31,0.7525,0.1892,0.1378,0.027
omp,Orthogonal Matching Pursuit,20984.82,1448410000.0,35509.15,0.735,0.1659,0.124,0.019
ada,AdaBoost Regressor,26597.48,1612824000.0,39367.64,0.734,0.2022,0.165,0.051
llar,Lasso Least Angle Regression,21576.28,1435693000.0,35433.47,0.7332,0.1703,0.125,0.027
lasso,Lasso Regression,21924.45,1458205000.0,35774.62,0.7293,0.174,0.1272,1.12
lr,Linear Regression,22718.53,1470588000.0,36155.16,0.7286,0.1872,0.1317,1.813


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
rf,Random Forest Regressor,19119.39,1129643000.0,32118.74,0.8148,0.153,0.1105,0.146
gbr,Gradient Boosting Regressor,18511.77,1136253000.0,31292.85,0.8134,0.1442,0.1048,0.063
et,Extra Trees Regressor,21044.51,1196369000.0,33637.15,0.8043,0.1646,0.1202,0.137
ridge,Ridge Regression,20159.34,1150912000.0,31968.75,0.8,0.2003,0.1163,1.113
en,Elastic Net,22277.06,1341069000.0,35127.81,0.7765,0.1771,0.1297,0.725
br,Bayesian Ridge,23743.83,1474366000.0,37181.31,0.7525,0.1892,0.1378,0.027
omp,Orthogonal Matching Pursuit,20984.82,1448410000.0,35509.15,0.735,0.1659,0.124,0.019
ada,AdaBoost Regressor,26597.48,1612824000.0,39367.64,0.734,0.2022,0.165,0.051
llar,Lasso Least Angle Regression,21576.28,1435693000.0,35433.47,0.7332,0.1703,0.125,0.027
lasso,Lasso Regression,21924.45,1458205000.0,35774.62,0.7293,0.174,0.1272,1.12


KeyboardInterrupt: 

In [None]:
a=3

In [None]:
a