**Table of contents**<a id='toc0_'></a>    
- [Baseline](#toc1_1_)    
  - [Predict logarithm](#toc1_2_)    
  - [Choose top 10 features](#toc1_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

To run baseline experiments I will use PyCaret, as it makes this sort of thing easier at the beginning. PyCaret is what's called an AutoML library, i.e. it automates some of the typical ML tasks. Data scaling, null handling, one-hot encoding, train-test split, model training, all of these things are handled by the same library.

For a full-on classification tutorial, I recommend checking [their tutorials](https://www.pycaret.org/tutorials/html/CLF101.html). 

In [11]:
import pandas as pd
import plotly.express as px

from pycaret.regression import *

In [2]:
df = pd.read_csv("../data/raw/train.csv")

## <a id='toc1_1_'></a>[Baseline](#toc0_)

In [6]:
experiment = setup(data=df, target='SalePrice', session_id=123) 

Unnamed: 0,Description,Value
0,Session id,123
1,Target,SalePrice
2,Target type,Regression
3,Original data shape,"(1460, 81)"
4,Transformed data shape,"(1460, 279)"
5,Transformed train set shape,"(1021, 279)"
6,Transformed test set shape,"(439, 279)"
7,Numeric features,37
8,Categorical features,43
9,Rows with missing values,100.0%


We see that:
- all null values were imputed either with the mean or mode
- we use 10-fold cross-validation to get evaluation metrics as close to reality as possible
- we use One-Hot Encoding for any categorical feature that has less than 25 categories and otherwise we use Label encoding

In [8]:
model = compare_models() # This is as easy as it gets

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
gbr,Gradient Boosting Regressor,17276.3461,847357928.118,28313.9245,0.8723,0.1377,0.1,0.921
lightgbm,Light Gradient Boosting Machine,17701.1514,1019922835.2337,31002.5633,0.8484,0.1449,0.102,0.945
rf,Random Forest Regressor,19042.9144,1117550843.2233,32426.1601,0.8345,0.1545,0.1114,1.525
et,Extra Trees Regressor,18873.1767,1163678875.5551,32786.948,0.8327,0.1513,0.1095,1.584
ada,AdaBoost Regressor,25963.6459,1417400709.1888,37051.1993,0.7868,0.2056,0.1679,0.836
llar,Lasso Least Angle Regression,18774.5227,1406152263.2702,34868.7879,0.7845,0.167,0.1124,0.578
ridge,Ridge Regression,20081.3998,1526032661.7945,36561.0633,0.7668,0.2087,0.1213,0.349
en,Elastic Net,21115.3002,1782962549.6151,38958.876,0.7382,0.1739,0.1218,0.562
omp,Orthogonal Matching Pursuit,22617.0267,1808808421.5105,39503.146,0.7344,0.1856,0.1343,0.575
lasso,Lasso Regression,20570.3408,1877626177.9343,40440.0086,0.6892,0.1946,0.1249,0.473


In [9]:
model

Initial thoughts:
- Boosting models, then random forest & some regression models (Lasso Least Angle, Ridge) perform best

Questions:
- Okay but why? And what are the training scores?
- What features are most important?
- What is the submission score for this model? (We wouldn't have this in a real-life scenario)

In [17]:
px.histogram(x=model.feature_importances_, y=model.feature_names_in_).update_yaxes(categoryorder='total ascending').update_layout(height=2_000)

As I expected, most features are unimportant. Most of the performance comes from the OverallQual feature, then square-feet features.

> Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

This suggests that perhaps we could predict the logarithm of the house price as opposed to the house price directly. This way we optimize directly for the same metric, since it seems to be just as important to predict both cheap and expensive houses well.

## <a id='toc1_2_'></a>[Predict logarithm](#toc0_)

In [18]:
experiment = setup(data=df, target='SalePrice', session_id=123, transform_target=True) 

Unnamed: 0,Description,Value
0,Session id,123
1,Target,SalePrice
2,Target type,Regression
3,Original data shape,"(1460, 81)"
4,Transformed data shape,"(1460, 279)"
5,Transformed train set shape,"(1021, 279)"
6,Transformed test set shape,"(439, 279)"
7,Numeric features,37
8,Categorical features,43
9,Rows with missing values,100.0%


In [19]:
model = compare_models()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
gbr,Gradient Boosting Regressor,17276.3652,913754745.6682,29505.4466,0.8652,0.1363,0.0966,0.978
lightgbm,Light Gradient Boosting Machine,17583.7265,945393828.1244,29735.392,0.8624,0.1391,0.0994,0.58
rf,Random Forest Regressor,19212.4509,1178677119.133,33263.5075,0.8285,0.1534,0.1091,1.252
et,Extra Trees Regressor,19234.3504,1224647550.2572,33770.3099,0.8227,0.1499,0.1065,1.815
ada,AdaBoost Regressor,24418.0172,1397201132.0592,36822.8107,0.7903,0.181,0.1374,0.76
dt,Decision Tree Regressor,27861.3533,2036059508.0529,44280.1915,0.6921,0.222,0.1603,0.362
knn,K Neighbors Regressor,31414.7438,2408383648.3254,48373.9598,0.6336,0.2356,0.1798,0.376
dummy,Dummy Regressor,56215.4821,6807981247.1675,81902.496,-0.0419,0.4058,0.3279,0.335
br,Bayesian Ridge,19740.0895,8453507506.8693,55153.4783,-0.2565,0.1584,0.1143,0.406
ridge,Ridge Regression,20087.8355,9250155922.015,56507.1839,-0.3793,0.1656,0.1181,0.365


R2 score got a bit worse, but that's not necessarily a bad thing. The ultimate goal is not JUST the best model, but a model that reflects reality and is suited to our needs. 

In [20]:
px.histogram(x=model.feature_importances_, y=model.feature_names_in_).update_yaxes(categoryorder='total ascending').update_layout(height=2_000)

Top features are mostly the same, wih some orders shifted.

## <a id='toc1_3_'></a>[Choose top 10 features](#toc0_)

Although this is a Kaggle competition, in real-life you wouldn't collect data unless it has a palpable effect on the model. That's why the next question is: How good will the model be with just 10 features?

In [21]:
features = pd.Series(model.feature_importances_, index=model.feature_names_in_)
top_10_features = features.sort_values(ascending=False)[:10].index.to_list()
top_10_features

['OverallQual',
 'GrLivArea',
 'TotalBsmtSF',
 'GarageCars',
 'YearBuilt',
 'BsmtFinSF1',
 'GarageFinish_Unf',
 'OverallCond',
 '1stFlrSF',
 'CentralAir']

In [26]:
top_10_original = ['OverallQual',
 'GrLivArea',
 'TotalBsmtSF',
 'GarageCars',
 'YearBuilt',
 'BsmtFinSF1',
 'GarageFinish',
 'OverallCond',
 '1stFlrSF',
 'CentralAir']

In [29]:
experiment = setup(data=df[top_10_original + ['SalePrice']], target='SalePrice', session_id=123, transform_target=True, 
                   # keep_features=top_10_features - this argument doesn't do what you think it does
                   ) 

Unnamed: 0,Description,Value
0,Session id,123
1,Target,SalePrice
2,Target type,Regression
3,Original data shape,"(1460, 11)"
4,Transformed data shape,"(1460, 13)"
5,Transformed train set shape,"(1021, 13)"
6,Transformed test set shape,"(439, 13)"
7,Numeric features,8
8,Categorical features,2
9,Rows with missing values,5.5%


In [30]:
model = compare_models()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
gbr,Gradient Boosting Regressor,18565.7643,955728702.4117,30088.3113,0.8544,0.1472,0.1053,0.113
lightgbm,Light Gradient Boosting Machine,19050.3748,1076515344.2083,31819.6182,0.8425,0.1511,0.1092,0.154
et,Extra Trees Regressor,19658.1046,1098603780.4682,31861.0855,0.8378,0.153,0.1124,0.144
rf,Random Forest Regressor,19419.1853,1168984004.3756,32903.7238,0.8297,0.1545,0.1114,0.24
ada,AdaBoost Regressor,25216.371,1454402106.9898,37489.2033,0.7805,0.1862,0.1415,0.088
dt,Decision Tree Regressor,26329.3488,1806888809.0821,41888.5385,0.7201,0.2155,0.1545,0.051
knn,K Neighbors Regressor,30733.1099,2396152385.3653,48519.4767,0.6301,0.2315,0.1794,0.065
dummy,Dummy Regressor,56215.4821,6807981247.1675,81902.496,-0.0419,0.4058,0.3279,0.039
omp,Orthogonal Matching Pursuit,41314.9626,7333581390.4055,73850.5213,-0.0977,0.296,0.24,0.038
par,Passive Aggressive Regressor,38428.4049,11566787403.2308,80266.7559,-0.6798,0.2751,0.22,0.039


All regression models started having terrible performance, which means that their feature importances must've been quite different for the ones GBR had.

For the GBR algorithm, using only the top 10 features or all of them gives almost the same results. This means we can now focus on our top 10 features for further feature engineering and perhaps come back to the rest of the features later.