<a href="https://colab.research.google.com/github/tushar2704/Data-Science-Master/blob/main/Copy_of_Kaggle_Workflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Unleashing the mystery of Kaggle
- How does it all work?
- Feature Engineering
- Parameter Tuning
- Ensembling & Stacking

## How does it all work - Should I trust the public leaderboard?

- Each Kaggle competition has public and private leaderboard. Public leaderboard only uses part of the test dataset to determine the score and the private leaderboard will evaluated using the other part at the end of the competition.
- You can find how Kaggle calculate the public and private leaderboard [here](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/leaderboard).
- If the competition has a large training set and a relatively small public test set compared to private test set, you can easily overfit the public test set. In this case, you **should not** trust the public leaderboard. 
- If the traing set and test set are collected from different time frames, you **must** trust the public leaderboard.

![img](https://s3.amazonaws.com/nycdsabt01/s2-4.png)

### CV or LB?
- **TRUST YOUR CV!**
- Typical question on smaller datasets: 
 - “I’m doing proper cross-validation and see improvements on my CV score, but public leaderboard is so random and does not correlate at all!”
- Top kagglers’ pick most of the time:
 - Final Submission = $X*CV + (1-X)*LB$, typically $X=0.5$ is OK.
- Trusting CV is a hard thing to do

## Step 1: Preprocess
- We will be using the [housing price prediction](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) Kaggle competition to illustrate the workflow. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
- To work with Kaggle competition in Google Colab, you need to have a Kaggle account and then click the avatar on the top right corner -> My Account -> Create New API Token and it will download a `kaggle.json` file. 
- Open the file and copy & paste the username and password to the following code chunk.

In [None]:
import os
os.environ['KAGGLE_USERNAME'] = "xxxxxxx" # username from the json file
os.environ['KAGGLE_KEY'] = "xxxxxxxxxxxxxxxxxxx" # key from the json file

- Using API to download the dataset directly from Kaggle.

In [None]:
!kaggle competitions download -c house-prices-advanced-regression-techniques

In [None]:
!pwd

In [None]:
!ls

- Google colab is running [Dokcer](https://www.youtube.com/watch?v=t9YuqwGYUUg&feature=youtu.be) containers behind the scene so that means you will lose your data if you disconnect from the server. 
- A easier solution is to mount your google drive to the docker container. You can also save the preprocessed the data files to avoid running the same code each time.
- Run the following cell to mount your Google Drive.

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

- Run the following code chunk if you want to use your own drive.

In [None]:
# import os
# os.chdir('/content/drive/My Drive')

In [None]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 100)

train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

train_df.head()

In [None]:
# Save the 'Id' column
train_ID = train_df['Id']
test_ID = test_df['Id']

# Now drop the 'Id' colum since we can not use it as a feature to train our model.
train_df.drop("Id", axis = 1, inplace = True)
test_df.drop("Id", axis = 1, inplace = True)

In [None]:
y_train = train_df['SalePrice']
X_train = train_df.drop('SalePrice', axis=1)
X_test = test_df.copy()

- Delete the dataframes that you do not need anymore to save computer RAM.

In [None]:
del train_df, test_df

In [None]:
print(X_train.shape)
print(X_test.shape)

- Combine training and test dataframes before feature engineering.
- **This is not always the correct way.**
 - For categorial features, this is fine because you want to avoid having new categories in the test set, which will cause different dimensions after dummify the data set.
 - If you want to perform any transformation (normalization, standardization, etc) on the numerical features, you should **[fit on the training set and transform on the test set.](https://stats.stackexchange.com/a/174865)**
 - It also applys to how you perform cross-validation. See Chapter 7.10.2 of [ESLR](https://web.stanford.edu/~hastie/Papers/ESLII.pdf)

In [None]:
all_data = pd.concat([X_train, X_test], ignore_index=True)
all_data.shape

## Step 2: Feature Engineering - most creative aspect of data science

### Categorical  features
- Nearly always need some treatment
- High cardinality can create very sparse data

#### One-hot encoding
- One-of-K encoding on an array of length K
- Basic method: used with most linear algorithm
- Drop first column avoids collinearity
 - encoding gender as two variables, **is_male** and **is_female**, produces two features which are perfectly negatively correlated
- Encode categories appearing 3+ times
 - Reduce training feature space with no loss of info.

In [None]:
for c in all_data.columns:
    if all_data[c].dtype == 'object':
        print(c, len(all_data[c].value_counts()))

In [None]:
one_hot_df = pd.get_dummies(all_data, drop_first=True, dummy_na=True)
one_hot_df.head()

- To view the quick documentation of the function, just put `??` in front of it.

In [None]:
??pd.get_dummies

#### Label encoding
- Give every categorial variable a unique numerical ID
- Useful for non-linear tree-based algorithm
- Does not increase dimensionality

In [None]:
from sklearn.preprocessing import LabelEncoder

label_df = all_data.copy()

for c in label_df.columns:
    if label_df[c].dtype == 'object':
        le = LabelEncoder()
        # Need to convert the column type to string in order to encode missing values
        label_df[c] = le.fit_transform(label_df[c].astype(str))

In [None]:
label_df.head()

### Ordinal Features

- Label Count encoding is good in general, however, some of the features are ordinal in nature.
- For example, we usually consider Excellent > Good > Average/Typical > Fair > Poor
- We can construct a dictionary like the following and map it to those columns:
  ```python
  {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa':2, 'Po':1}
  ```
 
- You need a different dictionary for columns with different levels.

In [None]:
ord_cols = ['ExterQual', 'ExterCond','BsmtCond','HeatingQC', 'KitchenQual', 
           'FireplaceQu', 'GarageQual', 'GarageCond', 'PoolQC']
ord_dic = {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa':2, 'Po':1}

In [None]:
ord_df = X_train.copy()

for col in ord_cols:
    ord_df[col] = ord_df[col].map(lambda x: ord_dic.get(x, 0))
ord_df.head()

#### Interactions
- If interactions are natural for a problem - ML only does approximations! => sub-optimal
 - Start from interactions that make sense intuitively. 
 - Winners usually find something that most people struggle to see in data. **Not many people look at the data at all!**
 
|  GarageCond |   GarageType   | GarageCond * GarageType  |
| ------------|:--------------:| -----:|
|  Ex  | 2Types | Ex * 2Types |
|  Ex  | CarPort| Ex * CarPort|
|  TA  | Basement| TA * Basement|
|  Fa  | BuiltIn | Fa * BuiltIn |
 
 
- Test your method with all explicitly created possible 2-way interactions if you have enough computing power
- This is especially useful when dealing with **anonymous data** (column name unknown)
- If 2-way interactions help – go even further (3-way, 4-way, ...)

**Dealing with NA's depends on situation. NA itself is an information unit! Usually separate category is enough.**

### Numerical features
Feature transformations to consider:
- Scaling - min/max, N(0,1), root/power scaling, log scaling, Box-Cox, quantiles.
- Rounding (too much precision might be noise!)
- Interactions {+,-,*,/}
 - Since area related features are very important to determine house prices, we can add one more feature which is the total area of basement, first and second floor areas of each house
 - `all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']`
- **Tree methods are almost invariant to scaling**

## Step 3: Parameter Tuning

#### Basic approach: apply grid search on all parameter space
- Zero effort and no supervision
- Enormous parameter space
- Very time consuming

#### Expert approach: experience + intuition + resources at hand
1. Pick one set of parameters from the Kaggle kernel or the golden parameter you used in the previous competition
2. Start with the parameter that doesn't affect the others too much
 - i.e. learning rate $\eta $ in boosting method doesn't influence other parameter tuning (from my experience)
 - `max_depth`, `min_samples_split` and `min_samples_leaf` in random forest are highly correlated with each other
3. Iteratively tuning the features that control overfitting/underfitting
 - If it helps on CV, try to tune it as much as possible. Stop after CV score converges.
 - You can use public leaderboard as your K+1 fold to further prove it.
4. Go back to step 2 and stop when you are satisfied with the result and won't regret not working harder.  


#### [Bayesian optimization method](https://github.com/fmfn/BayesianOptimization/blob/master/examples/visualization.ipynb): trade-off between expert and grid search approach
- Zero effort and no supervision
- Grid space reduced on previous iteration's results (mimic expert decisions)
- Time consuming (still)
- Easy to integrate with sklearn cross validation function. See [examples](https://github.com/fmfn/BayesianOptimization/blob/master/examples/sklearn_example.py) here.

#### Golden rule: finding optimal configuration rarely is a good time investment!

## Step 4: Ensemble & Stacking

### [Ensembling by voting](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier)
```
1111111100 = 80% accuracy 
0111011101 = 70% accuracy 
1000101111 = 60% accuracy
```
**Majority Vote**
```
1111111101 = 90% accuracy
```

### Ensembling by averaging
- Let’s say we have N predictions from N different models: $y_1, y_2, ... , y_N$
- We want to make a single prediction using weighted average: $\beta_1*y_1+\beta_2*y_2+...+\beta_N*y_N$
- How do we find the best beta cofficients?
- Very common mistake to select weights based on leaderboard feedback
 - **inefficient & prone to leaderboard overfitting**
- Solve the problem using CV predictions with optimization algorithms 
 - $optim(\beta_1*y_1+\beta_2*y_2+...+\beta_N*y_N)$ with starting weights $\beta_i=1/N$

### Stacked Generalization

The procedure for a 5 fold stacking may be described as follows:

1. Split the total training set into two disjoint sets (here train and holdout)

2. Train several base models on the first part (train)

3. Predict these base models on the second part (holdout)

4. Repeat step 1-3 five times and use the holdout predictions as the inputs, and the correct responses (target variable) as the outputs to train a higher level learner called meta-model.


- For the test set, we could either average the predictions of all base models on the test data or refit the model using the whole training set and then predict. Generally speaking, either way is fine because the test set hasn't seen the training set.
- If we ran 10 models using the same procedure, our meta model will have 10 input features.

![img](https://s3.amazonaws.com/nycdsabt01/stacking.jpg)

Borrowed from [Faron](https://www.kaggle.com/getting-started/18153#post103381)

- As a quick note, one should try a few diverse models. To my experience, a good stacking solution is often composed of at least:
 - 2 or 3 GBMs/XGBs/LightGBMs (one with low depth, one with medium and one with high)
 - 1 or 2 Random Forests (again as diverse as possible–one low depth, one high)
 - 1 linear model**!**

- Download helper Python scripts.

In [None]:
!wget https://gist.githubusercontent.com/hellozeyu/12f669f6a2f0ca4e228ed783c3937cfd/raw/5cfc1e6269bc6dc5b2e676c9970f6c95c4339ca4/preprocess.py

In [None]:
!wget https://gist.githubusercontent.com/hellozeyu/c0183dc498841f3a913284b8aac42c35/raw/b84cee92a97097683c9ea846e88071e4f107f7d5/stacking.py

In [None]:
from sklearn.linear_model import ElasticNet, LinearRegression as lr
from sklearn.ensemble import GradientBoostingRegressor as gbr, RandomForestRegressor as rfr
from preprocess import impute # impute is a helper function defined under preprocess.py

- Impute all the missing values of the dataframe

In [None]:
all_data = impute(all_data)

In [None]:
all_data.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,TotalSF
0,60,RL,65.0,8450,Pave,,Reg,Lvl,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706.0,Unf,0.0,150.0,856.0,GasA,Ex,Y,SBrkr,856,854,0,1710,1.0,0.0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2.0,548.0,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,2566.0
1,20,RL,80.0,9600,Pave,,Reg,Lvl,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978.0,Unf,0.0,284.0,1262.0,GasA,Ex,Y,SBrkr,1262,0,0,1262,0.0,1.0,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2.0,460.0,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,2524.0
2,60,RL,68.0,11250,Pave,,IR1,Lvl,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486.0,Unf,0.0,434.0,920.0,GasA,Ex,Y,SBrkr,920,866,0,1786,1.0,0.0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2.0,608.0,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,2706.0
3,70,RL,60.0,9550,Pave,,IR1,Lvl,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216.0,Unf,0.0,540.0,756.0,GasA,Gd,Y,SBrkr,961,756,0,1717,1.0,0.0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3.0,642.0,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,2473.0
4,60,RL,84.0,14260,Pave,,IR1,Lvl,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655.0,Unf,0.0,490.0,1145.0,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1.0,0.0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3.0,836.0,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,3343.0


- Encode the ordinal features based on the condition.

In [None]:
for col in all_data.columns:
    if col in ord_cols:
        all_data[col] = all_data[col].map(lambda x: ord_dic.get(x, 0))

- One-hot encode all the other categorical features.

In [None]:
all_data = pd.get_dummies(all_data)

In [None]:
train_index = len(X_train)
X_train = all_data.iloc[:train_index, :]
X_test = all_data.iloc[train_index:, :]

In [None]:
from stacking import stacking_regression # stacking_regression is a helper function defined under stacking.py
from sklearn.metrics import mean_squared_error

In [None]:
def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(np.log(y), np.log(y_pred)))

In [None]:
models = [
    # linear model, ElasticNet = lasso + ridge
    ElasticNet(random_state=0),
    
    # conservative random forst model
    rfr(random_state=0,
        n_estimators=1000, max_depth=6,  max_features='sqrt'),
    
    # aggressive random forst model
    rfr(random_state=0, 
        n_estimators=1000, max_depth=9,  max_features='auto'),
    
    # conservative gbm model
    gbr(random_state=0, learning_rate = 0.005, max_features='sqrt',
        min_samples_leaf=15, min_samples_split=10, 
        n_estimators=3000, max_depth=3),
    
    # aggressive gbm model
    gbr(random_state = 0, learning_rate = 0.01, max_features='sqrt',
        min_samples_leaf=10, min_samples_split=5, 
        n_estimators = 1000, max_depth = 9)
    ]

meta_model = lr(normalize=True)

In [None]:
%%time
final_prediction = stacking_regression(models, meta_model, X_train, y_train, X_test,
                               transform_target=np.log1p, transform_pred = np.expm1, 
                               metric=rmsle, verbose=1)

metric: [rmsle]

model 0: [ElasticNet]
    ----
    MEAN:   [0.19267646]

model 1: [RandomForestRegressor]


  X_train = X_train.as_matrix()
  X_test = X_test.as_matrix()


    ----
    MEAN:   [0.16326278]

model 2: [RandomForestRegressor]
    ----
    MEAN:   [0.14649338]

model 3: [GradientBoostingRegressor]
    ----
    MEAN:   [0.12821168]

model 4: [GradientBoostingRegressor]
    ----
    MEAN:   [0.12726764]

CPU times: user 1min 9s, sys: 257 ms, total: 1min 9s
Wall time: 1min 9s


In [None]:
submission_df = pd.DataFrame()
submission_df['Id'] = test_ID
submission_df['SalePrice'] = final_prediction 
submission_df.to_csv('stacking.csv', index=False)

In [None]:
!head stacking.csv

Id,SalePrice
1461,125359.34370744755
1462,149091.1608909535
1463,174055.74797125038
1464,181243.35887323445
1465,177494.68099706247
1466,168575.32684119936
1467,160012.00198952647
1468,160982.77986826166
1469,169473.06798496895


In [None]:
!kaggle competitions submit -c house-prices-advanced-regression-techniques -m "stacking model" -f stacking.csv

**Having more models than necessary in ensemble may hurt.**


- Lets say we have a library of created models. Usually greedy-forward approach works well:
 - Start with a few well-performing models’ ensemble
 - Loop through each other model in a library and add to current ensemble
 - Determine best performing ensemble configuration
 - Repeat until metric converged
- If you are using linear regression as the meta model, make sure you have **diverse/uncorrelated** first layer models

- During each loop iteration it is wise to consider only a subset of library models, which could work as a regularization for model selection.

- Repeating procedure few times and bagging results reduces the possibility of overfitting by doing model selection.

- R users can use the `caretStack` function from the [caretEnsemble](https://github.com/zachmayer/caretEnsemble) package directly. A nice tutorial [here](https://machinelearningmastery.com/machine-learning-ensembles-with-r/).

- The `stackedEnsemble()` function from the [H2o package](https://h2o-release.s3.amazonaws.com/h2o/rel-ueno/2/docs-website/h2o-docs/data-science/stacked-ensembles.html) is also a good choice out there. But the downside is it only takes h2o model as input.

## Success formula (personal opinion)

50% - feature engineering

30% - model diversity

10% - luck

10% - proper ensembling
 - Voting
 - Averaging
 - Stacking