# Table of Contents

<a id="table-of-contents"></a>
1. [Introduction](#introduction)
2. [Reference](#reference)
3. [Preparation](#preparation)
4. [Base Regression Model](#base_model)
5. [Base Regression Model and Label Encoding](#base_label_encode)
    * 5.1. [Log Continuous Features and Label Encoding](#modified_model_1)
    * 5.2. [Log Continuous Features and Log Label Encoding](#modified_model_2)
    * 5.3. [Continuous Features, Label Encoding and Min on Continuous Features](#modified_model_3)
    * 5.4. [Continuous Features, Label Encoding and Log Min on Continuous Features](#modified_model_4)
    * 5.5. [Continuous Features, Label Encoding and Max on Continuous Features](#modified_model_5)
    * 5.6. [Continuous Features, Label Encoding and Log Max on Continuous Features](#modified_model_6)
    * 5.7. [Continuous Features, Label Encoding and Sum on Continuous Features](#modified_model_7)
    * 5.8. [Continuous Features, Label Encoding and Log Sum on Continuous Features](#modified_model_8)
    * 5.9. [Continuous Features, Label Encoding and Multiplication on Continuous Features](#modified_model_9)
    * 5.10. [Continuous Features, Label Encoding and Log Multiplication on Continuous Features](#modified_model_10)
    * 5.11. [Continuous Features, Label Encoding and Prorate on Continuous Features](#modified_model_11)
    * 5.12. [Continuous Features, Label Encoding and Log Prorate on Continuous Features](#modified_model_12)
    * 5.13. [Exponential Continuous Features and Label Encoding](#modified_model_23)
    * 5.14. [Boxcox Continuous Features and Label Encoding](#modified_model_23)
6. [Base Regression Model and Target Encoding](#base_target_encode)
    * 6.1. [Continuous Features and Mean Encoding](#modified_model_13)
    * 6.2. [Continuous Features and Min Encoding](#modified_model_14)
    * 6.3. [Continuous Features and Max Encoding](#modified_model_15)
    * 6.4. [Continuous Features and Median Encoding](#modified_model_19)
    * 6.5. [Continuous Features and Standar Deviation Encoding](#modified_model_20)
    * 6.6. [Continuous Features and Skewness Encoding](#modified_model_21)
    * 6.7. [Continuous Features, Mean, Median, Min, Max and Std Dev Encoding](#modified_model_16)
7. [Base Regression Model and Categorical Encoding](#base_cat_encode)
    * 7.1. [Continuous Features and Percentage Encoding](#modified_model_17)
    * 7.2. [Continuous Features and Sum Categorical Encoding](#modified_model_18)
    * 7.3. [Continuous Features and Multiplication Categorical Encoding](#modiefied_model_22)
8. [Summary](#summary)

[back to top](#table-of-contents)
<a id="introduction"></a>
# 1. Introduction

The purpose of the notebook is to try as many feature engineering without any parameters tuning to see the impact of the new features. The base model is Lightgbm. There is a little summary on each section on features that are used on the model.

[back to top](#table-of-contents)
<a id="reference"></a>
# 2. Reference
Please find great notebooks that are used in this notebook:
* [February Solution - Stratified KFolds](https://www.kaggle.com/gpreda/february-solution-stratified-kfolds) by [Gabriel Preda](https://www.kaggle.com/gpreda)
* [TPS Feb 2021 with LGBMRegressor](https://www.kaggle.com/tunguz/tps-feb-2021-with-lgbmregressor) by [Bojan Tunguz](https://www.kaggle.com/tunguz)

[back to top](#table-of-contents)
<a id="preparation"></a>
# 3. Preparation
Loading packages, setup some function to cross validate the training dataset and set cross validation to 5.

In [None]:
import warnings
import numpy as np
import pandas as pd
import lightgbm
import datetime
from sklearn import datasets
from sklearn import model_selection
from sklearn.metrics import mean_squared_error, roc_auc_score
from sklearn.preprocessing import LabelEncoder
from scipy.stats import boxcox
from tqdm.notebook import tqdm

warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', 0)
n_fold = 5
result = pd.DataFrame(columns=['Model', 'RMSE'])

def create_stratified_folds_for_regression(data_df, n_splits=5):
    """
    @param data_df: training data to split in Stratified K Folds for a continous target value
    @param n_splits: number of splits
    @return: the training data with a column with kfold id
    """
    data_df['kfold'] = -1
    # randomize the data
    data_df = data_df.sample(frac=1, random_state=42).reset_index(drop=True)
    # calculate the optimal number of bins based on log2(data_df.shape[0])
    num_bins = np.int(np.floor(1 + np.log2(len(data_df))))
    # bins value will be the equivalent of class value of target feature used by StratifiedKFold to 
    # distribute evenly the classed over each fold
    data_df.loc[:, "bins"] = pd.cut(pd.to_numeric(data_df['target'], downcast="signed"), bins=num_bins, labels=False)
    kf = model_selection.StratifiedKFold(n_splits=n_splits, random_state=42)
    
    # set the fold id as a new column in the train data
    for f, (t_, v_) in enumerate(kf.split(X=data_df, y=data_df.bins.values)):
        data_df.loc[v_, 'kfold'] = f
    
    # drop the bins column (no longer needed)
    data_df = data_df.drop("bins", axis=1)
    
    return data_df

def cv_evaluation_regression(df, n_fold):
    oof = np.zeros((300000,))
    for fold in tqdm(range(n_fold)):
        val_ind = df[df.kfold == fold].index
        train = df[df.kfold != fold].reset_index(drop=True)
        valid = df[df.kfold == fold].reset_index(drop=True)

        features = [feature for feature in df.columns if feature not in ['id', 'target', 'kfold']]

        X_train = train[features]
        y_train = train['target']
        X_valid = valid[features]
        y_valid = valid['target']

        model = lightgbm.LGBMRegressor(random_state=42, objective='regression', metric='rmse')
        model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=False)
        preds = model.predict(valid[features])
        mse = mean_squared_error(y_valid, preds)
        
        oof[val_ind] = preds

        print(f'LGBM fold: {fold}: RMSE:{np.sqrt(mse)}')
    
    mse_oof = mean_squared_error(oof, train_df['target'])
    
    print(f'LGBM Overall RMSE:{np.sqrt(mse_oof)}')
    return np.sqrt(mse_oof)

[back to top](#table-of-contents)
<a id="base_model"></a>
# 4. Base Regression Model
Features that are used:
* Continuous features
* Label encoding on categorical features

In [None]:
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
new_row = {'Model':'4. Base Regression Model', 'RMSE': cv_evaluation_regression(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="base_label_encode"></a>
# 5. Base Regression Model and Label Encoding
<a id="modified_model_1"></a>
## 5.1. Log Continuous Features and Label Encoding
Features that are used:
* Log on all continuous features
* Label encoding on all categorical features

In [None]:
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
train_df[cont_features] = np.log(train_df[cont_features])

new_row = {'Model':'5.1. Log Continuous Features and Label Encoding', 'RMSE': cv_evaluation_regression(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="modified_model_2"></a>
## 5.2. Log Continuous Features and Log Label Encoding
Features that are used:
* Log on all continuous features
* Log label encoding on all categorical features

In [None]:
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
train_df[cont_features] = np.log(train_df[cont_features])
train_df[cat_features] = np.log(train_df[cat_features])

new_row = {'Model':'5.2. Log Continuous Features and Log Label Encoding', 'RMSE': cv_evaluation_regression(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="modified_model_3"></a>
## 5.3. Continuous Features, Label Encoding and Min on Continuous Features
Features that are used:
* Continuous features 
* Label encoding on categorical features
* New Features:
    * Min of all continuous features

In [None]:
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
train_df['cont_min'] = train_df[cont_features].min(axis=1)

new_row = {'Model':'5.3. Continuous Features, Label Encoding and Min on Continuous Features', 'RMSE': cv_evaluation_regression(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="modified_model_4"></a>
## 5.4. Continuous Features, Label Encoding and Log Min on Continuous Features
Features that are used:
* Continuous features 
* Label encoding on categorical features
* New Features:
    * Log min of all continuous features

In [None]:
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
train_df['cont_min'] = train_df[cont_features].min(axis=1)
train_df['cont_min'] = np.log(train_df['cont_min'])

new_row = {'Model':'5.4. Continuous Features, Label Encoding and Log Min on Continuous Features', 'RMSE': cv_evaluation_regression(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="modified_model_5"></a>
## 5.5. Continuous Features, Label Encoding and Max on Continuous Features
Features that are used:
* Continuous features 
* Label encoding on categorical features
* New Features:
    * Max of all continuous features

In [None]:
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
train_df['cont_max'] = train_df[cont_features].max(axis=1)

new_row = {'Model':'5.5. Continuous Features, Label Encoding and Max on Continuous Features', 'RMSE': cv_evaluation_regression(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="modified_model_6"></a>
## 5.6. Continuous Features, Label Encoding and Log Max on Continuous Features
Features that are used:
* Continuous features 
* Label encoding on categorical features
* New Features:
    * Log max of all continuous features

In [None]:
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
train_df['cont_max'] = train_df[cont_features].max(axis=1)
train_df['cont_max'] = np.log(train_df['cont_max'])

new_row = {'Model':'5.6. Continuous Features, Label Encoding and Log Max on Continuous Features', 'RMSE': cv_evaluation_regression(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="modified_model_7"></a>
## 5.7. Continuous Features, Label Encoding and Sum on Continuous Features
Features that are used:
* Continuous features 
* Label encoding on categorical features
* New Features:
    * Sum of all continuous features

In [None]:
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
train_df['cont_sum'] = train_df[cont_features].sum(axis=1)

new_row = {'Model':'5.7. Continuous Features, Label Encoding and Sum on Continuous Features', 'RMSE': cv_evaluation_regression(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="modified_model_8"></a>
## 5.8. Continuous Features, Label Encoding and Log Sum on Continuous Features
Features that are used:
* Continuous features 
* Label encoding on categorical features
* New Features:
    * Log sum of all continuous features

In [None]:
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
train_df['cont_sum'] = train_df[cont_features].sum(axis=1)
train_df['cont_sum'] = np.log(train_df['cont_sum'])

new_row = {'Model':'5.8. Continuous Features, Label Encoding and Log Sum on Continuous Features', 'RMSE': cv_evaluation_regression(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="modified_model_9"></a>
## 5.9. Continuous Features, Label Encoding and Multiplication on Continuous Features
Features that are used:
* Continuous features
* Label encoding on categorical features
* New Features:
    * Multiplication on all continuous features

In [None]:
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
train_df['cont_multiply'] = 1
for col in cont_features:
    train_df['cont_multiply'] = train_df[col] * train_df['cont_multiply']
    
new_row = {'Model':'5.9. Continuous Features, Label Encoding and Multiplication on Continuous Features', 'RMSE': cv_evaluation_regression(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="modified_model_10"></a>
## 5.10. Continuous Features, Label Encoding and Log Multiplication on Continuous Features
Features that are used:
* Continuous features
* Label encoding on categorical features
* New Features:
    * Log multiplication on all continuous features

In [None]:
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
train_df['cont_multiply'] = 1
for col in cont_features:
    train_df['cont_multiply'] = train_df[col] * train_df['cont_multiply']
train_df['cont_multiply'] = np.log(train_df['cont_multiply'])

new_row = {'Model':'5.10. Continuous Features, Label Encoding and Log Multiplication on Continuous Features', 'RMSE': cv_evaluation_regression(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="modified_model_11"></a>
## 5.11. Continuous Features, Label Encoding and Prorate on Continuous Features
Features that are used:
* Continuous features
* Label encoding on categorical features
* New Features:
    * Prorate on all continuous features

In [None]:
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
train_df['cont_sum'] = train_df[cont_features].sum(axis=1)
for col in cont_features:
    train_df[col+'new'] = train_df[col] / train_df['cont_sum']
train_df = train_df.drop('cont_sum', axis=1)

new_row = {'Model':'5.11. Continuous Features, Label Encoding and Prorate on Continuous Features', 'RMSE': cv_evaluation_regression(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="modified_model_12"></a>
## 5.12. Continuous Features, Label Encoding and Log Prorate on Continuous Features
Features that are used:
* Continuous features
* Label encoding on categorical features
* New Features:
    * Log prorate on all continuous features

In [None]:
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
train_df['cont_sum'] = train_df[cont_features].sum(axis=1)
for col in cont_features:
    train_df[col+'new'] = np.log(train_df[col] / train_df['cont_sum'])
train_df = train_df.drop('cont_sum', axis=1)

new_row = {'Model':'5.12. Continuous Features, Label Encoding and Log Prorate on Continuous Features', 'RMSE': cv_evaluation_regression(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="modified_model_23"></a>
## 5.13. Exponential Continuous Features and Label Encoding
Features that are used:
* Exponential on all continuous features
* Label encoding on all categorical features

In [None]:
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
train_df[cont_features] = np.exp(train_df[cont_features])

new_row = {'Model':'5.13. Exponential Continuous and Label Encoding', 'RMSE': cv_evaluation_regression(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="modified_model_24"></a>
## 5.14. Boxcox Continuous Features and Label Encoding
Features that are used:
* Boxcox 0 on all continuous features
* Label encoding on all categorical features

In [None]:
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
for col in cont_features:
    train_df[col] = boxcox(train_df[col]+1, 0)
    
new_row = {'Model':'5.14. Boxcox Continuous and Label Encoding', 'RMSE': cv_evaluation_regression(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="base_target_encode"></a>
# 6. Base Regression Model and Target Encoding
<a id="modified_model_13"></a>
## 6.1. Continuous Features and Mean Encoding
Features that are used:
* Continuous features
* Mean encoding on categorical features

In [None]:
def cv_evaluation_regression_mod(df, n_fold):
    oof = np.zeros((300000,))
    for fold in tqdm(range(n_fold)):
        val_ind = df[df.kfold == fold].index
        train = df[df.kfold != fold].reset_index(drop=True)
        valid = df[df.kfold == fold].reset_index(drop=True)

        features = [feature for feature in df.columns if feature not in ['id', 'target', 'kfold']]
        
        for col in cat_features:
            encode = train.groupby(col)['target'].mean()
            train[col] = train[col].map(encode)
            valid[col] = valid[col].map(encode)

        X_train = train[features]
        y_train = train['target']
        X_valid = valid[features]
        y_valid = valid['target']

        model = lightgbm.LGBMRegressor(random_state=42, objective='regression', metric='rmse')
        model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=False)
        preds = model.predict(valid[features])
        mse = mean_squared_error(y_valid, preds)
        
        oof[val_ind] = preds

        print(f'LGBM fold: {fold}: RMSE:{np.sqrt(mse)}')
    
    mse_oof = mean_squared_error(oof, train_df['target'])
    
    print(f'LGBM Overall RMSE:{np.sqrt(mse_oof)}')
    return np.sqrt(mse_oof)

train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
new_row = {'Model':'6.1. Continuous Features and Mean Encoding', 'RMSE': cv_evaluation_regression_mod(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="modified_model_14"></a>
## 6.2. Continuous Features and Min Encoding
Features that are used:
* Continuous features
* Minimum encoding on categorical features

In [None]:
def cv_evaluation_regression_mod(df, n_fold):
    oof = np.zeros((300000,))
    for fold in tqdm(range(n_fold)):
        val_ind = df[df.kfold == fold].index
        train = df[df.kfold != fold].reset_index(drop=True)
        valid = df[df.kfold == fold].reset_index(drop=True)

        features = [feature for feature in df.columns if feature not in ['id', 'target', 'kfold']]
        
        for col in cat_features:
            encode = train.groupby(col)['target'].min()
            train[col] = train[col].map(encode)
            valid[col] = valid[col].map(encode)

        X_train = train[features]
        y_train = train['target']
        X_valid = valid[features]
        y_valid = valid['target']

        model = lightgbm.LGBMRegressor(random_state=42, objective='regression', metric='rmse')
        model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=False)
        preds = model.predict(valid[features])
        mse = mean_squared_error(y_valid, preds)
        
        oof[val_ind] = preds

        print(f'LGBM fold: {fold}: RMSE:{np.sqrt(mse)}')
    
    mse_oof = mean_squared_error(oof, train_df['target'])
    
    print(f'LGBM Overall RMSE:{np.sqrt(mse_oof)}')
    return np.sqrt(mse_oof)

train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
new_row = {'Model':'6.2. Continuous Features and Min Encoding', 'RMSE': cv_evaluation_regression_mod(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="modified_model_15"></a>
## 6.3. Continuous Features and Max Encoding
Features that are used:
* Continuous features
* Maximum encoding on categorical features

In [None]:
def cv_evaluation_regression_mod(df, n_fold):
    oof = np.zeros((300000,))
    for fold in tqdm(range(n_fold)):
        val_ind = df[df.kfold == fold].index
        train = df[df.kfold != fold].reset_index(drop=True)
        valid = df[df.kfold == fold].reset_index(drop=True)

        features = [feature for feature in df.columns if feature not in ['id', 'target', 'kfold']]
        
        for col in cat_features:
            encode = train.groupby(col)['target'].max()
            train[col] = train[col].map(encode)
            valid[col] = valid[col].map(encode)

        X_train = train[features]
        y_train = train['target']
        X_valid = valid[features]
        y_valid = valid['target']

        model = lightgbm.LGBMRegressor(random_state=42, objective='regression', metric='rmse')
        model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=False)
        preds = model.predict(valid[features])
        mse = mean_squared_error(y_valid, preds)
        
        oof[val_ind] = preds

        print(f'LGBM fold: {fold}: RMSE:{np.sqrt(mse)}')
    
    mse_oof = mean_squared_error(oof, train_df['target'])
    
    print(f'LGBM Overall RMSE:{np.sqrt(mse_oof)}')
    return np.sqrt(mse_oof)

train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
new_row = {'Model':'6.3. Continuous Features and Max Encoding', 'RMSE': cv_evaluation_regression_mod(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="modified_model_19"></a>
## 6.4. Continuous Features and Median Encoding
Features that are used:
* Continuous features
* Median encoding on categorical features

In [None]:
def cv_evaluation_regression_mod(df, n_fold):
    oof = np.zeros((300000,))
    for fold in tqdm(range(n_fold)):
        val_ind = df[df.kfold == fold].index
        train = df[df.kfold != fold].reset_index(drop=True)
        valid = df[df.kfold == fold].reset_index(drop=True)

        features = [feature for feature in df.columns if feature not in ['id', 'target', 'kfold']]
        
        for col in cat_features:
            encode = train.groupby(col)['target'].median()
            train[col] = train[col].map(encode)
            valid[col] = valid[col].map(encode)

        X_train = train[features]
        y_train = train['target']
        X_valid = valid[features]
        y_valid = valid['target']

        model = lightgbm.LGBMRegressor(random_state=42, objective='regression', metric='rmse')
        model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=False)
        preds = model.predict(valid[features])
        mse = mean_squared_error(y_valid, preds)
        
        oof[val_ind] = preds

        print(f'LGBM fold: {fold}: RMSE:{np.sqrt(mse)}')
    
    mse_oof = mean_squared_error(oof, train_df['target'])
    
    print(f'LGBM Overall RMSE:{np.sqrt(mse_oof)}')
    return np.sqrt(mse_oof)

train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
new_row = {'Model':'6.4. Continuous Features and Median Encoding', 'RMSE': cv_evaluation_regression_mod(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="modified_model_20"></a>
## 6.5. Continuous Features and Standard Deviation Encoding
Features that are used:
* Continuous features
* Standard Deviation encoding on categorical features

In [None]:
def cv_evaluation_regression_mod(df, n_fold):
    oof = np.zeros((300000,))
    for fold in tqdm(range(n_fold)):
        val_ind = df[df.kfold == fold].index
        train = df[df.kfold != fold].reset_index(drop=True)
        valid = df[df.kfold == fold].reset_index(drop=True)

        features = [feature for feature in df.columns if feature not in ['id', 'target', 'kfold']]
        
        for col in cat_features:
            encode = train.groupby(col)['target'].std()
            train[col] = train[col].map(encode)
            valid[col] = valid[col].map(encode)

        X_train = train[features]
        y_train = train['target']
        X_valid = valid[features]
        y_valid = valid['target']

        model = lightgbm.LGBMRegressor(random_state=42, objective='regression', metric='rmse')
        model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=False)
        preds = model.predict(valid[features])
        mse = mean_squared_error(y_valid, preds)
        
        oof[val_ind] = preds

        print(f'LGBM fold: {fold}: RMSE:{np.sqrt(mse)}')
    
    mse_oof = mean_squared_error(oof, train_df['target'])
    
    print(f'LGBM Overall RMSE:{np.sqrt(mse_oof)}')
    return np.sqrt(mse_oof)

train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
new_row = {'Model':'6.5. Continuous Features and Standard Deviation Encoding', 'RMSE': cv_evaluation_regression_mod(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="modified_model_21"></a>
## 6.6. Continuous Features and Skewness Encoding
Features that are used:
* Continuous features
* Skewness encoding on categorical features

In [None]:
def cv_evaluation_regression_mod(df, n_fold):
    oof = np.zeros((300000,))
    for fold in tqdm(range(n_fold)):
        val_ind = df[df.kfold == fold].index
        train = df[df.kfold != fold].reset_index(drop=True)
        valid = df[df.kfold == fold].reset_index(drop=True)

        features = [feature for feature in df.columns if feature not in ['id', 'target', 'kfold']]
        
        for col in cat_features:
            encode = train.groupby(col)['target'].skew()
            train[col] = train[col].map(encode)
            valid[col] = valid[col].map(encode)

        X_train = train[features]
        y_train = train['target']
        X_valid = valid[features]
        y_valid = valid['target']

        model = lightgbm.LGBMRegressor(random_state=42, objective='regression', metric='rmse')
        model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=False)
        preds = model.predict(valid[features])
        mse = mean_squared_error(y_valid, preds)
        
        oof[val_ind] = preds

        print(f'LGBM fold: {fold}: RMSE:{np.sqrt(mse)}')
    
    mse_oof = mean_squared_error(oof, train_df['target'])
    
    print(f'LGBM Overall RMSE:{np.sqrt(mse_oof)}')
    return np.sqrt(mse_oof)

train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
new_row = {'Model':'6.6. Continuous Features and Skewness Encoding', 'RMSE': cv_evaluation_regression_mod(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="modified_model_16"></a>
## 6.7. Continuous Features, Mean, Median, Min, Max and Std Dev Encoding
Features that are used:
* Continuous features
* Mean encoding on categorical features
* New Features:
    * Minimum encoding on categorical features
    * Maximum encoding on categorical features

In [None]:
def cv_evaluation_regression_mod(df, n_fold):
    oof = np.zeros((300000,))
    for fold in tqdm(range(n_fold)):
        val_ind = df[df.kfold == fold].index
        train = df[df.kfold != fold].reset_index(drop=True)
        valid = df[df.kfold == fold].reset_index(drop=True)

        features = [feature for feature in df.columns if feature not in ['id', 'target', 'kfold']]

        for col in cat_features:
            encode = train.groupby(col)['target'].min()
            train[col+'min'] = train[col].map(encode)
            valid[col+'min'] = valid[col].map(encode)
            
        for col in cat_features:
            encode = train.groupby(col)['target'].max()
            train[col+'max'] = train[col].map(encode)
            valid[col+'max'] = valid[col].map(encode)
            
        for col in cat_features:
            encode = train.groupby(col)['target'].std()
            train[col+'med'] = train[col].map(encode)
            valid[col+'med'] = valid[col].map(encode)
            
        for col in cat_features:
            encode = train.groupby(col)['target'].median()
            train[col+'std_dev'] = train[col].map(encode)
            valid[col+'std_dev'] = valid[col].map(encode)

        for col in cat_features:
            encode = train.groupby(col)['target'].mean()
            train[col] = train[col].map(encode)
            valid[col] = valid[col].map(encode)

        X_train = train[features]
        y_train = train['target']
        X_valid = valid[features]
        y_valid = valid['target']

        model = lightgbm.LGBMRegressor(random_state=42, objective='regression', metric='rmse')
        model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=False)
        preds = model.predict(valid[features])
        mse = mean_squared_error(y_valid, preds)
        
        oof[val_ind] = preds

        print(f'LGBM fold: {fold}: RMSE:{np.sqrt(mse)}')
    
    mse_oof = mean_squared_error(oof, train_df['target'])
    
    print(f'LGBM Overall RMSE:{np.sqrt(mse_oof)}')
    return np.sqrt(mse_oof)

train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
new_row = {'Model':'6.7. Continuous Features, Mean, Median, Min, Max and Std Dev Encoding', 'RMSE': cv_evaluation_regression_mod(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="base_cat_encode"></a>
# 7. Base Regression Model and Categorical Encoding
<a id="modified_model_17"></a>
## 7.1. Continuous Features and Percentage Categorical Encoding
Features that are used:
* Continuous features
* Categorical encoding base on the percentage of categorical items

In [None]:
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
for col in cat_features:
    rate_df = train_df[col].value_counts() / len(train_df)
    train_df[col] = train_df[col].map(rate_df)
    
new_row = {'Model':'7.1. Continuous Features and Percentage Categorical Encoding', 'RMSE': cv_evaluation_regression(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="modified_model_18"></a>
## 7.2. Continuous Features and Sum Categorical Encoding
Features that are used:
* Continuous features
* Categorical encoding base on the percentage of categorical items
* New Features:
    * Sum of encoded categorical features

In [None]:
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
for col in cat_features:
    rate_df = train_df[col].value_counts() / len(train_df)
    train_df[col] = train_df[col].map(rate_df)
train_df['sum_cat'] = train_df[cat_features].sum(axis=1)

new_row = {'Model':'7.2. Continuous Features and Sum Categorical Encoding', 'RMSE': cv_evaluation_regression(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="modified_model_22"></a>
## 7.3. Continuous Features and Multiplication Categorical Encoding
Features that are used:
* Continuous features
* Categorical encoding base on the percentage of categorical items
* New Features:
    * Multiplication of encoded categorical features

In [None]:
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
train_df = create_stratified_folds_for_regression(train_df)
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])
    
for col in cat_features:
    rate_df = train_df[col].value_counts() / len(train_df)
    train_df[col] = train_df[col].map(rate_df)
train_df['cont_multiply'] = 1
for col in cat_features:
    train_df['cont_multiply'] = train_df[col] * train_df['cont_multiply']
    
new_row = {'Model':'7.3. Continuous Features and Multiplication Categorical Encoding', 'RMSE': cv_evaluation_regression(train_df, 5)}
result = result.append(new_row, ignore_index=True)

[back to top](#table-of-contents)
<a id="summary"></a>
## 8. Summary

**Observations:**
- `6.4. Continuous Features and Median Encoding` has the highest score compared to others feature engineering.
- Many feature engineering can't pass baseline model with RMSE of `0.846526`.
- There are only `6 models` that can pass the baseline model with a small improvement.
- These features engineering may not help the models to improve the RMSE, it seems that it create a new noise for the model to predict.

Below are the temporary result for the models:

In [None]:
result.sort_values('RMSE')