Thank you for reading this notebook. I'm new to Kaggle and machine-learing algorithms, and this competition is the second one for me after TPS-January. I didn't use any special techniques, but used GBDT modules I found common in Kaggle (LightGBM, XGBoost, and CatBoost). In this notebook I wrote down the basic flows I used in this competition. I don't suppose this will interest those who has been familiar with Kaggle, but I would appreciate it if you could read this and give me some advice. I'm also glad if this notebook would help other beginners.

**Import modules and dataset â†’**

In [None]:
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import optuna
from tqdm.notebook import tqdm
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture 
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, KFold
from lightgbm import LGBMRegressor, plot_importance

train = pd.read_csv("/kaggle/input/tabular-playground-series-feb-2021/train.csv")
test = pd.read_csv("/kaggle/input/tabular-playground-series-feb-2021/test.csv")

cont_features = [f for f in train.columns.tolist() if f.startswith('cont')]
cat_features = [f for f in train.columns.tolist() if f.startswith('cat')]
features = cat_features + cont_features
data = train[features]
target = train['target']

all_data = pd.concat([data, test])

# Feature Engineering

I did a slight feature-engineering.
Histograms of the cont features show multiple components. For instance, the cont1 has 7 discrete peaks as shown below. I thought these characteristics could be used as an additional feature.
So, I tried `sklearn.mixture.GaussianMixture` to devide into several groups [Ref: [Notebooks of TPS-Jan. by Dave E](https://www.kaggle.com/davidedwards1/jan21-tabplayground-nn-final-fewer-features)].

See also https://scikit-learn.org/stable/modules/mixture.html#gmm for Gaussian Mixture Models.

The scatter plots below show the cont-feature values and target, with the results of GMM.
The bottom histgrams also show the results of GMM.

In [None]:
fig, ax = plt.subplots(5, 3, figsize=(14, 24))
for i, feature in enumerate(cont_features):
    plt.subplot(5, 3, i+1)
    sns.histplot(all_data[feature][::100], 
                 color="blue", 
                 kde=True, 
                 bins=100)
    plt.xlabel(feature, fontsize=9)
plt.show()

In [None]:
inits = [[0.3, 0.5, 0.7, 0.9], 
         [0.039, 0.093, 0.24, 0.29, 0.35, 0.42, 0.49, 0.56, 0.62, 0.66, 0.76],
         [0.176, 0.322, 0.416, 0.495, 0.548, 0.618, 0.707, 0.937],
         [0.2, 0.35, 0.44, 0.59, 0.75, 0.83],
         [0.28, 0.31, 0.42, 0.5, 0.74, 0.85],
         [0.25, 0.38, 0.43, 0.58, 0.75, 0.9],
         [0.34, 0.48, 0.7, 0.88],
         [0.25, 0.29, 0.35, 0.48, 0.61, 0.68, 0.78, 0.9],
         [0.11, 0.2, 0.3, 0.35, 0.45, 0.6, 0.76, 0.9],
         [0.22, 0.32, 0.38, 0.44, 0.53, 0.63, 0.71, 0.81, 0.87],
         [0.19, 0.27, 0.37, 0.46, 0.56, 0.61, 0.71, 0.86],
         [0.23, 0.35, 0.52, 0.7, 0.84],
         [0.27, 0.32, 0.35, 0.49, 0.63, 0.7, 0.79, 0.88],
         [0.22, 0.29, 0.35, 0.4, 0.47, 0.58, 0.68, 0.72, 0.8]]
gmms = []
for feature, init in zip(cont_features, inits):
    X_ = np.array(all_data[feature].tolist()).reshape(-1, 1)
    means_init = np.array(init)[:,None]
    gmm_ = GaussianMixture(n_components=len(init),
                           means_init=means_init,
                           random_state=0).fit(X_)
    gmms.append(gmm_)
    preds = gmm_.predict(X_)
    all_data[f'{feature}_gmm'] = preds
    train[f'{feature}_gmm'] = preds[:len(train)]
    test[f'{feature}_gmm'] = preds[len(train):]

In [None]:
fig, ax = plt.subplots(5, 3, figsize=(24, 30))
for i, feature in enumerate(cont_features):
    plt.subplot(5, 3, i+1)
    sns.scatterplot(x=feature, 
                    y="target", 
                    data=train[::150], 
                    hue=f'{feature}_gmm', 
                    palette='muted')
    plt.xlabel(feature, fontsize=9)
plt.show()

In [None]:
fig, ax = plt.subplots(5, 3, figsize=(24, 30))
for i, feature in enumerate(cont_features):
    plt.subplot(5, 3, i+1)
    sns.histplot(x=feature, 
                 data=train[::150], 
                 hue=f'{feature}_gmm', 
                 kde=True, 
                 bins=100, 
                 palette='muted')
    plt.xlabel(feature, fontsize=9)
plt.show()

I calculated the standard deviations for each group and added them as new features.
```python
for feature in cont_features:
    mu = all_data.groupby(f'{feature}_gmm')[feature].transform("mean")
    sigma = all_data.groupby(f'{feature}_gmm')[feature].transform("std")
    
    train[f'{feature}_gmm_dev'] = (train[feature] - mu[:len(train)])/sigma[:len(train)]
    test[f'{feature}_gmm_dev'] = (test[feature] - mu[len(train):])/sigma[len(train):]
```

For categorical features, I used label-encoding (`sklearn.preprocessing.LabelEncoder`).
```python
for feature in cat_features:
    le = LabelEncoder()
    le.fit(train[feature])
    train[feature] = le.transform(train[feature])
    test[feature] = le.transform(test[feature])
    
features = [col for col in train.columns.to_list() if col not in ['id','target']]
```

# Hyperparameter Tuning


I learned that the hyperparameter tuning is necessary to improve scores.
Here is the example for tuning LightGBM by Optuna.
I don't really know what parameters to tune and what range to input (I don't even know what each parameter meansðŸ˜¥ ). Please let me know if I'm missing the point.

```python
def objective(trial, data=train[features], target=target):
    
    train_x, test_x, train_y, test_y = train_test_split(data, target, test_size=0.2, random_state=41)
    param = {
        'metric': 'rmse', 
        'random_state': 41,
        'n_estimators': 20000,
        'learning_rate': 0.01,
        'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-2, 100.),
        'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-3, 10.0),
        'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.01, 1.0),
        'subsample': trial.suggest_categorical('subsample', [0.2,0.3,0.4,0.5,0.6,0.7,0.8,1.0]),
        'subsample_freq': trial.suggest_int('subsample_freq', 1, 20),
        'max_depth': trial.suggest_categorical('max_depth', [-1,30,100,300]),
        'num_leaves' : trial.suggest_int('num_leaves', 2, 500),
        'min_child_samples': trial.suggest_int('min_child_samples', 1, 200),
        'min_child_weight': trial.suggest_loguniform('min_child_weight', 1e-3, 10),
        'cat_smooth' : trial.suggest_int('cat_smooth', 1, 100)
    }
    
    model = LGBMRegressor(**param)  
    model.fit(train_x,train_y,eval_set=[(test_x,test_y)],early_stopping_rounds=100,verbose=False)

    preds = model.predict(test_x)    
    rmse = mean_squared_error(test_y, preds,squared=False)
    
    return rmse

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=50)
print('Best trial:', study.best_params)
```

For hyperparamers of 3 GBDTs, see also the offical documents below.
- https://lightgbm.readthedocs.io/en/latest/Parameters.html 
- https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.plotting 
- https://catboost.ai/docs/concepts/python-reference_parameters-list.html#python-reference_parameters-list

I also found code examples for optuna are available on the official GitHub page:
https://github.com/optuna/optuna/tree/master/examples


```python
study.trials_dataframe()
optuna.visualization.plot_param_importances(study)
optuna.visualization.plot_parallel_coordinate(study)
optuna.visualization.plot_contour(study, params=['num_leaves', 'max_depth', 'subsample', 'min_child_samples', 'colsample_bytree'])
optuna.visualization.plot_optimization_history(study)
```
The above codes will illustrate how the optimizations proceed.\
For the output graphs, have a look at [Hamza's notebook](https://www.kaggle.com/hamzaghanmi/lgbm-hyperparameter-tuning-using-optuna) for instance and the offical document:
https://optuna.readthedocs.io/en/stable/reference/visualization/

# Training
- I found training for different random seeds and averaging them improve PB scores [[Apolo's notebook](https://www.kaggle.com/shkanda/random-seed-averaging-lgb-xgb)]. Is this kind of ensemble?

- I also found that a small learning rate could improve PB scores [[szdr's notebook](https://www.kaggle.com/shogosuzuki/0-69713-lightgbm-with-small-learning-rate)]. So I didn't tuned the learing rate with Optuna.

Codes look like this. I used [Tawara's notebook](https://www.kaggle.com/ttahara/tps-feb-2021-3gbdts-ensemble-baseline) as a guide.

```python
NUM_FOLDS = 10
seed_list = [0,1,2]

test_pred = np.zeros(len(test))
val_pred = np.zeros(len(train))

for seed in tqdm(seed_list):
    tmp_test_pred = np.zeros(len(test))
    tmp_val_pred = np.zeros(len(train))
    kf = KFold(n_splits=NUM_FOLDS, shuffle=True, random_state=seed)
    for f, (train_idx, val_idx) in tqdm(enumerate(kf.split(train[features], target))):
        print("*" * 20)
        print(f"Seed-#{seed};  Fold-#{f}")        
        train_x, val_x = train.iloc[train_idx][features], train.iloc[val_idx][features]
        train_y, val_y = target[train_idx], target[val_idx]

        model = LGBMRegressor(metric = 'rmse',
                              random_state=seed, 
                              learning_rate = 0.002,
                              n_estimators = 20000,
                              **study.best_params)
        model.fit(train_x,train_y,eval_set=[(val_x,val_y)],early_stopping_rounds=100,verbose=5000)
    
        temp_oof = model.predict(val_x)
        temp_test = model.predict(test[features])

        tmp_test_pred += temp_test
        tmp_val_pred[val_idx] = temp_oof
        print(mean_squared_error(temp_oof, val_y, squared=False))
    
    print("*" * 20)
    print(f"Seed-#{seed}\n{mean_squared_error(tmp_val_pred, target, squared=False)}")
    val_pred += tmp_val_pred
    test_pred += tmp_test_pred / NUM_FOLDS

val_pred /= len(seed_list)
test_pred /= len(seed_list)
print("*" * 20)
print(mean_squared_error(val_pred, target, squared=False))
```

# Ensemble
Ensembling different models was necessary to improve scores [e.g., [Somayyeh Gholami's notebook](https://www.kaggle.com/somayyehgholami/comparative-method-tabular-feb-301)].
I repeated the above process with other GBDT models and ensembled them.

```python
lgbm_1 = pd.read_csv("../input/tps2-submissions/submission1.csv") 
lgbm_2 = pd.read_csv("../input/tps2-submissions/submission2.csv")
xgb_1 = pd.read_csv("../input/tps2-submissions/submission3.csv")                          
cat_1 = pd.read_csv("../input/tps2-submissions/submission4.csv") 

models = [lgbm_1, lgbm_2, xgb_1, cat_1]
weights = [10., 5., 2., 1.]
sample_submission.target = 0

for model, weight in zip(models, weights):
    sample_submission.target += weight * model.target / sum(weights)

sample_submission.to_csv('sub-ensemble.csv', index=False)
```

The weights are random.
How can I get to quantitatively know appropriate weights to ensemble?

## Tip for beginners like me

The Kaggle notebook turns idle of there are no interaction for maybe 40 minutes. When it becomes idle, the ongoing calculation gets revoked. Because some calculations like Optuna take more than 40 minitues, it frustrates me.
In such a situation, I did "Save & Run All". 
Although the outputs of each cell don't show up, the all output can be seen after the run-all finishes. 
Meantime I can't check the outputs, so it's good to post RMSs to slack. 
Note that the total elapsed time must be within 9 hours. Otherwise, nothing will be saved.

https://api.slack.com/messaging/webhooks

Are there any good way to run a long calculation?

```python
import json
import urllib.request

def post_slack(message):
    url = 'https://hooks.slack.com/services/<<YOUR_SLACK_URL>>'
    headers = {'Content-Type': 'application/json'}
    data = {"channel": "#general",
            "username": "webhookbot", 
            "text": message, 
            "icon_emoji": ":ghost:"}

    req = urllib.request.Request(url, json.dumps(data).encode(), headers)
    with urllib.request.urlopen(req) as res:
        body = res.read()

post_slack(f"Fold-{f} finished.\nRMSE: {rmse}")
```

### Thank you very much for reading!