# Problem definition

From description:

"The dataset is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting the category on an eCommerce product given various attributes about the listing. Although the features are anonymized, they have properties relating to real-world features."


See notebooks using R:

1. [Finding the best pre-processing configuration and predictive models based on the original data](https://www.kaggle.com/gomes555/tps-may2021-r-eda-tidymodels-workflowsets/)
2. [Create DAE dataset and fit models in DAE data](https://www.kaggle.com/gomes555/tps-may2021-r-dae-keras) 
4. [Stacking all](https://www.kaggle.com/gomes555/tps-may2021-r-tidymodels-stacks/)

Notebooks using Python:

1. **LightGbm sequencial tuning with Optuna Step-wise by LightGBM Tuner**
2. [LightGbm tuning with Optuna TPE (Tree-structured Parzen Estimator)](https://www.kaggle.com/gomes555/tps-may2021-optuna-lightgbm-tpe/)
3. [LightGbm tuning one vs rest with Optuna Step-wise by LightGBM Tuner](https://www.kaggle.com/gomes555/tps-may2021-optuna-tuner-one-x-rest/)
4. [LightGbm tuning pseudo label with Optuna Tuner](https://www.kaggle.com/gomes555/tps-may2021-lightgbm-pseudolabel/)
5. [Stacking All](https://www.kaggle.com/gomes555/tps-may2021-stacking)

All notebooks will be public and suggestions and criticism are very welcome!


<br>

<p align="right"><span style="color:firebrick">Dont forget the upvote if you liked the notebook! <i class="fas fa-hand-peace"></i></span> </p>

# Dependencies

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import log_loss
from sklearn.model_selection import cross_val_score

import optuna.integration.lightgbm as lgb

import optuna
from optuna.visualization import plot_optimization_history, plot_param_importances
from optuna.integration import LightGBMPruningCallback

from tqdm import tqdm

In [None]:
train=pd.read_csv('../input/tabular-playground-series-may-2021/train.csv')
test=pd.read_csv('../input/tabular-playground-series-may-2021/test.csv')
sub=pd.read_csv('../input/tabular-playground-series-may-2021/sample_submission.csv')

# Prepare data

In [None]:
conditions = [
    (train.target == "Class_1"),
    (train.target == "Class_2"),
    (train.target == "Class_3"),
    (train.target == "Class_4")
]
choices = [0, 1, 2, 3]
train["target"] = np.select(conditions, choices)

In [None]:
X_test = test.drop(['id'], axis=1)
X = train.drop(['id', 'target'], axis=1)
y = train.target

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.10, random_state=42)

In [None]:
dtrain = lgb.Dataset(X_train, label=y_train)
dval = lgb.Dataset(X_val, label=y_val)

In [None]:
params = {
    "objective": "multiclass",
    "num_class": 4,
    "metric": "multi_logloss",
    "verbosity": -1,
    "boosting_type": "gbdt",
    'learning_rate': 0.02,
    'random_state': 314
    }

In [None]:
booster = lgb.train(params, 
                    dtrain, valid_sets=dval,
                    verbose_eval=0,
                    early_stopping_rounds=70
                   )

In [None]:
booster.params

In [None]:
y_pred = booster.predict(X_test, num_iteration=booster.best_iteration)

In [None]:
sub=pd.concat([
    test.id,
    pd.DataFrame(y_pred, columns = ['Class_1', 'Class_2', 'Class_3', 'Class_4'])
], axis=1)

sub.to_csv('lgbm_tuner.csv', index=False)

# DAE 

Data obtained from the notebook developed using the keras library in R: <https://www.kaggle.com/gomes555/tps-may2021-r-dae-keras>

In [None]:
dae_train=pd.read_csv('../input/tps-may2021-r-dae-keras/daeta_train.csv')
dae_test=pd.read_csv('../input/tps-may2021-r-dae-keras/daeta_test.csv')

In [None]:
dae_train["target"] = dae_train.target.str.extract("(\d)").astype("int64") - 1

In [None]:
X_test = dae_test.drop(['id'], axis=1)
X = dae_train.drop(['id', 'target'], axis=1)
y = dae_train.target

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.10, random_state=42)

In [None]:
dtrain = lgb.Dataset(X_train, label=y_train)
dval = lgb.Dataset(X_val, label=y_val)

In [None]:
booster = lgb.train(params, 
                    dtrain, valid_sets=dval,
                    verbose_eval=0,
                    early_stopping_rounds=70
                   )

In [None]:
booster.params

In [None]:
y_pred_dae = booster.predict(X_test.drop('target', axis=1), num_iteration=booster.best_iteration)

In [None]:
sub=pd.concat([
    test.id,
    pd.DataFrame(y_pred_dae, columns = ['Class_1', 'Class_2', 'Class_3', 'Class_4'])
], axis=1)

sub.to_csv('lgbm_tuner_dae.csv', index=False)

# Blending

In [None]:
y_blend = np.zeros([sub.shape[0], 4])

for j in [0,1, 2, 3]:
    y_blend[:,j] = (y_pred[:,j] + y_pred_dae[:,j] ) / 2

In [None]:
sub=pd.concat([
    test.id,
    pd.DataFrame(y_blend, columns = ['Class_1', 'Class_2', 'Class_3', 'Class_4'])
], axis=1)

sub.to_csv('lgbm_tuner_blend.csv', index=False)