# Problem definition

From description:

"The dataset is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting the category on an eCommerce product given various attributes about the listing. Although the features are anonymized, they have properties relating to real-world features."


See notebooks using R:

1. [Finding the best pre-processing configuration and predictive models based on the original data](https://www.kaggle.com/gomes555/tps-may2021-r-eda-tidymodels-workflowsets/)
2. [Create DAE dataset and fit models in DAE data](https://www.kaggle.com/gomes555/tps-may2021-r-dae-keras) 
4. [Stacking all](https://www.kaggle.com/gomes555/tps-may2021-r-tidymodels-stacks/)

Notebooks using Python language:

1. [LightGbm sequencial tuning with Optuna Step-wise by LightGBM Tuner](https://www.kaggle.com/gomes555/tps-may2021-optuna-lightgbm-tuner)
2. [LightGbm tuning with Optuna TPE (Tree-structured Parzen Estimator)](https://www.kaggle.com/gomes555/tps-may2021-optuna-lightgbm-tpe/)
3. **LightGbm tuning one vs rest with Optuna Step-wise by LightGBM Tuner**
4. [LightGbm tuning pseudo label with Optuna Tuner](https://www.kaggle.com/gomes555/tps-may2021-lightgbm-pseudolabel/)
5. [Stacking All](https://www.kaggle.com/gomes555/tps-may2021-stacking)

All notebooks will be public and suggestions and criticism are very welcome!


<br>

<p align="right"><span style="color:firebrick">Dont forget the upvote if you liked the notebook! <i class="fas fa-hand-peace"></i></span> </p>

# Load dependencies

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import log_loss
from sklearn.model_selection import cross_val_score

import optuna.integration.lightgbm as lgb

import optuna
from optuna.visualization import plot_optimization_history, plot_param_importances
from optuna.integration import LightGBMPruningCallback

from tqdm import tqdm

In [None]:
train=pd.read_csv('../input/tabular-playground-series-may-2021/train.csv')
test=pd.read_csv('../input/tabular-playground-series-may-2021/test.csv')
sub=pd.read_csv('../input/tabular-playground-series-may-2021/sample_submission.csv')

# Prepare data to one vs rest

In [None]:
X1 = train.drop(['target', 'id'], axis=1).copy()
X2 = train.drop(['target', 'id'], axis=1).copy()
X3 = train.drop(['target', 'id'], axis=1).copy()
X4 = train.drop(['target', 'id'], axis=1).copy()

y1 = np.where(train.target=="Class_1", 1, 0)
y2 = np.where(train.target=="Class_2", 1, 0)
y3 = np.where(train.target=="Class_3", 1, 0)
y4 = np.where(train.target=="Class_4", 1, 0)

X1_train, X1_val, y1_train, y1_val = train_test_split(X1, y1, test_size=0.10, random_state=42)
X2_train, X2_val, y2_train, y2_val = train_test_split(X2, y2, test_size=0.10, random_state=42)
X3_train, X3_val, y3_train, y3_val = train_test_split(X3, y3, test_size=0.10, random_state=42)
X4_train, X4_val, y4_train, y4_val = train_test_split(X4, y4, test_size=0.10, random_state=42)

# Models

In [None]:
params = {
    "objective": "binary",
    "metric": "binary_logloss",
    "verbosity": -1,
    "boosting_type": "gbdt",
    'learning_rate': 0.02,
    'random_state': 314
    }

In [None]:
booster1 = lgb.train(params, 
                     lgb.Dataset(X1_train, label=y1_train),
                     valid_sets=lgb.Dataset(X1_val, label=y1_val),
                     verbose_eval=0,
                     early_stopping_rounds=70)

In [None]:
y1_pred = booster1.predict(test.drop('id', axis=1), num_iteration=booster1.best_iteration)

In [None]:
booster2 = lgb.train(params, 
                     lgb.Dataset(X2_train, label=y2_train),
                     valid_sets=lgb.Dataset(X2_val, label=y2_val),
                     verbose_eval=0,
                     early_stopping_rounds=70)

In [None]:
y2_pred = booster2.predict(test.drop('id', axis=1), num_iteration=booster2.best_iteration)

In [None]:
booster3 = lgb.train(params, 
                     lgb.Dataset(X3_train, label=y3_train),
                     valid_sets=lgb.Dataset(X3_val, label=y3_val),
                     verbose_eval=0,
                     early_stopping_rounds=70)

In [None]:
y3_pred = booster3.predict(test.drop('id', axis=1), num_iteration=booster3.best_iteration)

In [None]:
booster4 = lgb.train(params, 
                     lgb.Dataset(X4_train, label=y4_train),
                     valid_sets=lgb.Dataset(X4_val, label=y4_val),
                     verbose_eval=0,
                     early_stopping_rounds=70)

In [None]:
y4_pred = booster4.predict(test.drop('id', axis=1), num_iteration=booster4.best_iteration)

# Submission

In [None]:
sub = pd.DataFrame({
    'id': test.id,
    'Class_1': y1_pred, 
    'Class_2': y2_pred,
    'Class_3': y3_pred,
    'Class_4': y4_pred
})

sub.to_csv('lgbm_tuner_one_x_rest.csv', index=False)