# Discription

This notebook explain how to create very simple submission with lightgbm optimization.

Notebook plan:
1. Modules import.
2. Utils.
3. LGBM parameters tuning and modeling.
4. Full model training.

## 1. Modules

We use only gentelmen pack of modules: numpy, pandas, matplotlib, lightgbm and sklearn. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import cross_validate


import lightgbm as lgbm

import gc
from pathlib import Path
gc.enable()

In [None]:
import warnings 
warnings.filterwarnings('ignore')

## 2. Utils

This simple utils is used to reduce memory usage. We have to use it to speed up our code if we work with huge amount of data in our dataset. 

In [None]:
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
#                 if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
#                     df[col] = df[col].astype(np.float16)
#                 elif

                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

## 3. LGBM parameters tuning and modeling.

All machine learning algorithms have different paramters which can seriously affect to final score. To test LGBM pramters we will use small amout of data to seed up our simlpe research.  

First cell is used to create data frames to training and grid search optimization.

In [None]:
data_dir = Path('../input/tabular-playground-series-oct-2021/')

# this constant depend amount of our experiment data amount
small_amout = 500000

df_train = pd.read_csv(
    data_dir / "train.csv",
    index_col='id',
    nrows=small_amout, 
)

df_train = reduce_mem_usage(df_train)

FEATURES = df_train.columns[:-1]
TARGET = df_train.columns[-1]

X = df_train.loc[:, FEATURES]
y = df_train.loc[:, TARGET]

seed = 0
fold = 5

The second cell contains the example code: how we can train our lgbm-model and diplay results. 

In [None]:
'''model_lgbm = lgbm.LGBMClassifier(
    num_iterations=100,
    objective = "binary",
    num_leaves= 31,
    feature_pre_filter = False
    )
def score(X, y, model_lgbm, cv):
    scoring = ["roc_auc"]
    scores = cross_validate(
        model_lgbm, X, y, scoring=scoring, cv=cv, return_train_score=True
    )
    scores = pd.DataFrame(scores).T
    return scores.assign(
        mean = lambda x: x.mean(axis=1),
        std = lambda x: x.std(axis=1),
    )

scores = score(X, y, model_lgbm, cv=fold)
display(scores)
'''

This cell contains greed search code. We work only with a few parametrs:
*  max_depth
*  num_iterations

All parametrs describe [here](https://lightgbm.readthedocs.io/en/latest/Parameters.html).

In [None]:
'''
def score(X, y, model_lgbm, cv):
    scoring = ["roc_auc"]
    scores = cross_validate(
        model_lgbm, X, y, scoring=scoring, cv=cv, return_train_score=True
    )
    scores = pd.DataFrame(scores).T
    return scores.assign(
        mean = lambda x: x.mean(axis=1),
        std = lambda x: x.std(axis=1),
    )

test_roc_auc_row = []

for num_iter in range(90, 301, 40):
    for max_d in range (4, 8, 1):
        model_lgbm = lgbm.LGBMClassifier(
        num_iterations=num_iter,
        objective = "binary",
        feature_pre_filter = False,
        max_depth = max_d
        )

        res = {}
        res['num_iter'] = num_iter
        res['max_depth'] = max_d
        scores = score(X, y, model_lgbm, cv=fold)
        res['test_roc_auc'] = scores.loc['test_roc_auc','mean']
        print(num_iter, max_d, res['test_roc_auc'])

        test_roc_auc_row.append(res)
'''

In [None]:
'''test_roc_auc_row'''

Attention!

For 10000 rows num_iter = 210 and max_depth=2 gives score 0.837904 (max)
For 100000 rows num_iter = 370 with max_depth = 3 gives score 0.850794 (max)

Final results depends from your training dataset size.

In [None]:
'''df = pd.DataFrame(test_roc_auc_row)
df.sort_values(by='test_roc_auc', ascending=False).head(10)'''

Train final model.

In [None]:
model_lgbm = LGBMClassifier(
    num_iterations=290,
    objective = "binary",
    feature_pre_filter = False,
    max_depth = 7
    )
def score(X, y, model_lgbm, cv):
    scoring = ["roc_auc"]
    scores = cross_validate(
        model_lgbm, X, y, scoring=scoring, cv=cv, return_train_score=True
    )
    scores = pd.DataFrame(scores).T
    return scores.assign(
        mean = lambda x: x.mean(axis=1),
        std = lambda x: x.std(axis=1),
    )

scores = score(X, y, model_lgbm, cv=fold)
display(scores)

In [None]:
model_lgbm.fit(X, y, eval_metric='auc')
X_test = pd.read_csv(data_dir / "test.csv", index_col='id')

y_pred_lgbm = pd.Series(
    model_lgbm.predict_proba(X_test)[:, 1],
    index=X_test.index,
    name=TARGET,
)
y_pred_lgbm.to_csv("submission.csv")

# Discussion

1. After applying the optimization procedure when working with 100,000 test data, the score increased by 0.012, which is a significant result in this task.
2. The scatter of the score on the grid parameters is 0.02.
3. Using all data to train with optimized parametrs gives score 0.85212 (+0.002).