# Introduction

**Surprise:** at table data kaggle competition "MoA" boostings perfoms worse than other models (LogReg, Neural Networks etc, even sometimes worse than constant predicton).

**Challenge:** Train your skills to tune boostings params using that "hard" dataset. 
I.e. try to find params which would be the best for each/all/subgroup of targets. It seems to be of general use - not only for that particular competition. In particular create list of good params for boosting which can be tried in future tasks, that may speed-up tuning params in future - first try params from the list, then tune around the best one.  

Everyone is  welcome to share his advise/experiments/everything... (Please note that, the goal is NOT that particular competition (MoA) - but getting general pictue, so the tricks specific to that competition may be of less interest). 


**More details:**

There two simple baselines for each target - just predict by constant, just predict by LogReg (with tuned C). 
We can use them as benchmarks. 

Boostings perform not good NOT only on targets with extreme class disbalance. So it is not the only reason. 

**Suprises:**

1) Lightgm with default params performs WORSE THAN CONSTANT on many targets.

2) Currently Lightgbm  with tuned params  worse than LogReg (tuned C) on ALL 206 targets . At least those params I have considered. Can one be better than LogReg ? At least at some targets ? 


**Further points to think on:**

1) Try to understand why that dataset is so hard for boostings ? May data data were prepared by LogReg is some way ? We can try to create articial data generated by logreg and compare boostings and logreg on them if results would be similar to current dataset than might be an evidence. 

2) We can try to find different params for different targets , as well as for all targets together - that are two different task. More natural is to consider some group of targets for example those with moderate 

3) Try to propose optimization schemes which might lead to good params in fastest way.

4) Try to benchmark optimization packages hyperopt, optune, etc 



# Current findings

0) Deatailed results of experiments can be  found at https://docs.google.com/spreadsheets/d/1fw6aqngYtwfVQTHbTpoGN4S8cUs-nGSM_w1fc3HpFbA/edit?usp=sharing

1) Lightgbm dart mode ( model = lgb.LGBMClassifier(**{ 'boosting_type': 'dart'}) )
works much better on all targets than default params. Still on many targets it works worse than constant.

2) Dependence on learning rate seems not be as smooth as I would expect, e.g. changing LR: 0.1->0.100028 sometimes gives  unproportianal improvement 

3) "Tokyo1" params (see values below)  seems to be quite not good - testing to be continued. 
Analysis: params are good, but still it seems improvement over logreg is due to clipping, not due to boosting. CV score is around 0.016* (without clipping), while logreg 0.015*. Good thing - at least with use of early stopping
results are always BETTER than constant prediction. May be with 150+- iterations would be the same without early stopping - will check later. 


Some previous experiments can be found at:
https://www.kaggle.com/alexandervc/lgb-worse-than-constant-dart-rules-optim-params



# What is about 

In current notebook we make some comparaison of boostings params.

It is draft version. It is planned to updated. 

# Technical remarks

1. Small data preprossing - control group is assigned to 1 always. Doses -> 0,1 and durations -> 0,0.5,1

2. Simple comparaison of params is done via cross-validation:
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state= 0 )

3. Advanced comparaisons will average several cvs - to be done. 


# Other notebooks with boosting and their results

"Tokyo1":
https://www.kaggle.com/code1110/moa-lgb-seed-average  
0.01925 - Analysis: params are good, but still it seems improvement over logreg is due to clipping, not due to boosting. CV score is around 0.016* (without clipping), while logreg 0.015*. Good thing - at least with use of early stopping
results are always BETTER than constant prediction. 

 
"Tokyo2" https://www.kaggle.com/sishihara/moa-lgbm-benchmark 
0.02038 
params = {    'num_leaves': 24, 'max_depth': 5,'objective': 'binary',     'learning_rate': 0.01 }

https://www.kaggle.com/nroman/moa-lightgbm-206-models
v2: Label encoding of categorical features. CV: 0.01627, LB: 0.02040
https://www.kaggle.com/fchmiel/xgboost-baseline-multilabel-classification

https://www.kaggle.com/pavelvpster/moa-lgb-optuna  
0.01994
 
https://www.kaggle.com/namanj27/catboost-moa-eda-starter 
0.02149

https://www.kaggle.com/swarajshinde/mechanisms-of-action-moa-eda-lightgbm-baseline
0.02037

https://www.kaggle.com/senkin13/moa-lightgbm-starter-with-nonscored-meta-feature
0.02063
https://www.kaggle.com/hetarthchopra/xgboost-catboost-e

nsemble-baseline-solution
0.01985
https://www.kaggle.com/mannsingh/simple-xgboost-model-for-beginners
0.02098
https://www.kaggle.com/acapricorni/moa-stacking-nn-lgbm
https://www.kaggle.com/acapricorni/moa-stacking-nn-lgbm
0.01980
https://www.kaggle.com/demetrypascal/catboost-and-logreg

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Load data and small preprocess

In [None]:
import lightgbm as lgb
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Lasso

from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import datetime
import time

import matplotlib.pyplot as plt
import seaborn as sns

import time
df = pd.read_csv('/kaggle/input/lish-moa/train_features.csv',index_col = 0)  
df0 = df.copy()
df['cp_type'] = df['cp_type'].map({'trt_cp':1.0, 'ctl_vehicle':1.0}) # Forget about control group  
df['cp_dose'] = df['cp_dose'].map({'D1':0.0, 'D2':1.0})
df['cp_time'] = df['cp_time'].map({24:0.0, 48: .5 , 72:1.0})
X = df.copy()
X_save = X.copy()
df_test = pd.read_csv('/kaggle/input/lish-moa/test_features.csv',index_col = 0)
df0_test = df_test.copy()
df_test['cp_type'] = df_test['cp_type'].map({'trt_cp':1.0, 'ctl_vehicle':0.0})
df_test['cp_dose'] = df_test['cp_dose'].map({'D1':0.0, 'D2':1.0})
df_test['cp_time'] = df_test['cp_time'].map({24:0.0, 48: .5 , 72:1.0})

y = pd.read_csv('/kaggle/input/lish-moa/train_targets_scored.csv',index_col = 0 )
y_save = y.copy()
print(y.iloc[:3,:2])
df

In [None]:
y_save.sum(axis = 0 ).sort_values(ascending = False)


# Show LGB default params

In [None]:
model = lgb.LGBMClassifier(**{}) # {'boosting_type': 'dart'})
model.get_params()

# target_name = 'dopamine_receptor_antagonist'

This target is one of the most difficult to predict - all models gives prediction near constant.
Despite it the disbalance at that target is lower than in most other targets. It contains 424 "1". 


In [None]:
rs = 0
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state= rs )

target_name = 'dopamine_receptor_antagonist'
y = y_save[target_name]
print('Target:', target_name, ' ,  count non-zero:', y.sum() )
print()

## Compare LGB different params, constant predict, logreg 

In [None]:
%%time 
y = y_save[target_name]
print('Target:', target_name, ' ,  count non-zero:', y.sum() )
print()

print('Predict by constant ' )
print('Logloss: ', np.round( -log_loss( y, np.ones_like(y)*y.mean() ) ,5 ) )
print()

model = lgb.LGBMClassifier(**{}) # {'boosting_type': 'dart'})
t0 = time.time()
cv_score = cross_val_score(model, X, y, cv=skf,   scoring='neg_log_loss' )
print('LGB default params - WORSE than constant!!!')
print('Logloss CV3', np.round( np.mean(cv_score), 5), 'Std', np.round( np.std(cv_score),5),'Folds:',   
      np.round(cv_score ,5), np.round(time.time()-t0, 1), 'seconds passed' )  
print()

model = lgb.LGBMClassifier(**{ 'boosting_type': 'dart'})
t0 = time.time()
cv_score = cross_val_score(model, X, y, cv=skf,   scoring='neg_log_loss' )
print('LGB DART - not bad')
print('Logloss CV3', np.round( np.mean(cv_score), 5), 'Std', np.round( np.std(cv_score),5),'Folds:',   np.round(cv_score ,5),
     np.round(time.time()-t0, 1), 'seconds passed' )
print()

model = lgb.LGBMClassifier(**{ 'boosting_type': 'dart', 'learning_rate':0.100028})
t0 = time.time()
cv_score = cross_val_score(model, X, y, cv=skf,   scoring='neg_log_loss' )
print('LGB DART , learning_rate=0.100028 - small improvement')
print('Logloss CV3', np.round( np.mean(cv_score), 5), 'Std', np.round( np.std(cv_score),5),'Folds:',   
      np.round(cv_score ,5), np.round(time.time()-t0, 1), 'seconds passed' )
print()

model = lgb.LGBMClassifier(**{ 'boosting_type': 'dart', 'learning_rate':0.100028,
 'max_depth':15, 'n_estimators':122, 'num_leaves':27, 'reg_alpha':0.0009} )
t0 = time.time()
cv_score = cross_val_score(model, X, y, cv=skf,   scoring='neg_log_loss' )
print('Best params')
print('Logloss CV3', np.round( np.mean(cv_score), 5), 'Std', np.round( np.std(cv_score),5),'Folds:',   
      np.round(cv_score ,5), np.round(time.time()-t0, 1), 'seconds passed' )
print()

model = LogisticRegression(C = 0.003)
t0 = time.time()
cv_score = cross_val_score(model, X, y, cv=skf,   scoring='neg_log_loss' )
print('Logreg')
print('Logloss CV3', np.round( np.mean(cv_score), 5), 'Std', np.round( np.std(cv_score),5),'Folds:',   
      np.round(cv_score ,5), np.round(time.time()-t0, 1), 'seconds passed' )
print()

    


## "Tokyo1" params 

Seems better than Dart on many targets, still mostly worse than LogReg

based on  https://www.kaggle.com/code1110/moa-lgb-seed-average

In [None]:
params = {
    'n_estimators': 50,
    'objective': 'binary',
    'boosting_type': 'gbdt',
    'max_depth': 3,
    'learning_rate': 0.08,
    'subsample': 0.72,
    'subsample_freq': 4,
    'feature_fraction': 0.4,
    'lambda_l1': 1,
    'lambda_l2': 1,
    'seed': 217, # SEED,
    #'early_stopping_rounds': 40,
    }    
params["metric"] = "binary_logloss" 

model = lgb.LGBMClassifier(**params )
t0 = time.time()
cv_score = cross_val_score(model, X, y, cv=skf,   scoring='neg_log_loss' )
print('Best params')
print('Logloss CV3', np.round( np.mean(cv_score), 5), 'Std', np.round( np.std(cv_score),5),'Folds:',   
      np.round(cv_score ,5), np.round(time.time()-t0, 1), 'seconds passed' )
print()


In [None]:
params = {
    'n_estimators': 50,
    'objective': 'binary',
    'boosting_type': 'gbdt',
    'max_depth': 3,
    'learning_rate': 0.08,
    'subsample': 0.72,
    'subsample_freq': 4,
    'feature_fraction': 0.4,
    'lambda_l1': 1,
    'lambda_l2': 1,
    'seed': 217, # SEED,
    #'early_stopping_rounds': 40,
    }    
params["metric"] = "binary_logloss" 

t00 = time.time()
df_stat = pd.DataFrame()
for n_estimators in range(10,300,10): #[10,20,30,40,50,60,70,80,90,100,110,120,130,150,170,200]:
    params["n_estimators"] = n_estimators 
    model = lgb.LGBMClassifier(**params )
    t0 = time.time()
    cv_score = cross_val_score(model, X, y, cv=skf,   scoring='neg_log_loss' )
    #print('n_estimators params', n_estimators)
    print('N_est',n_estimators, 'Logloss', np.round( np.mean(cv_score), 5), 'Std', np.round( np.std(cv_score),5),'Folds:',   
          np.round(cv_score ,5), np.round(time.time()-t0, 1), 'seconds' )
    #print()
    id = len(df_stat)+1
    df_stat.loc[id,'n_estimators'] = n_estimators
    df_stat.loc[id,'Logloss'] = np.round( np.mean(cv_score), 5)
    df_stat.loc[id,'Logloss Std'] = np.round( np.std(cv_score), 7)
    
seconds_passed_total = time.time()-t00
print(np.round(seconds_passed_total,1), np.round(seconds_passed_total/60,1) , np.round(seconds_passed_total/3600,1) , 
'seconds, minutes, hours, passed total' )
df_stat    


In [None]:
df_stat.describe()

In [None]:
plt.plot(df_stat['n_estimators'].values, df_stat['Logloss'].values, '*-'  )
plt.grid()
plt.title('LogLoss')
plt.xlabel("n_estimators")
plt.show()
plt.plot(df_stat['n_estimators'].values, df_stat['Logloss Std'].values, '*-'  )
plt.grid()
plt.title('Std')
plt.xlabel("n_estimators")
plt.show()
df_stat.describe()

In [None]:
df_stat.describe()

# Tokyo2 

In [None]:
params = {
    'num_leaves': 24,
    'max_depth': 5,
    'objective': 'binary',
    'learning_rate': 0.01
}

model = lgb.LGBMClassifier(**params )
t0 = time.time()
cv_score = cross_val_score(model, X, y, cv=skf,   scoring='neg_log_loss' )
print('Best params')
print('Logloss CV3', np.round( np.mean(cv_score), 5), 'Std', np.round( np.std(cv_score),5),'Folds:',   
      np.round(cv_score ,5), np.round(time.time()-t0, 1), 'seconds passed' )
print()


In [None]:
t00 = time.time()
df_stat = pd.DataFrame()
for n_estimators in range(10,300,10): #[10,20,30,40,50,60,70,80,90,100,110,120,130,150,170,200]:
    params["n_estimators"] = n_estimators 
    model = lgb.LGBMClassifier(**params )
    t0 = time.time()
    cv_score = cross_val_score(model, X, y, cv=skf,   scoring='neg_log_loss' )
    #print('n_estimators params', n_estimators)
    print('N_est',n_estimators, 'Logloss', np.round( np.mean(cv_score), 5), 'Std', np.round( np.std(cv_score),5),'Folds:',   
          np.round(cv_score ,5), np.round(time.time()-t0, 1), 'seconds' )
    #print()
    id = len(df_stat)+1
    df_stat.loc[id,'n_estimators'] = n_estimators
    df_stat.loc[id,'Logloss'] = np.round( np.mean(cv_score), 5)
    df_stat.loc[id,'Logloss Std'] = np.round( np.std(cv_score), 7)
    
seconds_passed_total = time.time()-t00
print(np.round(seconds_passed_total,1), np.round(seconds_passed_total/60,1) , np.round(seconds_passed_total/3600,1) , 
'seconds, minutes, hours, passed total' )
df_stat    


In [None]:
plt.plot(df_stat['n_estimators'].values, df_stat['Logloss'].values, '*-'  )
plt.grid()
plt.title('LogLoss')
plt.xlabel("n_estimators")
plt.show()
plt.plot(df_stat['n_estimators'].values, df_stat['Logloss Std'].values, '*-'  )
plt.grid()
plt.title('Std')
plt.xlabel("n_estimators")
plt.show()
df_stat.describe()

# target_name = 'jak_inhibitor'

## Example where Dart is worse than constant 

That would be typical if number disbalance is very high, however in that example 93 samples  = 1 , it is not so low, for many other targets which even lower number Lgb is better than constant 

In [None]:
%%time 
rs = 0
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state= rs )

target_name = 'jak_inhibitor'
y = y_save[target_name]
print('Target:', target_name, ' ,  count non-zero:', y.sum() )
print()

print('Predict by constant ' )
print('Logloss: ', np.round( -log_loss( y, np.ones_like(y)*y.mean() ) ,5 ) )
print()

model = lgb.LGBMClassifier(**{}) # {'boosting_type': 'dart'})
t0 = time.time()
cv_score = cross_val_score(model, X, y, cv=skf,   scoring='neg_log_loss' )
print('LGB default params - WORSE than constant!!!')
print('Logloss CV3', np.round( np.mean(cv_score), 5), 'Std', np.round( np.std(cv_score),5),'Folds:',   
      np.round(cv_score ,5), np.round(time.time()-t0, 1), 'seconds passed' )  
print()

model = lgb.LGBMClassifier(**{ 'boosting_type': 'dart'})
t0 = time.time()
cv_score = cross_val_score(model, X, y, cv=skf,   scoring='neg_log_loss' )
print('LGB DART - again WORSE than constant!!!')
print('Logloss CV3', np.round( np.mean(cv_score), 5), 'Std', np.round( np.std(cv_score),5),'Folds:',   np.round(cv_score ,5),
     np.round(time.time()-t0, 1), 'seconds passed' )
print()

model = lgb.LGBMClassifier(**{ 'boosting_type': 'dart', 'learning_rate':0.100028})
t0 = time.time()
cv_score = cross_val_score(model, X, y, cv=skf,   scoring='neg_log_loss' )
print('LGB DART , learning_rate=0.100028 - small change of LR, but big improvement , but  again WORSE than constant!!!')
print('Logloss CV3', np.round( np.mean(cv_score), 5), 'Std', np.round( np.std(cv_score),5),'Folds:',   
      np.round(cv_score ,5), np.round(time.time()-t0, 1), 'seconds passed' )
print()

model = lgb.LGBMClassifier(**{ 'boosting_type': 'dart', 'learning_rate':0.100028,
 'max_depth':15, 'n_estimators':122, 'num_leaves':27, 'reg_alpha':0.0009} )
t0 = time.time()
cv_score = cross_val_score(model, X, y, cv=skf,   scoring='neg_log_loss' )
print('Best previous params - again WORSE than constant!!! ')
print('Logloss CV3', np.round( np.mean(cv_score), 5), 'Std', np.round( np.std(cv_score),5),'Folds:',   
      np.round(cv_score ,5), np.round(time.time()-t0, 1), 'seconds passed' )
print()

model = LogisticRegression(C = 0.02)
t0 = time.time()
cv_score = cross_val_score(model, X, y, cv=skf,   scoring='neg_log_loss' )
print('Logreg')
print('Logloss CV3', np.round( np.mean(cv_score), 5), 'Std', np.round( np.std(cv_score),5),'Folds:',   
      np.round(cv_score ,5), np.round(time.time()-t0, 1), 'seconds passed' )
print()

    


In [None]:
params = {
    'n_estimators': 130,
    'objective': 'binary',
    'boosting_type': 'gbdt',
    'max_depth': 3,
    'learning_rate': 0.08,
    'subsample': 0.72,
    'subsample_freq': 4,
    'feature_fraction': 0.4,
    'lambda_l1': 1,
    'lambda_l2': 1,
    'seed': 217, # SEED,
    #'early_stopping_rounds': 40,
    }    
params["metric"] = "binary_logloss" 
model = lgb.LGBMClassifier(**params )
t0 = time.time()
cv_score = cross_val_score(model, X, y, cv=skf,   scoring='neg_log_loss' )
print('params "Tokyo1_130"')
print('Logloss CV3', np.round( np.mean(cv_score), 5), 'Std', np.round( np.std(cv_score),5),'Folds:',   
      np.round(cv_score ,5), np.round(time.time()-t0, 1), 'seconds passed' )
print()


params = {
    'num_leaves': 24,
    'max_depth': 5,
    'objective': 'binary',
    'learning_rate': 0.01
}
params["n_estimators"] = 150
model = lgb.LGBMClassifier(**params )
t0 = time.time()
cv_score = cross_val_score(model, X, y, cv=skf,   scoring='neg_log_loss' )
print('params "Tokyo2_150"')
print('Logloss CV3', np.round( np.mean(cv_score), 5), 'Std', np.round( np.std(cv_score),5),'Folds:',   
      np.round(cv_score ,5), np.round(time.time()-t0, 1), 'seconds passed' )
print()


## Tokyo1 plot

In [None]:
params = {
    'n_estimators': 130,
    'objective': 'binary',
    'boosting_type': 'gbdt',
    'max_depth': 3,
    'learning_rate': 0.08,
    'subsample': 0.72,
    'subsample_freq': 4,
    'feature_fraction': 0.4,
    'lambda_l1': 1,
    'lambda_l2': 1,
    'seed': 217, # SEED,
    #'early_stopping_rounds': 40,
    }    
params["metric"] = "binary_logloss" 

t00 = time.time()
df_stat = pd.DataFrame()
for n_estimators in range(10,300,10): #[10,20,30,40,50,60,70,80,90,100,110,120,130,150,170,200]:
    params["n_estimators"] = n_estimators 
    model = lgb.LGBMClassifier(**params )
    t0 = time.time()
    cv_score = cross_val_score(model, X, y, cv=skf,   scoring='neg_log_loss' )
    #print('n_estimators params', n_estimators)
    print('N_est',n_estimators, 'Logloss', np.round( np.mean(cv_score), 5), 'Std', np.round( np.std(cv_score),5),'Folds:',   
          np.round(cv_score ,5), np.round(time.time()-t0, 1), 'seconds' )
    #print()
    id = len(df_stat)+1
    df_stat.loc[id,'n_estimators'] = n_estimators
    df_stat.loc[id,'Logloss'] = np.round( np.mean(cv_score), 5)
    df_stat.loc[id,'Logloss Std'] = np.round( np.std(cv_score), 7)
    
seconds_passed_total = time.time()-t00
print(np.round(seconds_passed_total,1), np.round(seconds_passed_total/60,1) , np.round(seconds_passed_total/3600,1) , 
'seconds, minutes, hours, passed total' )
df_stat    


In [None]:
plt.plot(df_stat['n_estimators'].values, df_stat['Logloss'].values, '*-'  )
plt.grid()
plt.title('LogLoss')
plt.show()
plt.plot(df_stat['n_estimators'].values, df_stat['Logloss Std'].values, '*-'  )
plt.grid()
plt.title('Std')
plt.show()
df_stat.describe()

## Tokyo2 plot

In [None]:
params = {
    'num_leaves': 24,
    'max_depth': 5,
    'objective': 'binary',
    'learning_rate': 0.01
}

t00 = time.time()
df_stat = pd.DataFrame()
for n_estimators in range(10,300,10): #[10,20,30,40,50,60,70,80,90,100,110,120,130,150,170,200]:
    params["n_estimators"] = n_estimators 
    model = lgb.LGBMClassifier(**params )
    t0 = time.time()
    cv_score = cross_val_score(model, X, y, cv=skf,   scoring='neg_log_loss' )
    #print('n_estimators params', n_estimators)
    print('N_est',n_estimators, 'Logloss', np.round( np.mean(cv_score), 5), 'Std', np.round( np.std(cv_score),5),'Folds:',   
          np.round(cv_score ,5), np.round(time.time()-t0, 1), 'seconds' )
    #print()
    id = len(df_stat)+1
    df_stat.loc[id,'n_estimators'] = n_estimators
    df_stat.loc[id,'Logloss'] = np.round( np.mean(cv_score), 5)
    df_stat.loc[id,'Logloss Std'] = np.round( np.std(cv_score), 7)
    
seconds_passed_total = time.time()-t00
print(np.round(seconds_passed_total,1), np.round(seconds_passed_total/60,1) , np.round(seconds_passed_total/3600,1) , 
'seconds, minutes, hours, passed total' )
df_stat    


In [None]:
plt.plot(df_stat['n_estimators'].values, df_stat['Logloss'].values, '*-'  )
plt.grid()
plt.title('LogLoss')
plt.xlabel("n_estimators")
plt.show()
plt.plot(df_stat['n_estimators'].values, df_stat['Logloss Std'].values, '*-'  )
plt.grid()
plt.title('Std')
plt.xlabel("n_estimators")
plt.show()
df_stat.describe()