Since TPS June 2021 is a further extension on the [TPS May 2021](http://www.kaggle.com/c/tabular-playground-series-may-2021) multiclass classification problem with more samples and anonymous features together with target classes increasing from 4 to 9, we have already try to break down the modelling process fundamentally last month, so this time we will try to tackle the problem by Optuna to automate the hyperparameter tuning. Moreover, this solution will try to utilise the free GPU accelerator provided by Kaggle for practising purpose.

Based on the background above, you will see a solution in favour of simple ML workflow and low computation cost, ready to be deployed for different problems. The presentation may be raw, but I will keep it to show how the result is improved gradually.

Workflow:
1. Data Exploration
2. Data Preprocessing
3. Feature Engineering
4. Feature Selection
5. Base Models
6. Stacking

# Preparation

## Imports

In [None]:
# Essentials
import numpy as np
import pandas as pd
import datetime
import random

# Plots
import seaborn as sns
import matplotlib.pyplot as plt

# Models
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor, AdaBoostRegressor, BaggingRegressor, ExtraTreesClassifier, StackingClassifier
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import RidgeClassifier, RidgeCV
from sklearn.linear_model import ElasticNet, ElasticNetCV, LogisticRegressionCV, LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.classifier import StackingCVClassifier
import lightgbm as lgb
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import Pool, CatBoostClassifier

# Stats
from scipy.stats import skew, norm
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax

# Misc
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold, cross_val_score, validation_curve
from sklearn.metrics import log_loss, confusion_matrix
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import mutual_info_classif
from sklearn.calibration import CalibratedClassifierCV

pd.set_option('display.max_columns', None)

# Ignore useless warnings
import warnings
warnings.filterwarnings(action="ignore")
pd.options.display.max_seq_items = 8000
pd.options.display.max_rows = 8000

import os
os.listdir("../input/")

## Read data

In [None]:
# Read in the dataset as a dataframe
train = pd.read_csv("../input/tabular-playground-series-jun-2021/train.csv")
test = pd.read_csv("../input/tabular-playground-series-jun-2021/test.csv")
submission = pd.read_csv("../input/tabular-playground-series-jun-2021/sample_submission.csv")

#train.info()
#test.info()
#submission.info()

## Split datasets

In [None]:
# Split features and labels
train_labels = train['target'].reset_index(drop=True)
train_features = train.drop(['id','target'], axis=1)
test_features = test.drop(['id'], axis=1)
train_labels.head()

In [None]:
del train
del test

# Data Exploration

## Target distribution

As observed, 26% each of the target in the training set is of respectively "Class 6" & "Class 8", which is pretty balanced among 9 classes.

In [None]:
'''
sns.set_style("white")
sns.set_color_codes(palette='deep')
f, ax = plt.subplots(figsize=(8, 7))
#Check the new distribution 
sns.histplot(train['target'].sort_values(), color="b");
ax.xaxis.grid(False)
ax.set(ylabel="Frequency")
ax.set(xlabel="Target")
ax.set(title="Target distribution")
sns.despine(trim=True, left=True)
plt.show()
'''

In [None]:
#train['target'].value_counts().sort_values(ascending=False)/sum(train['target'].value_counts())

## Features EDA

No specific pattern is observed in this case.

In [None]:
'''
# visualising some more outliers in the data values
fig, axs = plt.subplots(ncols=2, nrows=1, figsize=(12, 120))
plt.subplots_adjust(right=2)
plt.subplots_adjust(top=2)
sns.color_palette("husl", 8)
for i, feature in enumerate(list(train_features), 1):
    plt.subplot(len(list(train_features)), 3, i)
    sns.boxplot(x=feature, y=train_labels, hue=train_labels, palette='Blues', data=train_features)
        
    plt.xlabel('{}'.format(feature), size=15,labelpad=12.5)
    plt.ylabel('Target', size=15, labelpad=12.5)
    
    for j in range(2):
        plt.tick_params(axis='x', labelsize=12)
        plt.tick_params(axis='y', labelsize=12)
    
    plt.legend(loc='best', prop={'size': 10})
        
plt.show()
'''

## Correlation

Filter by RF feature importance first when the number of features is too large.

The 50 features show no significant correlation with each other.

In [None]:
'''
# Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)

rf_model = rf.fit(train_features, train_labels)
#rf_pred = rf_model.predict_proba(test_features)

forest_importances = pd.Series(rf.feature_importances_, index=train_features.columns)
top_feat = forest_importances.sort_values(ascending = False).head(20)
top_feat

train_features[top_feat.index]
'''

In [None]:
'''
corr = train_features[top_feat.index].corr()
#corr
#corr = train.corr()
plt.subplots(figsize=(15,12))
sns.heatmap(corr, vmax=0.9, cmap="Blues", square=True)
'''

### Further exploration for high correlation to target

The most important features by RF is feature_54, but visually its standalone correlation with the target is insignificant.

In [None]:
'''
data = pd.concat([train['feature_54'], train['target']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=train['feature_54'], y="target", data=data)
#fig.axis(ymin=0, ymax=800000);
'''

# Data Preprocessing

Label encoding for features

In [None]:

encoder = OrdinalEncoder()
all_encoded = encoder.fit_transform(train_features.append(test_features))
train_features_encoded = all_encoded[0:len(train_features)]
test_features_encoded = all_encoded[len(train_features):]


In [None]:
del train_features
del test_features
del all_encoded

No outliers or missing values observed from EDA.

## Recombine datasets

No treatment is needed in this case.

# Feature Engineering

Since the features are anonymous and having considerable size, the features space can be pretty large if we adopt some brute force interactive opreations. This will significantly increase the computational cost, hence feature engineering of this sort is not considered in this problem.

## PCA

Since there are 75 features, the dimension reduction technique may help. I have tried PCA, but the result is not satisfactory. This is intuitive given the low features correlation shown in EDA, and the almost identical contributions from all the principal components.

In [None]:
'''
X=train_features
# Standardize
X_scaled = (X - X.mean(axis=0)) / X.std(axis=0)

# Create principal components
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Convert to dataframe
component_names = [f"PC{i+1}" for i in range(X_pca.shape[1])]
X_pca = pd.DataFrame(X_pca, columns=component_names)

X_pca.head()
'''

In [None]:
'''
loadings = pd.DataFrame(
    pca.components_.T,  # transpose the matrix of loadings
    columns=component_names,  # so the columns are the principal components
    index=X.columns,  # and the rows are the original features
)
loadings
'''

In [None]:
'''
def plot_variance(pca, width=8, dpi=100):
    # Create figure
    fig, axs = plt.subplots(1, 2)
    n = pca.n_components_
    grid = np.arange(1, n + 1)
    # Explained variance
    evr = pca.explained_variance_ratio_
    axs[0].bar(grid, evr)
    axs[0].set(
        xlabel="Component", title="% Explained Variance", ylim=(0.0, 0.1)
    )
    # Cumulative Variance
    cv = np.cumsum(evr)
    axs[1].plot(np.r_[0, grid], np.r_[0, cv], "o-")
    axs[1].set(
        xlabel="Component", title="% Cumulative Variance", ylim=(0.0, 1.0)
    )
    # Set up figure
    fig.set(figwidth=8, dpi=100)
    return axs

# Look at explained variance
plot_variance(pca);
'''

In [None]:
'''
def make_mi_scores(X, y, discrete_features):
    mi_scores = mutual_info_classif(X, y, discrete_features=discrete_features)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

mi_scores = make_mi_scores(X_pca, train_labels, discrete_features=False)
mi_scores
'''

In [None]:
'''
miindex = mi_scores.index[mi_scores.values>0]
miload = loadings[miindex]
'''

In [None]:
'''
train_features_pca = pd.DataFrame(data = np.matmul(train_features,miload))
train_features_pca.columns=miindex
'''

In [None]:
'''
test_features_pca = pd.DataFrame(data = np.matmul(test_features,miload))
test_features_pca.columns=miindex
'''

## Recreate training and test sets

No treatment is needed in this case.

# Feature Selection

# Base Models

## Optuna try

In [None]:

# Optuna for parameter search
!pip install -q optuna

import optuna
import pickle


Transform the target to numbers exactly as the class numbers

In [None]:

def class_to_num(classes):
    return [int(word[-1]) for word in classes]

#def num_to_class(nums):
 #   return ['Class_' + str(num) for num in nums]

#class type array starts from zero
train_labels_num = np.array(class_to_num(train_labels))-1
train_labels_num


In [None]:
del train_labels

### Light GBM

In [None]:
'''


params = {
    
    
    'reg_lambda': 405.6123975349561, 
     'reg_alpha': 0.09452256681364866, 
     'colsample_bytree': 0.31486263497374173, 
     'subsample': 0.7281301644169369,
     'learning_rate': 0.01, 
     'num_leaves': 135,
     'min_child_samples': 489,
     'max_depth': 29
}
'''

In [None]:
'''
X=train_features_encoded
y=train_labels_num

params_lgbm = params
params_lgbm['boosting_type'] = 'gbdt'
params_lgbm['device'] = 'gpu'
params_lgbm ['objective'] = 'multiclasss'
params_lgbm ['num_classes'] = 9,

params_lgbm ['metric'] = 'multi_logloss'
params_lgbm ['verbosity'] = -1
params_lgbm ['n_estimators']= 100000
#params_lgbm["cat_feature"] = cat_features

name = 'lightgbm_3seed_5fold'
k=5
seed_list=[0,1,2]
random_seed=0
kf = StratifiedKFold(n_splits=k,shuffle=True,random_state=random_seed)
oof = np.zeros((len(X),9))
test_preds_list = []
score_list = []
fold=1
  
splits = list(kf.split(X,y))
fold = 1
for train_idx, val_idx in splits:
  X_train, X_val = X[train_idx], X[val_idx]
  y_train, y_val = y[train_idx], y[val_idx]

  val_preds_list = []

  for seed in seed_list:
    
    # fit and run model
    params_lgbm['random_state'] = seed
    
    model = LGBMClassifier(**params_lgbm)
    
    model.fit(X_train,y_train,eval_set=[(X_train,y_train),(X_val,y_val)],
              early_stopping_rounds=100,
              eval_names=['train','val'],verbose=200)

    
    val_preds_list.append(model.predict_proba(X_val))
    test_preds_list.append(model.predict_proba(test_features_encoded))
    
  oof[val_idx] = np.mean(val_preds_list,axis=0)
  score = log_loss(y_val, oof[val_idx])
  print(f"fold: {fold},log_loss: {score}")
  score_list.append(score)
  # print(f"fold: {fold}, class0 tr %: {y_train.value_counts()[0]/len(y_train)}, class0 val %: {y_val.value_counts()[0]/len(y_val)} ")
  fold +=1
  
cv_logloss = np.mean(score_list)
print(f"{name} ,log_loss: {cv_logloss}")

preds= np.mean(test_preds_list,axis=0)


file_name_oof = name +"_oof.txt"
file_name_test = name + "_test.csv"
with open(file_name_oof, "wb") as fp:
      pickle.dump(oof, fp)

#files.download(file_name_oof)
submission.iloc[:,1:] = pd.DataFrame(preds)
submission.to_csv(file_name_test,index=None)
#files.download(file_name_test) 
'''

### Light GBM tuning

In [None]:
'''
# for the fixed learning rate, use the opt n iterations and tune the tree hyperparameters
def objective(trial, X=train_features_encoded, y=train_labels_num):
  """
  """
  param_space = {
               'device':'gpu',  # Use GPU acceleration
               'boosting_type': 'gbdt',
               'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-3, 1e3),
               'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-3, 1e3),
               'colsample_bytree': trial.suggest_float('colsample_bytree', 0.1 , 1.0),
               'subsample': trial.suggest_float('subsample', 0.1, 1.0),
                #'subsample_freq': trial.suggest_int('subsample_freq', 1, 10),
               'learning_rate': trial.suggest_loguniform('learning_rate', 1e-2, 1e-2),
               'num_leaves': trial.suggest_int("num_leaves", 31,256),
               'min_child_samples': trial.suggest_int('min_child_samples', 1, 500),
               'max_depth':trial.suggest_int('max_depth',3,127),
              #'min_split_gain': trial.suggest_float('min_split_gain', 0.0, 0.005),
              #'class_weight':trial.suggest_categorical('class_weight',['balanced',None]),
               'n_estimators':100000,
               'objective':'multiclass',
               'metric':'multi_logloss'
                }
            
  #X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=.1,random_state=2021,stratify=y)
  k=5
  seed_list=[0]
  random_seed=0
  kf = StratifiedKFold(n_splits=k,shuffle=True,random_state=random_seed)
  oof = np.zeros((len(X),9))
  score_list = []
  fold=1
  
  splits = list(kf.split(X,y))
  for train_idx, val_idx in splits:
    X_train, X_val = X[train_idx,:], X[val_idx,:]
    y_train, y_val = y[train_idx], y[val_idx]
  
    val_preds_list = []
  
    for seed in seed_list:
      # fit and run model
      param_space['random_state'] = seed

      model = LGBMClassifier(**param_space)
      model.fit(X_train,y_train,eval_set=[(X_train,y_train),(X_val,y_val)],
                early_stopping_rounds=100,
                eval_names=['train','val'],verbose=0)
    
      val_preds_list.append(model.predict_proba(X_val))
     #test_preds_list.append(model.predict_proba(X_test)[:,1])
    
    oof[val_idx] = np.mean(val_preds_list,axis=0)
    score = log_loss(y_val, oof[val_idx])
    print(f"fold: {fold},logloss: {score}")
    score_list.append(score)
    fold +=1
  
  cv_logloss = np.mean(score_list)
  
  return cv_logloss
 '''

In [None]:
'''
%%time

study = optuna.create_study(direction='minimize')
study.optimize(objective,n_trials= 20)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
'''

In [None]:
#study.best_params

### XGBoost

In [None]:
import xgboost

In [None]:
'''
params =  {
 'lambda': 1.916220456301414, 
 'alpha': 7.860684965705271, 
 'colsample_bytree': 0.39793959188267636, 
 'colsample_bynode': 0.35770691759121553,
 'colsample_bylevel': 0.43340183901358953, 
 'subsample': 0.639573806625875, 
 'eta': 0.01,
 'grow_policy': 'depthwise', 
 'max_depth': 10, 
 'min_child_weight': 112,
 'max_bin': 339, 
 'deterministic_histogram': False}
'''


In [None]:
'''
X=train_features_encoded
y=train_labels_num

params_xgb = params
params_xgb["tree_method"] = "gpu_hist"
params_xgb["predictor"] = 'gpu_predictor'
params_xgb["objective"] = 'multi:softprob'
params_xgb["num_class"] = 9
params_xgb["eval_metric"] ='mlogloss'

name = 'xgboost_3seed_5fold'
k=5
seed_list=[0,1,2]
random_seed=0
kf = StratifiedKFold(n_splits=k,shuffle=True,random_state=random_seed)
oof = np.zeros((len(X),9))
test_preds_list = []
score_list = []
fold=1
  
splits = list(kf.split(X,y))
fold = 1
for train_idx, val_idx in splits:
  X_train, X_val = X[train_idx], X[val_idx]
  y_train, y_val = y[train_idx], y[val_idx]

  val_preds_list = []

  for seed in seed_list:
    
    # fit and run model
    params_xgb['seed'] = seed
    
    dtrain = xgboost.DMatrix(data=X_train, label=y_train)
    dval = xgboost.DMatrix(data=X_val, label=y_val)
    dtest = xgboost.DMatrix(data=test_features_encoded)
    
    model = xgboost.train(params_xgb, dtrain,\
                       evals=[(dtrain,'train'),(dval,'val')],\
                       verbose_eval=False,
                       early_stopping_rounds=100,
                       num_boost_round=100000)
    
    

    
    val_preds_list.append(model.predict(dval))
    test_preds_list.append(model.predict(dtest))
    
  oof[val_idx] = np.mean(val_preds_list,axis=0)
  score = log_loss(y_val, oof[val_idx])
  print(f"fold: {fold},log_loss: {score}")
  score_list.append(score)
  # print(f"fold: {fold}, class0 tr %: {y_train.value_counts()[0]/len(y_train)}, class0 val %: {y_val.value_counts()[0]/len(y_val)} ")
  fold +=1
  
cv_logloss = np.mean(score_list)
print(f"{name} ,log_loss: {cv_logloss}")

preds= np.mean(test_preds_list,axis=0)


file_name_oof = name + "_oof.txt"
file_name_test = name + "_test.csv"
with open(file_name_oof, "wb") as fp:
      pickle.dump(oof, fp)

#files.download(file_name_oof)

submission.iloc[:,1:] = pd.DataFrame(preds)
submission.to_csv(file_name_test,index=None)
#files.download(file_name_test) 
'''

### XGBoost tuning

In [None]:
'''
# for the fixed learning rate, use the opt n iterations and tune the tree hyperparameters
def objective(trial,X=train_features_encoded, y=train_labels_num):
  """
  """
  param_space = { 
               'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0),
                'alpha': trial.suggest_loguniform('alpha', 1e-3, 10.0),
                'colsample_bytree': trial.suggest_float('colsample_bytree', 0.1, 0.9),
                'colsample_bynode': trial.suggest_float('colsample_bynode', 0.1, 0.9),
                'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.1, 0.9),
                'subsample': trial.suggest_float('subsample', 0.1, 0.9),
                'eta':trial.suggest_float('eta', 1e-2, 1e-2),
                'grow_policy': trial.suggest_categorical("grow_policy", ['depthwise','lossguide']),
                'max_depth': trial.suggest_int('max_depth',2,25),
                'seed': 0,
                'min_child_weight': trial.suggest_int('min_child_weight', 0, 300),
                'max_bin': trial.suggest_int('max_bin', 256, 512),
                'deterministic_histogram':trial.suggest_categorical('deterministic_histogram',[False]),
               "tree_method" : "gpu_hist",
                "predictor" : 'gpu_predictor',
                "objective" : 'multi:softprob',
                 "num_class":9
                }
            
  #X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=.1,random_state=2021,stratify=y)
  k=5
  seed_list=[0]
  random_seed=0
  kf = StratifiedKFold(n_splits=k,shuffle=True,random_state=random_seed)
  oof = np.zeros((len(X),9))
  score_list = []
  fold=1
  
  splits = list(kf.split(X,y))
  for train_idx, val_idx in splits:
    X_train, X_val = X[train_idx,:], X[val_idx,:]
    y_train, y_val = y[train_idx], y[val_idx]
  
    val_preds_list = []
  
    for seed in seed_list:
      # fit and run model
      param_space['seed'] = seed
      dtrain = xgboost.DMatrix(data=X_train, label=y_train)
      dval = xgboost.DMatrix(data=X_val, label=y_val)
      #dtest = xgboost.DMatrix(data=test_features_encoded)
      xgboost.set_config(verbosity=0)

      
      model = xgboost.train(param_space, dtrain,\
                       evals=[(dtrain,'train'),(dval,'val')],\
                       verbose_eval=False,
                       early_stopping_rounds=100,
                       num_boost_round=100000)
    
    

    
      val_preds_list.append(model.predict(dval))
     #test_preds_list.append(model.predict_proba(X_test)[:,1])
    
    oof[val_idx] = np.mean(val_preds_list,axis=0)
    score = log_loss(y_val, oof[val_idx])
    #print(f"fold: {fold},logloss: {score}")
    score_list.append(score)
    fold +=1
  
  cv_logloss = np.mean(score_list)
  
  return cv_logloss
  '''

In [None]:
'''
study = optuna.create_study(direction='minimize')
study.optimize(objective,n_trials= 20)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
'''

### Catboost

In [None]:
#cat_features = np.arange(0,train_features_encoded.shape[1]).tolist()

In [None]:
'''


params = {
    'learning_rate': 0.010516504167628355, 
     'depth': 10,
     'l2_leaf_reg': 15.358647811187538,
     'random_strength': 2.9499283334899307, 
     'border_count': 254,
     'grow_policy': 'SymmetricTree', 
     'min_data_in_leaf': 206        
}
'''

In [None]:
'''
X=train_features_encoded.astype(int)
y= train_labels_num


params_cb = params

params_cb ["loss_function"] = 'MultiClass'
params_cb ["od_wait"] = 100
params_cb ["od_type"] = 'Iter'
#params_cb ["max_ctr_complexity"] = 15
params_cb ["task_type"] = "GPU"
params_cb["cat_features"] = cat_features



name = 'catboost_3seeds_5fold'
k=5
seed_list=[0,1,2]
random_seed=0
kf = StratifiedKFold(n_splits=k,shuffle=True,random_state=random_seed)
oof = np.zeros((len(X),9))
test_preds_list = []
score_list = []
fold=1
  
splits = list(kf.split(X,y))
fold = 1
for train_idx, val_idx in splits:
  X_train, X_val = X[train_idx], X[val_idx]
  y_train, y_val = y[train_idx], y[val_idx]

  val_preds_list = []

  for seed in seed_list:
    
    # fit and run model
    params_cb['random_state'] = seed
        
    model = CatBoostClassifier(**params_cb,
            iterations=100000,
            use_best_model=True,
)

    model.fit(X_train,y=y_train,
              embedding_features=None,
              use_best_model=True,
               early_stopping_rounds=100,
              eval_set=[(X_val,y_val)],
              verbose=500)
    

    
    val_preds_list.append(model.predict_proba(X_val))
    test_preds_list.append(model.predict_proba(test_features_encoded.astype(int)))
    
  oof[val_idx] = np.mean(val_preds_list,axis=0)
  score = log_loss(y_val, oof[val_idx])
  print(f"fold: {fold},log_loss: {score}")
  score_list.append(score)
  # print(f"fold: {fold}, class0 tr %: {y_train.value_counts()[0]/len(y_train)}, class0 val %: {y_val.value_counts()[0]/len(y_val)} ")
  fold +=1
  
cv_logloss = np.mean(score_list)
print(f"{name} ,log_loss: {cv_logloss}")

preds= np.mean(test_preds_list,axis=0)


file_name_oof = name + "_oof.txt"
file_name_test = name + "_test.csv"
with open(file_name_oof, "wb") as fp:
      pickle.dump(oof, fp)

#files.download(file_name_oof)
submission.iloc[:,1:] = pd.DataFrame(preds).values
submission.to_csv(file_name_test,index=None)
#files.download(file_name_test) 
'''

### Catboost tuning

In [None]:
'''
# for the fixed learning rate, use the opt n iterations and tune the tree hyperparameters
def objective(trial,X=train_features_encoded.astype(int),y= train_labels_num):
  """
  """
 

  param_space = {
        "od_type" : "Iter",
        "od_wait" : 100,
        'learning_rate': trial.suggest_loguniform('learning_rate', 1e-2, 1e-1),
         "depth": trial.suggest_int("depth", 1, 10),
        "l2_leaf_reg": trial.suggest_loguniform('l2_leaf_reg', 1e-4, 1e3),
        "random_strength": trial.suggest_float("random_strength",0,3),
        # "bagging_temperature": trial.suggest_int("bagging_temperature",0,100),
        "border_count": trial.suggest_int("border_count",254,254),
        "grow_policy":trial.suggest_categorical("grow_policy",["Depthwise","SymmetricTree","Lossguide"]),
        'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 20, 300)

        }
            
  #X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=.1,random_state=2021,stratify=y)
  k=5
  seed_list=[0]
  random_seed=0
  kf = StratifiedKFold(n_splits=k,shuffle=True,random_state=random_seed)
  oof = np.zeros((len(X),9))
  score_list = []
  fold=1
  
  splits = list(kf.split(X,y))
  for train_idx, val_idx in splits:
    X_train, X_val = X[train_idx,:], X[val_idx,:]
    y_train, y_val = y[train_idx], y[val_idx]

    #if fold > 1:break

  
    val_preds_list = []
  
    
    for seed in seed_list:
      # fit and run model
      param_space['random_state'] = seed
      param_space ["loss_function"] = 'MultiClass'

      param_space["cat_features"] = cat_features

      model = CatBoostClassifier(**param_space,
                                task_type="GPU",
                                 iterations=100000,
                                 use_best_model=True)
      
      model.fit(X_train,y=y_train,
              embedding_features=None,
              use_best_model=True,
                early_stopping_rounds=100,
              eval_set=[(X_val,y_val)],
              verbose=500)
    
    
      val_preds_list.append(model.predict_proba(X_val))
     #test_preds_list.append(model.predict_proba(X_test)[:,1])
    
    oof[val_idx] = np.mean(val_preds_list,axis=0)
    score = log_loss(y_val, oof[val_idx])
    print(f"fold: {fold},logloss: {score}")
    score_list.append(score)
    fold +=1
  
  cv_logloss = np.mean(score_list)
  
  return cv_logloss
  '''

In [None]:
'''
study = optuna.create_study(direction='minimize')
study.optimize(objective,n_trials= 20)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
'''

### Logistic Regression

In [None]:
'''
encoder = OneHotEncoder()
all_encoded = encoder.fit_transform(pd.DataFrame(train_features_encoded).append(pd.DataFrame(test_features_encoded)))
#X = all_encoded[0:len(X)]
#X_test = all_encoded[len(X):]
train_features_onehot = all_encoded.tocsr()[0:len(train_features_encoded)]
test_features_onehot = all_encoded [len(train_features_encoded):]
'''

In [None]:
'''
params = { 
     'C': 0.0011494694737913215, 
      'multi_class': 'multinomial', 
    'penalty':'elasticnet',
          'solver': 'saga',
      'class_weight': None, 
      'l1_ratio': 0.508725921329706,
    'max_iter':10000,
          'n_jobs':-1
}
'''

In [None]:
'''
X=train_features_onehot
y=train_labels_num


name = 'logistic_regression'
k=5
seed_list=[0,1,2]
random_seed=0
kf = StratifiedKFold(n_splits=k,shuffle=True,random_state=random_seed)
oof = np.zeros((X.shape[0],9))
test_preds_list = []
score_list = []
fold=1
  
splits = list(kf.split(X,y))
fold = 1
for train_idx, val_idx in splits:
  X_train, X_val = X[train_idx], X[val_idx]
  y_train, y_val = y[train_idx], y[val_idx]

  val_preds_list = []

  for seed in seed_list:
    
    # fit and run model
    
    base_model = LogisticRegression(**params,random_state=seed)
    model = CalibratedClassifierCV(base_model, method='sigmoid', cv=k)


    model.fit(X_train,y=y_train)

    
    val_preds_list.append(model.predict_proba(X_val))
    test_preds_list.append(model.predict_proba(test_features_onehot))
    
  oof[val_idx] = np.mean(val_preds_list,axis=0)
  score = log_loss(y_val, oof[val_idx])
  print(f"fold: {fold},log_loss: {score}")
  score_list.append(score)
  # print(f"fold: {fold}, class0 tr %: {y_train.value_counts()[0]/len(y_train)}, class0 val %: {y_val.value_counts()[0]/len(y_val)} ")
  fold +=1
  
cv_logloss = np.mean(score_list)
print(f"{name} ,log_loss: {cv_logloss}")

preds= np.mean(test_preds_list,axis=0)


file_name_oof = "logistic_3seeds_oof.txt"
file_name_test = "logistic_3seeds_test.csv"
with open(file_name_oof, "wb") as fp:
      pickle.dump(oof, fp)

#files.download(file_name_oof)

submission.iloc[:,1:] = pd.DataFrame(preds).values
submission.to_csv(file_name_test,index=None)
#files.download(file_name_test) 
'''

### Logistic Regression tuning

In [None]:
'''
# for the fixed learning rate, use the opt n iterations and tune the tree hyperparameters
def objective(trial,X=train_features_onehot, y=train_labels_num):
  """
  """
  param_space = {
          'C': trial.suggest_loguniform('C', 1e-3, 1e2),
          'penalty':'elasticnet',
          'solver': 'saga',
          'multi_class':trial.suggest_categorical('multi_class',['ovr','multinomial']),
          'max_iter':10000,
          'class_weight':trial.suggest_categorical('class_weight',['balanced',None])  ,
           'n_jobs':-1,
          'l1_ratio':trial.suggest_uniform('l1_ratio', 0, 1)
                }
            
  #X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=.1,random_state=2021,stratify=y)
  k=5
  seed_list=[0]
  random_seed=0
  kf = StratifiedKFold(n_splits=k,shuffle=True,random_state=random_seed)
  oof = np.zeros((X.shape[0],9))
  score_list = []
  fold=1
  
  splits = list(kf.split(X,y))
  for train_idx, val_idx in splits:
    X_train, X_val = X[train_idx,:], X[val_idx,:]
    y_train, y_val = y[train_idx], y[val_idx]
  
    val_preds_list = []
  
    for seed in seed_list:
      # fit and run model
      param_space['random_state'] = seed

      model = LogisticRegression(**param_space)
      #model = CalibratedClassifierCV(base_model, method='sigmoid', cv=k, n_jobs=-1)
      model.fit(X_train,y=y_train)

    
      val_preds_list.append(model.predict_proba(X_val))
     #test_preds_list.append(model.predict_proba(X_test)[:,1])
    
    oof[val_idx] = np.mean(val_preds_list,axis=0)
    score = log_loss(y_val, oof[val_idx])
    print(f"fold: {fold},logloss: {score}")
    score_list.append(score)
    fold +=1
  
  cv_logloss = np.mean(score_list)
  
  return cv_logloss
  '''

In [None]:
'''
study = optuna.create_study(direction='minimize')
study.optimize(objective,n_trials= 30)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
'''

### Random Forest

In [None]:
'''

params = {
    'max_depth': 25, 
 'n_estimators': 1270, 
 'max_features': 'sqrt', 
 'min_samples_split': 10, 
 'bootstrap': False, 
 'min_samples_leaf': 2
}
'''

In [None]:
'''
train_features_encoded = train_features_encoded.astype(np.int16)
test_features_encoded = test_features_encoded.astype(np.int16)
train_labels_num = train_labels_num.astype(np.int8)
'''

In [None]:
'''
X=train_features_encoded
y=train_labels_num

name = 'random_forest'
k=5
seed_list=[0,1,2]
random_seed=0
kf = StratifiedKFold(n_splits=k,shuffle=True,random_state=random_seed)
oof = np.zeros((len(X),9))
test_preds_list = []
score_list = []
fold=1
  
splits = list(kf.split(X,y))
for train_idx, val_idx in splits:
  X_train, X_val = X[train_idx], X[val_idx]
  y_train, y_val = y[train_idx], y[val_idx]

  val_preds_list = []

  for seed in seed_list:
    
    # fit and run model
    
    model = RandomForestClassifier(**params,
                                        random_state=seed,  
                                        n_jobs=-1,
                                        criterion = "entropy",
                                       verbose=200)
    #model = CalibratedClassifierCV(base_model, method='sigmoid', cv=k, n_jobs=-1)

    model.fit(X_train,y=y_train)

    
    val_preds_list.append(model.predict_proba(X_val))
    test_preds_list.append(model.predict_proba(test_features_encoded))
    
  oof[val_idx] = np.mean(val_preds_list,axis=0)

  del val_preds_list
    
  score = log_loss(y_val, oof[val_idx])
  print(f"fold: {fold},log_loss: {score}")
  score_list.append(score)

  del score

  # print(f"fold: {fold}, class0 tr %: {y_train.value_counts()[0]/len(y_train)}, class0 val %: {y_val.value_counts()[0]/len(y_val)} ")
  fold +=1
  
cv_logloss = np.mean(score_list)
print(f"{name} ,log_loss: {cv_logloss}")

preds= np.mean(test_preds_list,axis=0)

del test_preds_list

file_name_oof = "rfc_3seed5f_oof.txt"
file_name_test = "rfc_3seed5f_test.csv"
with open(file_name_oof, "wb") as fp:
      pickle.dump(oof, fp)

del oof

#files.download(file_name_oof)
submission = pd.read_csv("../input/tabular-playground-series-jun-2021/sample_submission.csv")

submission.iloc[:,1:] = pd.DataFrame(preds)
submission.to_csv(file_name_test,index=None)
#files.download(file_name_test) 
'''

### Random Forest tuning

In [None]:
'''
random_seed=0

# for the fixed learning rate, use the opt n iterations and tune the tree hyperparameters
def objective(trial,X=train_features_encoded, y=train_labels_num):
  """
  """
  param_space = {
               'max_depth': trial.suggest_int('max_depth', 2, 30),
               'n_estimators': trial.suggest_int('n_estimators', 200,2000,10),
               'max_features': trial.suggest_categorical('max_features',['auto','sqrt']),
               'min_samples_split':trial.suggest_categorical('min_samples_split',[2,5,10]),
               'bootstrap' : trial.suggest_categorical('bootstrap',[True,False]),
               'min_samples_leaf':trial.suggest_categorical('min_samples_leaf',[2,5,10]),
               # 'min_impurity_decrease':trial.suggest_float('min_impurity_decrease', 0,0.005),
              # 'class_weight' : trial.suggest_categorical('class_weight',['balanced','balanced_subsample',None]),
              #'max_samples':trial.suggest_float('max_samples', 0.01,0.95),
              #'max_leaf_nodes': trial.suggest_int('max_leaf_nodes', 2,100)
               
                }
            
  model = RandomForestClassifier(**param_space,
                                 random_state=random_seed,
                                 n_jobs=-1, 
                                 criterion = "entropy")
  kf = StratifiedKFold(n_splits=5,shuffle=True,random_state=random_seed)
  scores = cross_val_score(model,X,y,scoring='neg_log_loss',cv=kf)
  cv_score = -1*scores.mean()
      
  return cv_score
'''

In [None]:
'''
study = optuna.create_study(direction='minimize')
study.optimize(objective,n_trials= 20)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
'''

In [None]:
#study.best_params

# Stacking

Be careful of leaking in training, so we would select the same KFold seed for every base model and during stacking.

## Import the base model results

Cross folds validation sets prediction results fron each base model

In [None]:
'''
input_val = []

val_result = ["../input/base-model/Base models results/xgboost_3seed_5fold_oof.txt",
               "../input/base-model/Base models results/lightgbm_3seed_5fold_oof.txt",
                "../input/base-model/Base models results/catboost_3seeds_5fold_oof.txt",
              #  "../input/base-model/Base models results/rfc_3seed5f_oof.txt",
               # "../input/base-model/Base models results/logistic_3seeds_oof.txt"
               ]

for text in val_result:
    input_val.append(pickle.load(open(text, "rb")))
'''

In [None]:
#input_val = pd.DataFrame(np.hstack(input_val))

Test set prediction results from each base model

In [None]:
'''
input_test = pd.DataFrame()

test_result = ["../input/base-model/Base models results/xgboost_3seed_5fold_test.csv",
               "../input/base-model/Base models results/lightgbm_3seed_5fold_test.csv",
                "../input/base-model/Base models results/catboost_3seeds_5fold_test.csv",
               # "../input/base-model/Base models results/rfc_3seed5f_test.csv",
               # "../input/base-model/Base models results/logistic_3seeds_test.csv"
               ]

for tr in test_result:
    input_test = pd.concat([input_test, pd.read_csv(tr).iloc[: ,1:]], axis=1, sort=False)
  '''

In [None]:
#input_test.columns = input_val.columns

meta model is ridge classifier with calibrated classifier CV

In [None]:
'''
params = {
 'alpha': 62.040049045839396, 
 'solver': 'svd',
      'max_iter':10000,
 'class_weight': None}
 '''

In [None]:
'''
X = input_val
y = train_labels_num

name = 'stackingridge_5f'
k=5
random_seed=0
kf = StratifiedKFold(n_splits=k,shuffle=True,random_state=random_seed)
oof_stack = np.zeros((len(X),9))

#seed_list=[0,1,2]
score_list= []
fold = 1
test_preds_stack = []

for train_index, test_index in kf.split(X, y):
    X_train, X_val = X.iloc[train_index], X.iloc[test_index]
    y_train, y_val = y[train_index], y[test_index]    
    
    rd = CalibratedClassifierCV(RidgeClassifier(**params), n_jobs=-1)
    
    rd.fit(X_train, y_train)
    y_stack = rd.predict_proba(X_val)
   
    
    oof_stack[test_index] = y_stack*1
    score = log_loss(y_val, oof_stack[test_index])
    print(f"fold: {fold},log_loss: {score}")
  
    score_list.append(score)
    
    test_preds_stack.append(rd.predict_proba(input_test.values))
    fold +=1

cv_logloss = np.mean(score_list)
print(f"{name} ,log_loss: {cv_logloss}")

preds= np.mean(test_preds_stack,axis=0)

file_name_oof = "stackingridge_5f_oof.txt"
file_name_test = "stackingridge_5f_test.csv"
with open(file_name_oof, "wb") as fp:
    pickle.dump(oof_stack, fp)

submission = pd.read_csv("../input/tabular-playground-series-jun-2021/sample_submission.csv")

submission.iloc[:,1:] = pd.DataFrame(preds)
submission.to_csv(file_name_test,index=None)
'''

meta model is lgb

In [None]:
'''
params = {
    'boosting_type': 'gbdt',
    'reg_lambda': 18.47848662046526,
 'reg_alpha': 0.09586897470473404, 
 'colsample_bytree': 0.4444514204868687,
 'subsample': 0.373940404514446,
 'learning_rate': 0.01, 
 'num_leaves': 38, 
 'min_child_samples': 7, 
 'max_depth': 21,
    'n_estimators':100000,
               'objective':'multiclass',
               'metric':'multi_logloss',
                'n_jobs':-1
}
'''

In [None]:
'''
X = input_val
y = train_labels_num

name = 'stackinglgb_3seed_5f'
k=5
random_seed=0
kf = StratifiedKFold(n_splits=k,shuffle=True,random_state=random_seed)
oof_stack = np.zeros((X.shape[0],9))

seed_list=[0,1,2]
score_list= []
fold = 1
test_preds_stack = []

splits = list(kf.split(X,y))
  
for train_index, test_index in splits:
    X_train, X_val = X.iloc[train_index], X.iloc[test_index]
    y_train, y_val = y[train_index], y[test_index]    
    
    val_preds_list = []
    
    for seed in seed_list:
    
        params['random_state'] = seed

        model = LGBMClassifier(**params)
        model.fit(X_train,y_train,eval_set=[(X_train,y_train),(X_val,y_val)],
                early_stopping_rounds=100,
                eval_names=['train','val'],verbose=0)

        val_preds_list.append(model.predict_proba(X_val))
        test_preds_stack.append(model.predict_proba(input_test.values))
    
    
    oof_stack[test_index] = np.mean(val_preds_list,axis=0)
    score = log_loss(y_val, oof_stack[test_index])
    print(f"fold: {fold},log_loss: {score}")
  
    score_list.append(score)
    fold +=1

cv_logloss = np.mean(score_list)
print(f"{name} ,log_loss: {cv_logloss}")

preds= np.mean(test_preds_stack,axis=0)

file_name_oof = name + "_oof.txt"
file_name_test = name + "_test.csv"
with open(file_name_oof, "wb") as fp:
    pickle.dump(oof_stack, fp)

submission = pd.read_csv("../input/tabular-playground-series-jun-2021/sample_submission.csv")

submission.iloc[:,1:] = pd.DataFrame(preds)
submission.to_csv(file_name_test,index=None)
'''

meta model is lgb with base models no logistic

In [None]:
'''
params = {
    
    'boosting_type': 'gbdt',
    'n_estimators':100000,
               'objective':'multiclass',
               'metric':'multi_logloss',
                'n_jobs':-1,
'reg_lambda': 1.1677970419963015, 
 'reg_alpha': 25.64393399350136,
 'colsample_bytree': 0.7698192407526574,
 'subsample': 0.4912058042676565, 
 'learning_rate': 0.01, 
 'num_leaves': 120, 
 'min_child_samples': 365, 
 'max_depth': 4
}
'''

'''
params = {
    'boosting_type': 'gbdt',
    'n_estimators':100000,
               'objective':'multiclass',
               'metric':'multi_logloss',
                'n_jobs':-1,
'reg_lambda': 0.8189181015375904, 
 'reg_alpha': 0.25487382221563054, 
 'colsample_bytree': 0.1275201917021311, 
 'subsample': 0.6396666339670933, 
 'learning_rate': 0.01, 
 'num_leaves': 44,
 'min_child_samples': 11, 
 'max_depth': 56
}
'''

In [None]:
'''
X = input_val
y = train_labels_num

name = 'stackinglgbnolog_3seed_5f'
k=5
random_seed=0
kf = StratifiedKFold(n_splits=k,shuffle=True,random_state=random_seed)
oof_stack = np.zeros((X.shape[0],9))

seed_list=[0,1,2]
score_list= []
fold = 1
test_preds_stack = []

splits = list(kf.split(X,y))
  
for train_index, test_index in splits:
    X_train, X_val = X.iloc[train_index], X.iloc[test_index]
    y_train, y_val = y[train_index], y[test_index]    
    
    val_preds_list = []
    
    for seed in seed_list:
    
        params['random_state'] = seed

        model = LGBMClassifier(**params)
        model.fit(X_train,y_train,eval_set=[(X_train,y_train),(X_val,y_val)],
                early_stopping_rounds=100,
                eval_names=['train','val'],verbose=0)

        val_preds_list.append(model.predict_proba(X_val))
        test_preds_stack.append(model.predict_proba(input_test.values))
    
    
    oof_stack[test_index] = np.mean(val_preds_list,axis=0)
    score = log_loss(y_val, oof_stack[test_index])
    print(f"fold: {fold},log_loss: {score}")
  
    score_list.append(score)
    fold +=1

cv_logloss = np.mean(score_list)
print(f"{name} ,log_loss: {cv_logloss}")

preds= np.mean(test_preds_stack,axis=0)

file_name_oof = name + "_oof.txt"
file_name_test = name + "_test.csv"
with open(file_name_oof, "wb") as fp:
    pickle.dump(oof_stack, fp)

submission = pd.read_csv("../input/tabular-playground-series-jun-2021/sample_submission.csv")

submission.iloc[:,1:] = pd.DataFrame(preds).values
submission.to_csv(file_name_test,index=None)
'''

meta model is xgb with base models no logistic

In [None]:
'''
params = {
'lambda': 0.0010446502460788832, 
 'alpha': 1.0638896344949464, 
 'colsample_bytree': 0.899300048854003,
 'colsample_bynode': 0.457360032783254, 
 'colsample_bylevel': 0.7961501791591739, 
 'subsample': 0.5572526278185042, 
 'eta': 0.01, 
 'grow_policy': 'depthwise', 
 'max_depth': 2, 
 'min_child_weight': 46, 
 'max_bin': 409, 
 'deterministic_histogram': False
}
'''

meta model is xgb with base models logistic

In [None]:
'''
params = {
    'lambda': 0.011483926762852138, 
 'alpha': 0.3063338385041086, 
 'colsample_bytree': 0.8674369490772537, 
 'colsample_bynode': 0.7529165609782398, 
 'colsample_bylevel': 0.6927394353409445, 
 'subsample': 0.5541902902608168, 
 'eta': 0.01,
 'grow_policy': 'lossguide',
 'max_depth': 4,
 'min_child_weight': 149, 
 'max_bin': 512, 
 'deterministic_histogram': False
}
'''

meta model is xgb with base models no logistic no rf

In [None]:
'''
params = {
    'lambda': 8.78438796741932, 
     'alpha': 1.5156056424257214, 
     'colsample_bytree': 0.6746676803716631, 
     'colsample_bynode': 0.23151927366501895,
     'colsample_bylevel': 0.6770030260262497, 
     'subsample': 0.4258029694908929,
     'eta': 0.01,
     'grow_policy': 'lossguide', 
     'max_depth': 4, 
     'min_child_weight': 37,
     'max_bin': 288, 
     'deterministic_histogram': False 
}
'''

In [None]:
'''
X = input_val
y = train_labels_num

params_xgb = params
params_xgb["tree_method"] = "hist"
params_xgb["predictor"] = 'cpu_predictor'
params_xgb["objective"] = 'multi:softprob'
params_xgb["num_class"] = 9
params_xgb["eval_metric"] ='mlogloss'

name = 'stackingxgbnolog_3seed_5f'
k=5
random_seed=0
kf = StratifiedKFold(n_splits=k,shuffle=True,random_state=random_seed)
oof_stack = np.zeros((X.shape[0],9))

seed_list=[0,1,2]
score_list= []
fold = 1
test_preds_stack = []

splits = list(kf.split(X,y))
  
for train_index, test_index in splits:
    X_train, X_val = X.iloc[train_index], X.iloc[test_index]
    y_train, y_val = y[train_index], y[test_index]    
    
    val_preds_list = []
    
    for seed in seed_list:
    
        params['random_state'] = seed
        
        dtrain = xgboost.DMatrix(data=X_train, label=y_train)
        dval = xgboost.DMatrix(data=X_val, label=y_val)
        dtest = xgboost.DMatrix(data=input_test)

        model = xgboost.train(params_xgb, dtrain,\
                       evals=[(dtrain,'train'),(dval,'val')],\
                       verbose_eval=False,
                       early_stopping_rounds=100,
                       num_boost_round=100000)

        val_preds_list.append(model.predict(dval))
        test_preds_stack.append(model.predict(dtest))
    
    
    oof_stack[test_index] = np.mean(val_preds_list,axis=0)
    score = log_loss(y_val, oof_stack[test_index])
    print(f"fold: {fold},log_loss: {score}")
  
    score_list.append(score)
    fold +=1

cv_logloss = np.mean(score_list)
print(f"{name} ,log_loss: {cv_logloss}")

preds= np.mean(test_preds_stack,axis=0)

file_name_oof = name + "_oof.txt"
file_name_test = name + "_test.csv"
with open(file_name_oof, "wb") as fp:
    pickle.dump(oof_stack, fp)

submission = pd.read_csv("../input/tabular-playground-series-jun-2021/sample_submission.csv")

submission.iloc[:,1:] = pd.DataFrame(preds).values
submission.to_csv(file_name_test,index=None)
'''

### Stacking tuning

meta model is ridge classifier with calibrated classifier CV

In [None]:
'''
# for the fixed learning rate, use the opt n iterations and tune the tree hyperparameters
def objective(trial,X=input_val, y=train_labels_num):
  """
  """
  param_space = {
          'alpha': trial.suggest_loguniform('alpha', 1e-3, 1e2),
          'solver': trial.suggest_categorical('solver',['svd', 'cholesky','sparse_cg', 'lsqr', 'sag', 'saga']),
          'max_iter':10000,
          'class_weight':trial.suggest_categorical('class_weight',['balanced',None])  
           #'n_jobs':-1
                }
            
  #X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=.1,random_state=2021,stratify=y)
  k=5
  random_seed=0
  kf = StratifiedKFold(n_splits=k,shuffle=True,random_state=random_seed)
  oof = np.zeros((X.shape[0],9))
  score_list = []
  fold=1
  
  splits = list(kf.split(X,y))
    
  for train_index, test_index in kf.split(X, y):
    
    X_train, X_val = X.iloc[train_index], X.iloc[test_index]
    y_train, y_val = y[train_index], y[test_index]
  
    val_preds_list = []
  
    rd = CalibratedClassifierCV(RidgeClassifier(**param_space, random_state=random_seed), n_jobs=-1)
    
    rd.fit(X_train, y_train)
    y_stack = rd.predict_proba(X_val)
    
    oof[test_index] = y_stack*1
    score = log_loss(y_val, oof[test_index])
    print(f"fold: {fold},logloss: {score}")
    score_list.append(score)
    fold +=1
  
  cv_logloss = np.mean(score_list)
  
  return cv_logloss
  '''

meta model is lgb

In [None]:
'''
# for the fixed learning rate, use the opt n iterations and tune the tree hyperparameters
def objective(trial,X=input_val, y=train_labels_num):
  """
  """
  param_space = {
               'device':'gpu',  # Use GPU acceleration
               'boosting_type': 'gbdt',
               'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-3, 1e3),
               'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-3, 1e3),
               'colsample_bytree': trial.suggest_float('colsample_bytree', 0.1 , 1.0),
               'subsample': trial.suggest_float('subsample', 0.1, 1.0),
               'learning_rate': trial.suggest_loguniform('learning_rate', 1e-2, 1e-2),
               'num_leaves': trial.suggest_int("num_leaves", 31,256),
               'min_child_samples': trial.suggest_int('min_child_samples', 1, 500),
               'max_depth':trial.suggest_int('max_depth',3,127),
               'n_estimators':100000,
               'objective':'multiclass',
               'metric':'multi_logloss',
               # 'n_jobs':-1
                }
            
  #X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=.1,random_state=2021,stratify=y)
  k=5
  random_seed=0
  kf = StratifiedKFold(n_splits=k,shuffle=True,random_state=random_seed)
  oof = np.zeros((X.shape[0],9))
  score_list = []
  fold=1
  
  splits = list(kf.split(X,y))
    
  for train_index, test_index in splits:
    
    X_train, X_val = X.iloc[train_index], X.iloc[test_index]
    y_train, y_val = y[train_index], y[test_index]
  
    val_preds_list = []
  
    param_space['random_state'] = random_seed

    model = LGBMClassifier(**param_space)
    model.fit(X_train,y_train,eval_set=[(X_train,y_train),(X_val,y_val)],
                early_stopping_rounds=100,
                eval_names=['train','val'],verbose=0)

    y_stack = model.predict_proba(X_val)
    
    oof[test_index] = y_stack*1
    score = log_loss(y_val, oof[test_index])
    print(f"fold: {fold},logloss: {score}")
    score_list.append(score)
    fold +=1
  
  cv_logloss = np.mean(score_list)
  
  return cv_logloss
  '''

meta model is lgb with base models no logistic

In [None]:
'''
# for the fixed learning rate, use the opt n iterations and tune the tree hyperparameters
def objective(trial,X=input_val, y=train_labels_num):
  """
  """
  param_space = {
               #'device':'gpu',  # Use GPU acceleration
               'boosting_type': 'gbdt',
               'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-3, 1e3),
               'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-3, 1e3),
               'colsample_bytree': trial.suggest_float('colsample_bytree', 0.1 , 1.0),
               'subsample': trial.suggest_float('subsample', 0.1, 1.0),
               'learning_rate': trial.suggest_loguniform('learning_rate', 1e-2, 1e-2),
               'num_leaves': trial.suggest_int("num_leaves", 31,256),
               'min_child_samples': trial.suggest_int('min_child_samples', 1, 500),
               'max_depth':trial.suggest_int('max_depth',3,127),
               'n_estimators':100000,
               'objective':'multiclass',
               'metric':'multi_logloss',
               'n_jobs':-1
                }
            
  #X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=.1,random_state=2021,stratify=y)
  k=5
  random_seed=0
  kf = StratifiedKFold(n_splits=k,shuffle=True,random_state=random_seed)
  oof = np.zeros((X.shape[0],9))
  score_list = []
  fold=1
  
  splits = list(kf.split(X,y))
    
  for train_index, test_index in splits:
    
    X_train, X_val = X.iloc[train_index], X.iloc[test_index]
    y_train, y_val = y[train_index], y[test_index]
  
    val_preds_list = []
  
    param_space['random_state'] = random_seed

    model = LGBMClassifier(**param_space)
    model.fit(X_train,y_train,eval_set=[(X_train,y_train),(X_val,y_val)],
                early_stopping_rounds=100,
                eval_names=['train','val'],verbose=0)

    y_stack = model.predict_proba(X_val)
    
    oof[test_index] = y_stack*1
    score = log_loss(y_val, oof[test_index])
    print(f"fold: {fold},logloss: {score}")
    score_list.append(score)
    fold +=1
  
  cv_logloss = np.mean(score_list)
  
  return cv_logloss
'''

In [None]:
'''
study = optuna.create_study(direction='minimize')
study.optimize(objective,n_trials= 20)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
'''

meta model is xgb with base models no logistic

In [None]:
'''
# for the fixed learning rate, use the opt n iterations and tune the tree hyperparameters
def objective(trial,X=input_val, y=train_labels_num):
  """
  """
  param_space = {
              'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0),
                'alpha': trial.suggest_loguniform('alpha', 1e-3, 10.0),
                'colsample_bytree': trial.suggest_float('colsample_bytree', 0.1, 0.9),
                'colsample_bynode': trial.suggest_float('colsample_bynode', 0.1, 0.9),
                'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.1, 0.9),
                'subsample': trial.suggest_float('subsample', 0.1, 0.9),
                'eta':trial.suggest_float('eta', 1e-2, 1e-2),
                'grow_policy': trial.suggest_categorical("grow_policy", ['depthwise','lossguide']),
                'max_depth': trial.suggest_int('max_depth',2,25),
                'seed': 0,
                'min_child_weight': trial.suggest_int('min_child_weight', 0, 300),
                'max_bin': trial.suggest_int('max_bin', 256, 512),
                'deterministic_histogram':trial.suggest_categorical('deterministic_histogram',[False]),
               "tree_method" : "hist",
                "predictor" : 'cpu_predictor',
                "objective" : 'multi:softprob',
                 "num_class":9
                  
                }
            
  #X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=.1,random_state=2021,stratify=y)
  k=5
  random_seed=0
  kf = StratifiedKFold(n_splits=k,shuffle=True,random_state=random_seed)
  oof = np.zeros((X.shape[0],9))
  score_list = []
  fold=1
  
  splits = list(kf.split(X,y))
    
  for train_index, test_index in splits:
    
    X_train, X_val = X.iloc[train_index], X.iloc[test_index]
    y_train, y_val = y[train_index], y[test_index]
  
    val_preds_list = []
  
    param_space['random_state'] = random_seed
    param_space['n_jobs'] = -1

    dtrain = xgboost.DMatrix(data=X_train, label=y_train)
    dval = xgboost.DMatrix(data=X_val, label=y_val)
    #dtest = xgboost.DMatrix(data=test_features_encoded)
    xgboost.set_config(verbosity=0)

      
    model = xgboost.train(param_space, dtrain,\
                       evals=[(dtrain,'train'),(dval,'val')],\
                       verbose_eval=False,
                       early_stopping_rounds=100,
                       num_boost_round=100000
                         )


    y_stack = model.predict(dval)
    
    oof[test_index] = y_stack*1
    score = log_loss(y_val, oof[test_index])
    print(f"fold: {fold},logloss: {score}")
    score_list.append(score)
    fold +=1
  
  cv_logloss = np.mean(score_list)
  
  return cv_logloss
'''

In [None]:
'''
study = optuna.create_study(direction='minimize')
study.optimize(objective,n_trials= 20)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
'''

### Clipping Ends

Just for experiments. No improvement is observed.

In [None]:

oof_stacking=pickle.load(open("../input/stacking-30062021/stackingxgbnolognorf_3seed_5f_oof.txt", "rb"))
test_stacking=pd.read_csv("../input/stacking-30062021/stackingxgbnolognorf_3seed_5f_test.csv").iloc[: ,1:]


In [None]:

X = oof_stacking
y = train_labels_num

best_clip = 0
clipd={0:0,
      0.005:0,
      0.01:0,
      0.015:0,
      0.02:0}

k=5
random_seed=0
kf = StratifiedKFold(n_splits=k,shuffle=True,random_state=random_seed)
splits = list(kf.split(X,y))

for cli in clipd.keys():

    fold=1
    oof_stack = np.zeros((X.shape[0],9))
    score_list= []
    
    for train_index, test_index in splits:
        X_train, X_val = X[train_index], X[test_index]
        y_train, y_val = y[train_index], y[test_index]    

        oof_stack[test_index] = np.clip(X_val, cli, 1-cli)
        score = log_loss(y_val, oof_stack[test_index])
        #print(f"fold: {fold},log_loss: {score}")

        score_list.append(score)
        fold +=1

    cv_logloss = np.mean(score_list)
    clipd[cli]=cv_logloss
    print(f"{cli} clip,log_loss: {cv_logloss}")

best_clip=min(clipd, key=clipd.get)
preds= np.clip(test_stacking, best_clip, 1-best_clip)
oof_stack= np.clip(oof_stacking, best_clip, 1-best_clip)
print(f"Best clip is {best_clip}")

file_name_oof = str(best_clip) + "clip_oof.txt"
file_name_test = str(best_clip) + "clip_test.csv"
with open(file_name_oof, "wb") as fp:
    pickle.dump(oof_stack, fp)

submission = pd.read_csv("../input/tabular-playground-series-jun-2021/sample_submission.csv")

submission.iloc[:,1:] = pd.DataFrame(preds)
submission.to_csv(file_name_test,index=None)


## Performance Log

Performance

LGB
lightgbm_3seed_5fold ,log_loss: 1.744902147971151
public 1.74898 # change save data for stacking, version 81

XGB
xgboost_3seed_5fold ,log_loss: 1.7435762563824553
public 1.74756

Random Forest
random_forest ,log_loss: 1.7544150313584528
public 1.75642 # change save data for stacking, version 79

Catboost
catboost_3seeds_5fold ,log_loss: 1.7446580996538827
public 1.74870 # change save data for stacking, version 72

Logistic Regression
logistic_regression ,log_loss: 1.7676670908351035
public 1.77073

Stacking ridge
stackingridge_5f ,log_loss: 1.750612802054284
public 1.75346

Stacking lgb
stackinglgb_3seed_5f ,log_loss: 1.7427530499190707
public 1.74649

Stacking lgb no logistic
stackinglgbnolog_3seed_5f ,log_loss: 1.742280337930729
public 1.74618 # version 65

same # repeat with better base model but worse stacking, version 85
stackinglgbnolog_3seed_5f ,log_loss: 1.7425574363337255
public 1.74643

Stacking xgb no logistic
stackingxgbnolog_3seed_5f ,log_loss: 1.7416487316326503
public 1.74587 # version 90

Stacking xgb with logistic
stackingxgblog_3seed_5f ,log_loss: 1.741866647633571
public 1.74573 # version 93

Stacking xgb no logistic no rf
stackingxgbnolognorf_3seed_5f ,log_loss: 1.741812529353016
public 1.74603 # version 96