Those notebooks had really inspired me :<br> 
https://www.kaggle.com/code/pourchot/simple-soft-voting<br>
https://www.kaggle.com/code/ricopue/tps-jul22-clusters-and-lgb<br>
https://www.kaggle.com/code/ambrosm/tpsjul22-gaussian-mixture-cluster-analysis<br>
https://www.kaggle.com/code/thedevastator/how-to-ensemble-clustering-algorithms-updated<br>
https://www.kaggle.com/code/eduus710/getting-cluster-ensembles-to-work<br>
https://www.kaggle.com/code/plarmuseau/bruteforce-clustering<br>
https://www.kaggle.com/code/thedevastator/bruteforce-clustering<br>
(and some others, sorry I don't remember everyone)<br>
Thank you very much.

With Ricopue's notebook [here][2] I get a 0.61419 score on public leaderboard (3rd place on public leaderboard the 13th july).<br>
Then I was wondering if I can get a higher score with another method than LGBM which can overfit.<br>
I tried a small (and fast) **extratrees on trusted data after BayesianGaussianMixture**, which gave very impressive silhouette / Calinski_Harabasz / Davis_Bouldin scores.<br>
And next, **QDA** was much more impressive. <br>
And then I tried a **soft voting** like in Laurent Pourchot's notebook [here][1].<br>

A high value of AUC score or Accuracy after classification (LGBM or QDA with reg_param == 0) can give a high score on public LeaderBoard. Interesting, isn't ?

[1]: https://www.kaggle.com/code/pourchot/simple-soft-voting<br>
[2]: https://www.kaggle.com/code/ricopue/tps-jul22-clusters-and-lgb<br>

# Librairies / data 

In [1]:
import numpy as np
import pandas as pd
pd.set_option('max_columns', 100)
pd.set_option('max_rows', 200)

import matplotlib.pyplot as plt
import seaborn as sns

import gc, random, os

from sklearn.model_selection import StratifiedKFold

from sklearn.preprocessing import PowerTransformer

from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from sklearn.metrics import balanced_accuracy_score, roc_auc_score

from sklearn.mixture import BayesianGaussianMixture

import lightgbm as lgb
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB

SEED = 666 # please chose another one than mine
N_FOLDS = 10
N_CLUSTERS = 7

def seed_everything(seed=SEED):
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)

seed_everything()

In [2]:
df = pd.read_csv("../input/tabular-playground-series-jul-2022/data.csv", usecols = [f"f_{i+1:02d}" for i in range(28)])
df.head()

Unnamed: 0,f_01,f_02,f_03,f_04,f_05,f_06,f_07,f_08,f_09,f_10,f_11,f_12,f_13,f_14,f_15,f_16,f_17,f_18,f_19,f_20,f_21,f_22,f_23,f_24,f_25,f_26,f_27,f_28
0,-0.912791,0.648951,0.589045,-0.830817,0.733624,2.25856,2,13,14,5,13,6,6,-0.469819,0.358126,1.068105,-0.55965,-0.366905,-0.478412,-0.757002,-0.763635,-1.090369,1.142641,-0.884274,1.137896,1.309073,1.463002,0.813527
1,-0.453954,0.654175,0.995248,-1.65302,0.86381,-0.090651,2,3,6,4,6,16,9,0.591035,-0.396915,0.145834,-0.030798,0.471167,-0.428791,-0.089908,-1.784204,-0.839474,0.459685,1.759412,-0.275422,-0.852168,0.562457,-2.680541
2,0.324568,-1.170602,-0.624491,0.105448,0.783948,1.988301,5,11,5,8,9,3,11,-0.679875,0.469326,0.349843,-0.288042,0.29147,-0.413534,-1.602377,1.190984,3.267116,-0.088322,-2.168635,-0.974989,1.335763,-1.110655,-3.630723
3,0.229049,0.264109,0.23152,0.415012,-1.221269,0.13885,6,2,13,8,9,6,4,-0.389456,0.626762,-1.074543,-1.521753,-1.150806,0.619283,1.287801,0.532837,1.036631,-2.041828,1.44049,-1.900191,-0.630771,-0.050641,0.238333
4,-1.039533,-0.270155,-1.830264,-0.290108,-1.852809,0.781898,8,7,5,3,1,13,11,-0.120743,-0.615578,-1.064359,0.444142,0.428327,-1.62883,-0.434948,0.322505,0.284326,-2.438365,1.47393,-1.044684,1.602686,-0.405263,-1.987263


In [3]:
all_scores = []
usefull_cols = [f"f_{i:02d}" for i in list(range(7, 14)) + list(range(22, 29))]

def scores(preds, lib, df=df[usefull_cols], verbose = True, compute_silhouette = True): 
    
    # Silhouette is very slow
    sil = 0
    if compute_silhouette:
        sil = silhouette_score(df, preds, metric='euclidean')
    
    s = (lib,
         sil, 
         calinski_harabasz_score(df, preds), 
         davies_bouldin_score(df, preds))
    
    if verbose:
        print(f"{s[0]} : Silhouette : {s[1]:.1%} | Calinski Harabasz : {s[2]:.1f} | Davis Bouldin : {s[3]:.3f}")
        
    return s

# Bayesian Gaussian Mixture

In [4]:
df_scaled = pd.DataFrame(PowerTransformer().fit_transform(df[usefull_cols]), columns = usefull_cols)

BGM = BayesianGaussianMixture(n_components = N_CLUSTERS, covariance_type = 'full', random_state = SEED, n_init = 5, tol=.01)
BGM.fit(df_scaled)

BGM_predict = BGM.predict(df_scaled)
BGM_predict_proba = BGM.predict_proba(df_scaled)

all_scores.append(scores(BGM_predict, lib="BayesianGaussianMixture after powertransformer"))

BayesianGaussianMixture after powertransformer : Silhouette : 5.0% | Calinski Harabasz : 8157.7 | Davis Bouldin : 2.657


# Trusted Data
https://www.kaggle.com/code/ricopue/tps-jul22-clusters-and-lgb<br>
Thanks to Ricopue

In [5]:
# get trusted data to train LGB model.
proba_threshold = .69

df_scaled['predict'] = BGM_predict
df_scaled['predict_proba'] = 0
for n in range(N_CLUSTERS):
    df_scaled[f'predict_proba_{n}'] = BGM_predict_proba[:,n]
    df_scaled.loc[df_scaled.predict == n, 'predict_proba'] = df_scaled[f'predict_proba_{n}']
    
    
idxs = np.array([])
for n in range(N_CLUSTERS):
    median = df_scaled[df_scaled.predict==n]['predict_proba'].median()
    idx = df_scaled[(df_scaled.predict==n) & (df_scaled.predict_proba > proba_threshold)].index
    idxs = np.concatenate((idxs, idx))
    print(f'Class n°{n}  |  Median : {median:.4f}  |  Training data : {len(idx)/len(df_scaled[(df_scaled.predict==n)]):.1%}')
    
X = df_scaled.loc[idxs][usefull_cols]
y = df_scaled.loc[idxs]['predict']

Class n°0  |  Median : 0.8682  |  Training data : 72.9%
Class n°1  |  Median : 0.9376  |  Training data : 79.4%
Class n°2  |  Median : 0.7302  |  Training data : 55.5%
Class n°3  |  Median : 0.9833  |  Training data : 88.5%
Class n°4  |  Median : 0.8670  |  Training data : 73.2%
Class n°5  |  Median : 0.9071  |  Training data : 78.0%
Class n°6  |  Median : 0.9117  |  Training data : 76.1%


# LightGBM on trusted data after BayesianGaussianMixture
https://www.kaggle.com/code/ricopue/tps-jul22-clusters-and-lgb<br>
Thanks to Ricopue

In [6]:
params_lgb = {'learning_rate': 0.07,'objective': 'multiclass','boosting': 'gbdt','verbosity': -1,'n_jobs': -1, 'num_classes':N_CLUSTERS} 

lgbm_predict_proba = 0 ; classif_scores = []

gkf = StratifiedKFold(N_FOLDS, shuffle=True, random_state = SEED)
for fold, (trn_idx, val_idx) in enumerate(gkf.split(X,y)):   

    X_trn = lgb.Dataset(X.iloc[trn_idx], y.iloc[trn_idx], feature_name = usefull_cols)
    X_val = lgb.Dataset(X.iloc[val_idx], y.iloc[val_idx], feature_name = usefull_cols)
    
    model = lgb.train(params = params_lgb, 
                train_set = X_trn, valid_sets =  X_val, 
                num_boost_round = 5000, 
                callbacks = [ lgb.early_stopping(stopping_rounds=100, verbose=True), lgb.log_evaluation(period=200)])  
    
    y_pred_proba = model.predict(X.iloc[val_idx])
    y_pred = np.argmax(y_pred_proba, axis=1)
    
    s = (balanced_accuracy_score(y.iloc[val_idx], y_pred),
        roc_auc_score(y.iloc[val_idx], y_pred_proba, average="weighted", multi_class="ovo"))
    print(f"Fold n°{fold+1} on LGBM. AUC : {s[1]:.3f} | Accuracy : {s[0]:.1%}\n")
    classif_scores.append(s)

    lgbm_predict_proba += model.predict(df_scaled[usefull_cols]) / N_FOLDS
    
all_scores.append(scores(np.argmax(lgbm_predict_proba, axis=1), lib="LGBM after BayesianGaussianMixture - threshold 0.69"))

pd.DataFrame(classif_scores, columns = ["balanced_accuracy_score", "roc_auc_score"]).mean(0)

Training until validation scores don't improve for 100 rounds
[200]	valid_0's multi_logloss: 0.0420435
[400]	valid_0's multi_logloss: 0.0257652
[600]	valid_0's multi_logloss: 0.0228872
Early stopping, best iteration is:
[637]	valid_0's multi_logloss: 0.0228498
Fold n°1 on LGBM. AUC : 1.000 | Accuracy : 99.1%

Training until validation scores don't improve for 100 rounds
[200]	valid_0's multi_logloss: 0.0421074
[400]	valid_0's multi_logloss: 0.0231106
[600]	valid_0's multi_logloss: 0.0188834
[800]	valid_0's multi_logloss: 0.018073
Early stopping, best iteration is:
[821]	valid_0's multi_logloss: 0.0180097
Fold n°2 on LGBM. AUC : 1.000 | Accuracy : 99.3%

Training until validation scores don't improve for 100 rounds
[200]	valid_0's multi_logloss: 0.0456881
[400]	valid_0's multi_logloss: 0.0297501
[600]	valid_0's multi_logloss: 0.0278121
Early stopping, best iteration is:
[607]	valid_0's multi_logloss: 0.0277294
Fold n°3 on LGBM. AUC : 1.000 | Accuracy : 99.1%

Training until validation s

balanced_accuracy_score    0.992903
roc_auc_score              0.999947
dtype: float64

High AUC and high Accuracy Score

# Extratree on trusted data after BayesianGaussianMixture

In [7]:
et_predict_proba = 0 ; classif_scores = []

gkf = StratifiedKFold(N_FOLDS, shuffle=True, random_state = SEED + 1)

for fold, (trn_idx, val_idx) in enumerate(gkf.split(X, y)):   

    X_trn, y_trn = X.iloc[trn_idx], y.iloc[trn_idx]
    X_val, y_val = X.iloc[val_idx], y.iloc[val_idx]
    
    model = ExtraTreesClassifier(n_estimators=100, random_state=SEED)
    model.fit(X_trn, y_trn)
    
    y_pred = model.predict(X_val)
    y_pred_proba = model.predict_proba(X_val)
    
    s = (balanced_accuracy_score(y_val, y_pred),
        roc_auc_score(y_val, y_pred_proba, average="weighted", multi_class="ovo"))
    print(f"Fold n°{fold+1} on Extratree. AUC : {s[1]:.3f} | Accuracy : {s[0]:.1%}")
    classif_scores.append(s)

    et_predict_proba += model.predict_proba(df_scaled[usefull_cols]) / N_FOLDS

all_scores.append(scores(np.argmax(et_predict_proba, axis=1), lib="Extratree after BayesianGaussianMixture"))

pd.DataFrame(classif_scores, columns = ["balanced_accuracy_score", "roc_auc_score"]).mean(0)

Fold n°1 on Extratree. AUC : 0.999 | Accuracy : 96.3%
Fold n°2 on Extratree. AUC : 0.998 | Accuracy : 95.3%
Fold n°3 on Extratree. AUC : 0.999 | Accuracy : 95.8%
Fold n°4 on Extratree. AUC : 0.998 | Accuracy : 95.7%
Fold n°5 on Extratree. AUC : 0.999 | Accuracy : 96.3%
Fold n°6 on Extratree. AUC : 0.999 | Accuracy : 96.0%
Fold n°7 on Extratree. AUC : 0.999 | Accuracy : 95.6%
Fold n°8 on Extratree. AUC : 0.999 | Accuracy : 95.8%
Fold n°9 on Extratree. AUC : 0.999 | Accuracy : 95.6%
Fold n°10 on Extratree. AUC : 0.999 | Accuracy : 96.2%
Extratree after BayesianGaussianMixture : Silhouette : 6.0% | Calinski Harabasz : 8646.9 | Davis Bouldin : 2.497


balanced_accuracy_score    0.958731
roc_auc_score              0.998615
dtype: float64

Accuracy Score is not so high than with LGBM

# QuadraticDiscriminantAnalysis on trusted data after BayesianGaussianMixture
First I'm trying to have the better silhouette, ... scores. But AUC and accuracy are very low and it scores only 0.39 on public LB.<br>
reg_param = 1 : strong regularization per-class covariances.

In [8]:
qda_predict_proba = 0 ; classif_scores = []

gkf = StratifiedKFold(N_FOLDS, shuffle=True, random_state = SEED + 2)

for fold, (trn_idx, val_idx) in enumerate(gkf.split(X, y)):   

    X_trn, y_trn = X.iloc[trn_idx], y.iloc[trn_idx]
    X_val, y_val = X.iloc[val_idx], y.iloc[val_idx]

    model = QuadraticDiscriminantAnalysis(reg_param=1)
    model.fit(X_trn, y_trn) # on trusted data only
    
    y_pred = model.predict(X_val)
    y_pred_proba = model.predict_proba(X_val)
    
    s = (balanced_accuracy_score(y_val, y_pred),
        roc_auc_score(y_val, y_pred_proba, average="weighted", multi_class="ovo"))
    print(f"Fold n°{fold+1} on QDA. AUC : {s[1]:.3f} | Accuracy : {s[0]:.1%}")
    classif_scores.append(s)

    qda_predict_proba += model.predict_proba(df_scaled[usefull_cols]) / N_FOLDS

all_scores.append(scores(np.argmax(qda_predict_proba, axis=1), lib="QuadraticDiscriminantAnalysis after BayesianGaussianMixture"))
pd.DataFrame(classif_scores, columns = ["balanced_accuracy_score", "roc_auc_score"]).mean(0)

Fold n°1 on QDA. AUC : 0.969 | Accuracy : 78.8%
Fold n°2 on QDA. AUC : 0.968 | Accuracy : 77.2%
Fold n°3 on QDA. AUC : 0.967 | Accuracy : 77.5%
Fold n°4 on QDA. AUC : 0.969 | Accuracy : 78.7%
Fold n°5 on QDA. AUC : 0.969 | Accuracy : 78.4%
Fold n°6 on QDA. AUC : 0.966 | Accuracy : 77.4%
Fold n°7 on QDA. AUC : 0.968 | Accuracy : 77.5%
Fold n°8 on QDA. AUC : 0.967 | Accuracy : 78.3%
Fold n°9 on QDA. AUC : 0.969 | Accuracy : 78.1%
Fold n°10 on QDA. AUC : 0.969 | Accuracy : 77.9%
QuadraticDiscriminantAnalysis after BayesianGaussianMixture : Silhouette : 8.1% | Calinski Harabasz : 9938.5 | Davis Bouldin : 2.214


balanced_accuracy_score    0.779673
roc_auc_score              0.967990
dtype: float64

And I'm trying to have the better AUC and accuracy scores. Score on public LB is higher than 0.615.
reg_param = 0 : low regularization per-class covariances.

In [9]:
qda_predict_proba = 0 ; classif_scores = []

gkf = StratifiedKFold(N_FOLDS, shuffle=True, random_state = SEED + 2)

for fold, (trn_idx, val_idx) in enumerate(gkf.split(X, y)):   

    X_trn, y_trn = X.iloc[trn_idx], y.iloc[trn_idx]
    X_val, y_val = X.iloc[val_idx], y.iloc[val_idx]

    model = QuadraticDiscriminantAnalysis(reg_param=0)
    model.fit(X_trn, y_trn) # on trusted data only
    
    y_pred = model.predict(X_val)
    y_pred_proba = model.predict_proba(X_val)
    
    s = (balanced_accuracy_score(y_val, y_pred),
        roc_auc_score(y_val, y_pred_proba, average="weighted", multi_class="ovo"))
    print(f"Fold n°{fold+1} on QDA. AUC : {s[1]:.3f} | Accuracy : {s[0]:.1%}")
    classif_scores.append(s)

    qda_predict_proba += model.predict_proba(df_scaled[usefull_cols]) / N_FOLDS

all_scores.append(scores(np.argmax(qda_predict_proba, axis=1), lib="QuadraticDiscriminantAnalysis after BayesianGaussianMixture"))
pd.DataFrame(classif_scores, columns = ["balanced_accuracy_score", "roc_auc_score"]).mean(0)

Fold n°1 on QDA. AUC : 1.000 | Accuracy : 99.9%
Fold n°2 on QDA. AUC : 1.000 | Accuracy : 99.9%
Fold n°3 on QDA. AUC : 1.000 | Accuracy : 100.0%
Fold n°4 on QDA. AUC : 1.000 | Accuracy : 100.0%
Fold n°5 on QDA. AUC : 1.000 | Accuracy : 99.9%
Fold n°6 on QDA. AUC : 1.000 | Accuracy : 100.0%
Fold n°7 on QDA. AUC : 1.000 | Accuracy : 100.0%
Fold n°8 on QDA. AUC : 1.000 | Accuracy : 99.9%
Fold n°9 on QDA. AUC : 1.000 | Accuracy : 100.0%
Fold n°10 on QDA. AUC : 1.000 | Accuracy : 99.9%
QuadraticDiscriminantAnalysis after BayesianGaussianMixture : Silhouette : 5.4% | Calinski Harabasz : 8324.5 | Davis Bouldin : 2.569


balanced_accuracy_score    0.999432
roc_auc_score              0.999999
dtype: float64

Better AUC et Accuracy scores than with LGBM. Interesting...

# Gaussian Naïve Bayes after BayesianGaussianMixture

In [10]:
GNB_predict_proba = 0 ; classif_scores = []

gkf = StratifiedKFold(N_FOLDS, shuffle=True, random_state = SEED + 2)

for fold, (trn_idx, val_idx) in enumerate(gkf.split(X, y)):   

    X_trn, y_trn = X.iloc[trn_idx], y.iloc[trn_idx]
    X_val, y_val = X.iloc[val_idx], y.iloc[val_idx]

    model = GaussianNB(var_smoothing=.1)
    model.fit(X_trn, y_trn) # on trusted data only
    
    y_pred = model.predict(X_val)
    y_pred_proba = model.predict_proba(X_val)
    
    s = (balanced_accuracy_score(y_val, y_pred),
        roc_auc_score(y_val, y_pred_proba, average="weighted", multi_class="ovo"))
    print(f"Fold n°{fold+1} on GaussianNB. AUC : {s[1]:.3f} | Accuracy : {s[0]:.1%}")
    classif_scores.append(s)

    GNB_predict_proba += model.predict_proba(df_scaled[usefull_cols]) / N_FOLDS

all_scores.append(scores(np.argmax(GNB_predict_proba, axis=1), lib="GaussianNaïveBayes after BayesianGaussianMixture"))
pd.DataFrame(classif_scores, columns = ["balanced_accuracy_score", "roc_auc_score"]).mean(0)

Fold n°1 on GaussianNB. AUC : 0.992 | Accuracy : 90.0%
Fold n°2 on GaussianNB. AUC : 0.992 | Accuracy : 88.9%
Fold n°3 on GaussianNB. AUC : 0.991 | Accuracy : 89.0%
Fold n°4 on GaussianNB. AUC : 0.993 | Accuracy : 89.9%
Fold n°5 on GaussianNB. AUC : 0.992 | Accuracy : 89.5%
Fold n°6 on GaussianNB. AUC : 0.992 | Accuracy : 90.0%
Fold n°7 on GaussianNB. AUC : 0.992 | Accuracy : 89.9%
Fold n°8 on GaussianNB. AUC : 0.991 | Accuracy : 89.6%
Fold n°9 on GaussianNB. AUC : 0.992 | Accuracy : 89.3%
Fold n°10 on GaussianNB. AUC : 0.992 | Accuracy : 89.4%
GaussianNaïveBayes after BayesianGaussianMixture : Silhouette : 7.2% | Calinski Harabasz : 9515.1 | Davis Bouldin : 2.317


balanced_accuracy_score    0.895482
roc_auc_score              0.991994
dtype: float64

# Linear Discriminant Analysis

In [11]:
lda_predict_proba = 0 ; classif_scores = []

gkf = StratifiedKFold(N_FOLDS, shuffle=True, random_state = SEED + 2)

for fold, (trn_idx, val_idx) in enumerate(gkf.split(X, y)):   

    X_trn, y_trn = X.iloc[trn_idx], y.iloc[trn_idx]
    X_val, y_val = X.iloc[val_idx], y.iloc[val_idx]

    model = LinearDiscriminantAnalysis()
    model.fit(X_trn, y_trn) # on trusted data only
    
    y_pred = model.predict(X_val)
    y_pred_proba = model.predict_proba(X_val)
    
    s = (balanced_accuracy_score(y_val, y_pred),
        roc_auc_score(y_val, y_pred_proba, average="weighted", multi_class="ovo"))
    print(f"Fold n°{fold+1} on LDA. AUC : {s[1]:.3f} | Accuracy : {s[0]:.1%}")
    classif_scores.append(s)

    lda_predict_proba += model.predict_proba(df_scaled[usefull_cols]) / N_FOLDS

all_scores.append(scores(np.argmax(lda_predict_proba, axis=1), lib="LinearDiscriminantAnalysis after BayesianGaussianMixture"))
pd.DataFrame(classif_scores, columns = ["balanced_accuracy_score", "roc_auc_score"]).mean(0)

Fold n°1 on LDA. AUC : 0.976 | Accuracy : 81.3%
Fold n°2 on LDA. AUC : 0.975 | Accuracy : 79.9%
Fold n°3 on LDA. AUC : 0.975 | Accuracy : 80.3%
Fold n°4 on LDA. AUC : 0.975 | Accuracy : 79.8%
Fold n°5 on LDA. AUC : 0.976 | Accuracy : 80.3%
Fold n°6 on LDA. AUC : 0.974 | Accuracy : 79.7%
Fold n°7 on LDA. AUC : 0.976 | Accuracy : 80.7%
Fold n°8 on LDA. AUC : 0.976 | Accuracy : 81.4%
Fold n°9 on LDA. AUC : 0.975 | Accuracy : 80.2%
Fold n°10 on LDA. AUC : 0.975 | Accuracy : 80.2%
LinearDiscriminantAnalysis after BayesianGaussianMixture : Silhouette : 8.4% | Calinski Harabasz : 10065.7 | Davis Bouldin : 2.192


balanced_accuracy_score    0.803733
roc_auc_score              0.975249
dtype: float64

Lowest AUC and accuracy scores.

# Soft voting
https://www.kaggle.com/code/pourchot/simple-soft-voting<br>
Thanks to Laurent Pourchot

In [12]:
def soft_voting(preds_probs):

    values = list(range(N_CLUSTERS))
    pred_test = pd.DataFrame(np.zeros((df.shape[0], 7)), columns = values)

    for i, p in enumerate(preds_probs):
    
        MAX = np.argmax(p, axis=1)
        df[f'pred_{i}'] = MAX
    
        # Sort of the prediction by same value of cluster
        pred_keys = df[f'pred_{i}'].value_counts().index.tolist()
        pred_dict = dict(zip(pred_keys, values))
        df[f'pred_{i}'] = df[f'pred_{i}'].map(pred_dict)

        pred_new = pd.DataFrame(p).rename(columns = pred_dict)
        pred_new = pred_new.reindex(sorted(pred_new.columns), axis=1)
        pred_test += pred_new # Soft voting by probabiliy addition

    return np.argmax(np.array(pred_test), axis=1)

In [13]:
sv1_predict = soft_voting([et_predict_proba, lgbm_predict_proba, qda_predict_proba, lda_predict_proba, GNB_predict_proba])
all_scores.append(scores(sv1_predict, lib="Soft voting n°1 : all"))

sv2_predict = soft_voting([et_predict_proba, lgbm_predict_proba, qda_predict_proba])
all_scores.append(scores(sv2_predict, lib="Soft voting n°2 : LGBM, extratree and QDA"))

sv3_predict = soft_voting([lgbm_predict_proba, qda_predict_proba])
all_scores.append(scores(sv3_predict, lib="Soft voting n°3 : LGBM and QDA"))

Soft voting n°1 : all : Silhouette : 5.9% | Calinski Harabasz : 8634.0 | Davis Bouldin : 2.504
Soft voting n°2 : LGBM, extratree and QDA : Silhouette : 5.5% | Calinski Harabasz : 8392.7 | Davis Bouldin : 2.571
Soft voting n°3 : LGBM and QDA : Silhouette : 5.4% | Calinski Harabasz : 8360.3 | Davis Bouldin : 2.581


In [14]:
pd.DataFrame(all_scores, columns=["Model", "silhouette", "Calinski_Harabasz", "Davis_Bouldin"])

Unnamed: 0,Model,silhouette,Calinski_Harabasz,Davis_Bouldin
0,BayesianGaussianMixture after powertransformer,0.050471,8157.65689,2.656782
1,LGBM after BayesianGaussianMixture - threshold...,0.053719,8343.593949,2.59015
2,Extratree after BayesianGaussianMixture,0.059842,8646.900312,2.496678
3,QuadraticDiscriminantAnalysis after BayesianGa...,0.081142,9938.495708,2.213741
4,QuadraticDiscriminantAnalysis after BayesianGa...,0.053897,8324.527293,2.56928
5,GaussianNaïveBayes after BayesianGaussianMixture,0.071833,9515.110303,2.317072
6,LinearDiscriminantAnalysis after BayesianGauss...,0.083624,10065.664248,2.191927
7,Soft voting n°1 : all,0.058854,8633.950275,2.50402
8,"Soft voting n°2 : LGBM, extratree and QDA",0.054696,8392.672557,2.570905
9,Soft voting n°3 : LGBM and QDA,0.054111,8360.295234,2.581185


# Submissions

In [15]:
sub = pd.read_csv("../input/tabular-playground-series-jul-2022/sample_submission.csv")

sub['Predicted'] = np.argmax(lgbm_predict_proba, axis = 1)
sub.to_csv("submission_lgbm.csv",index = False)

sub['Predicted'] = np.argmax(et_predict_proba, axis = 1)
sub.to_csv("submission_extratree.csv", index = False)
 
sub['Predicted'] = np.argmax(qda_predict_proba, axis = 1)
sub.to_csv("submission_qda.csv", index = False)
 
sub['Predicted'] = np.argmax(lda_predict_proba, axis = 1)
sub.to_csv("submission_lda.csv", index = False)
 
sub['Predicted'] = np.argmax(GNB_predict_proba, axis = 1)
sub.to_csv("submission_GNB.csv", index = False)
 
sub['Predicted'] = sv1_predict
sub.to_csv("submission_softvote1.csv", index = False)

sub['Predicted'] = sv2_predict
sub.to_csv("submission_softvote2.csv", index = False)

sub['Predicted'] = sv3_predict
sub.to_csv("submission_softvote3.csv", index = False)