**The rationale of this notebook is very simple - to demonstrate the power of ensembles. Through this notebook you will see how just blending thousands of *default* ensembles you can still get excellent results. Is this practical? Depends, how many cores do you have? This is more of an interesting phenomenon if anything. If you have any academic research supporting why 'Extreme Ensembling' works so well, please link below.**

# **The good stuff is always imported :)**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style('darkgrid')
sns.set(rc={'figure.figsize':(15, 10)})

from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score, precision_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, RobustScaler

RS = 69420
DATA_PATH = "../input/tabular-playground-series-mar-2021/train.csv"

# **Load the data**

In [None]:
train = pd.read_csv(DATA_PATH, index_col=0)

cat_features = [c for c in train.columns if 'cat' in c]
le = LabelEncoder()
for col in cat_features:
    train[col] = le.fit_transform(train[col])

X = train.iloc[:, :-1].values
y = train.iloc[:, -1].values

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RS, shuffle=True, stratify=y)

# **Robust Scaling Because why tf not**

In [None]:
sc = RobustScaler(with_centering=False)

X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# ***Extreme Ensemble***

In [None]:
estimators = []
for i in range(1500):
    model1 = LGBMClassifier(device='gpu',
                            verbose=0,
                            random_seed=np.random.randint(0, 100000))

    model2 = XGBClassifier(objective='binary:logistic',
                           predictor = 'gpu_predictor',
                           tree_method = 'gpu_hist',
                           verbose=None,
                           random_state=np.random.randint(0, 100000))
    
    model3 = CatBoostClassifier(task_type="GPU",
                                devices='0:1',
                                verbose=None,
                                random_seed=np.random.randint(0, 100000))
    
    estimators.append((f"lgbm_model{i}", model1))
    estimators.append((f"xgb_model{i}", model2))
    estimators.append((f"cat_model{i}", model3))

**With the range set to only 3 we can get an unseen score of 0.88557 putting you in the top 50% - with a default model! Imagine after tuning :) (Hint thats what I did)**

# **Build, Train, Test Voting Classifier**

In [None]:
clf = VotingClassifier(estimators=estimators,
                       verbose=1,
                       voting='soft')

In [None]:
%%time
clf.fit(X_train, y_train)

In [None]:
# # Predict
# y_pred = clf.predict(X_test)

# # Compute Metrics
# print(f"Testing Precision: {precision_score(y_test, y_pred, 'weighted')}")

# **Predict Unseen Data**

In [None]:
test = pd.read_csv("../input/tabular-playground-series-mar-2021/test.csv", index_col=0)

for col in cat_features:
    test[col] = le.fit_transform(test[col])

submission = pd.DataFrame(index=test.index)

test = sc.transform(test.values)

submission['target'] = clf.predict_proba(test)[:, 1]

submission.to_csv("submission.csv")

In [None]:
print("Ensemble is served!")

****