# Hello Everyone!

In this notebook, we're going to use a kinda weird approach to improving the accuracy of our XGBoost model.
We're going to train on the same data, the same parameters, but the seed will be diffrent.

You can find the single-XGBoost notebook [here](https://www.kaggle.com/okyanusoz/tps-jun-2021-random-forest)

If you are ready, let's begin!

# Overview of our ensemble

We have 101 models with the same algorithm (Gradient Boosting) and get the mean probability, but like I told earlier, we're going to use a diffrent seed for each.

That's basically it!

# Load data, original XGBoost model, and best params

In [None]:
import pandas as pd

df = pd.read_csv("../input/tabular-playground-series-jun-2021/train.csv")

df.drop("id", inplace=True, axis=1) # the "id" column is not very important in the train data

# Make target variable more machine-friendly
df["target"] = df["target"].map(lambda x: int(x.replace("Class_", "")) - 1)

In [None]:
# Split into training and validation
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(df, test_size=0.1)

In [None]:
import xgboost as xgb

original_model = xgb.XGBClassifier()
original_model.load_model("../input/tps-jun-2021-xgboost-model/model.txt")

In [None]:
original_model

In [None]:
import json

with open("../input/tps-jun-2021-xgboost-model/best-params.json", "r") as f:
    best_hyperparams = json.load(f)

In [None]:
del best_hyperparams["random_state"]

# Create the ensemble class

In [None]:
import numpy as np

class Ensemble:
    def __init__(self, n_classes=9):
        # sklearn style
        self.models_ = []
        self.n_classes = n_classes
    def append_model(self, model):
        self.models_.append(model)
    def predict_proba(self, x):
        # Most of the code here from https://github.com/Matuzas77/MNIST-0.17/blob/master/MNIST_final_solution.ipynb
        # Thank you!
        probabilities = np.asarray([a.predict_proba(x) for a in self.models_])
        return np.mean(probabilities,axis=0)

# Train 100 models (and append original)

In [None]:
seeds = np.arange(100)

seeds

In [None]:
import copy

def train_model(seed):
    clf_params = copy.deepcopy(best_hyperparams)
    del clf_params["early_stopping_rounds"]

    # https://stackoverflow.com/a/62302697
    xgb_model = xgb.XGBClassifier(objective="multi:softprob", random_state=seed, use_label_encoder=False, tree_method='gpu_hist', gpu_id=0, verbosity=0, **clf_params)
    xgb_model.fit(train_df.drop("target", axis=1), train_df["target"],
                    eval_set=[(val_df.drop("target", axis=1), val_df["target"])],
                    verbose=False,
                    early_stopping_rounds=best_hyperparams["early_stopping_rounds"]
    )
    return xgb_model

ensemble = Ensemble(n_classes=9)

In [None]:
ensemble.append_model(original_model)

In [None]:
from tqdm import tqdm

for seed in tqdm(seeds):
    ensemble.append_model(train_model(seed))

# Predict on test data

In [None]:
import numpy as np

test_df = pd.read_csv("../input/tabular-playground-series-jun-2021/test.csv")

In [None]:
test_pred = ensemble.predict_proba(test_df.drop("id", axis=1))

In [None]:
submission_df = pd.DataFrame()

submission_df["id"] = test_df["id"]

for i in range(9):
    probabilities = test_pred[:,i]
    submission_df[f"Class_{i + 1}"] = probabilities

In [None]:
submission_df.to_csv("submission.csv", index=False)