XGBoost baseline
==

This notebook is an example of using the output of a previous notebook as input. It uses the output of [this notebook](https://www.kaggle.com/kaaveland/tps202112-parquet), which converts the competition dataset into parquet format, which is much faster to load, and has appropriate datatypes for the columns, instead of getting everything as `int64`.

Check this out:

In [None]:
import random
import pandas as pd
import plotly.express as px
import xgboost
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

random.seed(64)
np.random.seed(64)

In [None]:
%time df_test = pd.read_parquet('../input/tps202112-parquet/test.pq')
%time df = pd.read_parquet('../input/tps202112-parquet/train.pq')

And of course, the `DataFrame`has the correct dtypes in it:

In [None]:
df.info()

Bonus: Training a booster
==

Just to submit something from this notebook:

In [None]:
label_encoder = LabelEncoder()
X_train, y_train = df.drop(columns=['Id', 'Cover_Type']), label_encoder.fit_transform(df.Cover_Type)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=.2, shuffle=True)

X_test = df_test.drop(columns=['Id'])

In [None]:
sane_defaults = {
    'objective': 'multi:softmax',
    'num_class': len(label_encoder.classes_),
    'tree_method': 'gpu_hist',
    'sampling_method': 'gradient_based',
    'subsample': .25,
    'max_depth': 4,
    'learning_rate': .10,
    'colsample_bytree': .5,
    'eval_metric': ['mlogloss', 'merror'],
    'predictor': 'gpu_predictor'
}

booster = xgboost.train(
    params=sane_defaults,
    dtrain=xgboost.DMatrix(X_train, label=y_train),
    num_boost_round=3000,
    early_stopping_rounds=50,
    evals=[(xgboost.DMatrix(X_val, label=y_val), 'val')],
    verbose_eval=100
)

Feature importance
==

In [None]:
px.bar(
    x=booster.get_fscore().keys(),
    y=booster.get_fscore().values(),
    title='Feature importance'
)

Let's submit:

In [None]:
sub = df_test[['Id']].assign(
    Cover_Type=label_encoder.inverse_transform(booster.predict(xgboost.DMatrix(X_test)).astype(np.int8))
)

sub.to_csv('submission.csv', index=False)
!head -n 5 submission.csv