# XGBoost Count and Liklihood Models

- The count model is used accross all three approaches (XGBoost, ViT and ResNet) in order to select an optimal number of likely species.

- The liklihood model trains and runs quicker than the other two approaches and scores __0.32217__ on the private test data

This model uses preprocessed Presence Absence Metadata, Bioclimatic Rasters and Time Series Landsat data to make its predictions.

To preprocess the data (if not already done) uncomment the second cell.

Imports

In [1]:
import pandas as pd
import pickle as pk
from pipeline import GeoLifeXGB
from pipeline import GeoLifePostprocessor

Uncomment to preprocess data if the script has not already been run.

In [2]:
# ! python preprocess_data.py

Set a training seed

In [3]:
SEED = 3

Initialise the XGBoost trainer class and train the count prediction model

In [4]:
xgb = GeoLifeXGB(SEED)
count_model = xgb.generate_count_model()

Save the count predictions for the test dataset

In [5]:
for idx in [0, 2, 2.5, 3, 3.5, 4, 4.5, 5]:
    pk.dump(xgb.predict_test_counts(idx), open(f"counts_plus_{idx}_.pkl", "wb"))

Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.




Generate the main XGBoost species occurrence liklihood model

In [6]:
_ = xgb.generate_main_model()

Generate occurrence predictions from the test data and save the raw output for use in Ensemble predictions

In [7]:
xgbout = xgb.predict_test_scores()
pk.dump(xgbout, open(f"xgbout_seed_{SEED}.pkl", "wb"))

Generate the final prediction file based on this model and save it as a .csv

In [8]:
xgbout = pk.load(open(f"xgbout_seed_{SEED}.pkl", "rb"))
xgbout = pd.DataFrame(xgbout, columns=xgb.yy_cols, index=xgb.test.index)

postproc = GeoLifePostprocessor([xgbout,], [1,], pk.load(open(f"counts_seed_{SEED}.pkl", "rb")))
postproc.save(f"xgboost_seed_{SEED}_")

Saved to: submissions/xgboost_seed_3_2024-06-22_22-36.csv
