To implement Boosted Trees on a Diet (ToaD) we made some adaptations to the LightGBM framework. We included a new penalizer in the serial tree learner; see the mrf_
pointer enabled functions in src/treeldarner/serial_tree_learner.cpp
for details.
Moreover, we added various helper functionalities that are implemented in src/treelearner/memory_restricted_forest.hpp
.
The experiments
folder provides the means to fetch the datasets tested, and run the Trees on a Diet (ToaD
) variant.
The steps are split to allow short runtimes.
The buildToaD.sh
script (or buildToaD-windows.sh
for Windows) builds LightGBM with the ToaD extension and automatically starts the experiments.
(Running .sh
scripts on Windows might require additional steps or a specific shell, such as Git Bash.)
Prerequisites to build the project can be found in the LightGBM documentation.
Depending on your system, training and evaluating the different model configurations might take several hours to days! Please modify the file to enable or disable GPU usage for speedup.
For now, we assume you install python packages yourself, requirements.txt will be added later
python/get_datasets.py
downloads the datasets. The files are stored in python/data
having a 80/20 training/testing split.
./runExperiments.sh
checks for datasets in the data folder with the scheme name.train
.
It is assumed that the corresponding file with test data is called name.test
.
You need to call the script with the respective LightGBM build path, i.e. sh runExperiments.sh "../lightgbm"
(Mac/Ubuntu) or sh runExperiments.sh "../Release/lightgbm"
(Windows).
❗ The script runs for every dataset with 40,620 configurations (26 feature penalties, 26 threshold penalties, 20 tree sizes, 3 depths, and a run without split and threshold penalties) ❗
For testing purposes, you might want to modify the for-loops inside the script.
for i in $(seq -10 1 15); do
for j in $(seq -10 1 15); do
for tree in 1 2 3 4 5 6 7 8 9 10 15 20 30 40 50 100 200 500 1000 10000; do
for depth in 3 5 7; do
(i
and j
are converted to different power of two values and represent the penalties.)
Again, we assume you install python packages yourself
The data inside the models is transformed to .csv
files with the python/evaluate_models.py
script.
This might require more time than you would expect as accuracy metrics need to be calculated. The .csv files are stored in
data/
datasetname/last.csv
.
Afterwards similar graphical representation can be generated by calling the python/plot.py
script.
To enable figure creation without the whole training and evaluation process, the results of our experiments are placed in the respective results
directory.