Probabilistic Deep Forest (PDF)

Overview

This project presents a novel model, resulting from the combination of a Deep Forest (DF) and a Probabilistic Random Forest (PRF). Standard machine learning models, including DF, typically assume that input data is precise. However, real-world data, especially from scientific domains like astronomy, is inherently noisy. This work modifies the Deep Forest architecture to use Probabilistic Random Forests (PRF) as its base estimators, creating a Probabilistic Deep Forest (PDRF) that can leverage data uncertainty to make more robust and accurate predictions.

The core of this project is a custom-built Python library, probabilistic-deep-forest, which is a modified version of the original deep-forest library, along with a custom PRF4DF estimator compatible with the scikit-learn ecosystem.

The Core Problem: Uncertainty in Deep Architectures

The original Deep Forest model, as proposed by Zhou and Feng (2017), builds a deep, layer-by-layer cascade of random forest estimators. A key part of its design is in-model feature transformation, where the class probability vectors generated by one layer become the input features for the next.

This presents a fundamental challenge when dealing with uncertain data:

Loss of Information: While the initial raw features may have known uncertainties (e.g., measurement errors), the new, generated probability vectors do not. The uncertainty information is lost after the first layer.
Over-confidence: Subsequent layers are forced to treat these generated features as perfectly certain, even if they were produced by a highly uncertain prediction in the previous layer. This can lead to error propagation and suboptimal performance, especially on noisy datasets.

The Solution: A Probabilistic Cascade

Our solution is to create a deep cascade that is "uncertainty-aware" from end to end. We achieve this by integrating the Probabilistic Random Forest (PRF), proposed by Reis et al. (2018), into the Deep Forest architecture.

Key Implementation Details:

Scikit-Learn Compatibility Wrapper: The core PRF implementation was not originally a scikit-learn estimator. To integrate it with the deep-forest library, which expects scikit-learn compatible models, we developed the SklearnCompatiblePRF wrapper. This class implements the necessary methods, allowing our custom probabilistic model to be used seamlessly as a plug-in component within the deep forest cascade.
Uncertainty Propagation (dX): We modified the deep-forest library to handle a parallel uncertainty array (dX) alongside the feature array (X). When a layer generates new augmented features, it also calculates the uncertainty of those features. We use the Standard Error of the Mean (SEM) of the predictions from all trees in a forest as a robust metric for this uncertainty.
Automated Experimentation Pipeline: To rigorously test the PDRF model against standard benchmarks (Random Forest, Deep Forest, Neural Networks, SVMs), we built a comprehensive experimentation pipeline. This pipeline, driven by simple .yaml configuration files, automates the entire workflow:
- Data Loading: Handles various dataset formats (KEEL .dat/ MNIST .idx1-ubyte and .idx3-ubyte).
- Controlled Noise Injection: Includes a dedicated module (utils/noising.py) to synthetically add various types of noise (e.g., Gaussian, Unifiorm) to the features, and Random Classification Noise (Angluin and Laird, 1988) to the labels, allowing us to test model robustness under different conditions.
- Model Configuration and Comparison: Automatically configures, trains, and evaluates all specified models on the same data splits, generating summary tables and plots for easy comparison of performance.

Running Experiments

Experiments are managed by the run_experiment.py script and configured using .yaml files located in the configs/ directory.

To run an experiment, execute the following command from the project's root directory:

In Windows:

./run.ps1 -Configs "mnist1.yaml", "mnist2.yaml", "mnist3.yaml"

In Linux

chmod +x run.sh
./run.sh mnist1.yaml mnist2.yaml mnist3.yaml

The results, including tables and plots, will be saved in the results/ directory.

Tip

The file example_notebook.ipynb contains a simple example of the whole project!

Note

This project was developed using Python 3.7.9

Bibliography

Deep Forest: Zhou, Z.-H., & Feng, J. (2017). Deep Forest: Towards an Alternative to Deep Neural Networks. arXiv:1702.08835.

Probabilistic Random Forest: Reis, I., Baron, D., & Shahaf, S. (2018). Probabilistic Random Forest: A machine learning algorithm for noisy datasets. arXiv:1811.05994.

Random Classification Noise: Angluin, D., Laird, P. Learning From Noisy Examples. Machine Learning 2, 343–370 (1988)

Sources

Probabilistic Random Forest:
- Paper: https://arxiv.org/abs/1811.05994
- Github: https://github.com/ireis/PRF
Deep Forest:
- Paper: https://arxiv.org/abs/1702.08835
- Github: https://github.com/LAMDA-NJU/Deep-Forest
Random Classification Noise:
- Paper: https://link.springer.com/article/10.1023/A:1022873112823
Datasets:
- KEEL dataset repository: https://sci2s.ugr.es/keel/datasets.php
- MNIST dataset: https://git-disl.github.io/GTDLBench/datasets/mnist_datasets/

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
KEEL/keel_data		KEEL/keel_data
configs		configs
configs_grid		configs_grid
data		data
results		results
scripts		scripts
test		test
utils		utils
.gitignore		.gitignore
README.md		README.md
example_notebook.ipynb		example_notebook.ipynb
requirements.txt		requirements.txt
run.ps1		run.ps1
run.sh		run.sh
run_grid.ps1		run_grid.ps1
run_grid.sh		run_grid.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Probabilistic Deep Forest (PDF)

Overview

The Core Problem: Uncertainty in Deep Architectures

The Solution: A Probabilistic Cascade

Running Experiments

Bibliography

Sources

About

Uh oh!

Contributors 2

Uh oh!

Languages

carlos-vf/PML

Folders and files

Latest commit

History

Repository files navigation

Probabilistic Deep Forest (PDF)

Overview

The Core Problem: Uncertainty in Deep Architectures

The Solution: A Probabilistic Cascade

Running Experiments

Bibliography

Sources

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages