Skip to content
/ PML Public

This repository introduces the Probabilistic Deep Forest (PDF), a novel model that enhances Deep Forest to handle noisy, real-world data. It solves the critical issue of uncertainty loss during layer-by-layer prediction by using Probabilistic Random Forests as its core estimators.

Notifications You must be signed in to change notification settings

carlos-vf/PML

Repository files navigation

Probabilistic Deep Forest (PDF)

Overview

This project presents a novel model, resulting from the combination of a Deep Forest (DF) and a Probabilistic Random Forest (PRF). Standard machine learning models, including DF, typically assume that input data is precise. However, real-world data, especially from scientific domains like astronomy, is inherently noisy. This work modifies the Deep Forest architecture to use Probabilistic Random Forests (PRF) as its base estimators, creating a Probabilistic Deep Forest (PDRF) that can leverage data uncertainty to make more robust and accurate predictions.

The core of this project is a custom-built Python library, probabilistic-deep-forest, which is a modified version of the original deep-forest library, along with a custom PRF4DF estimator compatible with the scikit-learn ecosystem.

The Core Problem: Uncertainty in Deep Architectures

The original Deep Forest model, as proposed by Zhou and Feng (2017), builds a deep, layer-by-layer cascade of random forest estimators. A key part of its design is in-model feature transformation, where the class probability vectors generated by one layer become the input features for the next.

This presents a fundamental challenge when dealing with uncertain data:

  • Loss of Information: While the initial raw features may have known uncertainties (e.g., measurement errors), the new, generated probability vectors do not. The uncertainty information is lost after the first layer.

  • Over-confidence: Subsequent layers are forced to treat these generated features as perfectly certain, even if they were produced by a highly uncertain prediction in the previous layer. This can lead to error propagation and suboptimal performance, especially on noisy datasets.

The Solution: A Probabilistic Cascade

Our solution is to create a deep cascade that is "uncertainty-aware" from end to end. We achieve this by integrating the Probabilistic Random Forest (PRF), proposed by Reis et al. (2018), into the Deep Forest architecture.

Key Implementation Details:

  • Scikit-Learn Compatibility Wrapper: The core PRF implementation was not originally a scikit-learn estimator. To integrate it with the deep-forest library, which expects scikit-learn compatible models, we developed the SklearnCompatiblePRF wrapper. This class implements the necessary methods, allowing our custom probabilistic model to be used seamlessly as a plug-in component within the deep forest cascade.

  • Uncertainty Propagation (dX): We modified the deep-forest library to handle a parallel uncertainty array (dX) alongside the feature array (X). When a layer generates new augmented features, it also calculates the uncertainty of those features. We use the Standard Error of the Mean (SEM) of the predictions from all trees in a forest as a robust metric for this uncertainty.

  • Automated Experimentation Pipeline: To rigorously test the PDRF model against standard benchmarks (Random Forest, Deep Forest, Neural Networks, SVMs), we built a comprehensive experimentation pipeline. This pipeline, driven by simple .yaml configuration files, automates the entire workflow:

    • Data Loading: Handles various dataset formats (KEEL .dat/ MNIST .idx1-ubyte and .idx3-ubyte).

    • Controlled Noise Injection: Includes a dedicated module (utils/noising.py) to synthetically add various types of noise (e.g., Gaussian, Unifiorm) to the features, and Random Classification Noise (Angluin and Laird, 1988) to the labels, allowing us to test model robustness under different conditions.

    • Model Configuration and Comparison: Automatically configures, trains, and evaluates all specified models on the same data splits, generating summary tables and plots for easy comparison of performance.

Running Experiments

Experiments are managed by the run_experiment.py script and configured using .yaml files located in the configs/ directory.

To run an experiment, execute the following command from the project's root directory:

In Windows:

./run.ps1 -Configs "mnist1.yaml", "mnist2.yaml", "mnist3.yaml"

In Linux

chmod +x run.sh
./run.sh mnist1.yaml mnist2.yaml mnist3.yaml

The results, including tables and plots, will be saved in the results/ directory.

Tip

The file example_notebook.ipynb contains a simple example of the whole project!

Note

This project was developed using Python 3.7.9

Bibliography

Deep Forest: Zhou, Z.-H., & Feng, J. (2017). Deep Forest: Towards an Alternative to Deep Neural Networks. arXiv:1702.08835.

Probabilistic Random Forest: Reis, I., Baron, D., & Shahaf, S. (2018). Probabilistic Random Forest: A machine learning algorithm for noisy datasets. arXiv:1811.05994.

Random Classification Noise: Angluin, D., Laird, P. Learning From Noisy Examples. Machine Learning 2, 343–370 (1988)

Sources

About

This repository introduces the Probabilistic Deep Forest (PDF), a novel model that enhances Deep Forest to handle noisy, real-world data. It solves the critical issue of uncertainty loss during layer-by-layer prediction by using Probabilistic Random Forests as its core estimators.

Topics

Resources

Stars

Watchers

Forks

Contributors 2

  •  
  •