This project combines machine learning and Answer Set Programming (ASP) to provide interpretable explanations for malware classification, based on feature manipulations derived from XGBoost models trained on the EMBER dataset.
.
├── dataset/ # EMBER dataset directory (download required)
├── export/ # Output directory for generated samples and solutions
├── model/ # Directory for saved models
│
├── lib/
│ ├── asp/ # XGBoost to ASP conversion logic
│ ├── dataset/ # EMBER dataset preprocessor
│ └── utils/ # Utility functions
│
├── metrics/
│ ├── booster.py # Checking the boosters module
│ └── generation.py # Checking the sample generation module
│
├── config.py # Configuration file for directories and parameters
│
├── narrow_bounds.py # ASP-based narrow bound solver
├── narrow_bounds_plot.py # Visualization for narrow bounds
│
├── expand_bounds.py # ASP-based expanded bound solver
├── expand_bounds_plot.py # Visualization for expanded bounds
│
├── sample_generation.py # Generate sample with desired malware probability
├── rule_extraction.py # Train XGBoost, extract rules, and convert to ASP
│
├── LICENSE
├── requirements.txt
└── README.md
- Python 3.12
- Install dependencies:
pip install -r requirements.txt
pip install git+https://github.com/blkdmr/ember.git
Download the EMBER 2018 dataset from:
https://ember.elastic.co/ember_dataset_2018_2.tar.bz2
Place the archive in the dataset/
folder and extract it.
If you are using custom folders, update the paths in config.py
.
python rule_extraction.py
- Trains an XGBoost model
- Dumps the model
- Extracts decision rules
- Converts them to an ASP program
The first time you run this script, it will initialize the EMBER dataset.
python narrow_bounds.py
- Finds minimal feature combinations to generate a sample with a target malware probability
- Saves the solution in the
export/
directory
To visualize:
python narrow_bounds_plot.py
First, create a sample with a specific malware probability p
:
python sample_generation.py
Then, expand bounds for a target probability q
:
python expand_bounds.py
- Alters the sample to achieve the new malware probability
q
- Saves the result in the
export/
directory
To visualize:
python expand_bounds_plot.py
python metrics/booster.py
- Evaluates the trained booster (XGBoost model)
- Outputs performance metrics and checks internal booster statistics
python metrics/generation.py
- Evaluates the quality of sample generation
- Outputs statistics related to malware probability manipulation and feature adjustments
This project is licensed under the MIT License. See LICENSE
for details.