Machine Learning-Based Exploitability Prediction for Penetration Testing A Data-Driven Approach to Prioritizing Vulnerabilities
IEEE TIFS
Python 3.8+
License: MIT # ๐ License
This project is licensed under the MIT License. See LICENSE for details.
๐ Overview This repository contains the code and data pipeline for the IEEE TIFS paper:
"Machine Learning-Based Exploitability Prediction for Penetration Testing: A Data-Driven Approach"
We present a production-ready XGBoost model that predicts the likelihood of a CVE being weaponized, using features from:
National Vulnerability Database (NVD)
Exploit Database (ExploitDB)
Key innovations: โ 25% recall at 6% precision (optimized for high-risk triage) โ 62.5% reduction in missed exploits vs. random sampling โ FastAPI microservice for integration with pentesting tools (Metasploit/Burp Suite)
๐ Quick Start
- Install Dependencies bash pip install -r requirements.txt # Python 3.8+
- Run the Jupyter Notebook bash jupyter notebook exploit_prediction.ipynb # Full pipeline: EDA โ Training โ Evaluation
- Deploy the FastAPI Service bash uvicorn api:app --reload # Access docs at http://localhost:8000/docs ๐ Repository Structure Copy โโโ data/ # Processed datasets (NVD + ExploitDB) โ โโโ nvd_2024.json # Sample NVD data โ โโโ exploits.csv # ExploitDB records โโโ models/ # Pretrained XGBoost + SMOTE โ โโโ exploit_model.joblib โโโ api/ # FastAPI deployment โ โโโ app.py # REST endpoint โ โโโ schemas.py # Pydantic input validation โโโ exploit_prediction.ipynb # Main Colab notebook โโโ requirements.txt # Python dependencies โโโ LICENSE # MIT License ๐ Key Features ๐ Feature Engineering CVSS Metrics: Base score, attack vector, criticality flags
Temporal Signals: Days since publication ("golden hour" for exploits)
Class Imbalance Handling: SMOTE oversampling (1:738 ratio)
โ๏ธ Optimized XGBoost Model python model = XGBClassifier( scale_pos_weight=100, # Penalize false negatives 100ร more max_depth=10, n_estimators=200, eval_metric='logloss' ) ๐จ Security Thresholding Recall-Optimized Decision Threshold (ฮธ=0.10):
25% exploit detection rate
<1% false alarms
๐ API Endpoints Endpoint Description Example Request /predict Predict exploit probability {"cve_id": "CVE-2024-1234", "cvss_score": 9.8, "days_since_published": 30} /docs Interactive OpenAPI 3.0 docs - ๐ Performance Comparison with Baselines (Test Set, n=5,979 CVEs):
Model Recall Precision F0.7-Score CVSS โฅ 7.0 8% 0.5% 0.03 Random Forest 7% 0.3% 0.02 Our XGBoost 25% 6% 0.18 SHAP Analysis: SHAP Summary Plot
๐ ๏ธ Integration with Pentesting Tools python import requests
response = requests.post( "http://localhost:8000/predict", json={"cve_id": "CVE-2024-1234", "cvss_score": 9.2, "days_since_published": 15} ) print(response.json()) # {"risk_level": "HIGH", "probability": 0.87, "threshold_used": 0.10} ๐ Citation If you use this work, please cite:
bibtex @article{your_tifs_paper, title={Machine Learning-Based Exploitability Prediction for Penetration Testing}, author={Your Name et al.}, journal={IEEE Transactions on Information Forensics and Security}, year={2024} } ๐ฎ Contact For questions or collaborations: ๐ง Email: your.email@example.com ๐ป GitHub Issues: Open an issue
๐จ Disclaimer This tool is designed for defensive security only. Always comply with ethical hacking guidelines.