This project implements a machine learning pipeline to detect Command and Control (C2) beaconing activity from Zeek network connection logs. It includes scripts for data preprocessing, feature engineering, model training (XGBoost), evaluation, and visualization. The system is designed to work with the IOT-23 dataset and custom C2 capture data.
Ensure you have Python 3.
# Clone or copy this project, then navigate to the project directory
git clone https://github.com/wspencerhurst/BeaconClassifier.git # Or your repo URL
cd BeaconClassifier
# Create and activate a virtual environment (optional but recommended)
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txtDependencies:
- Python 3
- pandas, pyarrow, numpy
- xgboost, scikit-learn, joblib, tqdm
- matplotlib (for
plot_metrics.py)
System Requirements:
- Processing the full IOT-23 dataset and training can be memory-intensive. At least 8GB RAM is recommended, but I've seen training on the full IOT-23 dataset use over 24GB
- The IOT-23 dataset files (Zeek
conn.logformat, often found as.logor.csvwith pipe delimiters) should be placed in a directory, e.g.,data/iot23_raw/. - If you used the
data/download_data.pyscript mentioned previously and it places data indata/network-malware-detection-connection-analysis/, ensure your input paths in the commands below reflect this.
- Capture Traffic: Use tools like
tcpdumpor Wireshark to capture network traffic while your custom C2 beacon is active, preferably with some background noise traffic. Save the capture as a.pcapfile (e.g.,capture.pcap). - Convert to Zeek Logs:
This will generate several log files, including
zeek -C -r capture.pcap # The -C option ignores checksum errorsconn.log. - Organize Logs: Create a directory (e.g.,
data/custom_c2_raw/) and move the generatedconn.logfile (and any others if needed by different tools, though our classifier only usesconn.log) into it. If your logs are from different capture sessions or times and are gzipped (e.g.,conn.18_00_00-19_00_00.log.gz), place them all in this directory. Ourpreprocess.pyscript can handle.gzfiles for thecustominput type.
The general workflow involves preprocessing raw data, (optionally) preparing a combined training set, training a model, and then evaluating it.
This script converts raw Zeek conn.log files into feature-rich Parquet files. It handles different input formats and performs feature engineering (especially temporal IAT features).
A. Pre-processing the IOT-23 Dataset:
python scripts/preprocess.py \
--input-dir data/iot23_raw \
--output-fp artifacts/iot23_features.parquet \
--input-type iot23 \
--recursive--input-dir: Directory containing IOT-23 log files.--output-fp: Path to save the processed Parquet file.--input-type iot23: Specifies the format and labeling logic for IOT-23.--recursive: Search for log files in subdirectories.
B. Pre-processing Your Custom C2 Dataset:
python scripts/preprocess.py \
--input-dir data/custom_c2_raw \
--output-fp artifacts/custom_c2_features.parquet \
--input-type custom \
--victim-ip YOUR_VICTIM_IP \
--c2-server-ip YOUR_C2_SERVER_IP \
--recursive--input-dir: Directory containing your custom C2conn.log(orconn.*.log.gz) files.--input-type custom: Specifies format and labeling for custom C2.- Requires:
--victim-ip(e.g.,10.10.140.58) and--c2-server-ip(e.g.,10.10.140.32) to correctly label your C2 traffic as malicious and set theis_c2flag.
- Requires:
After running these, you should have:
artifacts/iot23_features.parquetartifacts/custom_c2_features.parquet
This script takes the preprocessed IOT-23 data and a portion of your preprocessed custom C2 data to create a combined training set. It also saves a hold-out portion of your custom C2 data for testing the combined model.
python scripts/prepare_combined_dataset.py \
--iot23-input artifacts/iot23_features.parquet \
--custom-input artifacts/custom_c2_features.parquet \
--combined-train-output artifacts/combined_train_features.parquet \
--custom-holdout-output artifacts/custom_c2_holdout_test_features.parquet \
--custom-train-fraction 0.7 \
--random-seed 42This will create:
artifacts/combined_train_features.parquet(IOT-23 + 70% of your custom C2 data)artifacts/custom_c2_holdout_test_features.parquet(the remaining 30% of your custom C2 data)
This script trains the XGBoost classifiers (model_mal for Malicious/Benign and model_c2 for C2/Non-C2).
A. Training Model A (on IOT-23 data only):
python scripts/train.py \
--train artifacts/iot23_features.parquet \
--model-dir artifacts/model_A_iot23_trained \
--test-size 0.2 \
--random-seed 42--train: Path to the training data Parquet file.--model-dir: Directory to save the trained models (model_mal.joblib,model_c2.joblib) andfeature_columns.joblib.--test-size: Fraction of training data to hold out for internal validation during this training run.
B. Training Model B (on combined data):
python scripts/train.py \
--train artifacts/combined_train_features.parquet \
--model-dir artifacts/model_B_combined_trained \
--test-size 0.2 \
--random-seed 42These scripts evaluate a trained model on new test data and generate visualizations.
General Evaluation Command Structure (evaluate.py):
python scripts/evaluate.py \
--model-dir path/to/your/model_directory \
--test-data path/to/your/test_features.parquet \
--output-csv reports/some_predictions.csv \
--task c2 # or 'mal' for the Malicious/Benign model- This prints metrics if the test data has ground truth labels (
labeloris_c2). - Outputs a CSV with predictions (
pred) and probabilities (proba). - Saves detailed metrics to
eval_metrics.jsonwithin the--model-dir.
General Plotting Command Structure (plot_metrics.py):
python scripts/plot_metrics.py \
--model-dir path/to/your/model_directory \
--test-data path/to/your/test_features.parquet \
--output-dir reports/some_plots_directory \
--task c2 \
--sample 100000 # Optional: samples N rows from test data for faster plotting- Generates
confusion_matrix_thresh_0.5.png,roc_curve.png,precision_recall_curve.png, andfeature_importance.png. - Prints metrics for various thresholds to the console.
- Metrics Files:
eval_metrics.json(in each model directory) contains detailed metrics for the default 0.5 threshold (or as evaluated byevaluate.py). - Console Output from
plot_metrics.py: Provides precision, recall, F1 for various thresholds – crucial for understanding threshold impact. - Plots (
reports/subdirectories):- Confusion Matrix: Visualizes true/false positives/negatives for a given threshold (default 0.5 in
plot_metrics.py). - ROC Curve: Shows model discrimination ability across all thresholds. Higher AUC is better.
- Precision-Recall Curve: Illustrates the trade-off between precision and recall. Essential for selecting an operational threshold, especially in security contexts.
- Feature Importance: Shows which features contributed most to the XGBoost model's predictions (by gain).
- Confusion Matrix: Visualizes true/false positives/negatives for a given threshold (default 0.5 in
- Feature Engineering: Modify
preprocess.pyto add or change features. - Model Tuning: Adjust hyperparameters in
train.py. - Threshold Selection: The
plot_metrics.pyoutput and the precision-recall curve are key. For operational use, you'd select a threshold from the PR curve or based on F1-scores that balances your desired recall and precision. - Memory Management:
- The
train.pyscript samples 500,000 rows by default if the dataset is larger. Remove or adjust thedf = df.sample(n=500_000, ...)line to use more/all data. - XGBoost's
tree_method="hist"is already used intrain.pyfor better memory efficiency.
- The