A machine learning project for classifying DDoS (Distributed Denial of Service) attacks using advanced data preprocessing and deep learning models. This repository provides reusable preprocessing pipelines and explores multiple machine learning approaches to accurately identify DDoS traffic patterns.
- Production-ready preprocessing pipeline - Modular, sklearn-compatible transformers for feature engineering
- Multiple ML models - Logistic Regression, Autoencoders, and clustering algorithms
- Comprehensive data analysis - Exploratory notebooks demonstrating dataset characteristics and model performance
- Easy experimentation - Jupyter notebooks for iterative development and model evaluation
- Reproducible results - Poetry-managed dependencies and standardized preprocessing
- 📚 Quick Start Guide - Get up and running in 5 minutes
- 🔧 Preprocessing Documentation - Data pipeline API and examples
- 🤖 Models Documentation - Algorithm descriptions and use cases
- 📊 Dataset Documentation - Dataset structure and features
- 🤝 Contributing Guide - How to contribute to the project
- Python 3.12 - Programming language
- scikit-learn - ML algorithms and utilities
- pandas / numpy - Data manipulation and numerical computing
- matplotlib / seaborn - Data visualization
- PyTorch - Deep learning framework (CPU optimized)
- Poetry - Dependency management and packaging
├── preprocessing/ # Data preprocessing pipeline and transformers
│ ├── training_pipeline.py # DataPreprocessingPipeline class
│ ├── feature_dropper.py # Remove unwanted features
│ ├── composite_splitter.py # Split IP addresses into octets
│ ├── numerical_standardiser.py # Standardize numerical features
│ ├── category_encoder.py # Encode categorical variables
│ └── feature_imputer.py # Handle missing values
├── models/ # Trained model files
│ ├── logistic_regression.pkl # Trained logistic regression model
│ └── autoencoder_model.pt # Trained autoencoder model
├── logistic_regression/ # Logistic regression experiments
├── autoencoder/ # Autoencoder experiments
├── notebooks/ # Additional exploratory notebooks
├── data/ # Dataset directory (not in repo)
├── data analysis.ipynb # Main data exploration notebook
├── pipeline usage.ipynb # Preprocessing pipeline examples
└── README.md # This file
- Python 3.12 or higher
- Poetry (install from https://python-poetry.org)
-
Install Poetry (if not already installed):
curl -sSL https://install.python-poetry.org | python3 - -
Clone the repository:
git clone https://github.com/rtweera/ddos-classifier.git cd ddos-classifier -
Install dependencies:
poetry install
-
Activate the Poetry shell:
poetry shell
The main preprocessing pipeline is DataPreprocessingPipeline in preprocessing/training_pipeline.py. See PREPROCESSING.md for detailed API documentation.
Quick example:
from preprocessing import DataPreprocessingPipeline
import pandas as pd
# Load your data
df = pd.read_csv('path/to/data.csv')
# Create and fit the pipeline
pipeline = DataPreprocessingPipeline()
pipeline.fit(df)
# Transform new data
transformed = pipeline.transform(df)
# Save for later use
pipeline.save('pipeline.pkl')Start exploring with the included notebooks:
# Main data analysis
jupyter notebook "data analysis.ipynb"
# Pipeline usage examples
jupyter notebook "pipeline usage.ipynb"
# Model-specific experiments
jupyter notebook notebooks/logistic_regression.ipynb
jupyter notebook notebooks/autoencoder.ipynbThe preprocessing pipeline includes the following stages:
- Feature Dropping - Removes irrelevant features (index, Flow ID, Timestamp)
- Composite Splitting - Splits IP addresses into octets for better feature representation
- Numerical Standardization - Standardizes numerical features to mean=0, std=1
See PREPROCESSING.md for more details on each transformer and customization options.
The project explores three main approaches:
- Logistic Regression - Fast baseline model for binary classification
- Autoencoder - Deep learning model for feature learning and anomaly detection
- DBSCAN - Clustering algorithm for unsupervised DDoS detection
See MODELS.md for detailed algorithm descriptions and use cases.
This project uses the Kaggle DDoS Dataset. See Dataset.md for:
- Dataset structure and schema
- Feature descriptions
- Download instructions
- Data characteristics and statistics
If you encounter issues installing PyTorch CPU version, ensure you're using Python 3.12:
python --versionThe preprocessing pipeline supports incremental processing:
pipeline = DataPreprocessingPipeline()
for chunk in pd.read_csv('large_file.csv', chunksize=10000):
pipeline.partial_fit(chunk)Download the DDoS dataset from Kaggle and place it in the data/ directory.
Contributions are welcome! Please see CONTRIBUTING.md for:
- Development setup
- Code style guidelines
- Testing procedures
- Pull request process
This project is open source. See LICENSE file for details.
- Dataset: Kaggle DDoS Datasets
- scikit-learn: https://scikit-learn.org
- PyTorch: https://pytorch.org