DDoS Classifier

A machine learning project for classifying DDoS (Distributed Denial of Service) attacks using advanced data preprocessing and deep learning models. This repository provides reusable preprocessing pipelines and explores multiple machine learning approaches to accurately identify DDoS traffic patterns.

Features

Production-ready preprocessing pipeline - Modular, sklearn-compatible transformers for feature engineering
Multiple ML models - Logistic Regression, Autoencoders, and clustering algorithms
Comprehensive data analysis - Exploratory notebooks demonstrating dataset characteristics and model performance
Easy experimentation - Jupyter notebooks for iterative development and model evaluation
Reproducible results - Poetry-managed dependencies and standardized preprocessing

Quick Links

📚 Quick Start Guide - Get up and running in 5 minutes
🔧 Preprocessing Documentation - Data pipeline API and examples
🤖 Models Documentation - Algorithm descriptions and use cases
📊 Dataset Documentation - Dataset structure and features
🤝 Contributing Guide - How to contribute to the project

Tech Stack

Python 3.12 - Programming language
scikit-learn - ML algorithms and utilities
pandas / numpy - Data manipulation and numerical computing
matplotlib / seaborn - Data visualization
PyTorch - Deep learning framework (CPU optimized)
Poetry - Dependency management and packaging

Project Structure

├── preprocessing/              # Data preprocessing pipeline and transformers
│   ├── training_pipeline.py    # DataPreprocessingPipeline class
│   ├── feature_dropper.py      # Remove unwanted features
│   ├── composite_splitter.py   # Split IP addresses into octets
│   ├── numerical_standardiser.py # Standardize numerical features
│   ├── category_encoder.py     # Encode categorical variables
│   └── feature_imputer.py      # Handle missing values
├── models/                     # Trained model files
│   ├── logistic_regression.pkl # Trained logistic regression model
│   └── autoencoder_model.pt    # Trained autoencoder model
├── logistic_regression/        # Logistic regression experiments
├── autoencoder/                # Autoencoder experiments
├── notebooks/                  # Additional exploratory notebooks
├── data/                       # Dataset directory (not in repo)
├── data analysis.ipynb         # Main data exploration notebook
├── pipeline usage.ipynb        # Preprocessing pipeline examples
└── README.md                   # This file

Setup

Prerequisites

Python 3.12 or higher
Poetry (install from https://python-poetry.org)

Installation

Install Poetry (if not already installed):

curl -sSL https://install.python-poetry.org | python3 -

Clone the repository:

git clone https://github.com/rtweera/ddos-classifier.git
cd ddos-classifier

Install dependencies:
```
poetry install
```
Activate the Poetry shell:
```
poetry shell
```

Usage

Using the Preprocessing Pipeline

The main preprocessing pipeline is DataPreprocessingPipeline in preprocessing/training_pipeline.py. See PREPROCESSING.md for detailed API documentation.

Quick example:

from preprocessing import DataPreprocessingPipeline
import pandas as pd

# Load your data
df = pd.read_csv('path/to/data.csv')

# Create and fit the pipeline
pipeline = DataPreprocessingPipeline()
pipeline.fit(df)

# Transform new data
transformed = pipeline.transform(df)

# Save for later use
pipeline.save('pipeline.pkl')

Running Experiments

Start exploring with the included notebooks:

# Main data analysis
jupyter notebook "data analysis.ipynb"

# Pipeline usage examples
jupyter notebook "pipeline usage.ipynb"

# Model-specific experiments
jupyter notebook notebooks/logistic_regression.ipynb
jupyter notebook notebooks/autoencoder.ipynb

Pipeline Overview

The preprocessing pipeline includes the following stages:

Feature Dropping - Removes irrelevant features (index, Flow ID, Timestamp)
Composite Splitting - Splits IP addresses into octets for better feature representation
Numerical Standardization - Standardizes numerical features to mean=0, std=1

See PREPROCESSING.md for more details on each transformer and customization options.

Models

The project explores three main approaches:

Logistic Regression - Fast baseline model for binary classification
Autoencoder - Deep learning model for feature learning and anomaly detection
DBSCAN - Clustering algorithm for unsupervised DDoS detection

See MODELS.md for detailed algorithm descriptions and use cases.

Dataset

This project uses the Kaggle DDoS Dataset. See Dataset.md for:

Dataset structure and schema
Feature descriptions
Download instructions
Data characteristics and statistics

Troubleshooting

Issues with PyTorch Installation

If you encounter issues installing PyTorch CPU version, ensure you're using Python 3.12:

python --version

Memory Issues with Large Datasets

The preprocessing pipeline supports incremental processing:

pipeline = DataPreprocessingPipeline()
for chunk in pd.read_csv('large_file.csv', chunksize=10000):
    pipeline.partial_fit(chunk)

Missing Data Files

Download the DDoS dataset from Kaggle and place it in the data/ directory.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for:

Development setup
Code style guidelines
Testing procedures
Pull request process

License

This project is open source. See LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DDoS Classifier

Features

Quick Links

Tech Stack

Project Structure

Setup

Prerequisites

Installation

Usage

Using the Preprocessing Pipeline

Running Experiments

Pipeline Overview

Models

Dataset

Troubleshooting

Issues with PyTorch Installation

Memory Issues with Large Datasets

Missing Data Files

Contributing

License

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.idea		.idea
autoencoder		autoencoder
data		data
data_findings		data_findings
logistic_regression		logistic_regression
models		models
notebooks		notebooks
preprocessing		preprocessing
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dataset.md		Dataset.md
MODELS.md		MODELS.md
PREPROCESSING.md		PREPROCESSING.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
data analysis.ipynb		data analysis.ipynb
example.ipynb		example.ipynb
log-reg-autoencoder-lstm-for-ddos.ipynb		log-reg-autoencoder-lstm-for-ddos.ipynb
pipeline usage.ipynb		pipeline usage.ipynb
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

DDoS Classifier

Features

Quick Links

Tech Stack

Project Structure

Setup

Prerequisites

Installation

Usage

Using the Preprocessing Pipeline

Running Experiments

Pipeline Overview

Models

Dataset

Troubleshooting

Issues with PyTorch Installation

Memory Issues with Large Datasets

Missing Data Files

Contributing

License

References

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages