Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
94cc438
Remove unncessary dockerfiles
stefanDeveloper Sep 26, 2025
98482ac
Fix linting
stefanDeveloper Sep 26, 2025
c7738bd
Remove fill_levels insertion in collector.py
lamr02n Sep 29, 2025
b2c587b
Remove LogCollector fill level panel from dashboard and edit remainin…
lamr02n Sep 29, 2025
42c2f85
Edit LogCollector fill level panels from dashboard (Overview)
lamr02n Sep 29, 2025
f7b0794
Small updates in Log Volumes dashboard
lamr02n Sep 29, 2025
24086fd
Small updates in Overview dashboard
lamr02n Sep 29, 2025
b558c43
Merge pull request #91 from stefanDeveloper/90-log-volume-collector-f…
stefanDeveloper Sep 29, 2025
66e178c
Update pipeline.rst for Stage 1: Log Storage
lamr02n Sep 30, 2025
feb9e6d
Update docstrings for server.py
lamr02n Sep 30, 2025
9f5cf3f
Add developer guide section to readthedocs for better structure
maldwg Sep 30, 2025
cbff8a6
Fix smal typo
maldwg Sep 30, 2025
56f02d1
Merge pull request #94 from stefanDeveloper/documentation/rc1-develop…
stefanDeveloper Sep 30, 2025
42b5834
Update docstrings for server.py
lamr02n Oct 1, 2025
7d0b48c
Update docstrings for server.py (2)
lamr02n Oct 1, 2025
1c69201
Update docstrings for server.py (3)
lamr02n Oct 1, 2025
149105a
Merge remote-tracking branch 'origin/v1.0.0-rc1' into v1.0.0-rc1
lamr02n Oct 1, 2025
720b36f
Update docstrings for server.py (4)
lamr02n Oct 1, 2025
147cf95
Update docstrings for collector.py
lamr02n Oct 1, 2025
1bf7fe0
Update docstrings for batch_handler.py
lamr02n Oct 2, 2025
b181763
Update pipeline.rst for Stage 2: Log Collection
lamr02n Oct 2, 2025
7741f3f
Update pipeline.rst for Stage 2: Log Collection (2)
lamr02n Oct 2, 2025
4b13a7a
Update pipeline.rst for Stage 2: Log Collection (2)
lamr02n Oct 2, 2025
8c016dd
Merge remote-tracking branch 'origin/v1.0.0-rc1' into v1.0.0-rc1
lamr02n Oct 2, 2025
1db94ad
Update pipeline.rst for Stage 2: Log Collection (3)
lamr02n Oct 2, 2025
a217bf2
Update docker compose
stefanDeveloper Oct 2, 2025
0b98e95
Update docker compose changes
stefanDeveloper Oct 2, 2025
d95b611
Fix linting
stefanDeveloper Oct 2, 2025
8edbc3c
Update banner
stefanDeveloper Oct 2, 2025
21af55d
Update quality
stefanDeveloper Oct 2, 2025
af7383c
Update readthedocs
stefanDeveloper Oct 2, 2025
a5c28d2
Update logline format description in configuration.rst
lamr02n Oct 6, 2025
1abf463
Update pipeline.rst for Stage 3: Log Filtering
lamr02n Oct 6, 2025
f22cd30
Update references and underlines in configuration.rst and pipeline.rst
lamr02n Oct 6, 2025
5941c39
Update docstrings for prefilter.py
lamr02n Oct 6, 2025
ceac86d
Update docstrings for logline_handler.py
lamr02n Oct 6, 2025
5d23967
Update docstrings for clickhouse_kafka_sender.py
lamr02n Oct 6, 2025
fe5f545
Update docstrings for utils.py
lamr02n Oct 6, 2025
df686be
Update docstrings for log_config.py
lamr02n Oct 6, 2025
e9db495
Update docstrings for logline_handler.py
lamr02n Oct 6, 2025
ce68ef9
Update docstrings for kafka_handler.py
lamr02n Oct 6, 2025
89a372b
Update docstrings for inspector.py
lamr02n Oct 6, 2025
174eb0f
Update docstrings for detector.py
lamr02n Oct 6, 2025
c813b29
Update docstrings for clickhouse_batch_sender.py
lamr02n Oct 6, 2025
ff1916f
Update docstrings for monitoring_agent.py
lamr02n Oct 6, 2025
28eb2a5
Handle all sphinx warnings
lamr02n Oct 6, 2025
2783a52
Create global variable to make mock_logs.dev.py more adjustable
lamr02n Oct 8, 2025
2b5de4b
Refactor and update README.md
lamr02n Oct 8, 2025
e304d3f
Update README.md (2)
lamr02n Oct 8, 2025
7e7641f
Update docstrings for inspector.py
lamr02n Oct 13, 2025
052fdce
Update docstrings for detector.py
lamr02n Oct 13, 2025
e0e2df7
Update docstrings for dataset.py
lamr02n Oct 13, 2025
a587b99
Update docstrings for explainer.py
lamr02n Oct 13, 2025
ad4cbc7
Update docstrings for feature.py
lamr02n Oct 13, 2025
a895c55
Update docstrings for model.py
lamr02n Oct 13, 2025
5de1114
Fix argument in model.py
lamr02n Oct 13, 2025
84957b2
Update docstrings for train.py
lamr02n Oct 13, 2025
017b04c
Optimize imports for src/train files
lamr02n Oct 13, 2025
823d925
Small docstring fix
lamr02n Oct 14, 2025
1a0b625
Small docstring fixes (2)
lamr02n Oct 14, 2025
e2cafcb
Fix detector feature calculation
stefanDeveloper Oct 14, 2025
69fb4ad
Fix linting
stefanDeveloper Oct 14, 2025
3f6cc5e
Adapt config.yaml to point at external kafka APIs
maldwg Oct 14, 2025
8b0ae07
Merge branch 'v1.0.0-rc1' of github.com:stefanDeveloper/heiDGAF into …
maldwg Oct 14, 2025
f486aad
Update inspector and detector docu
stefanDeveloper Oct 14, 2025
b96543b
Remove too detailed information in pipeline.rst
lamr02n Oct 17, 2025
12c5387
Update Inspector usage section in pipeline.rst
lamr02n Oct 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
225 changes: 124 additions & 101 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,6 @@
<a href="https://heidgaf.readthedocs.io/en/latest/"><strong>Explore the docs »</strong></a>
<br />
<br />
<a href="https://mybinder.org/v2/gh/stefanDeveloper/heiDGAF-tutorials/HEAD?labpath=demo_notebook.ipynb">View Demo</a>
·
<a href="https://github.com/stefanDeveloper/heiDGAF/issues/new?labels=bug&template=bug-report---.md">Report Bug</a>
·
<a href="https://github.com/stefanDeveloper/heiDGAF/issues/new?labels=enhancement&template=feature-request---.md">Request Feature</a>
Expand Down Expand Up @@ -58,23 +56,78 @@

## About the Project

![Pipeline overview](https://raw.githubusercontent.com/stefanDeveloper/heiDGAF/main/docs/media/pipeline_overview.png?raw=true)
![Pipeline overview](https://raw.githubusercontent.com/stefanDeveloper/heiDGAF/main/docs/media/heidgaf_overview_detailed.drawio.png?raw=true)

## Getting Started

If you want to use heiDGAF, just use the provided Docker compose to quickly bootstrap your environment:
#### Run **heiDGAF** using Docker Compose:

```
docker compose -f docker/docker-compose.yml up
```sh
HOST_IP=127.0.0.1 docker compose -f docker/docker-compose.yml up
```
<p align="center">
<img src="https://raw.githubusercontent.com/stefanDeveloper/heiDGAF/main/assets/terminal_example.gif?raw=true" alt="Terminal example"/>
</p>

## Examplary Dashboards
In the below summary you will find examplary views of the grafana dashboards. The metrics were obtained using the [mock-generator](./docker/docker-compose.send-real-logs.yml)
#### Or run the modules locally on your machine:
```sh
python -m venv .venv
source .venv/bin/activate

sh install_requirements.sh
```
Alternatively, you can use `pip install` and enter all needed requirements individually with `-r requirements.*.txt`.

Now, you can start each stage, e.g. the inspector:

```sh
python src/inspector/inspector.py
```

<p align="right">(<a href="#readme-top">back to top</a>)</p>


## Usage

### Configuration

To configure **heiDGAF** according to your needs, use the provided `config.yaml`.

The most relevant settings are related to your specific log line format, the model you want to use, and
possibly infrastructure.

The section `pipeline.log_collection.collector.logline_format` has to be adjusted to reflect your specific input log
line format. Using our adjustable and flexible log line configuration, you can rename, reorder and fully configure each
field of a valid log line. Freely define timestamps, RegEx patterns, lists, and IP addresses. For example, your
configuration might look as follows:

```yml
- [ "timestamp", Timestamp, "%Y-%m-%dT%H:%M:%S.%fZ" ]
- [ "status_code", ListItem, [ "NOERROR", "NXDOMAIN" ], [ "NXDOMAIN" ] ]
- [ "client_ip", IpAddress ]
- [ "dns_server_ip", IpAddress ]
- [ "domain_name", RegEx, '^(?=.{1,253}$)((?!-)[A-Za-z0-9-]{1,63}(?<!-)\.)+[A-Za-z]{2,63}$' ]
- [ "record_type", ListItem, [ "A", "AAAA" ] ]
- [ "response_ip", IpAddress ]
- [ "size", RegEx, '^\d+b$' ]
```

The options `pipeline.data_inspection` and `pipeline.data_analysis` are relevant for configuring the model. The section
`environment` can be fine-tuned to prevent naming collisions for Kafka topics and adjust addressing in your environment.

For more in-depth information on your options, have a look at our
[official documentation](https://heidgaf.readthedocs.io/en/latest/usage.html), where we provide tables explaining all
values in detail.

### Monitoring
To monitor the system and observe its real-time behavior, multiple Grafana dashboards have been set up.

Have a look at the following pictures showing examples of how these dashboards might look at runtime.

<details>
<summary>📊 <strong>Overview Dashboard</strong></summary>
<summary><strong>Overview</strong> dashboard</summary>

Contains the most relevant information on the system's runtime behavior, its efficiency and its effectivity.

<p align="center">
<a href="./assets/readme_assets/overview.png">
Expand All @@ -85,7 +138,10 @@ In the below summary you will find examplary views of the grafana dashboards. Th
</details>

<details>
<summary>📈 <strong>Latencies Dashboard</strong></summary>
<summary><strong>Latencies</strong> dashboard</summary>

Presents any information on latencies, including comparisons between the modules and more detailed,
stand-alone metrics.

<p align="center">
<a href="./assets/readme_assets/latencies.jpeg">
Expand All @@ -96,7 +152,11 @@ In the below summary you will find examplary views of the grafana dashboards. Th
</details>

<details>
<summary>📉 <strong>Log Volumes Dashboard</strong></summary>
<summary><strong>Log Volumes</strong> dashboard</summary>

Presents any information on the fill levels of each module, i.e. the number of entries that are currently in the
module for processing. Includes comparisons between the modules, more detailed, stand-alone metrics, as well as
total numbers of logs entering the pipeline or being marked as fully processed.

<p align="center">
<a href="./assets/readme_assets/log_volumes.jpeg">
Expand All @@ -107,7 +167,9 @@ In the below summary you will find examplary views of the grafana dashboards. Th
</details>

<details>
<summary>🚨 <strong>Alerts Dashboard</strong></summary>
<summary><strong>Alerts</strong> dashboard</summary>

Presents details on the number of logs detected as malicious including IP addresses responsible for those alerts.

<p align="center">
<a href="./assets/readme_assets/alerts.png">
Expand All @@ -118,7 +180,12 @@ In the below summary you will find examplary views of the grafana dashboards. Th
</details>

<details>
<summary>🧪 <strong>Dataset Dashboard</strong></summary>
<summary><strong>Dataset</strong> dashboard</summary>

This dashboard is only active for the **_datatest_** mode. Users who want to test their own models can use this mode
for inspecting confusion matrices on testing data.

> This feature is in a very early development stage.

<p align="center">
<a href="./assets/readme_assets/datatests.png">
Expand All @@ -128,131 +195,87 @@ In the below summary you will find examplary views of the grafana dashboards. Th

</details>


### Developing

Install all Python requirements:

```sh
python -m venv .venv
source .venv/bin/activate

sh install_requirements.sh
```

Alternatively, you can use `pip install` and enter all needed requirements individually with `-r requirements.*.txt`.

Now, you can start each stage, e.g. the inspector:

```sh
python src/inspector/main.py
```
<p align="right">(<a href="#readme-top">back to top</a>)</p>

### Configuration

The following table lists the most important configuration parameters with their respective default values.
The full list of configuration parameters is available at the [documentation](https://heidgaf.readthedocs.io/en/latest/usage.html)

| Path | Description | Default Value |
| :----------------------------------------- | :-------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------------------- |
| `pipeline.data_inspection.inspector.mode` | Mode of operation for the data inspector. | `univariate` (options: `multivariate`, `ensemble`) |
| `pipeline.data_inspection.inspector.ensemble.model` | Model to use when inspector mode is `ensemble`. | `WeightEnsemble` |
| `pipeline.data_inspection.inspector.ensemble.module` | Module name for the ensemble model. | `streamad.process` |
| `pipeline.data_inspection.inspector.models` | List of models to use for data inspection (e.g., anomaly detection). | Array of model definitions (e.g., `{"model": "ZScoreDetector", "module": "streamad.model", "model_args": {"is_global": false}}`)|
| `pipeline.data_inspection.inspector.anomaly_threshold` | Threshold for classifying an observation as an anomaly. | `0.01` |
| `pipeline.data_analysis.detector.model` | Model to use for data analysis (e.g., DGA detection). | `rf` (Random Forest) option: `XGBoost` |
| `pipeline.data_analysis.detector.checksum` | Checksum for the model file to ensure integrity. | `021af76b2385ddbc76f6e3ad10feb0bb081f9cf05cff2e52333e31040bbf36cc` |
| `pipeline.data_analysis.detector.base_url` | Base URL for downloading the model if not present locally. | `https://heibox.uni-heidelberg.de/d/0d5cbcbe16cd46a58021/` |

<p align="right">(<a href="#readme-top">back to top</a>)</p>

### Insert test data
## Models and Training

>[!IMPORTANT]
> To be able to train and test our or your own models, you will need to download the datasets.
To train and test our and possibly your own models, we currently rely on the following datasets:

For training our models, we currently rely on the following data sets:
- [CICBellDNS2021](https://www.unb.ca/cic/datasets/dns-2021.html)
- [DGTA Benchmark](https://data.mendeley.com/datasets/2wzf9bz7xr/1)
- [DNS Tunneling Queries for Binary Classification](https://data.mendeley.com/datasets/mzn9hvdcxg/1)
- [UMUDGA - University of Murcia Domain Generation Algorithm Dataset](https://data.mendeley.com/datasets/y8ph45msv8/1)
- [Real-CyberSecurity-Datasets](https://github.com/gfek/Real-CyberSecurity-Datasets/)
- [DGArchive](https://dgarchive.caad.fkie.fraunhofer.de/)

However, we compute all feature separately and only rely on the `domain` and `class`.
Currently, we are only interested in binary classification, thus, the `class` is either `benign` or `malicious`.
We compute all features separately and only rely on the `domain` and `class` for binary classification.

After downloading the dataset and storing it under `<project-root>/data` you can run
```
docker compose -f docker/docker-compose.send-real-logs.yml up
```
to start inserting the dataset traffic.
### Inserting Data for Testing

<p align="right">(<a href="#readme-top">back to top</a>)</p>
For testing purposes, we provide multiple scripts in the `scripts` directory. Use `real_logs.dev.py` to send data from
the datasets into the pipeline. After downloading the dataset and storing it under `<project-root>/data`, run
```sh
python scripts/real_logs.dev.py
```
to start continuously inserting dataset traffic.

### Training Your Own Models

### Train your own models
> [!IMPORTANT]
> This is only a brief wrap-up of a custom training process.
> We highly encourage you to have a look at the [documentation](https://heidgaf.readthedocs.io/en/latest/training.html)
> for a full description and explanation of the configuration parameters.

Currently, we feature two trained models, namely XGBoost and RandomForest.
We feature two trained models:
1. XGBoost (`src/train/model.py#XGBoostModel`) and
2. RandomForest (`src/train/model.py#RandomForestModel`).

After installing the requirements, use `src/train/train.py`:

```sh
python -m venv .venv
source .venv/bin/activate
> python -m venv .venv
> source .venv/bin/activate

pip install -r requirements/requirements.train.txt
```
> pip install -r requirements/requirements.train.txt

After setting up the [dataset directories](#insert-test-data) (and adding the code for your model class if applicable), you can start the training process by running the following commands:
> python src/train/train.py
Usage: train.py [OPTIONS] COMMAND [ARGS]...

**Model Training**
```
python src/train/train.py train --dataset <dataset_type> --dataset_path <path/to/your/datasets> --model <model_name>
```
The results will be saved per default to `./results`, if not configured otherwise. <br>
Options:
-h, --help Show this message and exit.

**Model Tests**
```
python src/train/train.py test --dataset <dataset_type> --dataset_path <path/to/your/datasets> --model <model_name> --model_path <path_to_model_version>
Commands:
explain
test
train
```

**Model Explain**
```
python src/train/train.py explain --dataset <dataset_type> --dataset_path <path/to/your/datasets> --model <model_name> --model_path <path_to_model_version>
```
This will create a rules.txt file containing the innards of the model, explaining the rules it created.
Setting up the [dataset directories](#insert-test-data) (and adding the code for your model class if applicable) lets you start
the training process by running the following commands:

<p align="right">(<a href="#readme-top">back to top</a>)</p>
#### Model Training

```sh
> python src/train/train.py train --dataset <dataset_type> --dataset_path <path/to/your/datasets> --model <model_name>
```
The results will be saved per default to `./results`, if not configured otherwise.

### Data

> [!IMPORTANT]
> We support custom schemes.

Depending on your data and usecase, you can customize the data scheme to fit your needs.
The below configuration is part of the [main configuration file](./config.yaml) which is detailed in our [documentation](https://heidgaf.readthedocs.io/en/latest/usage.html#id2)
#### Model Tests

```yml
loglines:
fields:
- [ "timestamp", RegEx, '^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}Z$' ]
- [ "status_code", ListItem, [ "NOERROR", "NXDOMAIN" ], [ "NXDOMAIN" ] ]
- [ "client_ip", IpAddress ]
- [ "dns_server_ip", IpAddress ]
- [ "domain_name", RegEx, '^(?=.{1,253}$)((?!-)[A-Za-z0-9-]{1,63}(?<!-)\.)+[A-Za-z]{2,63}$' ]
- [ "record_type", ListItem, [ "A", "AAAA" ] ]
- [ "response_ip", IpAddress ]
- [ "size", RegEx, '^\d+b$' ]
```sh
> python src/train/train.py test --dataset <dataset_type> --dataset_path <path/to/your/datasets> --model <model_name> --model_path <path_to_model_version>
```

#### Model Explain

```sh
> python src/train/train.py explain --dataset <dataset_type> --dataset_path <path/to/your/datasets> --model <model_name> --model_path <path_to_model_version>
```
This will create a `rules.txt` file containing the innards of the model, explaining the rules it created.

<p align="right">(<a href="#readme-top">back to top</a>)</p>


<!-- CONTRIBUTING -->
## Contributing

Expand Down
Binary file added assets/heidgaf_logo_github.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 3 additions & 3 deletions config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -70,11 +70,11 @@ pipeline:

environment:
kafka_brokers:
- hostname: kafka1
- hostname: 127.0.0.1
port: 8097
- hostname: kafka2
- hostname: 127.0.0.1
port: 8098
- hostname: kafka3
- hostname: 127.0.0.1
port: 8099
kafka_topics:
pipeline:
Expand Down
Empty file removed data/.gitkeep
Empty file.
Empty file removed data/cic/.gitkeep
Empty file.
26 changes: 0 additions & 26 deletions data/cic/cic_dns_decode.py

This file was deleted.

Empty file removed data/dgta/.gitkeep
Empty file.
17 changes: 0 additions & 17 deletions data/dgta/dgta_decode.py

This file was deleted.

17 changes: 0 additions & 17 deletions docker/benchmark_tests/Dockerfile.run_test

This file was deleted.

Loading
Loading