Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

MalPaCA (Malware Packet-sequence Clustering and Analysis)

This repo contains the implementation of MalPaCA (Malware Packet-sequence Clustering and Analysis). Roughly, it does the following:

  1. It takes as input a pcap file or a folder containing pcap files
  2. It reads each file and splits pcaps into uni-directional connections
  3. Each connection is represented as 4 sequential features, i.e. packet sizes (bytes), inter-arrival-times (ms), source ports, dest ports
  4. Then, for each feature, it computes the pairwise distance between all connections and stores them in their respective distance matrix
  5. The distance matrices are combined using a simple weighted average, where all features have equal weights
  6. The final diatcne matrix goes as input into HDBScan clustering algorithm
  7. The final clusters are post-processed and printed out as a .csv and in their respective temporal heatmaps.

Required packages:

pip install dpkt, statsmodels, cython, dtw-python, fastdtw, hdbscan

sudo apt-get install r-base-dev (If you want to use the speedup)

Usage (for clustering):

python {file|folder} {path/to/file|path/to/folder} {experiment-name} {sequence-length-threshold} {speedup}

{file|folder} : Choose one option for either one pcap or multiple in a folder

{experiment-name} : Name of your experiments. It will be used to name the generated artifacts, i.e. distance matrices, tsne plots, clusters.

{sequence-length-threshold} : Length of feature sequences

{speedup} : Boolean depending on whether to use R's parallel DTW library or not

Usage (for re-clustering):

python {path/to/model.pkl} {file|folder} {path/to/file|path/to/folder} {experiment-name} {sequence-length-threshold}

{path/to/model.pkl} : Path to the model file of the prior experiment, to which you want to re-cluster new sequences

{file|folder} : (For the new experiment) Choose one option for either one pcap or multiple in a folder

{path/to/file|path/to/folder}: (For the new experiment) Path to .pcap file (if file selected) or folder containing .pcap files (if folder selected)

{experiment-name} : (For the new experiment) Name of your new experiments. It will be used to name the generated artifacts, i.e. distance matrices, tsne plots, clusters.

{sequence-length-threshold} : (For the new experiment) Length of feature sequences, i.e. number of packets per sequence to consider for clustering


  1. ngram order : Size of the sliding window for port numbers. Can be found as ngrams.append(zip(x,x+1,x+2)). Default: 3
  2. sequence-length-threshold (thresh) : Length of feature sequences. Default: 20
  3. min_cluster_size : Minimum cluster size. The smaller, the better. Default: 7
  4. min_samples : Size of radius around each point to consider nearest neighbor. The smaller, the higher the noise samples. Default: 7
  5. speedup : True | False depending on whether you want to speed up the code using R's parallel DTW computation library


  • If connection length falls less than 3 packets, this code will fail (due to trigram calculation).
  • If number of clusters are 0 for some reason, this code will fail.

If you use MalPaCA in a scientific work, consider citing the following paper:

  title={Beyond Labeling: Using Clustering to Build Network Behavioral Profiles of Malware Families},
  author={Nadeem, Azqa and Hammerschmidt, Christian and Ga{\~n}{\'a}n, Carlos H and Verwer, Sicco},
  journal={Malware Analysis Using Artificial Intelligence and Deep Learning},

Azqa Nadeem

TU Delft


No releases published


No packages published