This repository contains the
ALL dataset, which includes edges from all the
600 benign and attack scenario graphs. The
GFC datasets can be derived
ALL by picking graph ID's having scenarios as follows:
YDC: YouTube, Download, CNN
GFC: GMail, VGame, CNN
Tab-separated file with one edge on each line in the following format:
source-id source-type destination-id destination-type edge-type graph-id
Graph ID's correspond to scenarios as follows:
- YouTube (graph ID's 0 - 99)
- GMail (graph ID's 100 - 199)
- VGame (graph ID's 200 - 299)
- Drive-by-download attack (graph ID's 300 - 399)
- Download (graph ID's 400 - 499)
- CNN (graph ID's 500 - 599)
ALL dataset was extracted from the raw flow-graph data using
which performs the following:
- Each node and edge type is mapped to a single character.
- Consecutive edges between the same pair of nodes corresponding to block-by-block file reads are collapsed into a single edge.
- Node ID's are incremented by 1 (so ID's of
- The timestamp field is removed (raw edges are sorted by timestamp).
preprocess.py is run as:
python preprocess.py <raw edges file>