Data-Mining/Assignment 3 - Graph Classification at master · Aditi-Singla/Data-Mining

Readme.md

Aditi Singla	Ankush Phulia	Vaibhav Bhagee
2014CS50277	2014CS50279	2014CS50297

To run -

Comparison between gSpan, FSG and Gaston

python scripts/plot.py <input-data> --supports <list of frequency thresholds>

Generation of train/test set for classification

./scripts/classify.sh <train-data> <active-ids> <inactive-ids> <test-data>

Comparision for execution time for gSpan, fsg and gaston at support thresholds 5, 10, 25, 50, 95:

The running time increases with decrease in support threshold.
This increase is greater for gSpan and FSG than for Gaston. The former two show an almost exponential increase, for the latter, it is very less.
Gaston performs the best out of the three - requiring the minimum running time. FSG does the worst on the aido99_all dataset

Lowered support thresholds results in more frequent subgrahps being mined, and this increases exopnentially.
FSG follows a breadth-first strategy - generates all possible candidates at each level and prunes. This is generally slower than gSpan and Gaston, which follow a depth-first strategy - remove infrequent edges and avoid generating all candidates/pruning.
Gaston works the best as it improves the depth-first approach of gSpan by first mining frequent paths and trees, followed by trees with cycles and then the others.
- It follows a global order on cycle-closing edges and only generates those cycles that are bigger than the previous one
- Duplicate detection is carried out efficiently - first is hashing to perform a pre-sort and then an isomorphism test for final duplicate detection. This makes for very fast pattern