Skip to content
Crystal is a novel method for distributed subgraph matching on very large graphs. Crystal outperforms existing methods by several orders of magnitude on very large graphs. The work was published on VLDB 2018 with title "Subgraph matching: on compression and computation".
Branch: master
Clone or download
Pull request Compare This branch is even with zhanghao111111111:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Crystal:A very fast distributed subgraph matching algorithm

This project contains the implemetation of the method described in Subgraph matching: on compression and computation published in VLDB2018.



Environment Requirement

Hadoop 2.7.3.

How to use?

1) preprocess datasets (in edge list form) by

run ""

2) perform counting

run "", which will count subgraph matching for all pattern listed.

3) generate compression form

run "", which will dump the the result of subgraph matching for all pattern listed to HDFS

Detail Specification

Source Code is in Subgraph folder, which can be open by Intellij Idea.

In it you can find three packages:

  • sAlg
    • Square
    • Chordal Square
    • House
      • HouseS: for small data graph
      • House:for large data graph
    • Chordal Roof
    • Three Triangle
    • Solar Square
    • Near5Clique
    • QuadTriangle
    • SolarSquarePlus
  • sData: data structure used in sAlg
  • sPreprocess: Preprocessing Related
    • sCliqueGeneration
      • classes used to generate the crystal
    • sTool
      • preprocessing the graph

Each of the patterns comes with two java class, one for counting, one for outputting(class with name ended with O). More detail please read the code.

Test File

Files related to testing is at Top level.

  • Data: small datasets for testing the algorithm, in which each dataset is attached with a *-pattern-size.txt recording the pattern's size in each graph for each pattern.
  • Preprocessing the Data for and
  • Testing the code related to counting in Crystal
  • Testing the code related to writing in Crystal
  • pattern.eps: give a visulization about what pattern looks like

Please run first, before running the other two shell script. More detail please read the .sh file, and change some of the variables accordingly in the script file.


  • mapreduce.job.reduces: better using the same number as vCore in yarn
  • mapreduce.reduce.memory.mb: the default is 4000MB, if graph bigger than UK2002 need to be processed, this might need to increase.
  • the default is 4000MB, if graph bigger than UK2002 need to be processed, this might need to increase.
  • test.memory: same as But only available in sAlg.
  • bloomHash: controlling hash function used in Bloom filter
  • bloomBit: controlling numbers of bit for representing a record in Bloom filter
  • test.p
    • test.p (for classes in sIntersection): controlling the number of the round for the algorithm, the lower the better (if memory permit)
    • test.p (for classes in sPartition)L controlling the number of partitions for the algorithm, the lower the better (if memory permit)
  • test.isOutput: specify whether the result should be outputted in *O.class.
  • test.isEnumerating: whether just counting or enumerating.


  • InputFormat: Edge List
  • OutputFormat: file format specified in the paper.
    //Below three configuration is commented for maximum compatibility for the different version of Hadoop, if possible, they should be uncommented to improve the performance and compression ratio.
    //These configurations can be found in the source code.
    SequenceFileOutputFormat.setOutputCompressionType(job, CompressionType.BLOCK);
    FileOutputFormat.setCompressOutput(job, true);
    FileOutputFormat.setOutputCompressorClass(job, SnappyCodec.class);

Core-Crystal Decomposition

The pattern decomposition aims at optimzing Objective 3 in the paper.

This package provides decomposition.cpp which decomposes the pattern graph with/without the statistics of the cliques.

input specification Input the pattern graph to pattern.txt.
first line : n m b n: # of nodes, m: # of edges. Note that nodes are labeled with [1,n]. b: a boolean value of 0 or 1, specifying that statistics are used (1) or not (0) followed by m lines, each line x y describes an undirected edge (x,y)

followed by a line, if b=1,
k M k: the largest clique with statistics, k should be at least = the largest clique size in the pattern graph M: memory size followed by k lines, the i-th line specifies the total number of Clique_i in the target graph. Specifically, the first line is the # of nodes in the target graph and the second line is the # of edges in the target

Example input 1

4 4 0
1 2
1 3
2 4
Example input 2
4 4 1
1 2
1 3
2 4
3 4

2 20

Please refer to patternpool.txt for further example

For citatation

  title={Subgraph matching: on compression and computation},
  author={Qiao, Miao and Zhang, Hao and Cheng, Hong},
  journal={Proceedings of the VLDB Endowment},
  publisher={VLDB Endowment}

contact me at

You can’t perform that action at this time.