This project develops a parallel solution for mining maximal quasi-cliques that is able to fully utilize CPU cores. Our solution utilizes divide and conquer to decompose the overall mining workloads into independent tasks for parallel mining, and we addressed the problem of (i) drastic load imbalance among different tasks and (ii) difficulty in predicting the task running time and the time growth with task-subgraph size, by (a) using a timeout-based task decomposition strategy, and by (b) utilizing a priority task queue to schedule long-running tasks earlier for mining and decomposition to avoid stragglers.
We target a single-machine multi-core environment, since a distributed cluster is not always readily available to an average end user while most modern computers are multi-core. Our system achieved a close to ideal speedup ratio as illustrated in the figure below for the YouTube dataset.
Previously, we developed a distributed solution which can be found here, which is built on top of the G-thinker system.
- The
system
folder: it contains the code for our T-thinker engine, which is a task-based general-purpose framework for writing parallel programs. In the folder,worker.h
is the main thread that creates other computing threads (aka. compers) to work on tasks. When task queues are near empty, T-thinker will generate new tasks from data items to refill the queues; while if too many tasks are created (e.g. due to decomposing a big task), tasks will be spilled to local disk to keep memory bounded, and these tasks will be loaded back for processing when task queues have space. The figure below shows the tuned system parameters for our task queues:
- The
app_qc
folder: this is the application code for mining maximal quasi-cliques, which runs on top of T-thinker. The figure below shows an example quasi-clique for the Arxiv GR-QC dataset (arXiv collaboration network) found by our application code.
-
The
app_kernel
folder: this is the application code for kernel expansion, which uses the top-k maximal quasi-cliques found byapp_qc
as seeds, to grow maximal quasi-cliques with a smaller γ. This idea was originally proposed by Sanei-Mehri et al. to speed up mining of large quasi-cliques, and we improved the method to avoid redundant search space exploration. -
The
maximal_check
folder: This is the postprocessing step, used to remove non-maximal quasi-cliques from the output ofapp_qc
andapp_kernel
.
In each folder, app_qc
, app_kernel
and maximal_check
, there is a Makefile. Just enter each folder and use the command make
to compile, and a program named run
will be generated.
Workflow A: to Mine Maximal Quasi-Cliques Directly
-
Quasi-clique mining:
Run the program in the
app_qc
folder:app_qc/run [input_data] [thread_num] [gamma] [min_size] [time_split_threshold]
, where:- input_data: input graph file where the i-th row records the adjacency list of Vertex i
- thread_num: number of threads. We also call each computing thread a comper
- gamma: user-specified minimum degree-ratio threshold
- min_size: minimum size threshold; each returned result should have at least so many vertices
- time_split_threshold: timeout duration threshold. A task running longer than the threshold will decompose into subtasks
Example:
app_qc/run input_graph 5 0.8 10 5
-
Postprocessing:
- Each thread (Comper i) will write the results it finds to a file
output_i
- Aggregate all quasi-cliques outputs into one file:
cat output_* > results
- Remove non-maximal quasi-cliques:
maximal_check/quasiCliques results max_results
- Each thread (Comper i) will write the results it finds to a file
Workflow B: to Mine Maximal Quasi-Cliques Using Kernels
-
Mine large quasi-cliques by first mining dense parts that are faster to find using gamma', where gamma' > gamma, using
app_qc/run [input_data] [thread_num]
[gamma']
[min_size] [time_split_threshold]
-
Postprocessing:
- Aggregate all quasi-cliques outputs into one file:
cat output_* > result
- Remove non-maximal quasi-cliques:
maximal_check/quasiCliques results max_results
- Generate top-k Kernels:
sort -n -r -k 1 max_results > kernels
- Aggregate all quasi-cliques outputs into one file:
-
Kernel expansion:
- Using the generated kernels (from Step 2 above) and the original values of gamma and min_size to expand the results, by selecting the largest k' kernels. Run the program in the
app_kernel
folder:app_kernel/run [input_data] [thread_num] [gamma] [min_size] [time_split_threshold] [kernels]
[k_prime]
- Using the generated kernels (from Step 2 above) and the original values of gamma and min_size to expand the results, by selecting the largest k' kernels. Run the program in the
-
Top-k kernels
- Aggregate all quasi-clique outputs into one file:
cat output_* > results_expanded
- Remove non-maximal quasi-cliques:
maximal_check/quasiCliques results_expanded max_results_expanded
- Generate top-k Kernels:
sort -n -r -k 1 max_results_expanded > sorted
head -n k sorted > topk
- Aggregate all quasi-clique outputs into one file:
Click here for a demo on Google Colab. The notebook first clones the repo and download the Arxiv GR-QC dataset. It then runs the quasi-clique mining program to find maximal results. Finally, it plots the first and second largest quasi-cliques.
- C++11
- OpenMP
- Jalal Khalil (jalalk@uab.edu)
- Da Yan (yanda@uab.edu)
- Guimu Guo (guimuguo@uab.edu)
The authors are affiliated with the Department of Computer Science, University of Alabama at Birmingham