This repo is a framework to evaluate the effectiveness of various sparsifiers on preserving graph properties. If you find this repo useful, please cite our paper:
@inproceedings{chen2023sparsification,
author = {Yuhan Chen and
Haojie Ye and
Sanketh Vedula and
Alex Bronstein and
Ronald Dreslinski and
Trevor Mudge and
Nishil Talati
},
title = {{Demystifying Graph Sparsification Algorithms in Graph Properties Preservation}},
booktitle = {{50th International Conference on Very Large Databases (VLDB 2024)}},
pages = {},
publisher = {ACM},
year = {2024}
}
We may use the terms sparsify
and prune
interchangably in this doc
.
├── config.json # config file for sparsification params
├── data # datasets raw files and pruned files, auto-created
├──...
├── # graphs are stored in edgelist format.
# uduw.el, duw.el, udw.wel, dw.wel means
# undirected-unweighted, directed-unweighted,
# undirected-weighted, directed-weighted
# edgelist files, respectively.
├── dataLoader # code for loading datasets
├── env.sh # bash file for setting PROJECT_HOME
├── experiments # output folder for GNN, auto-created
├── myLogger # logger lib
├── output_metric_parsed # parsed metric evaluation output, auto-created when parse output_metric_raw
├── output_metric_plot # plot output, auto-created when run plot on output_metric_parsed
├── output_metric_raw # raw metric evaluation output, auto-created when run eval
├── output_sparsifier_parsed # parsed sparsifier output, auto-created when parse output_sparsifier_raw
├── output_sparsifier_raw # raw sparsifier evaluation output, auto-created when run sparsify
├── paper_fig # reproduced fig same as the ones in the paper
├── parser # parser code for parsing raw outputs and generate parsed otuput
├── plot # ploter code
├── setup.py # setup file for this sparsifiction lib
├── sparsifier # code for ER sparsifier and some legacy sparsifiers
├── src
│ ├── ClusterGCN.py # code for running CluseterGCN
│ ├── graph_reader.py # helper code for reading graphs
│ ├── logger.py # helpper code for logging GNN
│ ├── main.py # enter point
│ ├── metrics_gt.py # metric evaluaters using graph-tool
│ ├── metrics_nk.py # metric evaluaters using Networkit
│ └── sparsifiers.py # lib invoking all sparsifiers
└── utils # helper functions and binaries
Conda is recommendated to manage env. To install necessary packages:
- Install conda following link.
- Create an env named
spar
by runningconda env create --file env.yaml
- Activate env by running
conda activate spar
- Setup env by running
source env.sh
. (Run step 3 and 9 every time a new terminal is started.)
- Install the current folder by running
pip install -e .
(make sure you type the dot (.
) in the command) - Install extra packages by running
pip install networkit ogb networkx rich setuptools==52.0.0
. Latest version of setuptool cause weird errors. - Install
torch
following link. Use your own cuda version.pip
recommended,conda
had some issues for me. - Install
PyG
following link. Use your own cuda version.pip
recommended,conda
had some issues for me. - Install additional packages by running
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-${TORCH}+${CUDA}.html
, where${TORCH}
and${CUDA}
is the torch and cuda version you use, follow link for more details. - Install julia following link.
- OS: We run experiments on Ubuntu 20.04 LTS, with python 3.9.12. Other platform may also work, but not extensively tested.
- Memory: Depends on the size of graphs.
- Storage: Graph size varies from ~MB to ~GB. However, to conduct end-to-end experiments for all sparsifiers, the storage required will quickly explode. Each graph will be sparsified using
12
sparsifiers, each with9
different prune rates, and some non-deterministic sparsifiers will run3-10
times to show variance. Also a directed/un-directed weighted/enweighted version (totaling4
) for each graph may be required for evaluating for some metrics. These factors altogether will leads to a100x-1000x
storage expansion to only the original graph, and can quickly get toTB
level. We recommend starting with small graphs first.
cd $PROJECT_HOME/utils
mkdir bin
make
cd $PROJECT_HOME
python $PROJECT_HOME/utils/data_preparation.py --dataset_name [dataset_name/all]
This will download data and do necessary data pre-processin. all
will download all data, we recommend start with small datasets. datasets from small to large (by #edge) are (smallest) ego-Facebook, ca-HepPh, email-Enron, ca-AstroPh, com-Amazon, com-DBLP, web-NotreDame, ego-Twitter, web-Stanford, web-Google, web-BerkStan, human_gene2, ogbn-proteins, Reddit (largest)
python $PROJECT_HOME/src/main.py --dataset_name [dataset_name/all] --mode [sparsify/eval/all/clean]
--dataset_name
indicates the dataset to use, use name instead of the dataset path, all
will run for all datasets. It is recommended NOT to use all
unless you know what you are doing because it can take a long time and large file space.
--mode
indicates what to run. sparsify
will run all sparsifiers on the given dataset_name
. eval
assumed the sparsified files already exists, and evaluate the performance of the sparsified graphs on all metrics, run eval
only if you have run sparsify
on the given dataset. all
will run sparsify
and eval
in tandem. clean
will delete all files (raw graph, sparsified graphs, metric output, sparsifier output) associated with given dataset_name.
To run in a finer granularity, e.g. if want to run only a subset of sparsifiers and/or a subset of evaluation metrics, you need to modify the $PROJECT_HOME/src/main.py
file, simply comment out lines for specific sparsifers and evaluation metrics should do.
By default, profiling for sparsifiers are not enabled, to enable profiling, go to $PROJECT_HOME/sparsifiers.py
and enable @profile decorator before each sparsifier function. In the meantime, make sure graphSparsifier()
in main.py
is called with multi_process=False
, or it will fail.
The log for running sparsify
will be in $PROJECT_HOME/output_sparsifier_raw/[dataset_name].txt
, and the log for running eval
will be in $PROJECT_HOME/output_metric_raw/[dataset_name]/[metric]/log
Two scripts are provided for raw output parse, $PROJECT_HOME/parser/sparsifier_parse.py
, and $PROJECT_HOME/parser/metric_parse.py
.
python $PROJECT_HOME/parser/{sparsifier_parse.py, metric_parse.py} --dataset_name [dataset_name]/all
As always, all
will run for all datasets.
Two script are provided for plotting parsed data, $PROJECT_HOME/plot/plot.py
and $PROJECT_HOME/plot/paper_plot.py
python $PROJECT_HOME/plot/plot.py --dataset_name [dataset_name]/all --metric [metric]/all
will plot the specified dataset for the specified metric for all sparsifiers and prune rates.
python $PROJECT_HOME/plot/paper_plot.py
will reproduce the figures used in the paper.
In this section, we give instructions to preproduce figures 4(c) and 11(b) in the paper. Other figures can also be reproduced, but due to the very long run time, we recommend reproducing results on graph ego-Facebook
(the smallest graph) first.
- Follow instructions to setup environments and install necessary dependencies.
- run the following command in order to regenerate figure 4(c) and 11(b) in the paper:
# download and pre-process ego-Facebook
python $PROJECT_HOME/utils/data_preparation.py --dataset_name ego-Facebook
# Run sparsification on ego-Facebook
python $PROJECT_HOME/src/main.py --dataset_name ego-Facebook --mode sparsify
# Run evaluation on ego-Facebook, this will take ~ 20 minutes
python $PROJECT_HOME/src/main.py --dataset_name ego-Facebook --mode sparsify
# parse the results generated in the evaluation step
python $PROJECT_HOME/parser/metric_parse.py --dataset_name ego-Facebook
# plot the results
python $PROJECT_HOME/plot/paper_plot.py
The plots generated will be in $PROJECT_HOME/paper_fig/
. Note that the figure may be slightly different from the ones in the paper due to the randomness in the sparsify and eveluation process, but the discrepancy should be minimal.
To reproduce other figures in the paper, you can repeat the steps above, but change ego-Facebook
to other datasets. Or change ego-Facebook
to all
to run all sparsifiers and evaluation on all datasets.
Warning: This will take significantly long time and requires a lot of memory and storage space for large graphs.
Alternatively, we provides our experiment logs in output_archive.zip
. To use it, run the following command:
# unzip the archive
unzip output_archive.zip
# plot the results
python $PROJECT_HOME/plot/paper_plot.py
The plots generated will be in $PROJECT_HOME/paper_fig/
. Note that the figure may be slightly different from the ones in the paper due to the randomness in the sparsify and eveluation process, but the discrepancy should be minimal.
The appendix file Appendix.pdf
presents pseudocode for some of the sparsifiers, as well as full results for the evaluation of all metrics evaluated on all graphs.
You can also reproduce the figures in the appendix by running the following commands:
# unzip the archive
unzip output_archive.zip
# plot the results
python $PROJECT_HOME/plot/plot.py --dataset_name all --metric all
The plots will be in $PROJECT_HOME/output_metric_plot/