Skip to content

Files

Latest commit

0e4d51e · Dec 13, 2018

History

History
This branch is 25 commits ahead of gaolk/graph-database-benchmark:master.

tigergraph

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Sep 18, 2018
Sep 18, 2018
Dec 13, 2018
Sep 18, 2018
Sep 18, 2018
Sep 18, 2018
Sep 18, 2018
Sep 18, 2018
Sep 18, 2018
Sep 18, 2018
Sep 18, 2018
Sep 18, 2018
Sep 18, 2018
Sep 18, 2018
Sep 18, 2018
Sep 18, 2018
Sep 18, 2018
Sep 18, 2018
Sep 18, 2018
Sep 18, 2018
Sep 18, 2018
Sep 18, 2018
Sep 18, 2018
Sep 18, 2018
############################################################
# Copyright (c)  2015-now, TigerGraph Inc.
# All rights reserved
# It is provided as it is for benchmark reproducible purpose.
# anyone can use it for benchmark purpose with the 
# acknowledgement to TigerGraph.
# Author: Mingxi Wu mingxi.wu@tigergraph.com
############################################################

This article documents the details on how to reproduce the graph database benchmark result on TigerGraph Dev Edtion.

Data Sets
===========

- graph500 edge file: http://service.tigergraph.com/download/benchmark/dataset/graph500-22/graph500-22

- twitter edge file: http://service.tigergraph.com/download/benchmark/dataset/twitter/twitter_rv.tar.gz


Hardware & Major enviroment
================================
- Amazon EC2 machine r4.8xlarge. 
- OS Ubuntu 14.04.5 LTS
- Java build 1.8.0_144-b01 (folow http://www.webupd8.org/2012/09/install-oracle-java-8-in-ubuntu-via-ppa.html)
- you need to install the following python modules
$ sudo apt-get install python-pip python-dev build-essential 
$ sudo pip install --upgrade pip 
$ sudo pip install --upgrade virtualenv 
$ sudo pip install tornado
$ sudo pip install neo4j-driver
$ sudo pip install requests
$ sudo apt-get install libcurl4-openssl-dev
$ sudo apt-get install libssl-dev
$ sudo pip install pycurl


- 32 vCPUs
- 244 GiB memory
- attached a 250G  EBS-optimized Provisioned IOPS SSD (IO1), IOPS we set is 50 IOPS/GiB
  Raw data and neo4j binary are put on this SSD. 

TigerGraph Version
===================
- TigerGraph Devloper Editon downloaded from https://www.tigergraph.com/download

Install TigerGraph 
===================

- installation video reference here https://www.youtube.com/watch?v=q-vAioBUwkI&t=3s

- put the ExprFunctions.hpp under tigergraph/dev/gdk/gsql/src/QueryUdf/ folder.
  It contains a udf for WCC use. 

Open READ and WRITE permission for TigerGraph User
==================================================
If you installed TigerGraph with the default user, it's tigergraph.

Grant Read and Write permission to the tigergraph user on 
- raw data folder.
- the script folder. 

Configure timeout
===================
Setup timeout to 180 s
Use the following command to change Restpp.timeout_seconds to  3600
 gadmin --configure timeout_seconds

Use the following two commands to apply the config change.
 gadmin config-apply
 gadmin restart

- use the following command to check TigerGraph service 
gadmin status 


Graph500 
==================
Setup schema  
----------------
On a bash shell, switch to the user you specified during TigerGraph installation. 
By default, the user is tigergraph and password is tigergraph.

When all TigerGraph service is up, install the schema, loading job and queries on 
the bash command line.

gsql graph500_setup.gsql

Note: this setup takes up to several mintues.

Load data
---------------
On a bash shell, make sure you have permission to READ the raw edge file. Then, run the loading job by

nohup  gsql -g graph500 'run loading job load_graph500_edge using file1="/ebs/raw/graph500-22/graph500-22"'

Note: In the above command, replace file1="your_raw_data_path"


Run the benchmark 
--------------------
In bash shell 

./benchmark-graph500.sh "path_to_graph500_raw_data_folder" 
e.g. 

./benchmark-graph500.sh "/ebs/raw/graph500/" 

A random seed folder for k-hop-path-neighbor-count query will be generated. 
A result folder will be generated to keep the benchmark result.
This shell script will run the following workload in sequel.

1. k=1 for k-hop-path-neighbor-count query over 1000 random selected vertices.
2. k=2 for k-hop-path-neighbor-count query over 1000 random selected vertices.
3. 3 times of the wcc algorithm over the graph500 data 
4. 3 times of the page rank algorithm 10 iterations over the graph500 data 

Twitter
==================
Setup schema  
----------------
On a bash shell, switch to the user you specified during TigerGraph installation. 
By default, the user is tigergraph and password is tigergraph.

When all TigerGraph service is up, install the schema, loading job and queries on 
the bash command line.

gsql twitter_setup.gsql

Note: this setup takes up to several mintues.

Load data
---------------
On a bash shell, make sure you have permission to READ the raw edge file. Then, run the loading job by

nohup  gsql -g twitter 'run loading job load_twitter_edge using file1="/ebs/raw/twitter_rv/twitter_rv.net"'

Note: In the above command, replace file1="your_raw_data_path"


Run the benchmark 
--------------------
In bash shell 

./benchmark-twitter.sh "path_to_twitter_raw_data_folder" 
e.g. 

./benchmark-twitter.sh "/ebs/raw/twitter_rv/" 

A random seed folder for k-hop-path-neighbor-count query will be generated. 
A result folder will be generated to keep the benchmark result.
This shell script will run the following workload in sequel.

1. k=1 for k-hop-path-neighbor-count query over 1000 random selected vertices.
2. k=2 for k-hop-path-neighbor-count query over 1000 random selected vertices.
3. 3 times of the wcc algorithm over the twitter data 
4. 3 times of the page rank algorithm 10 iterations over the twitter-rv data 


Mutliple machines 
=====================================
r4.2xlarge 8 vCPUs 61 GiB 

Setup TigerGraph Enterprise Edition, you need a license key from TigerGraph
- Edit the cluster_config.json to put the node ips, and ssh private key file.
- Make sure you open all ports.
- Then install 
./install.sh -cn 

Setup the schema, loading job and queries. 
gsql twitter_setup.gsql

Then, run the job on one of the node.

gsql -g twitter 'run loading job load_twitter_edge using file1="all:/ebs/raw/split_twitter_rv.net/"'

This will launch a loader on each machine. And the loading time should be much shorter comparing to load from one machine.

Setup Configure
--------------------
- setup segment size to 2^17. In GSQL, enter "set segsize_in_bits=17"
- Setup timeout to 3600 s

Use the following command to change "Restpp.timeout_seconds" to  3600
 gadmin --configure timeout_seconds

Use the following two commands to apply the config change.
 gadmin config-apply
 gadmin restart

 - setup libtcmalloc. Find your installation folder, and the bin folder. E.g. /ebs/install/tigergraph/bin
 gadmin --config runtime
 GPE: "LD_PRELOAD=/ebs/install/tigergraph/bin/libtcmalloc.so "
 gadmin restart all

 - make sure all ports are open. As the installer will deploy binary to each cluster node.

 - the following are convinient parallel ssh commands you can refer to do some preparation.

  parallel-ssh -h server-list  -l ubuntu -P -x "-oStrictHostKeyChecking=no  -i /home/ubuntu/qa_release"  'sudo mount /dev/xvdb /ebs'

  parallel-ssh -h server-list  -l ubuntu -P -x "-oStrictHostKeyChecking=no  -i /home/ubuntu/qa_release"  'sudo apt-get install policycoreutils -y'

  parallel-ssh -h server-list  -l ubuntu -P -x "-oStrictHostKeyChecking=no  -i /home/ubuntu/qa_release"  'sudo mkdir  -p /ebs/raw/twitter'

  parallel-ssh -h server-list  -l ubuntu -P -x "-oStrictHostKeyChecking=no  -i /home/ubuntu/qa_release"  'sudo chown -R  tigergraph /ebs/raw’

Loading preparation
-------------------------
put all node info in a server-list file.
parallel-ssh -h server-list  -l ubuntu -P -x "-oStrictHostKeyChecking=no  -i /home/ubuntu/ssh-private-key"  'sudo mkdir  -p /ebs/split'
 ./split_distribute.sh  ./raw   /ebs/split  net

Run the following to test page rank:
nohup gsql twitter_setup.gsql;nohup gsql -g twitter 'run loading job load_twitter_edge using file1="all:/ebs/split"';nohup python pg.py twitter-rv tigergraph 10 3