Files
This branch is 25 commits ahead of gaolk/graph-database-benchmark:master.
neo4j
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||
############################################################ # Copyright (c) 2015-now, TigerGraph Inc. # All rights reserved # It is provided as it is for benchmark reproducible purpose. # anyone can use it for benchmark purpose with the # acknowledgement to TigerGraph. # Author: Mingxi Wu mingxi.wu@tigergraph.com ############################################################ This article documents the details on how to reproduce the graph database benchmark result on Neo4j. Data Sets =========== - graph500 edge file: http://service.tigergraph.com/download/benchmark/dataset/graph500-22/graph500-22 - graph500 vertex file: http://service.tigergraph.com/download/benchmark/dataset/graph500-22/graph500-22_unique_node - graph500 vertex header file: http://service.tigergraph.com/download/benchmark/dataset/graph500-22/graph500-22-node-header.txt - graph500 edge header file: http://service.tigergraph.com/download/benchmark/dataset/graph500-22/graph500-22-edge-header.txt - twitter edge file: http://service.tigergraph.com/download/benchmark/dataset/twitter/twitter_rv.tar.gz - twitter vertex file: http://service.tigergraph.com/download/benchmark/dataset/twitter/twitter_rv.net_unique_node - twitter vertex header file: http://service.tigergraph.com/download/benchmark/dataset/twitter/twitter_rv-node-header.txt - twitter edge header file: http://service.tigergraph.com/download/benchmark/dataset/twitter/twitter_rv-edge-header.txt Hardware & Major enviroment ================================ - Amazon EC2 machine r4.8xlarge - OS Ubuntu 14.04.5 LTS - Java build 1.8.0_144-b01 (folow http://www.webupd8.org/2012/09/install-oracle-java-8-in-ubuntu-via-ppa.html) - you need to install the following python modules $ sudo apt-get install python-pip python-dev build-essential $ sudo pip install --upgrade pip $ sudo pip install --upgrade virtualenv $ sudo pip install tornado $ sudo pip install neo4j-driver $ sudo pip install requests $ sudo apt-get install libcurl4-openssl-dev $ sudo apt-get install libssl-dev $ sudo pip install pycurl - 32 vCPUs - 244 GiB memory - attached a 250G EBS-optimized Provisioned IOPS SSD (IO1), IOPS we set is 50 IOPS/GiB Raw data and neo4j binary are put on this SSD. Neo4j Version ============== - 3.4.4 Community Edition downloaded from https://neo4j.com/download/other-releases/#releases - Graph library, we used apoc-3.4.0.1-all.jar (obtained from https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/3.4.0.1) Put under neo4j-community-3.4.4/plugins Install Neo4j =============== # untar to put under a folder. E.g. /ebs/install/neo4j/ folder, this is an installation point we will use through out this README. Replace it to your installation point when trying to reproduce. tar -xf neo4j-community-3.4.4-unix.tar.gz It will create a neo4j-community-3.4.4 folder. # switch to root sudo bash # under this bin folder, you will see all neo4j server/client binaries. /ebs/install/neo4j/neo4j-community-3.4.4/bin/ # start server. /ebs/install/neo4j/neo4j-community-3.4.4/bin/neo4j start # to stop server /ebs/install/neo4j/neo4j-community-3.4.4/bin/neo4j stop # download the latest apoc jar and put it in the plugins folder. Restart neo4j wget https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/3.4.0.1 #install apoc library , put apoc-3.4.0.1-all.jar under the plugins folder /ebs/install/neo4j/neo4j-community-3.4.1/plugins Loading data ============== - neo4j_load_graph500.sh: bash script to load graph500 data. Description is inside. - neo4j_load_twitter.sh: bash script to load twitter data. Description is inside. To load: - Load graph500 data nohup ./neo4j_load_graph500.sh /install/neo4j-community-3.4.4 /datafolder Note: change /install to your installation point, and data folder to where you put the graph500 data - Load twitter data nohup ./neo4j_load_twitter.sh /install/neo4j-community-3.4.4 /datafolder Note: change /install to your installation point, and data folder to where you put the twitter data #check final storage . change the path to your .db folder du -hc /ebs/install/neo4j/neo4j-community-3.4.4/data/databases/graph500.db/*store.db* du -hc /ebs/install/neo4j/neo4j-community-3.4.4/data/databases/twitter_rv.db/*store.db* Attach the loaded data store to Neo4j ========================================= Startup graph database with the loaded data by creating a symbolic link to the loaded folder. If you have not backup graph.db, backup it first by mv graph.db graph.db.bak # create a symbolic to the graph.db, the default neo4j database file cd /ebs/install/neo4j/neo4j-community-3.4.4/data/databases mv graph.db graph.db.bak #create a symbolic link of graph.db ln -s graph500.db/ graph.db Note: to run twitter graph after its loading, create a symbolic link of graph.db to twitter.db ln -s twitter.db/ graph.db #change neo4j config cd /ebs/install/neo4j/neo4j-community-3.4.4/conf use an editor to edit neo4j.conf, add the following lines at the bottom ################################### # benchmark config ################################### # run neo4j-admin memrec, and put the following in neo4j.conf # put the following in neo4j.conf dbms.memory.heap.initial_size=31g dbms.memory.heap.max_size=31g dbms.memory.pagecache.size=194000m #set timeout to 2.5 hour=9000s dbms.transaction.timeout=9000s #set apoc library dbms.security.procedures.unrestricted=apoc.* ############################################## #restart neo4j by bin/neo4j stop bin/neo4j start Create index using Cypher shell =================================== Without index, neo4j run slow on all the benchmark workload. It is important for you to create an index for vertex id field to get faster benchmark result. bin/cypher-shell user:neo4j pass:neo4j #change password once in neo4j> CALL dbms.changePassword('benchmark') #exit shell neo4j> :exit #log in again cypher-shell -u neo4j -p benchmark # we need to build index in shell neo4j> create index on :MyNode(id); # wait until it's *online*. At the beginning, it's "populating" state neo4j> CALL db.indexes; #after index finishes populating, you can explain the query plan to see NodeIndexSeek is used. neo4j> explain match (n1:MyNode)-[:MyEdge*1 ]->(n2:MyNode) where n1.id=17330741 return count(distinct n2); Restart the server ======================= bin/neo4j restart Run benchmark ================ Download all files in the README folder to a script folder. Before running the benchmark script below, please make sure the current user has the READ permission from the raw file, and the WRITE permission on the folder where you put the benchmark script folder, since the random seed will be generted in this folder. You may consider to make ssh config to keep it alive by following, since the benchmark will run long time during the first run to generate the random seed. https://www.howtogeek.com/howto/linux/keep-your-linux-ssh-session-from-disconnecting/ Output folder ----------------- Assuming you are in the script folder, the output produced will be in - ./result: the folder the benchmark result. - ./seed: the random input seed will be placed in this folder. The first run will yield the seed file, so it may take longer. Graph500 ----------------- #khop, change graph500-22-seed and twitter_rv.net-seed path to your seed file path. # here we assume they are put under /home/ubuntu/ecosys/benchmark/neo4j/ nohup python kn.py /home/ubuntu/ecosys/benchmark/neo4j/graph500-22-seed 300 neo4j 1 notes latency nohup python kn.py /home/ubuntu/ecosys/benchmark/neo4j/graph500-22-seed 300 neo4j 2 notes latency nohup python kn.py /home/ubuntu/ecosys/benchmark/neo4j/graph500-22-seed 10 neo4j 3 notes latency nohup python kn.py /home/ubuntu/ecosys/benchmark/neo4j/graph500-22-seed 10 neo4j 6 notes latency #wcc, 3 runs nohup python wcc.py graph500-22 neo4j 3 #page rank, 3 runs, each run 10 iterations nohup python pg.py graph500-22 neo4j 10 3 Twitter ------------- #khop nohup python kn.py /home/ubuntu/ecosys/benchmark/neo4j/twitter_rv.net-seed 300 neo4j 1 notes latency nohup python kn.py /home/ubuntu/ecosys/benchmark/neo4j/twitter_rv.net-seed 300 neo4j 2 notes latency nohup python kn.py /home/ubuntu/ecosys/benchmark/neo4j/twitter_rv.net-seed 10 neo4j 3 notes latency nohup python kn.py /home/ubuntu/ecosys/benchmark/neo4j/twitter_rv.net-seed 10 neo4j 6 notes latency #wcc nohup python wcc.py twitter-rv neo4j 3 #page rank, 3 runs, each run 10 iterations. #it will run very long python pg.py twitter neo4j 10 3