A distributed, scalable and efficient RDF query engine.
Following software are required in order to run d-sparq.
- MongoDB (http://www.mongodb.org)
- Java 1.6 or later (http://www.oracle.com/technetwork/java/index.html)
- Hadoop 1.0.3 (http://hadoop.apache.org)
- ant (http://ant.apache.org)
- Metis (http://glaros.dtc.umn.edu/gkhome/metis/metis/download). If a distributed graph partitioner can be used then it is better. But right now, did not find any freely available distributed graph partitioner.
Please download and install all of them. MongoDB needs to be installed on all the machines in the cluster that you plan to make use of. Add the executables to PATH environment variable.
- Download the source code and compile using
ant jar
.
- Instructions are given at http://docs.mongodb.org/manual/tutorial/deploy-shard-cluster.
Note that if the underlying machine architecture is NUMA, then follow the instructions at http://docs.mongodb.org/manual/administration/production-notes/#mongodb-and-numa-hardware for starting MongoDB. Instructions are given here for convenience. - Check whether the file
zone_reclaim_mode
at/proc/sys/vm/zone_reclaim_mode
has content as 0. If not use the followingecho 0 > /proc/sys/vm/zone_reclaim_mode
. - Use
numactl
while starting MongoDB.numactl --interleave=all bin/mongod <mongo_params>
. - Disable transparent huge pages https://docs.mongodb.org/manual/tutorial/transparent-huge-pages/ (system restart is required).
The steps to set up sharded cluster are given here for convenience. For the following commands, create appropriate folder structure for dbpath.
- Start the config server database instances. One instance of this is minimally sufficient. Use
numactl --interleave=all bin/mongod --config mongod-configsvr.conf
. - Start the mongos instances. One instance of this is minimally sufficient. Use
numactl --interleave=all bin/mongos --config mongos.conf
. - Start the shards in the cluster. Do this on all the nodes in the cluster. Use
numactl --interleave=all bin/mongod --config mongod-shardsvr.conf
. - Add shards to the cluster. Using mongo shell and connect to mongos instance (bin/mongo --port 27019). At the prompt run the
following commands: a)
use admin
b)db.runCommand( { addshard : "<shard_host>:<shard_port>" } );
Run this for all shards i.e., put in the information for all the shards. c) Enable sharding for the database (rdfdb) as well as the collection (idvals) to be sharded. Use the commands,db.runCommand( { enablesharding : "rdfdb" } );
anddb.runCommand( { shardcollection : "rdfdb.idvals", key : { _id : 1 }, unique : true });
- In ShardInfo.properties, make the necessary changes i.e., put the information regarding MongoDB cluster.
- The input triples should be in N-Triples format. If not, RDF2RDF (http://www.l3s.de/~minack/rdf2rdf) can be used to convert the triples into N-Triples format.
- Create hash digest message for each term (subject/predicate/object). This is required because some
terms are very long (eg., blog comments) and it is not convenient to index on long texts.
Use
hadoop jar dist/d-sparq.jar dsparq.load.HashGeneratorMR <input_dir> <output_dir> <num_nodes>
. Input is the directory containing the triples. - Load the digest messages along with its string equivalent values into MongoDB. Copy the output of
previous step from HDFS to local file system or to the one hosting Mongo router. Use
java -Xms12g -Xmx12g -cp dist/d-sparq.jar dsparq.load.HashDigestLoader <input_dir>
. Input directory is the one containing the output of previous step. This step also generates numerical IDs which are required for Metis (vertex IDs). - Generate triple file(s) with numerical IDs. After this step, triples file in the form of
SubjectID|PredicateID|ObjectID would be generated. Use
hadoop jar dist/d-sparq.jar dsparq.load.KVFileCreatorMR <input_dir> <output_dir>
. Input directory is the one containing original triple files.
- Use
java -Xms12g -Xmx12g -cp dist/d-sparq.jar dsparq.load.SplitTripleLoader <dir_with_KV_files>
. The input is the directory containing the triple file(s) in Key-Value format generated as the output of previous step. Note that this input directory should not contain any other files or sub-directories. Triples are loaded into the local DB. Indexes are also created in this step. - Generate predicate selectivity. Use
java -Xms12g -Xmx12g -cp dist/d-sparq.jar dsparq.load.PredicateSelectivity
. - Go to the section on running queries as a next step.
- Use
hadoop jar dist/d-sparq.jar dsparq.partitioning.TypeTriplesSeparator <input_dir> <output_dir>
. Input directory should contain the files generated by dsparq.load.KVFileCreatorMR. Ignore this step if there is only 1 node in the cluster.
- Metis needs the graph to be in the form of an adjacency list. So generate it using
hadoop jar dist/d-sparq.jar dsparq.partitioning.UndirectedGraphPartitioner <input_dir> <output_dir>
. Input directory contains the untyped triples from the previous step i.e., the triples separated from typed triples. Copy the adjvertex-... and vertex-... files to local file system. - Total number of vertices and edges should also be specified. Get them using
java -Xms12g -Xmx12g -cp dist/d-sparq.jar dsparq.partitioning.format.GraphStats adjvertex-r-00000-new
. - Add the total number of vertices and edges to the METIS input file (adjvertex-...) as its first line.
- Run Metis using
gpmetis <adjvertex_file> <num_partitions>
. Number of partitions is same as number of nodes in the cluster.
The output of Metis specifies the vertices which belong to a particular partition. From that, we need to get the triples associated with those vertices. Following steps are required to get all the triples which should go into a particular partition.
- Combine VertexID file and partitionID file. Use
java -Xms12g -Xmx12g -cp dist/d-sparq.jar dsparq.partitioning.format.VertexPartitionFormatter vertex-r-00000 adjvertex-r-00000.part.<num>
. - Based on the vertexID - partitionID pairs, create separate files for each partition which contain
all vertices belonging to that partition. Use
hadoop jar dist/d-sparq.jar dsparq.partitioning.TripleSplitter <input_dir> <output_dir>
. Input directory contains the combined vertexID - partitionID pairs. - For the triples on vertex partitions, use n-hop duplication.
- Based on the output of above step, get the triples associated with each partition. Use
hadoop jar dist/d-sparq.jar dsparq.partitioning.PartitionedTripleSeparator <input_dir> <output_dir>
. - Copy the triple files which belong to a particular partition on to that partition.
- Load the triples into local MongoDB and as well as into RDF-3X on each partition. Use
java -Xms12g -Xmx12g -cp dist/d-sparq.jar dsparq.load.PartitionedTripleLoader <input_triple_file>
. Indexes are also created in this step (PartitionedTripleLoader does that). - Generate predicate selectivity. Use
java -Xms12g -Xmx12g -cp dist/d-sparq.jar dsparq.load.PredicateSelectivity
. - For loading to RDF-3X, use
time bin/rdf3xload rdfdb <file>
.
- Run this on each node of the cluster. Use
java -Xms12g -Xmx12g -cp dist/d-sparq.jar dsparq.query.opt.ThreadedQueryExecutor2 <queries/q?> <number_of_times_to_run>
. The input query file should contain only one query. Query should be written without using any line breaks. SP2 benchmark tool is used to generate data and the modified queries are also provided. Since only basic graph patterns are supported, some of the queries from SP2 are modified. Test queries are provided in the queries folder. Note that, only the total results are printed, not the individual results. This is sufficient for our purpose. - For running queries on RDF-3X,
- put rdf3x-0.3.7/bin in the PATH variable.
- Use
time java -Xms12g -Xms12g -cp dist/d-sparq.jar:src dsparq.sample.RDF3XTest <path_to_rdfdb> <queries/q?>
.