d-sparq

A distributed, scalable and efficient RDF query engine.

Dependencies

Following software are required in order to run d-sparq.

MongoDB (http://www.mongodb.org)
Java 1.6 or later (http://www.oracle.com/technetwork/java/index.html)
Hadoop 1.0.3 (http://hadoop.apache.org)
ant (http://ant.apache.org)
Metis (http://glaros.dtc.umn.edu/gkhome/metis/metis/download). If a distributed graph partitioner can be used then it is better. But right now, did not find any freely available distributed graph partitioner.

Please download and install all of them. MongoDB needs to be installed on all the machines in the cluster that you plan to make use of. Add the executables to PATH environment variable.

Instructions

Download the source code and compile using ant jar.

Start MongoDB Sharded Cluster

Instructions are given at http://docs.mongodb.org/manual/tutorial/deploy-shard-cluster.
Note that if the underlying machine architecture is NUMA, then follow the instructions at http://docs.mongodb.org/manual/administration/production-notes/#mongodb-and-numa-hardware for starting MongoDB. Instructions are given here for convenience.
Check whether the file zone_reclaim_mode at /proc/sys/vm/zone_reclaim_mode has content as 0. If not use the following echo 0 > /proc/sys/vm/zone_reclaim_mode.
Use numactl while starting MongoDB. numactl --interleave=all bin/mongod <mongo_params>.
Disable transparent huge pages https://docs.mongodb.org/manual/tutorial/transparent-huge-pages/ (system restart is required).

The steps to set up sharded cluster are given here for convenience. For the following commands, create appropriate folder structure for dbpath.

Start the config server database instances. One instance of this is minimally sufficient. Use numactl --interleave=all bin/mongod --config mongod-configsvr.conf.
Start the mongos instances. One instance of this is minimally sufficient. Use numactl --interleave=all bin/mongos --config mongos.conf.
Start the shards in the cluster. Do this on all the nodes in the cluster. Use numactl --interleave=all bin/mongod --config mongod-shardsvr.conf.
Add shards to the cluster. Using mongo shell and connect to mongos instance (bin/mongo --port 27019). At the prompt run the following commands: a) use admin b) db.runCommand( { addshard : "<shard_host>:<shard_port>" } ); Run this for all shards i.e., put in the information for all the shards. c) Enable sharding for the database (rdfdb) as well as the collection (idvals) to be sharded. Use the commands, db.runCommand( { enablesharding : "rdfdb" } ); and db.runCommand( { shardcollection : "rdfdb.idvals", key : { _id : 1 }, unique : true });

Configure d-sparq and format input

In ShardInfo.properties, make the necessary changes i.e., put the information regarding MongoDB cluster.
The input triples should be in N-Triples format. If not, RDF2RDF (http://www.l3s.de/~minack/rdf2rdf) can be used to convert the triples into N-Triples format.

Encoding Triples

Create hash digest message for each term (subject/predicate/object). This is required because some terms are very long (eg., blog comments) and it is not convenient to index on long texts. Use hadoop jar dist/d-sparq.jar dsparq.load.HashGeneratorMR <input_dir> <output_dir> <num_nodes>. Input is the directory containing the triples.
Load the digest messages along with its string equivalent values into MongoDB. Copy the output of previous step from HDFS to local file system or to the one hosting Mongo router. Use java -Xms12g -Xmx12g -cp dist/d-sparq.jar dsparq.load.HashDigestLoader <input_dir>. Input directory is the one containing the output of previous step. This step also generates numerical IDs which are required for Metis (vertex IDs).
Generate triple file(s) with numerical IDs. After this step, triples file in the form of SubjectID|PredicateID|ObjectID would be generated. Use hadoop jar dist/d-sparq.jar dsparq.load.KVFileCreatorMR <input_dir> <output_dir>. Input directory is the one containing original triple files.

In case of only 1 node

Use java -Xms12g -Xmx12g -cp dist/d-sparq.jar dsparq.load.SplitTripleLoader <dir_with_KV_files>. The input is the directory containing the triple file(s) in Key-Value format generated as the output of previous step. Note that this input directory should not contain any other files or sub-directories. Triples are loaded into the local DB. Indexes are also created in this step.
Generate predicate selectivity. Use java -Xms12g -Xmx12g -cp dist/d-sparq.jar dsparq.load.PredicateSelectivity.
Go to the section on running queries as a next step.

Separate rdf:type triples

Use hadoop jar dist/d-sparq.jar dsparq.partitioning.TypeTriplesSeparator <input_dir> <output_dir>. Input directory should contain the files generated by dsparq.load.KVFileCreatorMR. Ignore this step if there is only 1 node in the cluster.

Generate input file for Metis

Metis needs the graph to be in the form of an adjacency list. So generate it using hadoop jar dist/d-sparq.jar dsparq.partitioning.UndirectedGraphPartitioner <input_dir> <output_dir>. Input directory contains the untyped triples from the previous step i.e., the triples separated from typed triples. Copy the adjvertex-... and vertex-... files to local file system.
Total number of vertices and edges should also be specified. Get them using java -Xms12g -Xmx12g -cp dist/d-sparq.jar dsparq.partitioning.format.GraphStats adjvertex-r-00000-new.
Add the total number of vertices and edges to the METIS input file (adjvertex-...) as its first line.
Run Metis using gpmetis <adjvertex_file> <num_partitions>. Number of partitions is same as number of nodes in the cluster.

Load the triples into respective partitions

The output of Metis specifies the vertices which belong to a particular partition. From that, we need to get the triples associated with those vertices. Following steps are required to get all the triples which should go into a particular partition.

Combine VertexID file and partitionID file. Use java -Xms12g -Xmx12g -cp dist/d-sparq.jar dsparq.partitioning.format.VertexPartitionFormatter vertex-r-00000 adjvertex-r-00000.part.<num>.
Based on the vertexID - partitionID pairs, create separate files for each partition which contain all vertices belonging to that partition. Use hadoop jar dist/d-sparq.jar dsparq.partitioning.TripleSplitter <input_dir> <output_dir>. Input directory contains the combined vertexID - partitionID pairs.
For the triples on vertex partitions, use n-hop duplication.
Based on the output of above step, get the triples associated with each partition. Use hadoop jar dist/d-sparq.jar dsparq.partitioning.PartitionedTripleSeparator <input_dir> <output_dir>.
Copy the triple files which belong to a particular partition on to that partition.
Load the triples into local MongoDB and as well as into RDF-3X on each partition. Use java -Xms12g -Xmx12g -cp dist/d-sparq.jar dsparq.load.PartitionedTripleLoader <input_triple_file>. Indexes are also created in this step (PartitionedTripleLoader does that).
Generate predicate selectivity. Use java -Xms12g -Xmx12g -cp dist/d-sparq.jar dsparq.load.PredicateSelectivity.
For loading to RDF-3X, use time bin/rdf3xload rdfdb <file>.

Run the queries

Run this on each node of the cluster. Use java -Xms12g -Xmx12g -cp dist/d-sparq.jar dsparq.query.opt.ThreadedQueryExecutor2 <queries/q?> <number_of_times_to_run>. The input query file should contain only one query. Query should be written without using any line breaks. SP2 benchmark tool is used to generate data and the modified queries are also provided. Since only basic graph patterns are supported, some of the queries from SP2 are modified. Test queries are provided in the queries folder. Note that, only the total results are printed, not the individual results. This is sufficient for our purpose.
For running queries on RDF-3X,
put rdf3x-0.3.7/bin in the PATH variable.
Use time java -Xms12g -Xms12g -cp dist/d-sparq.jar:src dsparq.sample.RDF3XTest <path_to_rdfdb> <queries/q?>.

Name		Name	Last commit message	Last commit date
Latest commit History 198 Commits
lib		lib
queries		queries
src/dsparq		src/dsparq
.gitignore		.gitignore
README.md		README.md
ShardInfo.properties		ShardInfo.properties
build.xml		build.xml
mongod-configsvr.conf		mongod-configsvr.conf
mongod-shardsvr.conf		mongod-shardsvr.conf
mongos.conf		mongos.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

d-sparq

Dependencies

Instructions

Start MongoDB Sharded Cluster

Configure d-sparq and format input

Encoding Triples

In case of only 1 node

Separate rdf:type triples

Generate input file for Metis

Load the triples into respective partitions

Run the queries

About

Releases

Packages

Languages

raghavam/d-sparq

Folders and files

Latest commit

History

Repository files navigation

d-sparq

Dependencies

Instructions

Start MongoDB Sharded Cluster

Configure d-sparq and format input

Encoding Triples

In case of only 1 node

Separate rdf:type triples

Generate input file for Metis

Load the triples into respective partitions

Run the queries

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages