Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
MapReduce-based generation of HDT
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Type||Name||Latest commit message||Commit time|
|Failed to load latest commit information.|
======================== HDT-MR Library. ======================== Copyright (C) 2015, Jose M. Gimenez-Garcia, Javier D. Fernandez, Miguel A. Martinez-Prieto All rights reserved. This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA Visit our Web Page: dataweb.infor.uva.es/projects/hdt-mr Contacting the authors: Jose M. Gimenez-Garcia: email@example.com Javier D. Fernandez: firstname.lastname@example.org, email@example.com Miguel A. Martinez-Prieto: firstname.lastname@example.org Overview ================= HDT-MR improves the HDT-java library by introducing MapReduce as the computation model for large HDT serialization. HDT-MR performs in linear time with the dataset size and has proven able to serialize datasets up to 4.42 billion triples, preserving HDT compression and retrieval features. HDT-java is a Java library that implements the W3C Submission (http://www.w3.org/Submission/2011/03/) of the RDF HDT (Header-Dictionary-Triples) binary format for publishing and exchanging RDF data at large scale. Its compact representation allows storing RDF in fewer space, while providing direct access to the stored information. See rdfhdt.org for further information. HDT-MR provides three components: - iface: Provides an API to use HDT-MR, including interfaces and abstract classes - src: Core library and command lines tools for using HDT-MR. It allows creating HDT files from RDF. - config: Examples of configuration files Note that the current distribution is an alpha version. Therefore, while this build has been tested, it is still subject to bugs and optimizations. Compiling ================= Dependencies: * HDT-java (https://code.google.com/p/hdt-java/). *** src/org/rdfhdt/hdt includes those classes who has been modified/extended Command line tools ================= The tool provides the following main command line tool: Usage: hadoop HDTBuilderDriver [options] Options: -a, --awsbucket Amazon Web Services bucket -bu, --baseURI Base URI for the dataset -b, --basedir Root directory for the process -bd, --builddictionary Whether to build HDT dictionary or not -bh, --buildhdt Whether to build HDT or not -c, --conf Path to configuration file -dd, --deleteoutputdictionary Delete dictionary job output path before running job -dt, --deleteoutputtriples Delete triples job output path before running job -dsd, --deletesampledictionary Delete dictionary job sample path before running job -dst, --deletesampletriples Delete triples job sample path before running job -d, --dictionarydistribution Dictionary distribution among mappers and reducers -fd, --filedictionary Name of hdt dictionary file -fr, --fileobjects Name of hdt dictionary file for Reducers -fm, --filesubjects Name of hdt dictionary file for Mappers -hc, --hdtconf Conversion config file -x, --index Generate also external indices to solve all queries -i, --input Path to input files. Relative to basedir -it, --inputtriples Path to triples job input files. Relative to basedir -nd, --namedictionaryjob Name of dictionary job -fh, --namehdtfile Name of hdt file -nt, --nametriplesjob Name of triples job -o, --options HDT Conversion options (override those of config file) -od, --outputdictionary Path to dictionary job output files. Relative to basedir -ot, --outputtriples Path to triples job output files. Relative to basedir -q, --quiet Do not show progress of the conversion -t, --rdftype Type of RDF Input (ntriples, nquad, n3, turtle, rdfxml) -Rd, --reducersdictionary Number of reducers for dictionary job -Rds, --reducersdictionarysampling Number of reducers for dictionary input sampling job -Rt, --reducerstriples Number of reducers for triples job -Rts, --reducerstriplessampling Number of reducers for triples input sampling job -rd, --rundictionary Whether to run dictionary job or not -rds, --rundictionarysampling Whether to run dictionary input sampling job or not -rt, --runtriples Whether to run triples job or not -rts, --runtriplessampling Whether to run triples input sampling job or not -p, --sampleprobability Probability of using each element for sampling -sd, --samplesdictionary Path to dictionary job sample files. Relative to basedir -st, --samplestriples Path to triples job sample files. Relative to basedir Usage example ================= After installation, run: $ hadoop HDTBuilderDriver # This first try to read configuration parameters at the default config file (HDTMRBuilder.xml), using default values for those missing parameters. It reads RDF input data from the default 'input' folder and outputs the HDT conversion in 'output.hdt' $ hadoop HDTBuilderDriver -i mashup # Same previous example, but it reads RDF input data from the directory 'mashup' $ hadoop HDTBuilderDriver -c lubm-dictionary.xml -p 0.01 # It uses 'lubm-dictionary.xml' as the configuration file. This file states that input data must be taken from the 'lubm' directory and it forces to compute only the HDT dictionary, which is written in 'dictionary/dictionary.hdt' # It uses 0.01 as the probability of using each element for sampling. $ hadoop HDTBuilderDriver -c lubm-triples.xml -Rt 1 -Rts 1 # It uses 'lubm-triples.xml' as the configuration file. This file states that input data must be taken from the 'lubm' directory and it forces to compute the HDT triples and the final HDT representation by taken the already computed dictionary in 'dictionary/dictionary.hdt' # It forces to use one reducer in both jobs. License =============== All HDT-MR content is licensed by Lesser General Public License. Acknowledgements ================ HDT-MR is a project partially funded by Ministerio de Economia y Competitividad, Spain: TIN2013-46238-C4-3-R, and Austrian Science Fund (FWF): M1720-G11.