Switch branches/tags
Nothing to show
Find file
e43aa93 May 23, 2013
72 lines (47 sloc) 3.78 KB
Welcome to Glimmer
Glimmer is a Hadoop based distributed indexing system for building MG4J indexes from RDF tuples in NQuad format. It also includes a simple web application
for querying the resulting indexes (TODO +something about this+scorer..). Both the indexing and web application are written in Java. There are also a few
shell scripts to execute the steps need to build the indexes for a given NQuads file and query the resulting index from the command line.
Glimmer is an academic project and is the implementation of distributed indexing detailed in the paper
'Distributed Indexing for Semantic Search' by Peter Mika(Yahoo Research).
- Java JDK 6. Other versions may work. The code was written against version 1.6.0_31
Note that MG4J 5.1 uses memory mapped local file access on the local machine and a bit in Hadoop. For index building to work well, use a version of Java that has a 64bit data model
You will possibly get Map failed errors otherwise. Especially when merging lots of indexes.
- A Maven installation. Version 3.0.3 was used during development.
- A Hadoop cluster. Probably version 0.23.x. The version of Hadoop we developed the code with is defined in the pom.xml file.
- If you want try out the web app an install of a Java servlet container such as Tomcat, Jetty etc..
- If you want a more usable interface to the MG4J query command-line you can install rlwrap to get command-line history and editing.
All other dependencies are jars that are automatically downloaded by Maven.
Index Building
1. Build an uber jar with mvn for building indexes with.
In the root directory of Glimmer run:
mvn compile (or mvn test)
mvn assembly:single
This will produce a jar Glimmer-?.?.?-SNAPSHOT-jar-for-hadoop.jar in the ./target directory. This jar contains the Glimmer classes + all dependencies needed for Hadoop zipped into one file.
2. The process of building an index with Glimmer consists of the following steps. (The shell script is provided to automate the process.)
* Preprocess the NQuad tuples file to get:
- A sorted unique list of subjects with the subject's associated predicates, objects and contexts. BZip2'ed
- A file containing a mapping between the BZip2 block start offsets(in the file above) and the first subject id in that block.
- A sorted unique list of subjects resources.
- A sorted unique list of predicates resources.
- A sorted unique list of all resources(Subject, Predicate, Object & Context).
* Build minimal perfect hash functions over the unique sorted lists of resources. This function is a mapping of a given resource to its position in the unique sorted list.
* Build the 'horizontal' and 'vertical'(See the paper..) MG4J indexes.
* Compute the Document sizes.
* Copy all the generated files to the desired location.
Once you have the indexes build you can use the MG4J query command line class it.unimi.di.mg4j.query.Query to query them. You may find the script helpful here.
You can also build and deploy the Java web application as follows:
1. Configure where your indexes are in src/main/webapp/WEB-INF/classes/
The directories found in the directory given by 'multiindex.path' property and that have names starting with the
'multiindex.dirprefix' property are treated as MG4J indexes. The names will appear in the Dataset dropdown in the webapp.
2. Use the maven-war-plugin to build the war file. From the projects root directory run the command
mvn compile (or mvn test)
mvn war:war (mvn war:exploded maybe of interest to some people too)
This will build the war file Glimmer-?.?.?-SNAPSHOT.war in the target directory.
2. Copy the war into do the deployment directory of your installed Java Servlet container.