Skip to content
Permalink
Browse files

Blazegraph vocabulary and inline uri setup

This change improves load speed and disk space by about 15%.

It uses a backport of http://trac.bigdata.com/ticket/1179#ticket .

Change-Id: I864c9a07d497280c6806ab2c23c85fcd89ffe4da
  • Loading branch information...
Nik Everett
Nik Everett committed Apr 12, 2015
1 parent b88b454 commit 6942870a9cd62fcf25fddb970694a2735c737d13
Showing with 1,233 additions and 161 deletions.
  1. +51 −0 backporting_blazegraph.txt
  2. +66 −0 blazegraph/src/main/java/com/bigdata/rdf/internal/NormalizingInlineUriHandler.java
  3. +32 −0 blazegraph/src/main/java/com/bigdata/rdf/internal/TrailingSlashRemovingInlineUriHandler.java
  4. +4 −0 blazegraph/src/main/java/com/bigdata/rdf/internal/package-info.java
  5. +0 −137 blazegraph/src/main/java/org/wikidata/query/rdf/blazegraph/WikibaseDateExtension.java
  6. +2 −0 blazegraph/src/main/java/org/wikidata/query/rdf/blazegraph/WikibaseExtensionFactory.java
  7. +68 −0 blazegraph/src/main/java/org/wikidata/query/rdf/blazegraph/WikibaseInlineUriFactory.java
  8. +36 −0 blazegraph/src/main/java/org/wikidata/query/rdf/blazegraph/WikibaseVocabulary.java
  9. +118 −0 ...ph/src/main/java/org/wikidata/query/rdf/blazegraph/inline/literal/AbstractMultiTypeExtension.java
  10. +67 −0 blazegraph/src/main/java/org/wikidata/query/rdf/blazegraph/inline/literal/WikibaseDateExtension.java
  11. +57 −0 ...h/src/main/java/org/wikidata/query/rdf/blazegraph/inline/uri/UndecoratedUuidInlineUriHandler.java
  12. +48 −0 ...h/src/main/java/org/wikidata/query/rdf/blazegraph/inline/uri/ValuePropertiesInlineUriHandler.java
  13. +109 −0 ...ain/java/org/wikidata/query/rdf/blazegraph/inline/uri/WikibaseStyleStatementInlineUriHandler.java
  14. +15 −0 ...egraph/src/main/java/org/wikidata/query/rdf/blazegraph/vocabulary/CommonValuesVocabularyDecl.java
  15. +29 −0 blazegraph/src/main/java/org/wikidata/query/rdf/blazegraph/vocabulary/OntologyVocabularyDecl.java
  16. +18 −0 blazegraph/src/main/java/org/wikidata/query/rdf/blazegraph/vocabulary/ProvenanceVocabularyDecl.java
  17. +22 −0 ...egraph/src/main/java/org/wikidata/query/rdf/blazegraph/vocabulary/SchemaDotOrgVocabularyDecl.java
  18. +26 −0 ...egraph/src/main/java/org/wikidata/query/rdf/blazegraph/vocabulary/WikibaseUrisVocabularyDecl.java
  19. +133 −0 blazegraph/src/test/java/org/wikidata/query/rdf/blazegraph/AbstractRandomizedBlazegraphTestBase.java
  20. +0 −18 blazegraph/src/test/java/org/wikidata/query/rdf/blazegraph/DummyUnitTest.java
  21. +115 −0 blazegraph/src/test/java/org/wikidata/query/rdf/blazegraph/WikibaseInlineUriFactoryUnitTest.java
  22. +136 −0 blazegraph/src/test/java/org/wikidata/query/rdf/blazegraph/WikibaseVocabularyUnitTest.java
  23. +6 −0 blazegraph/src/test/resources/log4j.properties
  24. +5 −0 common/src/main/java/org/wikidata/query/rdf/common/StatementUtil.java
  25. +18 −0 common/src/main/java/org/wikidata/query/rdf/common/uri/CommonValues.java
  26. +33 −1 common/src/main/java/org/wikidata/query/rdf/common/uri/Ontology.java
  27. +10 −0 common/src/main/java/org/wikidata/query/rdf/common/uri/WikibaseUris.java
  28. +6 −3 pom.xml
  29. +1 −1 tools/pom.xml
  30. +2 −1 tools/src/test/resources/blazegraph/RWStore.properties
@@ -0,0 +1,51 @@
Creating a backport against Blazegraph 1.5.1 is reasonably simple:
1. Checkout Blazegraph 1.5.1

2. Apply all the patches we've backported. So far that is:
http://trac.bigdata.com/ticket/1179

3. Get the builds you need:
mkdir ~/scratch
ant jar
mv ant-build/bigdata-1.5.1-wmf-1.jar ~/scratch/
ant war
mv ant-build/bigdata.war ~/scratch/bigdata-1.5.1-wmf-1.war
ant sourceJar
mv ant-build/bigdata-1.5.1-wmf-1-sources.jar ~/scratch/
ant executable-jar
mv ant-build/bigdata-1.5.1-wmf-1-bundled.jar ~/scratch/
cp pom.xml ~/scratch/

4. Replace this line in ~/scratch/pom.xml
<version>1.5.0-SNAPSHOT</version>
with this line:
<version>1.5.1-wmf-1</version>

5. Login to archiva.wikimedia.org. Sometimes the login script takes a long,
long time. Just wait for it to finish or reload archiva.wikimedia.org in
another window. Eventually it'll let you in. We don't know why this happens.

6. Click Upload Artifact

7. Click choose file and specify everything in you scratch directory

8. Set the war's packaging to war, the bundled's classifier to bundle, the
sources' classifier to sources, and click the pomFile tick box on the pom file.

9. Start the uploads and then fill in the top of the page:
Repository Id: Wikimedia Mirrored Repository
Group ID: com.bigdata
Artifact ID: bigdata
Version: 1.5.1-wmf-1 <--- or -wmf-2, -wmf-3, etc.
Packaging: jar

10. Once the upload is done you can click save file.

11. You can see the files here:
https://archiva.wikimedia.org/repository/mirrored/com/bigdata/bigdata/1.5.1-wmf-1/
Just replace the -wmf-1 with whatever your version number is.

12. Update the blazegraph.version property in the pom in the root directory of
the project.

13. mvn clean install
@@ -0,0 +1,66 @@
package com.bigdata.rdf.internal;

import java.util.Arrays;
import java.util.List;

import org.openrdf.model.URI;

import com.bigdata.rdf.internal.impl.literal.AbstractLiteralIV;
import com.bigdata.rdf.internal.impl.uri.URIExtensionIV;
import com.bigdata.rdf.model.BigdataLiteral;
import com.bigdata.rdf.vocab.Vocabulary;

/**
* InlineURIHandler that wraps another handler, normalizing multiple uri
* prefixes into one.
*/
public class NormalizingInlineUriHandler extends InlineURIHandler {
private final InlineURIHandler next;
private final List<String> normalizedPrefixes;

public NormalizingInlineUriHandler(InlineURIHandler next, String... normalizedPrefixes) {
this(next, Arrays.asList(normalizedPrefixes));
}

public NormalizingInlineUriHandler(InlineURIHandler next, List<String> normalizedPrefixes) {
super(next.getNamespace());
this.next = next;
this.normalizedPrefixes = normalizedPrefixes;
}

@Override
public void init(Vocabulary vocab) {
super.init(vocab);
next.init(vocab);
}

@Override
@SuppressWarnings({ "unchecked", "rawtypes" })
protected URIExtensionIV createInlineIV(URI uri) {
if (namespaceIV == null) {
// Can't do anything without a namespace.
return null;
}
for (String prefix : normalizedPrefixes) {
if (uri.stringValue().startsWith(prefix)) {
AbstractLiteralIV localNameIv = next.createInlineIV(uri.stringValue().substring(prefix.length()));
if (localNameIv == null) {
return null;
}
return new URIExtensionIV(localNameIv, namespaceIV);
}
}
return next.createInlineIV(uri);
}

@Override
public String getLocalNameFromDelegate(AbstractLiteralIV<BigdataLiteral, ?> delegate) {
return next.getLocalNameFromDelegate(delegate);
}

@Override
@SuppressWarnings("rawtypes")
protected AbstractLiteralIV createInlineIV(String localName) {
return next.createInlineIV(localName);
}
}
@@ -0,0 +1,32 @@
package com.bigdata.rdf.internal;

import com.bigdata.rdf.internal.impl.literal.AbstractLiteralIV;
import com.bigdata.rdf.vocab.Vocabulary;

/**
* InlineURIHandler that wraps another handler than removes any trailing forward
* slashes from the localName before giving it to the wrapped handler.
*/
public class TrailingSlashRemovingInlineUriHandler extends InlineURIHandler {
private final InlineURIHandler next;

public TrailingSlashRemovingInlineUriHandler(InlineURIHandler next) {
super(next.namespace);
this.next = next;
}

@Override
public void init(Vocabulary vocab) {
super.init(vocab);
next.init(vocab);
}

@Override
@SuppressWarnings("rawtypes")
protected AbstractLiteralIV createInlineIV(String localName) {
if (localName.endsWith("/")) {
localName = localName.substring(0, localName.length() - 1);
}
return next.createInlineIV(localName);
}
}
@@ -0,0 +1,4 @@
/**
* Package declared so we can call protected methods in Blazegraph.
*/
package com.bigdata.rdf.internal;

This file was deleted.

Oops, something went wrong.
@@ -3,6 +3,8 @@
import java.util.Collection;
import java.util.Iterator;

import org.wikidata.query.rdf.blazegraph.inline.literal.WikibaseDateExtension;

import com.bigdata.rdf.internal.DefaultExtensionFactory;
import com.bigdata.rdf.internal.IDatatypeURIResolver;
import com.bigdata.rdf.internal.IExtension;
@@ -0,0 +1,68 @@
package org.wikidata.query.rdf.blazegraph;

import org.wikidata.query.rdf.blazegraph.inline.uri.UndecoratedUuidInlineUriHandler;
import org.wikidata.query.rdf.blazegraph.inline.uri.ValuePropertiesInlineUriHandler;
import org.wikidata.query.rdf.common.uri.CommonValues;
import org.wikidata.query.rdf.common.uri.WikibaseUris;

import com.bigdata.rdf.internal.InlineURIFactory;
import com.bigdata.rdf.internal.InlineUnsignedIntegerURIHandler;
import com.bigdata.rdf.internal.NormalizingInlineUriHandler;
import com.bigdata.rdf.internal.TrailingSlashRemovingInlineUriHandler;

/**
* Factory building InlineURIHandlers for wikidata.
*
* One thing to consider when working on these is that its way better for write
* (and probably update) performance if all the bits of an entity are grouped
* together in Blazegraph's BTrees. Scattering them causes updates to have to
* touch lots of BTree nodes. {s,p,o}, {p,o,s}, and {o,s,p} are the indexes so
* and {s,p,o} seems most sensitive to scattering.
*
* Another thing to consider is that un-inlined uris are stored as longs which
* take up 9 bytes including the flags byte. And inlined uris are stored as 1
* flag byte, 1 (or 2) uri prefix bytes, and then delegate date type. That means
* that if the delegate data type is any larger than 6 bytes then its a net loss
* on index size using it. So you should avoid longs and uuids. Maybe even
* forbid them entirely.
*/
public class WikibaseInlineUriFactory extends InlineURIFactory {
public WikibaseInlineUriFactory() {
// TODO lookup wikibase host and default to wikidata
final WikibaseUris uris = WikibaseUris.WIKIDATA;

/*
* Order matters here because some of these are prefixes of each other.
*/
addHandler(new ValuePropertiesInlineUriHandler(uris.qualifier() + "P"));
addHandler(new InlineUnsignedIntegerURIHandler(uris.qualifier() + "Q"));
addHandler(new ValuePropertiesInlineUriHandler(uris.value() + "P"));
addHandler(new InlineUnsignedIntegerURIHandler(uris.value() + "Q"));
addHandler(new UndecoratedUuidInlineUriHandler(uris.value()));
/*
* We don't use WikibaseStyleStatementInlineUriHandler because it makes
* things worse!
*/
addHandler(new InlineUnsignedIntegerURIHandler(uris.truthy() + "P"));
addHandler(new InlineUnsignedIntegerURIHandler(uris.truthy() + "Q"));
addHandler(new InlineUnsignedIntegerURIHandler(uris.entity() + "P"));
addHandler(new InlineUnsignedIntegerURIHandler(uris.entity() + "Q"));

// These aren't part of wikibase but are common in wikidata
addHandler(new NormalizingInlineUriHandler(new TrailingSlashRemovingInlineUriHandler(
new InlineUnsignedIntegerURIHandler(CommonValues.VIAF)), CommonValues.VIAF_HTTP));

/*
* Value nodes are inlined even though they are pretty big (uuids). It
* doesn't seem to effect performance either way.
*
* Statements can't be inlined without losing information or making them
* huge and bloating the index. We could probably rewrite them at the
* munger into something less-uuid-ish.
*
* References aren't uuids - they are sha1s or sha0s or something
* similarly 160 bit wide. 160 bits is too big to fit into a uuid so we
* can't inline that without bloating either.
*/
}
}
Oops, something went wrong.

0 comments on commit 6942870

Please sign in to comment.
You can’t perform that action at this time.