-
Notifications
You must be signed in to change notification settings - Fork 173
SPARQL_Update_Performance
Follow these tips for even faster update.
We had reports of a performance slowdown for SPARQL UPDATE INSERT/DELETE WHERE queries:
DELETE {...}
INSERT {...}
WHERE {...}
For example, you can observe this using the following SPARQL UPDATE request against a data set with 100k or more instances of rdf:label.
DELETE { ?s rdfs:label ?o } INSERT {?s rdfs:comment ?o } WHERE { ?s rdfs:label ?o }
Looking into the timing, we found that the time to insert or remove each statement was growing in proportion to the number of statements already added or removed in the connection. The actual timings are below. In the first log output, it took 3 seconds to process 10,000 statements. The performance is fairly flat for the next 20,000 statements. However, the latency of the operation then starts to grow very rapidly. By the last log output it was taking 97 seconds to process 10,000 statements!
[java] Added 10000 stmts for removal in 3 seconds (now 10000 stmts in total)
[java] Added 10000 stmts for removal in 3 seconds (now 20000 stmts in total)
[java] Added 10000 stmts for removal in 4 seconds (now 30000 stmts in total)
[java] Added 10000 stmts for removal in 9 seconds (now 40000 stmts in total)
[java] Added 10000 stmts for removal in 9 seconds (now 50000 stmts in total)
[java] Added 10000 stmts for removal in 12 seconds (now 60000 stmts in total)
[java] Added 10000 stmts for removal in 20 seconds (now 70000 stmts in total)
[java] Added 10000 stmts for removal in 26 seconds (now 80000 stmts in total)
[java] Added 10000 stmts for removal in 32 seconds (now 90000 stmts in total)
[java] Added 10000 stmts for removal in 39 seconds (now 100000 stmts in total)
[java] Added 10000 stmts for removal in 46 seconds (now 110000 stmts in total)
[java] Added 10000 stmts for removal in 53 seconds (now 120000 stmts in total)
[java] Added 10000 stmts for removal in 65 seconds (now 130000 stmts in total)
[java] Added 10000 stmts for removal in 74 seconds (now 140000 stmts in total)
[java] Added 10000 stmts for removal in 83 seconds (now 150000 stmts in total)
[java] Added 10000 stmts for removal in 90 seconds (now 160000 stmts in total)
[java] Added 10000 stmts for removal in 97 seconds (now 170000 stmts in total)
...
What’s going on here!?! The underlying BigdataSailConnection batches everything into a StatementBuffer object. The StatementBuffer is then incrementally flushed onto the backing indices every 10,000 statements (by default – See BigdataSail.Options.BUFFER_CAPACITY to change this parameter). The remove statements code path is nearly identical. Both add statements and remove statements on the BigdataSailConnection are known to scale into billions of statements added or removed in a single operation. Scaling should not be worse than log-linear since it depends on the depth of the B+Tree indices (the cost of an index probe on a B+Tree is bounded by log(p), where p is the height of the B+Tree).
Looking at the scaling performance, it was immediately suggestive of a poor data structure choice. Something where the cost of the scan was proportional to the number of items scanned.
Looking in our code, the relevant lines are:
// Evaluate the WHERE clause.
final MutableTupleQueryResult result = new MutableTupleQueryResult(
ASTEvalHelper.evaluateTupleQuery(
context.conn.getTripleStore(), astContainer,
context.getQueryBindingSet()/* bindingSets */));
// Evaluate the DELETE clause (evaluation of the INSERT clause is similar).
final ASTConstructIterator itr = new ASTConstructIterator(context, context.conn.getTripleStore(), template, op.getWhereClause(), null, result);
while (itr.hasNext()) {
final BigdataStatement stmt = itr.next();
addOrRemoveStatement(context.conn.getSailConnection(), stmt, false/* insert */);
}
// Batch up individual add or remove statement calls.
void addOrRemoveStatement(final BigdataSailConnection conn, final BigdataStatement spo, final boolean insert) throws SailException {
// ...
if (insert) {
conn.addStatement(s, p, o, contexts);
} else {
conn.removeStatements(s, p, o, contexts);
}
Looking into our code (above), we found an openrdf MutableTupleQueryResult object. This object is used to buffer the statements identified by the WHERE clause of the query. That buffer is then rewound and replayed twice. Once for the DELETE clause and once more for the INSERT clause. So my suspicions naturally fell on the otherwise innocuous line:
final BigdataStatement stmt = itr.next();
That line is our ASTConstructIterator, which is known to be scalable. But in this case it is basked by the MutableTupleQueryResult class. So, looking into MutableTupleQueryResult I see:
private List bindingSets = new LinkedList();
Bingo!
MutableTupleQueryResult.next() is calling through to List.get(index):
public BindingSet next() {
if (hasNext()) {
BindingSet result = get(currentIndex);
lastReturned = currentIndex;
currentIndex++;
return result;
}
}
public BindingSet get(int index) {
return bindingSets.get(index);
}
List.get(index) against a LinkedList does a scan. That explains the performance degradation we were seeing. A one line change fixes this.
private List bindingSets = new ArrayList();
Performance for the same SPARQL UPDATE request is now perfectly flat. The main cost was the MutableTupleQueryResult scan of the LinkedList. Blazegraph automatically and scaleably buffers the data and incrementally evicts the mutations to the write cache and from there to the disk.
[java] Added 10000 stmts for removal in 3 seconds (now 10000 stmts in total)
[java] Added 10000 stmts for removal in 3 seconds (now 20000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 30000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 40000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 50000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 60000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 70000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 80000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 90000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 100000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 110000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 120000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 130000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 140000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 150000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 160000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 170000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 180000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 190000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 200000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 210000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 220000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 230000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 240000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 250000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 260000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 270000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 280000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 290000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 300000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 310000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 320000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 330000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 340000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 350000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 360000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 370000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 380000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 390000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 400000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 410000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 420000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 430000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 440000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 450000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 460000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 470000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 480000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 490000 stmts in total)
See BLZG-1404 for the workaround and a link to the corresponding ticket that we have raised with the openrdf project. This fix will be in our next major release, and is available in hot fix releases to customers with support subscriptions.