InfluxDBStore: point batching during Collect is suboptimal #132

slimsag · 2016-04-05T18:15:17Z

Issue

Ideally, InfluxDBStore.Collect is as fast as reasonably possible. Right now, with examples/cmd/webapp-influxdb I've noticed Collect times right now in the range of 60-200ms (just eyeballing it, I could be off by a bit).

If Collect cannot complete in under 50ms we lose trace data, because we cannot have trace data build up in memory forever (memory leak), nor can it block pending HTTP requests. 50ms for Collect is, ideally, an upper time bound (hopefully most Collect are much quicker).

To measure this, I've added some hacky timing debug information to influxdb_store.go and changed the webapp-influxdb command, you can try my test branch issue131 for example, or see my changes here: https://github.com/sourcegraph/appdash/compare/issue131 (note: this branch is just PR #127 and #131 merged, then f552611 applied on top).

Reproducing

Run the example app cleanly:

go install ./examples/cmd/webapp-influxdb/ && rm -rf ~/.influxdb/ && webapp-influxdb

Then using vegeta HTTP profiling tool, perform 1 HTTP request/sec for 8s:

echo "GET http://localhost:8699/" | vegeta attack -duration=8s -rate=1 | vegeta report

You should observe some logs that look like:

InfluxDBStore.Collect -> in.con.Write took 364.948424ms
InfluxDBStore.Collect -> took 367.329577ms
appdash: 2016/04/05 10:55:54 ChunkedCollector: queue entirely dropped (trace data will be missing)
appdash: 2016/04/05 10:55:54 ChunkedCollector: queueSize:3 queueSizeBytes:2133

...

InfluxDBStore.Collect -> in.con.Write took 65.655258ms
InfluxDBStore.Collect -> took 68.063186ms
appdash: 2016/04/05 10:56:00 ChunkedCollector: queue entirely dropped (trace data will be missing)
appdash: 2016/04/05 10:56:00 ChunkedCollector: queueSize:3 queueSizeBytes:3440

Possible solution

Note that in.con.Write takes most of the time spent during Collect, i.e. the Collect function itself is not very expensive, but writing to InfluxDB via in.con.Write is!

I think this is because Collect is inherently a very small operation, at most it will be writing a single InfluxDB data point. Consider our code:

    // A single point represents one span.
    pts := []influxDBClient.Point{*p}
    bps := influxDBClient.BatchPoints{
        Points:   pts,
        Database: in.dbName,
    }
    _, writeErr := in.con.Write(bps)
    if writeErr != nil {
        return writeErr
    }
    return nil

We only write a single point to InfluxDB, and this becomes very expensive because InfluxDB cannot handle small writes very easily, and adds a large overhead like 50-200ms to them. However, InfluxDB can write a very large number of points (500+, a batch of points) in almost the same amount of time (50-200ms) from my tests.

I think the solution here is to make InfluxDBStore.Collect append to an internal slice, such that it queues up an entire batch of points, and then after some period of time writes them to InfluxDB in a background goroutine. Important aspects would be:

Allow flush time, maximum queue size, etc to be configurable.
Ensure that the queue size cannot grow forever, causing memory leaks.
For inspiration, see ChunkedCollector here

The text was updated successfully, but these errors were encountered:

slimsag · 2016-04-18T04:50:04Z

#139 did a huge deal for us, prior to that change we saw numbers of 1/req/sec causing ChunkedCollector to go over it's 50ms threshold and spill trace data. I've confirmed after #139 just now that we are able to do 75/req/sec before hitting that threshold. a 75x improvement, good job @chris-ramon !

slimsag · 2016-04-18T05:52:30Z

In the influx-batching branch I added some very naive point batching that does 8092 points/batch instead of one/batch. This bumped us up to 200-500/req/sec. I don't have an accurate number because various system spikes caused it to spike above 50ms, but at once point I was able to get 500/req/sec stable and not spill over for a few seconds.

…traint Fixes #132

chris-ramon mentioned this issue Apr 8, 2016

Improvements on Recorder. #135

Merged

chris-ramon mentioned this issue Apr 17, 2016

Perf. improvements for InfluxDBStore. #138

Closed

2 tasks

slimsag mentioned this issue Apr 17, 2016

Perf. improvements for InfluxDBStore. #139

Merged

2 tasks

chris-ramon referenced this issue Apr 18, 2016

Add some very naive batching to InfluxDBStore

8807211

slimsag added a commit that referenced this issue Apr 26, 2016

InfluxDBStore: add proper point-batching support based on memory cons…

26aa309

…traint Fixes #132

slimsag added a commit that referenced this issue Apr 26, 2016

InfluxDBStore: add proper point-batching support based on memory cons…

2875930

…traint Fixes #132

slimsag mentioned this issue Apr 26, 2016

InfluxDBStore: add proper point-batching support based on memory constraint #157

Closed

slimsag added a commit that referenced this issue Apr 26, 2016

InfluxDBStore: add proper point-batching support based on memory cons…

38906e5

…traint Fixes #132

slimsag closed this as completed in b01a950 Apr 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InfluxDBStore: point batching during Collect is suboptimal #132

InfluxDBStore: point batching during Collect is suboptimal #132

slimsag commented Apr 5, 2016

slimsag commented Apr 18, 2016

slimsag commented Apr 18, 2016

InfluxDBStore: point batching during Collect is suboptimal #132

InfluxDBStore: point batching during Collect is suboptimal #132

Comments

slimsag commented Apr 5, 2016

Issue

Reproducing

Possible solution

slimsag commented Apr 18, 2016

slimsag commented Apr 18, 2016