Skip to content
This repository has been archived by the owner on Oct 29, 2021. It is now read-only.

InfluxDBStore: point batching during Collect is suboptimal #132

Closed
slimsag opened this issue Apr 5, 2016 · 2 comments
Closed

InfluxDBStore: point batching during Collect is suboptimal #132

slimsag opened this issue Apr 5, 2016 · 2 comments

Comments

@slimsag
Copy link
Member

slimsag commented Apr 5, 2016

Issue

Ideally, InfluxDBStore.Collect is as fast as reasonably possible. Right now, with examples/cmd/webapp-influxdb I've noticed Collect times right now in the range of 60-200ms (just eyeballing it, I could be off by a bit).

If Collect cannot complete in under 50ms we lose trace data, because we cannot have trace data build up in memory forever (memory leak), nor can it block pending HTTP requests. 50ms for Collect is, ideally, an upper time bound (hopefully most Collect are much quicker).

To measure this, I've added some hacky timing debug information to influxdb_store.go and changed the webapp-influxdb command, you can try my test branch issue131 for example, or see my changes here: https://github.com/sourcegraph/appdash/compare/issue131 (note: this branch is just PR #127 and #131 merged, then f552611 applied on top).

Reproducing

Run the example app cleanly:

go install ./examples/cmd/webapp-influxdb/ && rm -rf ~/.influxdb/ && webapp-influxdb

Then using vegeta HTTP profiling tool, perform 1 HTTP request/sec for 8s:

echo "GET http://localhost:8699/" | vegeta attack -duration=8s -rate=1 | vegeta report

You should observe some logs that look like:

InfluxDBStore.Collect -> in.con.Write took 364.948424ms
InfluxDBStore.Collect -> took 367.329577ms
appdash: 2016/04/05 10:55:54 ChunkedCollector: queue entirely dropped (trace data will be missing)
appdash: 2016/04/05 10:55:54 ChunkedCollector: queueSize:3 queueSizeBytes:2133

...

InfluxDBStore.Collect -> in.con.Write took 65.655258ms
InfluxDBStore.Collect -> took 68.063186ms
appdash: 2016/04/05 10:56:00 ChunkedCollector: queue entirely dropped (trace data will be missing)
appdash: 2016/04/05 10:56:00 ChunkedCollector: queueSize:3 queueSizeBytes:3440

Possible solution

Note that in.con.Write takes most of the time spent during Collect, i.e. the Collect function itself is not very expensive, but writing to InfluxDB via in.con.Write is!

I think this is because Collect is inherently a very small operation, at most it will be writing a single InfluxDB data point. Consider our code:

    // A single point represents one span.
    pts := []influxDBClient.Point{*p}
    bps := influxDBClient.BatchPoints{
        Points:   pts,
        Database: in.dbName,
    }
    _, writeErr := in.con.Write(bps)
    if writeErr != nil {
        return writeErr
    }
    return nil

We only write a single point to InfluxDB, and this becomes very expensive because InfluxDB cannot handle small writes very easily, and adds a large overhead like 50-200ms to them. However, InfluxDB can write a very large number of points (500+, a batch of points) in almost the same amount of time (50-200ms) from my tests.

I think the solution here is to make InfluxDBStore.Collect append to an internal slice, such that it queues up an entire batch of points, and then after some period of time writes them to InfluxDB in a background goroutine. Important aspects would be:

  • Allow flush time, maximum queue size, etc to be configurable.
  • Ensure that the queue size cannot grow forever, causing memory leaks.
  • For inspiration, see ChunkedCollector here
@slimsag
Copy link
Member Author

slimsag commented Apr 18, 2016

#139 did a huge deal for us, prior to that change we saw numbers of 1/req/sec causing ChunkedCollector to go over it's 50ms threshold and spill trace data. I've confirmed after #139 just now that we are able to do 75/req/sec before hitting that threshold. a 75x improvement, good job @chris-ramon !

@slimsag
Copy link
Member Author

slimsag commented Apr 18, 2016

In the influx-batching branch I added some very naive point batching that does 8092 points/batch instead of one/batch. This bumped us up to 200-500/req/sec. I don't have an accurate number because various system spikes caused it to spike above 50ms, but at once point I was able to get 500/req/sec stable and not spill over for a few seconds.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant