[ElasticsearchSink] OutOfMemory exception #293

chuwik · 2014-11-19T00:37:48Z

Hi,

I have a system using Serilog with the ElasticSearch sink, generating a big volume of logs (~45 million log entries / 24h) and sometimes at high frequency. The problem is that sometimes I get an OutOfMemory exception when I use the ElasticSearch sink; if I use the File sink, for instance, there's no problem. I don't have a clear repro from the code in my system, but I have observed it usually happens when there's a high frequency of logs being created.

I've been able to replicate the issue by tweaking the Serilog.PerformanceTests to use the ElasticSearch sink and more iterations:

static void Main()
{
    const int iterations = 10000000;
    var log = new LoggerConfiguration()
            .WriteTo.ElasticSearch(node: new Uri("http://localhost:9200"))
            .CreateLogger();

    for (var i = 0; i < iterations; i++)
    {
        log.Information("Running iteration {I:00.0} for {@J}!", i, new { Goal = "Speed" });
    }
}

It throws the OutOfMemory exception for me a bit over the 7000000 one. It works fine with the File sink.

Is there any configuration of the ElasticSearch sink that I'm missing and could use to handle this scenario (high volume/frequency)? Is there any documented "ceiling" for how much the sink can take? Or is it just a bug/improvement needed in the code?

Right now I have to default to enabling both the ElasticSearch and File sinks (the latter just for error & up) just to make sure I don't miss anything... which is quite annoying! Hopefully there'll be a fix for this :)

Thanks a lot

The text was updated successfully, but these errors were encountered:

nblumhardt · 2014-11-19T21:50:21Z

Hi - this is caused by the sink falling behind - there's definitely an improvement that can be made here, but figuring out what the hard queue size limit should be is a tricky call to make.

The cause of the perf problem to start with is very likely the default batch size of 50 events, which is chosen with network traversal in mind, and to a lesser extent the buffering period of 2 sec. On localhost, try for example:

.WriteTo.ElasticSearch(
    node: new Uri("http://localhost:9200"),
    batchPostingLimit: 10000,
    period: TimeSpan.FromMilliseconds(1))

Let me know if this helps, and if you're interested in pursuing the "ceiling" let me know also - I'm not able to attempt it right now but opening a PR/discussion might move things forward on that front.

Cheers,
Nick

chuwik · 2014-11-19T22:51:56Z

Thanks!

I've tried your suggestion, and also with bigger and lower batch sizes and period, but unfortunately it still gives the same OutOfMemory exception around the 7000000 iteration. I'm happy to look more into it, but I'm afraid I don't know where to start looking for what is causing this. Could you give me some pointers, or have any other ideas for what could be causing it?

Cheers,
Enrique

nblumhardt · 2014-12-22T22:23:29Z

Hi again Enrique - any luck with this one (e.g. via the changes in #302)?

Regards,
Nick

chuwik · 2015-01-19T17:20:42Z

No, unfortunately the issue still repros. I've tried with the latest ES and Serilog version, with the original config for the issue and the ones you proposed, as well as trying out different values for both the "BatchPostingLimit" and "Period"; but in all cases it still ends up throwing an OutOfMemoryException in around iteration 7000000.

Cheers,
Enrique

mivano · 2015-01-19T23:08:08Z

I can reproduce the rapid increase in memory now. On Mono however, but it shows the same problem and will hit the ceiling at some point. I have no idea what to do about this. Looked for points where we might not dispose stuff but it all looks okay. Maybe @Mpdreamz has some ideas?
And I also have no idea how to measure where the memory is being used inside the Xamarin tools. Have to use Visual Studio for that I'm afraid. I see that the queue in the PeriodicBatchingSink goes up pretty quickly. So it looks like it simply is not fast enough in writing it to ElasticSearch?

gmarz · 2015-01-20T00:05:23Z

@mivano That's correct. IIRC, last time I looked into this, it wasn't a memory leak or anything of that nature. It was simply a matter of events queuing faster than Elasticsearch/Serilog could process them, eventually leading to an OOM exception.

Short of queuing events to disk rather than memory, I'm not sure what else can be done?

mivano · 2015-01-20T08:55:53Z

So either we optimize the ElasticSearch sink or use another mechanism. However I have no clue how further optimize the ES sink; async does not help, it already uses bulk indexing and I do not think it has very heavy dependencies. So playing around with timeout and batch size might help a little. You might run into similar issues with other sinks that are slower in submitting their data to its service.

Another mechanism is to send the data to an external queue system like RabbitMQ. This offloads the internal queue, can store on disk if needed and be on another machine. You can fill ES from RabbitMQ using ES rivers or using Logstash.

konste · 2015-02-09T05:13:44Z

It is always possible that application for some reason generates logs faster than Elasticsearch sink (or any sink in that matter) is able to handle those, regardless of how well you optimize it. One important principle should stay for most cases - problems with the logging should not affect availability (read - crash) of application. Ideally log subsystem should affect app perf (CPU and mem) as little as possible, but definitely should not crash the app.

So let me formulate the problem differently - what can we do to Elasticsearch sink to prevent it from excessive memory consumption even if the price is dropped log events?

nblumhardt · 2015-02-12T22:02:01Z

I think PeriodicBatchingSink will already handle this in most (non-pathological?) cases; @gmarz helped drive out that behaviour a little while ago.

https://github.com/serilog/serilog/blob/dev/src/Serilog.FullNetFx/Sinks/PeriodicBatching/PeriodicBatchingSink.cs#L171 handles dropping the queue after a certain number of delivery failures.

Some improvement's obviously possible here and would be appreciated as a PR if anyone can formulate a reliable strategy that doesn't just limit buffer size (which would artificially break the sink for successful high-throughput logging). Ideas?

nblumhardt · 2015-02-12T22:03:53Z

Just for context of course, in case it is lost in the depth of this thread. Here we are talking about a test case that does nothing else but pump 7,000,000 events through a log sink - imaginable perhaps, but not really representative of the behavior of even broken apps (an app doing this unexpectedly would most likely die for other reasons ahead of Serilog!) Cheers :-)

chuwik · 2015-02-18T13:33:40Z

While it's pretty extreme, I have seen this problem happen in a production system with very high volumes of logs. In all fairness they were way too many, and reducing them to a (more) sensible amount made the issue go away. I'd agree with @konste that dropping logs is better than having the app crash, even if this volume is difficult to reach (yet still possible).

Maybe #391 could be a good workaround for this, using a File as an intermediary in any scenarios where there's just too many logs?

mookid8000 · 2015-02-18T14:57:34Z

With #391 and a configuration like this:

var node = new Uri("http://my-elastic-url:9200/");
var options = new ElasticsearchSinkOptions(node)
{
    BufferBaseFilename = "c:\somedir\somewhere\logs",
};
config.WriteTo.Elasticsearch(options);

the original ElasticSearch sink will be bypassed, and the logging will actually be done with a RollingFileSink with ElasticSearchJsonFormatter. In the background, an ElasticSearchLogShipper will then send batches of 100 log events to ElasticSearch as fast as it can.

With this solution, I believe it's only disk space that can turn out to be a problem ;)

mivano · 2015-04-01T07:34:16Z

Lets close this one as the durability support is now in the new Elasticsearch sink as described in #394 and will be released soon.

mookid8000 · 2015-04-01T07:37:06Z

👍

chuwik · 2015-04-01T10:49:56Z

Sounds good!

gmarz mentioned this issue Nov 23, 2014

PeriodicBatchingSink ignores connection period #302

Closed

mivano closed this as completed Apr 1, 2015

operezfuentes mentioned this issue Jul 21, 2016

OutOfMemory under stressed sinks serilog/serilog-sinks-periodicbatching#6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ElasticsearchSink] OutOfMemory exception #293

[ElasticsearchSink] OutOfMemory exception #293

chuwik commented Nov 19, 2014

nblumhardt commented Nov 19, 2014

chuwik commented Nov 19, 2014

nblumhardt commented Dec 22, 2014

chuwik commented Jan 19, 2015

mivano commented Jan 19, 2015

gmarz commented Jan 20, 2015

mivano commented Jan 20, 2015

konste commented Feb 9, 2015

nblumhardt commented Feb 12, 2015

nblumhardt commented Feb 12, 2015

chuwik commented Feb 18, 2015

mookid8000 commented Feb 18, 2015

mivano commented Apr 1, 2015

mookid8000 commented Apr 1, 2015

chuwik commented Apr 1, 2015

[ElasticsearchSink] OutOfMemory exception #293

[ElasticsearchSink] OutOfMemory exception #293

Comments

chuwik commented Nov 19, 2014

nblumhardt commented Nov 19, 2014

chuwik commented Nov 19, 2014

nblumhardt commented Dec 22, 2014

chuwik commented Jan 19, 2015

mivano commented Jan 19, 2015

gmarz commented Jan 20, 2015

mivano commented Jan 20, 2015

konste commented Feb 9, 2015

nblumhardt commented Feb 12, 2015

nblumhardt commented Feb 12, 2015

chuwik commented Feb 18, 2015

mookid8000 commented Feb 18, 2015

mivano commented Apr 1, 2015

mookid8000 commented Apr 1, 2015

chuwik commented Apr 1, 2015