Bulk insertion timeout #25

miklosbarabas · 2017-05-22T12:06:39Z

Hi!

I am running monstache with HA mode on, which goes fine for the first minutes, but then:

ERROR 2017/05/22 13:59:53 elastic: bulk processor "monstache" failed but will retry: Post http://elastichost:9200/_bulk: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Seems like it cannot reconnect to the cluster. The elasticsearch log only shows that it was updateing, but no errors.

Any idea why this could happen? Is there any timeout in monstache that can be set against the elasticsearch cluster? (like the mongodb timeout options)

Thanks for the help in advance!
M

The text was updated successfully, but these errors were encountered:

rwynn · 2017-05-22T22:03:41Z

@miklosbarabas,
Did a quick test with

./monstache --cluster-name c1 --verbose

Ran 2 processes like that and it looked like when I killed one the other took over and was still able to connect and send bulk requests.

There is a currently hard code 10s timeout on the http client given to use for elasticsearch

https://github.com/rwynn/monstache/blob/master/monstache.go#L1101

I'm not sure if that's the problem or not. Do you also see this issue on a simple local install of elasticsearch?

rwynn · 2017-05-24T23:08:08Z

@miklosbarabas,

Just published new version with configurable timeout. Defaults now to 60 seconds.

miklosbarabas · 2017-05-26T18:24:13Z

@rwynn

Sorry for the late response, and thank you for your fast reaction and for the new release!

Some additions for the scenario:
Using ES 2.4.4 with 3-5 nodes in the cluster, and running monstache on the node boxes (3 workers / node) the above mentioned error occurred regardless of resources (tho it started to happen earlier on smaller AWS T2 boxes)
The following configurations were used:

...
elasticsearch-max-docs = 1000
elasticsearch-max-seconds = 1
elasticsearch-max-bytes = 32768 # I tried with the default as well..
elasticsearch-retry = true
gtm-channel-size = 200
direct-read-namespaces = ["mydb.mycoll"] # here we use multiple db and collection
#some mappings
...

Also the error was happening during the direct-read-namespaces stage (new oplog entries weren't synced at this point) and somehow made the ES cluster stucked: no error in logs, the REST endpoint seemingly up, tho apps could not connect on transport protocol (9300) anymore. The metrics showed normal resource consumption, the cluster was in green state, but not really fully usable...

Finally after rereading the monstache documentation (:+1:) found the Thead link
Adding the following properties into elasticsearch.yml made the issue go away:

threadpool:
    index:
        type: fixed
        size: 30
        queue_size: 1000
threadpool:
    bulk:
        type: fixed
        size: 30
        queue_size: 1000

So far the above mentioned scenario is working properly. I could even roll out your new version without any issue with other host's workers picking up the job!

miklosbarabas · 2017-05-29T12:06:08Z

@rwynn I have some other questions, hope you don't mind if I put them down here as well:

Any idea why the direct-read-namespaces phase sometimes slows down that much, that it is inserting 1-2 indices/s only? (these times, the stats shows that 1-2 indices are queued)
Checking the monstache.monstache collection in Mongo, sometimes the worker ts field is being updated in a strange way: hopping back-and-forth in time (like 2017-feb-4 12.00, then 2015-may-12 14.00, then 2017-jan-3 11.00 with inc of 0) Why is this happening?
Also could you please tell some words about how the elasticsearch-retry exponential backoff is working?

Many thanks!

rwynn · 2017-05-31T20:13:20Z

@miklosbarabas,
can you explain a little more about slow down you are seeing? I tried direct reads on 3 mongo collections with 100K documents each, and monstache indexed them in 30 seconds. You can try timing this with a command like:

time ./monstache --direct-read-namespace test.a --direct-read-namespace test.b --direct-read-namespace test.c --exit-after-direct-reads

I pushed a new release which should solve the ts field in the monstache collection. This was because it was recording timestamps from direct reads which it should not. Now the ts should only progress forward and be from the oplog data only.

I also added some documentation about retry. Hopefully, this will explain it.

Have you tried only running the direct reads via a cron job and not in the process with clustering and workers? The direct reads are good for catching up periodically. The long running processes can be tailing the oplog and not doing direct reads in a cluster. Another single process, non clustered, could be run periodically to do the direct reads and exit.

miklosbarabas · 2017-06-01T18:10:34Z

@rwynn
Thx for the new release!

It is happening after around the same amount of documents indexed, which is ~3 million and only happening for this collection which is ~6 million.
MongoDB metrics are fine as well as the ES metrics. No errors in the logs.

Tried to run only one monstache, non-worker and non-HA mode, just to direct-read the data and set to exit when it's done.
Experienced the same slowdown.

Even tried to increase gtm values like this:

[gtm-settings]
channel-size = 512
buffer-size = 256
buffer-duration = "4000ms"

with
elasticsearch-max-seconds = 5
Nothing changed. Slowdown at around the same point of the direct-read. (after ~3 million docs)

Following the trace file, it looks like it really is inserting mostly 1-3 indices per request, but there are some times when there one or two larger (1000) bulk is appearing.

miklosbarabas · 2017-06-01T18:28:58Z

Also after enabling index-oplog-time I see these in the trace:
"_oplog_date":"1970/01/01 00:00:00","_oplog_ts":0
for some of the indices.

The application stops after a while so direct-read finished it's job, but comparing the indexed document count in ES and the count of the collection from mongodb it still seemingly missing ~1mil docs.
Based on the stats file, only ~1500 failed before it's finished the direct-read.

rwynn · 2017-06-01T19:38:37Z

The oplog-time is not always available for direct reads. If the _id of the document is an ObjectID then it can be determined, because of the nature of ObjectIDs. If not, the timestamp will just be 0.

https://github.com/rwynn/gtm/blob/master/gtm.go#L465

rwynn · 2017-06-01T19:40:27Z

My current guess is that the slowdown may be occuring because of the nature of the direct read query. It uses a $natural sort which might not be optimal. I can try to change this to an _id sort to see if it then uses the index to speed things up.

https://github.com/rwynn/gtm/blob/master/gtm.go#L448

miklosbarabas · 2017-06-01T19:46:22Z

I could give it a try straight away if you release one based on the _id sort!

rwynn · 2017-06-02T01:57:10Z

@miklosbarabas, just pushed a new release. Give that one a try and let me know if the problem persists. Thanks for reporting this issue.

rwynn · 2017-06-02T12:53:00Z

The new release includes an updated version of the github.com/rwynn/gtm library. That is where the changes related to the slowness issue were made.

rwynn · 2017-06-03T02:43:02Z

@miklosbarabas, any luck with the latest release?

miklosbarabas · 2017-06-03T16:13:00Z

@rwynn: with the new release, the missing ~1million doc got inserted properly, didn't experience the slowdown. Thanks for all the help!!!

rwynn · 2017-06-03T21:57:34Z

Glad to hear it helped. You might be able to squeeze out a little more performance with the workers option and multiple processes. That will increase, by a multiple of N workers, the overall query load on MongoDB but will decrease the Elasticsearch indexing load of each process by N-1/N . Depending on how well Mongo can handle the queries it might give you a boost overall.

rwynn closed this as completed Jun 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk insertion timeout #25

Bulk insertion timeout #25

miklosbarabas commented May 22, 2017 •

edited

Loading

rwynn commented May 22, 2017

rwynn commented May 24, 2017

miklosbarabas commented May 26, 2017 •

edited

Loading

miklosbarabas commented May 29, 2017 •

edited

Loading

rwynn commented May 31, 2017 •

edited

Loading

miklosbarabas commented Jun 1, 2017

miklosbarabas commented Jun 1, 2017 •

edited

Loading

rwynn commented Jun 1, 2017

rwynn commented Jun 1, 2017

miklosbarabas commented Jun 1, 2017

rwynn commented Jun 2, 2017

rwynn commented Jun 2, 2017

rwynn commented Jun 3, 2017

miklosbarabas commented Jun 3, 2017 via email •

edited

Loading

rwynn commented Jun 3, 2017

Bulk insertion timeout #25

Bulk insertion timeout #25

Comments

miklosbarabas commented May 22, 2017 • edited Loading

rwynn commented May 22, 2017

rwynn commented May 24, 2017

miklosbarabas commented May 26, 2017 • edited Loading

miklosbarabas commented May 29, 2017 • edited Loading

rwynn commented May 31, 2017 • edited Loading

miklosbarabas commented Jun 1, 2017

miklosbarabas commented Jun 1, 2017 • edited Loading

rwynn commented Jun 1, 2017

rwynn commented Jun 1, 2017

miklosbarabas commented Jun 1, 2017

rwynn commented Jun 2, 2017

rwynn commented Jun 2, 2017

rwynn commented Jun 3, 2017

miklosbarabas commented Jun 3, 2017 via email • edited Loading

rwynn commented Jun 3, 2017

miklosbarabas commented May 22, 2017 •

edited

Loading

miklosbarabas commented May 26, 2017 •

edited

Loading

miklosbarabas commented May 29, 2017 •

edited

Loading

rwynn commented May 31, 2017 •

edited

Loading

miklosbarabas commented Jun 1, 2017 •

edited

Loading

miklosbarabas commented Jun 3, 2017 via email •

edited

Loading