Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk insertion timeout #25

Closed
miklosbarabas opened this issue May 22, 2017 · 15 comments
Closed

Bulk insertion timeout #25

miklosbarabas opened this issue May 22, 2017 · 15 comments

Comments

@miklosbarabas
Copy link

miklosbarabas commented May 22, 2017

Hi!

I am running monstache with HA mode on, which goes fine for the first minutes, but then:

ERROR 2017/05/22 13:59:53 elastic: bulk processor "monstache" failed but will retry: Post http://elastichost:9200/_bulk: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Seems like it cannot reconnect to the cluster. The elasticsearch log only shows that it was updateing, but no errors.

Any idea why this could happen? Is there any timeout in monstache that can be set against the elasticsearch cluster? (like the mongodb timeout options)

Thanks for the help in advance!
M

@rwynn
Copy link
Owner

rwynn commented May 22, 2017

@miklosbarabas,
Did a quick test with

./monstache --cluster-name c1 --verbose

Ran 2 processes like that and it looked like when I killed one the other took over and was still able to connect and send bulk requests.

There is a currently hard code 10s timeout on the http client given to use for elasticsearch

https://github.com/rwynn/monstache/blob/master/monstache.go#L1101

I'm not sure if that's the problem or not. Do you also see this issue on a simple local install of elasticsearch?

@rwynn
Copy link
Owner

rwynn commented May 24, 2017

@miklosbarabas,

Just published new version with configurable timeout. Defaults now to 60 seconds.

@miklosbarabas
Copy link
Author

miklosbarabas commented May 26, 2017

@rwynn

Sorry for the late response, and thank you for your fast reaction and for the new release!

Some additions for the scenario:
Using ES 2.4.4 with 3-5 nodes in the cluster, and running monstache on the node boxes (3 workers / node) the above mentioned error occurred regardless of resources (tho it started to happen earlier on smaller AWS T2 boxes)
The following configurations were used:

...
elasticsearch-max-docs = 1000
elasticsearch-max-seconds = 1
elasticsearch-max-bytes = 32768 # I tried with the default as well..
elasticsearch-retry = true
gtm-channel-size = 200
direct-read-namespaces = ["mydb.mycoll"] # here we use multiple db and collection
#some mappings
...

Also the error was happening during the direct-read-namespaces stage (new oplog entries weren't synced at this point) and somehow made the ES cluster stucked: no error in logs, the REST endpoint seemingly up, tho apps could not connect on transport protocol (9300) anymore. The metrics showed normal resource consumption, the cluster was in green state, but not really fully usable...

Finally after rereading the monstache documentation (:+1:) found the Thead link
Adding the following properties into elasticsearch.yml made the issue go away:

threadpool:
    index:
        type: fixed
        size: 30
        queue_size: 1000
threadpool:
    bulk:
        type: fixed
        size: 30
        queue_size: 1000

So far the above mentioned scenario is working properly. I could even roll out your new version without any issue with other host's workers picking up the job!

@miklosbarabas
Copy link
Author

miklosbarabas commented May 29, 2017

@rwynn I have some other questions, hope you don't mind if I put them down here as well:

  • Any idea why the direct-read-namespaces phase sometimes slows down that much, that it is inserting 1-2 indices/s only? (these times, the stats shows that 1-2 indices are queued)
  • Checking the monstache.monstache collection in Mongo, sometimes the worker ts field is being updated in a strange way: hopping back-and-forth in time (like 2017-feb-4 12.00, then 2015-may-12 14.00, then 2017-jan-3 11.00 with inc of 0) Why is this happening?
  • Also could you please tell some words about how the elasticsearch-retry exponential backoff is working?

Many thanks!

@rwynn
Copy link
Owner

rwynn commented May 31, 2017

@miklosbarabas,
can you explain a little more about slow down you are seeing? I tried direct reads on 3 mongo collections with 100K documents each, and monstache indexed them in 30 seconds. You can try timing this with a command like:

time ./monstache --direct-read-namespace test.a --direct-read-namespace test.b --direct-read-namespace test.c --exit-after-direct-reads

I pushed a new release which should solve the ts field in the monstache collection. This was because it was recording timestamps from direct reads which it should not. Now the ts should only progress forward and be from the oplog data only.

I also added some documentation about retry. Hopefully, this will explain it.

Have you tried only running the direct reads via a cron job and not in the process with clustering and workers? The direct reads are good for catching up periodically. The long running processes can be tailing the oplog and not doing direct reads in a cluster. Another single process, non clustered, could be run periodically to do the direct reads and exit.

@miklosbarabas
Copy link
Author

@rwynn
Thx for the new release!

It is happening after around the same amount of documents indexed, which is ~3 million and only happening for this collection which is ~6 million.
MongoDB metrics are fine as well as the ES metrics. No errors in the logs.

Tried to run only one monstache, non-worker and non-HA mode, just to direct-read the data and set to exit when it's done.
Experienced the same slowdown.

Even tried to increase gtm values like this:

[gtm-settings]
channel-size = 512
buffer-size = 256
buffer-duration = "4000ms"

with
elasticsearch-max-seconds = 5
Nothing changed. Slowdown at around the same point of the direct-read. (after ~3 million docs)

Following the trace file, it looks like it really is inserting mostly 1-3 indices per request, but there are some times when there one or two larger (1000) bulk is appearing.

@miklosbarabas
Copy link
Author

miklosbarabas commented Jun 1, 2017

Also after enabling index-oplog-time I see these in the trace:
"_oplog_date":"1970/01/01 00:00:00","_oplog_ts":0
for some of the indices.

The application stops after a while so direct-read finished it's job, but comparing the indexed document count in ES and the count of the collection from mongodb it still seemingly missing ~1mil docs.
Based on the stats file, only ~1500 failed before it's finished the direct-read.

@rwynn
Copy link
Owner

rwynn commented Jun 1, 2017

The oplog-time is not always available for direct reads. If the _id of the document is an ObjectID then it can be determined, because of the nature of ObjectIDs. If not, the timestamp will just be 0.

https://github.com/rwynn/gtm/blob/master/gtm.go#L465

@rwynn
Copy link
Owner

rwynn commented Jun 1, 2017

My current guess is that the slowdown may be occuring because of the nature of the direct read query. It uses a $natural sort which might not be optimal. I can try to change this to an _id sort to see if it then uses the index to speed things up.

https://github.com/rwynn/gtm/blob/master/gtm.go#L448

@miklosbarabas
Copy link
Author

I could give it a try straight away if you release one based on the _id sort!

@rwynn
Copy link
Owner

rwynn commented Jun 2, 2017

@miklosbarabas, just pushed a new release. Give that one a try and let me know if the problem persists. Thanks for reporting this issue.

@rwynn
Copy link
Owner

rwynn commented Jun 2, 2017

The new release includes an updated version of the github.com/rwynn/gtm library. That is where the changes related to the slowness issue were made.

@rwynn
Copy link
Owner

rwynn commented Jun 3, 2017

@miklosbarabas, any luck with the latest release?

@miklosbarabas
Copy link
Author

miklosbarabas commented Jun 3, 2017 via email

@rwynn
Copy link
Owner

rwynn commented Jun 3, 2017

Glad to hear it helped. You might be able to squeeze out a little more performance with the workers option and multiple processes. That will increase, by a multiple of N workers, the overall query load on MongoDB but will decrease the Elasticsearch indexing load of each process by N-1/N . Depending on how well Mongo can handle the queries it might give you a boost overall.

@rwynn rwynn closed this as completed Jun 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants