Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow job execution [Race condition workaround - Thread.sleep 2s] #4456

Closed
jbguerraz opened this Issue Feb 5, 2019 · 8 comments

Comments

Projects
None yet
2 participants
@jbguerraz
Copy link
Contributor

jbguerraz commented Feb 5, 2019

Describe the bug
Job executions are taking 2 seconds (instead of 90ms for the very first execution).
c91d7e0 forces Rundeck to wait for 2 seconds if somehow the execution was not found (in our case, after very first executions, it systematically waits for 2 seconds, retries, gets the data, and finally, moves ahead).
Instead of a sleep, a lock mechanism would probably helps to wait the very minimum of time.

My Rundeck detail

  • Rundeck version: 3.0.13
  • install type: war
  • OS Name/version: alpine linux (k8s)
  • DB Type/version: postgres

To Reproduce
Steps to reproduce the behavior:

  1. Get a slow datasource (or, somehow, delay the writes) :) use postgres (probably other db as well) => we use a docker container with official postgres image to reproduce the issue
  2. Import job https://gist.github.com/jbguerraz/c88dfa1a4254c1709ba6e29bb15e1fd4#file-gistfile1-txt
  3. Run the job few several (tons of) times (actually, watch -n 0.1 "curl -X POST http://localhost:4440/api/16/job/the-job-id/executions\?authtoken\=thetoken" is very good at reproducing the issue)
  4. See the 2 seconds duration for some of the job executions

Expected behavior
The execution won't be blocked for 2 seconds (* needed retries) but instead will wait for the execution to be available for reading.

@jbguerraz

This comment has been minimized.

Copy link
Contributor Author

jbguerraz commented Feb 5, 2019

( related to #193 #190 )

@gschueler gschueler added the bug label Feb 6, 2019

@jbguerraz

This comment has been minimized.

Copy link
Contributor Author

jbguerraz commented Feb 6, 2019

We started today to tackle that one. Hopefully we will have something by tomorrow.

@jbguerraz

This comment has been minimized.

Copy link
Contributor Author

jbguerraz commented Feb 7, 2019

Actually it's not a race condition between "write" and "read" ("read" - get() - is always executed after "write" - save()).
The issue lives in the Hibernate usage. The save(flush:true) actually doesn't makes the data available to another transaction as said here grails/grails-core#10836 (comment)
We first implemented a locking mechanism; it helped since there is no more sleep. We moved from a 30 seconds pipeline (about 14 jobs) to a 1.4 second pipeline!!! , but, this locking stuff is not a proper fix (at the end, it's doing nothing more than a smaller sleep actually).

@jbguerraz

This comment has been minimized.

Copy link
Contributor Author

jbguerraz commented Feb 8, 2019

For the record, reproduced with MySQL 8 but not with MySQL 5.7

@jbguerraz

This comment has been minimized.

Copy link
Contributor Author

jbguerraz commented Feb 8, 2019

After discussing it with Hibernate community, it sounds like that issue is related to the way Hibernate is used by Rundeck more than an Hibernate/Oracle(/Postgres/MySQL 8) bug.

Since MySQL 5.7 saved our performances, for now, this issue is of less priority on our side

@jbguerraz

This comment has been minimized.

Copy link
Contributor Author

jbguerraz commented Feb 11, 2019

FTR: actually, MySQL 5.7 also fall into this execution not found case, but pretty rarely (8 times over about 3000 executions)

@gschueler

This comment has been minimized.

Copy link
Member

gschueler commented Feb 13, 2019

Another method might be: use "gorm events" to listen to the afterInsert event for Execution objects with a central bean. the ExecutionJob would query that bean if the execution is not found, and give it a handler to trigger it to proceed when the event is received (or immediately if it was already received)

@gschueler gschueler added this to the 3.0.17 milestone Feb 26, 2019

@gschueler gschueler closed this Feb 26, 2019

@jbguerraz

This comment has been minimized.

Copy link
Contributor Author

jbguerraz commented Feb 26, 2019

👍 thank you @carlosrfranco !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.