-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I think there is a race condition with limit_concurrency #224
Comments
Hey @hms, thanks for writing this up. I don't have a lot of experience with PostgreSQL but I think the whole transaction wrapping the 3 jobs being enqueued might be the problem.
This is what is being done and the basis of the concurrency limits. The unique key is precisely to avoid race conditions with multiple jobs that share a concurrency key being enqueued at the same time.
I imagine that in this case, the expected
Have you tried just running |
@rosa always appreciate the amount of support here!
I understand the point of a unique key constraint, but as I had mentioned, it really depends on how you are using it. I believe there are 2 distinct possibilities here:
You are correct that my wrapping 3 jobs in a transaction is not making things easier. That was simply a response to this randomly loosing one of my job requests. A better way to handle this issue of random failures to enqueue is to change my code support a retry on a per job request. As for my ability to get a working test environment up and working. bin/setup has always blown up on my laptop with the following. And since I've never worked in a docker environment, this is befuddling to me:
|
I spent a little time running 'by-hand' tests to confirm my suspicions. Here is the TL;DR; version of things:
Unfortunately, as stated above, there is something in my environment that prevents me from building the Solid test environment. But I was able to test via 'hand' via:
Without a savepoint, one of the two sessions would become 'dead/unstable' and generate the same error I'm getting in my logs from SolidQueue (i.e., both inserts would fail and I would get the Pg:: InFailedTransaction: ERROR: current transaction is aborted, commands ignored until end of transaction block error message) With the save point as suggested below, both transactions would run to completion -- one transaction writing the conflict record and both transactions writing the validation records. I believe this change will fix this issue (maybe Postgresql only????): def attempt_creation
Semaphore.transaction(request_new: true) do # this creates the required save point
Semaphore.create!(key: key, value: limit - 1, expires_at: expires_at)
true
end
rescue ActiveRecord::RecordNotUnique # this automatically rollbacks to 'savepoint'
# Any database interactions from here on are valid and will work
limit == 1 ? false : attempt_decrement
end |
@rosa I managed to get my bin/setup working. But I could use some help designing a test that works within the solid environment. |
We have the same problem. We are experiencing this issue with Twilio webhook. onConversationAdded and onMessageAdded trigger at the same time. When onConversationAdded is added as a job and onMessageAdded is added, it throws the same error when both try to go to the db at the same time.
|
@MuhammetDilmac i don't know if you are triggering the problem in the same way as I am, but the code above solves for me. I have that code in a PR without any tests if you would like me to push it up |
When you have a minute, I'm stuck with how to write a test that demonstrates this issue given the code change doesn't lend itself to flags or easy access any other other standard tricks for testing two implementations of the same functionality. I have a fix that I've been running for about 2 weeks and it seems to resolve the issue. But I'd like to have tests and not trust the "it was just your lucky two weeks faries". Also, when attempting to run the tests on main, I am unable to run them to completion successfully. So, clearly I have to start there and it would help know if these errors are to be expected or if I'm "being special again". Hal |
Hey @hms, sorry for the delay! I'm on-call this week and haven't had time to dedicate to Solid Queue 😳 I'll try on Friday. I'm not sure it'll be possible to write a test for this scenario, TBH. What error do you get when you run the tests on main? You can also try running them for Postgres only, with the environment variable |
Per all of my earlier discussion, Postgres transactions become invalid upon any database contraint invalidating an operation and make all follow on database interactions invalid until a rollback is executed. Semephore#attempt_creation uses a rather common database pattern to rely on database locks to reliably syncrhronize based on uniqueness of values. The catch is that Postgresql requires a little more handholding to use this pattern. By wrapping just the Semaphore.create statement in a SAVEPOINT (a pretend nested transaction), we have what appears to be a transparent operation that works across all of the other supported databases without any known consequences AND allows Postgres to continue to work as expected per the existing code. This change works across the Solid supported databases (at least based on the tests) and I believe the extra overhead for the non-postgres databases is small enough that special casing the code or resorting to the complexity of dependency injection is just not worth it. I added two tests to the concurrency_controls_test.rb. One to show the error is real and is easy to reproduce. The other to show the fix fixes it.
@rosa if my response came across as demanding or expecting, I'm sorry. I know everyone here is on an all volunteer basis with day jobs. |
Hey @hms, no, not at all! Don't worry at all, sorry if I implied I was bothered or something! 😅 I'm fortunate that I can work on Solid Queue as part of my day job, but lately I've been busy with other priorities there, including on-call, but hopefully I'll manage to find time soon 🤞 Great job on this issue and the associated PR. |
Environment:
My workflow that seems to reliably cause the issue:
One of the Jobs fairly consistently generates an error on 'INSERT INTO "solid_queue_semaphores".
For each of the block of 10 or so creates, the offending jobs solid_queue_semephore "key" value is JobName/User.id, and all 10 of the commits will have the same key.
What I believe I'm seeing via the logs is:
I haven't dug down into any of the SolidQueue code around enqueuing a job with limits_concurrency, but, at least with Postgres and it's version of MVCC, there is no way to preemptively detect a potential duplicate key prior to insert via a query. It either has to be rescued or handled with an 'on conflict' clause (or the mySQL / sqlLite equivalent).
If my somewhat rushed conclusion is correct, then given limits_concurrency is a SolidQueue feature that enhances an enqueue request, it would be highly desirable for Solid to detect and retry internally. Less desirable, but absolutely functional would be to raise an error that's specifically for this issue because the Job.perform_later is not the root cause, there isn't anything wrong with the database, you just got stepped on by combination of a database constraints and unfortunate timing, so the right answer is to resubmit job request it will likely work.
When I figure out how to get SolidQueue test environment working (I'm on a Mac and homebrew seems to require some kind of weird incantation to build a working version of mysql that works with docker), I'll see if I can develop a test for this specific case.
The text was updated successfully, but these errors were encountered: