Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I think there is a race condition with limit_concurrency #224

Open
hms opened this issue May 13, 2024 · 10 comments
Open

I think there is a race condition with limit_concurrency #224

hms opened this issue May 13, 2024 · 10 comments

Comments

@hms
Copy link

hms commented May 13, 2024

Environment:

  • Latest version of SolidQueue
  • Postgresql@16.3
  • MacOS

My workflow that seems to reliably cause the issue:

  • I have a workflow where I create approximately 10 records in couple of seconds.
  • This model has an after_commit block that enqueues 3 jobs via perform_later.
  • The after_commit code is wrapped in a transaction (the 3 jobs really should be enqueued all-or-nothing), but the SolidJob error was occurring with and without the transactional wrapper

One of the Jobs fairly consistently generates an error on 'INSERT INTO "solid_queue_semaphores".
For each of the block of 10 or so creates, the offending jobs solid_queue_semephore "key" value is JobName/User.id, and all 10 of the commits will have the same key.

What I believe I'm seeing via the logs is:

  • N jobs are being enqueued in parallel (looks like N of 2)
  • Because of some unfortunate random timing, these 2 enqueue requests are both attempting to insert at effectively the same time.
  • Because they have the same solid_queue_semephore "key" value and the "key" has a unique constraint, one of the enqueues gets rejected with a 'ActiveRecord::StatementInvalid (PG::InFailedSqlTransaction: ERROR: current transaction is aborted, commands ignored until end of transaction block' from deep within the bowels of Rails ActiveJob#enqueue.
  • This is less than ideal ;-)

I haven't dug down into any of the SolidQueue code around enqueuing a job with limits_concurrency, but, at least with Postgres and it's version of MVCC, there is no way to preemptively detect a potential duplicate key prior to insert via a query. It either has to be rescued or handled with an 'on conflict' clause (or the mySQL / sqlLite equivalent).

If my somewhat rushed conclusion is correct, then given limits_concurrency is a SolidQueue feature that enhances an enqueue request, it would be highly desirable for Solid to detect and retry internally. Less desirable, but absolutely functional would be to raise an error that's specifically for this issue because the Job.perform_later is not the root cause, there isn't anything wrong with the database, you just got stepped on by combination of a database constraints and unfortunate timing, so the right answer is to resubmit job request it will likely work.

When I figure out how to get SolidQueue test environment working (I'm on a Mac and homebrew seems to require some kind of weird incantation to build a working version of mysql that works with docker), I'll see if I can develop a test for this specific case.

@rosa
Copy link
Member

rosa commented May 13, 2024

Hey @hms, thanks for writing this up. I don't have a lot of experience with PostgreSQL but I think the whole transaction wrapping the 3 jobs being enqueued might be the problem.

it would be highly desirable for Solid to detect and retry internally

This is what is being done and the basis of the concurrency limits. The unique key is precisely to avoid race conditions with multiple jobs that share a concurrency key being enqueued at the same time.

The after_commit code is wrapped in a transaction (the 3 jobs really should be enqueued all-or-nothing), but the SolidJob error was occurring with and without the transactional wrapper

I imagine that in this case, the expected ActiveRecord::RecordNotUnique is not being raised, but instead that other PostgreSQL error.

When I figure out how to get SolidQueue test environment working (I'm on a Mac and homebrew seems to require some kind of weird incantation to build a working version of mysql that works with docker), I'll see if I can develop a test for this specific case.

Have you tried just running bin/setup? This will use Docker to run MySQL so you don't have to install it in your machine.

@hms
Copy link
Author

hms commented May 13, 2024

@rosa always appreciate the amount of support here!

[This is what is being done](it would be highly desirable for Solid to detect and retry internally) and the basis of the concurrency limits. The unique key is precisely to avoid race conditions with multiple jobs that share a concurrency key being enqueued at the same time.

I understand the point of a unique key constraint, but as I had mentioned, it really depends on how you are using it. I believe there are 2 distinct possibilities here:

  1. The key already exists, and a read or 'select for update' to avoid race conditions make detection easy.
  2. The key does not exist...
    In postgresql, you can't 'select for update' a key that doesn't exist and can't proactively lock to avoid a race, you'd have to fallback to table locking if you are trying to 'read-verify' before writing.
    Otherwise, you have to try to insert, then one threads will end up in a non-recoverable state at commit.

You are correct that my wrapping 3 jobs in a transaction is not making things easier. That was simply a response to this randomly loosing one of my job requests. A better way to handle this issue of random failures to enqueue is to change my code support a retry on a per job request.

As for my ability to get a working test environment up and working. bin/setup has always blown up on my laptop with the following. And since I've never worked in a docker environment, this is befuddling to me:

bin/setup
creating Makefile

current directory: /Users/hms/.rvm/gems/ruby-3.2.2/gems/mysql2-0.5.4/ext/mysql2
make DESTDIR\= sitearchdir\=./.gem.20240513-34066-kirvj1 sitelibdir\=./.gem.20240513-34066-kirvj1 clean

current directory: /Users/hms/.rvm/gems/ruby-3.2.2/gems/mysql2-0.5.4/ext/mysql2
make DESTDIR\= sitearchdir\=./.gem.20240513-34066-kirvj1 sitelibdir\=./.gem.20240513-34066-kirvj1
compiling client.c
In file included from client.c:15:
./mysql_enc_name_to_ruby.h:43:1: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
mysql2_mysql_enc_name_to_rb_hash (str, len)
^
./mysql_enc_name_to_ruby.h:86:1: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
mysql2_mysql_enc_name_to_rb (str, len)
^
client.c:1377:3: error: call to undeclared function 'mysql_ssl_set'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
  mysql_ssl_set(wrapper->client,
  ^
client.c:1377:3: note: did you mean 'mysql_close'?
/opt/homebrew/opt/mysql/include/mysql/mysql.h:797:14: note: 'mysql_close' declared here
void STDCALL mysql_close(MYSQL *sock);
             ^
2 warnings and 1 error generated.
make: *** [client.o] Error 1

make failed, exit code 2

@hms
Copy link
Author

hms commented May 14, 2024

@rosa

I spent a little time running 'by-hand' tests to confirm my suspicions. Here is the TL;DR; version of things:

  • Postgresql transactions become "unstable" after any statement rejection due to a constraint violation. In this case, SolidQueue::Semaphore#attempt_creation and the unique key index on the key field.
  • These unstable transaction are dead until transaction rollback:
    • With Rails defaults, this represents the entire transaction. As a side effect, it leaves all post 'rescued' database DML code (insert/update/delete) in a state of denial about the fact it is all dead code.
    • Using Rails transaction options, the transaction can be 'restabilize' via a rollback to a 'savepoint' (the Rails documents refers to them as sub-transactions). This requires not only wrapping the code in question with another transaction request, but also using the (requires_new: true) option.

Unfortunately, as stated above, there is something in my environment that prevents me from building the Solid test environment. But I was able to test via 'hand' via:

  • In window 1, lock a shared point of access that blocks window/session 2 and 3 to simulate simultaneous access to the semaphore table: In my database, this is a share point of contention is: User.lock.find(1)
  • In two separate terminal sessions
    • ActiveRecord::Base.transaction do
    • attempt to User.lock.find(1) # synchronizing multiple requests
    • In a begin / rescue block: attempt to insert into a table using the same primary key (id) simulating the Semaphore key conflict
    • rescue ActiveRecord::RecordNotUnique
    • After the rescue block, attempt to insert into the same table a second and distinct record is guaranteed valid (i.e., it inserts).
    • end # transaction

Without a savepoint, one of the two sessions would become 'dead/unstable' and generate the same error I'm getting in my logs from SolidQueue (i.e., both inserts would fail and I would get the Pg:: InFailedTransaction: ERROR: current transaction is aborted, commands ignored until end of transaction block error message)

With the save point as suggested below, both transactions would run to completion -- one transaction writing the conflict record and both transactions writing the validation records.

I believe this change will fix this issue (maybe Postgresql only????):

def attempt_creation
   Semaphore.transaction(request_new: true) do # this creates the required save point
      Semaphore.create!(key: key, value: limit - 1, expires_at: expires_at)
      true
   end
   rescue ActiveRecord::RecordNotUnique # this automatically rollbacks to 'savepoint'
     # Any database interactions from here on are valid and will work
      limit == 1 ? false : attempt_decrement
end

@hms
Copy link
Author

hms commented May 15, 2024

@rosa I managed to get my bin/setup working. But I could use some help designing a test that works within the solid environment.

@MuhammetDilmac
Copy link

MuhammetDilmac commented May 21, 2024

We have the same problem. We are experiencing this issue with Twilio webhook. onConversationAdded and onMessageAdded trigger at the same time. When onConversationAdded is added as a job and onMessageAdded is added, it throws the same error when both try to go to the db at the same time.

PG::InFailedSqlTransaction: ERROR: current transaction is aborted, commands ignored until end of transaction block

@hms
Copy link
Author

hms commented May 21, 2024

@MuhammetDilmac i don't know if you are triggering the problem in the same way as I am, but the code above solves for me. I have that code in a PR without any tests if you would like me to push it up

@hms
Copy link
Author

hms commented May 22, 2024

@rosa

When you have a minute, I'm stuck with how to write a test that demonstrates this issue given the code change doesn't lend itself to flags or easy access any other other standard tricks for testing two implementations of the same functionality. I have a fix that I've been running for about 2 weeks and it seems to resolve the issue. But I'd like to have tests and not trust the "it was just your lucky two weeks faries".

Also, when attempting to run the tests on main, I am unable to run them to completion successfully. So, clearly I have to start there and it would help know if these errors are to be expected or if I'm "being special again".

Hal

@rosa
Copy link
Member

rosa commented May 22, 2024

Hey @hms, sorry for the delay! I'm on-call this week and haven't had time to dedicate to Solid Queue 😳 I'll try on Friday. I'm not sure it'll be possible to write a test for this scenario, TBH.

What error do you get when you run the tests on main? You can also try running them for Postgres only, with the environment variable TARGET_DB=postgres, in case that helps, and don't need to bother with MySQL.

hms added a commit to ikyn-inc/solid_queue that referenced this issue May 22, 2024
Per all of my earlier discussion, Postgres transactions become
invalid upon any database contraint invalidating an operation
and make all follow on database interactions invalid until a
rollback is executed.

Semephore#attempt_creation uses a rather common database pattern to
rely on database locks to reliably syncrhronize based on uniqueness
of values.  The catch is that Postgresql requires a little more
handholding to use this pattern.  By wrapping just the Semaphore.create
statement in a SAVEPOINT (a pretend nested transaction), we have
what appears to be a transparent operation that works across all
of the other supported databases without any known consequences
AND allows Postgres to continue to work as expected per the existing
code.

This change works across the Solid supported databases
(at least based on the tests) and I believe the extra overhead
for the non-postgres databases is small enough that special casing
the code or resorting to the complexity of dependency injection
is just not worth it.

I added two tests to the concurrency_controls_test.rb.
One to show the error is real and is easy to reproduce.
The other to show the fix fixes it.
@hms
Copy link
Author

hms commented May 23, 2024

@rosa if my response came across as demanding or expecting, I'm sorry. I know everyone here is on an all volunteer basis with day jobs.

@rosa
Copy link
Member

rosa commented May 23, 2024

Hey @hms, no, not at all! Don't worry at all, sorry if I implied I was bothered or something! 😅 I'm fortunate that I can work on Solid Queue as part of my day job, but lately I've been busy with other priorities there, including on-call, but hopefully I'll manage to find time soon 🤞 Great job on this issue and the associated PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants