New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For discussion: experiment using optimistic locking to avoid race condition #291
For discussion: experiment using optimistic locking to avoid race condition #291
Conversation
… race conditions AR-only in this demo. A side effect is you no longer get AR model callbacks on promote (or manual ) which may be a deal-breaker
Thanks a lot for the PR and your detailed thought. I should have time tomorrow to review it properly. But for now I want to say that I've also tried this approach in the past, but that introduced a regression because people relied on ActiveRecord callbacks to trigger during promotion, so I reverted it. Maybe it could still work if it was introduced in a safer more backwards compatible way, we can discuss these possibilities. |
Thanks for response. Yeah, I knew that might be a blocker. If it really is the right way to do things, might need to be a backwards incompat change -- or an option. But really this was just some trial run for needing to do things with "derivatives", where, if we let people update derivatives "whenever" -- which I think is actually a core requirement -- an analagous situation becomes a problem which is both much more dangerous, and much more complicated to deal with. :( The two problems could interact though, to the extent that someone could be using the backgrounding plugin and "new" derivatives. Spins my head to think about it. |
Here's a thought: The mechanism of a race-condition-free update strategy needs to be entirely pluggable (not neccesarily meaning the shrine plugin system, but maybe) . Like not even assumed by an ORM plugin -- although a given ORM plugin might have a default. But even with a given ORM, you might want to choose an optimistic locking approach (either using the optimistic locking built into the ORM, or not, as in this PR); a pessimistic locking approach; or transaction isolation levels. Ideally there would be race-free-update "strategies" that could work for backgrounding update AND new-style derivatives update. You may or may not want to choose the same one for each kind of update. Of course, if we were to provide all of those out of the box, it gives us a lot more work to do! And it's really hard to test and be sure you are free of race conditions with some of the approaches. |
Ok, was able to properly read the code and your thoughts in this PR. First of all, in case you're interested, here are the commits that introduced (and later reverted) a similar kind of optimistic locking in the past:
One interesting note is that with Sequel it's possible to add a WHERE filter to a model instance that will be applied on update, and this way the callbacks will still trigger. But I really disliked that it modifies the current model instance, which could be very surprising if the user wanted to make another update after attachment promotion (that update probably then wouldn't work). So, I would prefer to avoid this approach if we can. Mainly because of the lack of callbacks, but also because I don't know how it would help with
Yes, I agree that having a mechanism to eliminate race conditions for the new
I don't think it should be too complicated, I imagine that we'd have an ORM-specific
Oh, I thought I added tests after introducing that backwards incompatibility with callbacks. I agree they should definitely exist.
Yeah,
I would like to hear more. What makes you concerned about performance of optimistic locking? What are the scenarios which you imagine where pessimistic locking could cause deadlocks? Since all that would happen inside the lock would be an The advantages for me are: (a) callbacks are still triggered, (b) no need to rescue exceptions as the DB query will just block until the lock is released, and (c) it should work for both promoting and derivates because of record refreshing. To briefly explain (c); with promoting you only have to check whether the attachment has changed, but with derivates we also need to be able to retrieve the up-to-date list of derivates before updating the column, which we would because the record would be refreshed. This is how I imagine it: def safe_update
record.db.transaction do
record.lock! # record is refreshed with a SELECT ... FOR UPDATE query
yield # this updates any record column values in-memory
record.save(validate: false) # persists any new values
end
end # promoting
# ... upload cached file to permanent storage ...
safe_update do
record.send("#{data_column}=", stored_file.to_json) # write stored file data to attachment column
end # adding new derivates
# ... process new derivates and upload them to permanent storage ...
safe_update do
# record is refreshed with any derivates that were added while new derivates were processed, so we get the up-to-date list
record.send("#{derivates_column}=", derivates.merge(**new_derivates).to_json)
end Note that this is just a sketch. There should also be some failure callback which would get triggered if the attachment has changed, where we would delete the processed derivates. It should maybe also be possible to do promotion/derivate updating without locking, to avoid refreshing and lock in case people don't use backgrounding (though maybe this should still be done in either case for safety, e.g. if people decide to do their own backgrounding).
I guess we could have ActiveRecord's/Sequel's built-in optimistic locking (with the separate version column) as an alternative locking implementation, and I agree with the behaviour that you suggested. I think your suggestion that we could make locking strategies configurable is great, e.g.
That's an interesting find, but I have to say that I'm having a trouble understanding how it works. And also, I thought that transactions are used for making multiple updates atomic, that they don't actually prevent updates from happening during a transaction. Since we're doing a single update, I don't think transactions will help us. Also, I'm worried about cross-DB support. |
Thanks for detailed comments!
Hmm... but doesn't the current promotion code also modify the current model instance? Even with backgrounding, it fetches a new model instance to check for changes (still with a race condition as discussed)... but then just abandons the new model instance, and updates the original. And without backgrounding, it's always updating the original (and saving it to the db). No? Am I misunderstanding the code? But yeah, this is some of the stuff I'm running into working on derivatives -- for which I hope to have some code to talk about (that will potentially have major things that need to be changed) next week. Always a lot easier to talk about this with some code to comment on that so abstractly. Incidentally, an interesting thing about pessimistic locking is that, as far as I can tell, it's really only safe if ALL the code that might update that record (or at least that column in that record) uses it. As opposed to optimistic locking, that is safe even if other code updating that record/column isn't in on the game. In derivatives, in order to have any hope of dealing with race conditions sanely, I think Which I guess probably means it should not be on the current model instance, as, like I think you were saying, persisting other previous un-saved changes on the model instance would be unexpected. (But I thought that's what promotion was doing now.... in my following of the code, any call to |
On performance characteristics of pessimistic locking -- well, when you take out a pessimistic lock, it means anything else that is going to need that lock is going to block waiting for it. I am not totally sure if that includes anything else wanting to read the row in question, or just anything else wanting to write to it. And hypothetically two processes can end up deadlocked each waiting for a lock the other one has out. (in which case I believe postgres, at least, will notice and return an error, rather than letting them both wait forever). Will performance be an issue here, where the lock should not be held very long at all? Is deadlock actually possible? I'm not sure, it's quite possible it's just fine, but it's hard enough for me to wrap my head around that it always makes me nervous. On transactions with isolation levels... theoretically it's an SQL standard, so all dbs should work the same. I am not totally sure how much practice matches this theory. But you don't need to have multiple updates for transactions with isolation levels to be useful, even though examples to demonstrate them often use multiple records. Theoretically, transaction isolation will do something very similar to pessimistic locking, but with better performance/concurrency characteristics. Unlike a pessimistic lock, other processes won't be blocked waiting for access; rather, if another process does do something that would conflict with the semantics, one of the processes will error out (similar to an optimistic locking error in that case). Here's the pertinent line from the postgres docs that makes me think this:
If you have a transaction with "repeatable read isolation" -- if someone else updated the record you fetched and then tried to update -- postgres will refuse to do your update and error out. Accomplishig exactly what we are trying to accomplish here -- the race condition we are dealing with is when our process fetches a record, makes changes, and then saves it -- but in between our fetch and save, someone else updated it, and we're in danger of overwriting their update and/or putting data in inconsistent state because of it. "repeatable read" isolation avoids that, giving you an error if that was going to happen. But without actually taking out a lock that will make other processes wait in a queue for it to be released. Have I ever actually used this myself? No. Does it make my head spin? Yes. Do I trust my ability to understand the implications when using it? Totally not. So I'm not really saying to use it... but I'm scared of pessimistic lock for the same reason. It is really hard to make a test for any of this. I am not surprised there's no test for current backgrounding attempt to avoid race condition. Plus, if there WAS a test and it passed... woudln't that be some false confidence, since we in fact know there IS a race condition? |
So, this PR was meant to demonstrate that no extra DB column is required for optimistic locking, right? But yeah, to use the built-in AR optimistic locking, true. I don't know about Sequel. As far as what to do with backgrounding and race conditions... I honestly have no idea. I'm not even totally sure how bad the consequences of the race condition are here. Thinking it through, if the race condition happens, what could happen is:
While annoying, that is not as disastrous as the race conditions that could happen with derivatives which could be either A) A record ends up, either temporarily or permanently, having derivatives that don't match_ the original file, they are for some other file, or B) some derivatives you tried to add didn't end up stored in the model at all (cause they were overwritten by another update), but the derivative files are still persisted in storage. A lot more disastrous in both cases. It's worth pointing out that if we took ActiveStorage's approach of having one row per derivative... it might make these race conditions a LOT easier to deal with. But it would also make some other things, especially with shrine's architecture, a lot harder. I don't even know how you'd deal with foreign-key related objects in a DB-agnostic way, like shrine does now. AS's approach is a lot harder when you can't assume a particular ORM. |
True, promotion will discard any other in-memory column changes, and that is expected. Note that promotion is intended to be called after the record has been saved, so it's unexpected that the user will modify it before promotion. Also, the only way the user can modify the record before promotion is if they use
This thought also came to my mind, so I went ahead and checked it, and while the lock has been taken by a
I don't think saving other un-saved changes would be unexpected or unwanted, only that the caller shouldn't expect it. In the case of pessimistic locking (where reloading happens) the changes and your optimistic locking implementation other changes would be dropped, while in the case of built-in optimistic locking other changes would be persisted. I wouldn't worry about that. |
Hmm. I don't think it will discard them. I think it will save them. But we could obviously play with the code and find out for sure!
I believe there's at least some documentation somewhere that suggests calling promote (or update or swap, which I think will have the same issues) directly in some circumstances. I could go look for it. But yes, ordinarily that won't be an issue. Except with derivatives it will be, because we're definitely gonna be suggesting people call In some of my current implementations, One way or another, something's gonna be weird I think, I can't find any ways to make all of these things act as one might expect, with no race conditions. But this will be more clear when there's code to look at maybe. |
Normal
Since the lock will be held for the shortest amount time possible, I don't think it could cause performance issues (otherwise I would never use this feature 😛). I think the only way a deadlock is possible is if while holding the lock, the connection (a) does Though it's possible that user-defined callbacks trigger some other UPDATE queries... You're right, we should consider risks of deadlocks.
If there is a parallel update to the record while we're updating the attachment, I would like that update to succeed once that transaction is finished, not error out. What I like about the pessimistic locking approach is that there are no errors, only "adjusting" when UPDATES will happen.
If while we're adding a derivate to the record someone updates another column on the same record, I wouldn't consider that a race condition, and I wouldn't want an exception to happen. I would want the other update to succeed (after the derivate has been added). If the parallel update is caused by another derivate being added to the same record, than the pessimistic locking approach would correctly add one derivate after the other without race conditions, while the transaction isolation approach would error out?
Hehe, yeah, I didn't test this because I didn't know how to stop the process between the SELECT (refresh) and the UPDATE. I'm comfortable with using Fibers, but I cannot just insert a
Yes, I was considering pessimistic locking vs ActiveRecord's/Sequel's built-in optimistic locking here (Sequel has it as well). At the moment I don't know if the implementation from this PR is going to work, because (a) it doesn't trigger callbacks and (b) I don't see how it could be utilized for derivates (if update count is
Agreed 😄
Actually, I'm not sure how would having one row per derivate in separate table help us. Can you clarify? |
Oh, right, core promotion will save changes. The
The way I see it, we'll want to have two ways of adding derivates:
The latter would then probably be a wrapper around the former. We can then document that their record changes won't be saved in the latter case (I think that's completely reasonable). |
Sorry for so many words, there's just a lot going on, thanks for the discussion.
Yeah, I mean, everyone would LIKE that, but there's a reason people use optimistic locking instead... cause it's got a LOT more straightforward performance characteristics. In this case, transparently recovering from an optimistic locking error (and this applies to the analogous transaction isolation "serialization" error too) is really straightforward: Fetch the record again, re-apply your merges on top of the hash, save again. If error again, repeat to some number of max-tries. That's how optimistic locking is actually designed to be used, although the way ORM's expose it can make it hard to use in this way sometimes!
So there's a new class of race condition introduced by single-column-hash derivatives. Not just the "backgrounding" one, where another process may have changed the original. But also one where the original remains the same, and another process has changed the derivatives. Because we're storing all the derivatives in a single column, if someone else added or removed a derivative in between when we fetched and when we saved -- our change overwrites their changes, since we're writing the whole hash. And overwrites it without any app callbacks ever being called, so there may be actual derivative files still on storage but no longer referenced (or derivatives referenced but no longer persisted in storage since someone else thought they removed em). Single derivative per row gets rid of that kind of conflict, because the db already lets you atomically update a row, that's what dbs do. But it doesn't let you atomically update part of a serialized json hash (well, except pg jsonb does, with the right calls, that ActiveRecord won't do for you on
The former puts you at risk of race conditions unless you really know what you're doing, and pay attention to where you're doing it in your code. The latter... in addition to any in-memory unrelated record-changes not being saved (which I think is probably desirable in addition to being reasonable)... what if the derivatives changes aren't actually reflected in your in-memory record at all. You'd have to |
on optimistic locking (with a failure that can be recovered from) compared to pessimistic locking (with blocking/waiting)... see "lock free concurrency", "wait free concurrency", "lock-free design" etc in concurrent programming. We are talking about a kind of concurrent programming here. There's a reason people prefer the non-blocking ones for performance reasons. That apply here, at least theoretically. Still, the pessimistic locking approach is probably good enough... for anyone not dealing with massive scale (is that good enough?)... and I'm really no concurrency expert either, I just know enough to know what I don't know, heh. If each derivative was in a separate row, it kind of gets us out of having to be a concurrency expert, cause the DB takes care of it for us -- DBs were designed for atomic updates on an individual column in an individual row, not for a key in a serialized hash. But I really don't want to do one-row-per-derivative either if we can avoid it. In part cause I have no idea how to implement it in shrine. |
That's fine, I'm glad we're discussing this, it's great for me to have other points of view besides mine 😃
I don't think that's the primary reason people use optimistic locking (I'm talking now about the built-in one to ActiveRecord/Sequel). I think people use it mainly for the use case of two people updating the same resource via a form; if the other person updated the resource you're looking at between the time you opened at and the time you clicked save, you want the save to fail and app to tell you to refresh the page. This is not possible with pessimistic locking, because there updates always succeed, they just wait in certain scenarios. This is what I think I want in Shrine. Note that if we're talking about ActiveRecord's/Sequel's built-in optimistic locking, then I'm not sure which one has better performance characteristics. For every time you retry the update you add two more queries (refresh + update), while with pessimistic locking you'll only ever make two queries ( So I think that ActiveRecord's/Sequel's built-in optimistic locking could produce more surprising behaviour (unnecessary exceptions) compared to pessimistic locking. I know in this PR we were talking about a different kind of optimistic locking, I just wanted to cover this one as well because it was mentioned here.
Thanks for clarifying. Agreed, having a row per derivate would eliminate race condition considerations for the derivates column. Though, we would still need to check whether the original attachment has changed when updating derivates.
Yes, I really dislike the performance considerations of this approach. One is that people would need to remember to eager load whenever the have a collection of records with attachments – I want people to opt-in for this responsibility. The other is that the speed of retrieving derivates would become inversely proportional to number of derivates per attachment. I don't think that's a good tradeoff, as it shouldn't be that hard to make the current approach race-condition free.
Yes, that would make sense mostly in synchronous examples, where you're not using background jobs. Still, I would like to allow people to save the record in their preferred way, should they really want to. But we can think about that after we merge to initial version of the I would like us to summarise our thoughts now, so that we can reach a conclusion. I will start: For me the downsides of the "optimistic locking" proposed in this PR are:
I also don't think that ActiveRecord's/Sequel's built-in optimistic locking would work well, because:
For the latter reason I'm also not fond of transaction with a specific isolation level – I don't want any exceptions to be raised when user does a regular record update. I also think that could be overwhelming for people trying to understand the Shrine code, as it's non-standard, at least I've never seen it (optimistic/pessimistic locking are more standard in ActiveRecord/Sequel IMO). So, at the moment, I'm still leaning towards pessimistic locking.
|
Do keep in mind that this is not necessarily true, so long as we have a way to fetch what was in the db when the exception happens, our own logic can rescue it and fix it. Pseudo-code:
This works with any kind of optimistic locking, built-in ORM or not. It does potentially get complicated with how it impacts the existing record, if you're using the existing record, rather than a copy you got with But I think even with pessimistic locking you've got to re-fetch the record and then update that other copy, instead of the user's original copy -- cause the fetch and update have to both happen within the lock, for it to work, right? So it's still got weird results on whether the in-memory record is updated. But okay, I hear ya! My current code experiments do have a 'pluggable' place for how the update is done... but different ones have really different impacts on the semantics (callbacks, whether the in-memory record is updated, etc). I've just got to get my code into a state where I can share it, and we can feedback and iterate from there. I'm not super happy with where the code is getting me, I just can't figure out any other way, more sets of eyes/brains would be very helpful. (I am also considering exploring ActiveStorage more, to be honest... as well as doing a custom-fit ActiveRecord only implementation that uses one row per derivative, but with a shrine attacher on each one) |
Yes, I agree about this on the Shrine side. But I forgot to clarify, I meant I didn't want exceptions to be raised in user-defined code. For example, if an Article has a cover photo attachment, and while the cover photo is being promoted in the background the user e.g. updates the article title, in that case that article title update to fail. Because, if you're using the ORM's optimistic locking, it will automatically be used for any kind of updates, not just on the Shrine side. And when update on the Shrine side increments the As I understood, the same holds for the transaction isolation approach.
By "copy" do you mean model instance? You can actually still update the same model instance with pessimistic locking, we can re-fetch the record into a separate model instance (so, instead of record.class.transaction do
current_record = record.class.where(id: record.id).lock(true).first
current_attacher = self.class.new(current_record, name)
raise ActiveRecord::Rollback if current_attacher.read != self.read # attachment has changed
write_derivates(current_attacher.derivates.merge(**new_derivates))
record.save
end
Awesome, feel free to push these changes to a branch on this repo directly, and Hiren and I will definitely take a look.
Ok, that makes sense, I would like to draw some knowledge and ideas from ActiveStorage. |
I think this is not true. If all we are updating is the derivatives column, on optimistic locking failure we reload the record from the db (so we got the title they updated), then re-apply changes and save again.
Hmm. I'm not sure your code really protects against the race condition... wait, maybe it does. Interesting. It's really a pain to create test/demonstration cases here that prove it and/or demonstrate a case where it doesn't, I'll spend more time thinking on it. But thanks for the code there, that's not the approach I was thinking of. We also need to take care of removing from persistent storage any derivatives that have been replaced, but it's relatively straightforward how to fit that in there I think. |
I think you are describing the scenario in which the "promote" backgrounding job loads the article and begins processing/upload, then in the meanwhile I'm talking about the opposite scenario:
Yeah, assumed we just wouldn't have tests for this race condition safety, at least I don't know how I would write them 😛 |
Okay, thanks for your code snippet -- I wasn't understanding/coming up with that myself for some reason, that DOES look like the right way to do it! Still have a buncha things I'm not sure about, but probably makes sense to just get it in code at this point. My draft right now does use a second attacher/attachment -- partially just cause it was so useful to be able to re-use that functionality and overwhelming to try to reimplement stuff that was already there from the start; partially cause I'm not yet convinced it's not useful to do so (or at least an optionally custom uploader?). It'll be a place to start, if I can get it from random messy code proof of concept in a custom app I have to something I can actually share. :) |
I'm going to close this, as we agreed to incorporate the ideas from this discussion into #299. |
This is not really proposed as being possibly ready for merge, it's sometimes just easier to talk starting from some code, instead of purely abstract discussion. I also did this as part of getting more familiar with shrine's internal implementation, and because concurrent race conditions are likely to be even more of a problem (and more complicated to solve) when dealing with 'derivatives' possibilities.
This uses an "optimistic locking" approach on promotion with backgrounding plugin, replacing the logic there before that meant to avoid the race condition but did not do so completely -- right now only with an ActiveRecord implementation -- but it does not use ActiveRecord's built-in optimistic locking approach, instead it locks on the shrine data column, while only updating the shrine data column.
One of the nice things about that approach is that, unlike built-in optimistic locking, if someone else has updated the record, but only other columns, it will sucessfully not conflict.
Because of the limitations on API's available in AR, the consequence of this is you don't get AR callbacks called -- so no AR callbacks would be called in promotion phase. Or in any manual uses of the
swap
method, as for instance recommended in the regenerating versions doc. It's possible that if we were to go ahead with this approach, we'd have to make sure there were easy ways to use built in before/after/around promotion/swap/update hooks that could serve as a substitute for the missing AR ones (which might be difficult to make them fully capable with access to deltas).That may be a deal-breaker already. It does not cause any tests to fail, but that may be considered a test bug/hole. When I tried overriding
update
directly instead ofswap
to use this approach, it did cause test failures. It's not actually clear to me whetherupdate
orswap
would be the "right" thing to override, I am unclear on what the difference in semantics between the two methods is supposed to be.Other approaches
If this whole approach is not gonna work out (I'm not sure), and this race condition is actually in need of fixing (I'm not sure how serious it is; I was mostly investigating cause it will get more serious with 'derivatives' and I wanted to start playing with something simpler), other approaches could include:
Pessimistic locking. Personally not a fan, the performance characteristics and deadlock potentials of pessimistic locking are something I find confusing and hard to deal with.
For ActiveRecord, suggest that if someone cares about the race condition, they use AR's own optimistic locking feature on the relevant model -- but then shrine code could rescue the
ActiveRecord:: StaleObjectError
and do appropriate things in the rescue (the appropriate thing may differ in this simple case (where it's probably just silently abandon) and in the hypothetical derivatives case (where it would want to merge changes and try again).Use DB transactions with explicit raised isolation levels. It is possible to do that in AR, should be in Sequel. Hypothetically, if I am understanding the documentation right, postgres with "repeatable read" for this use case will result in functionality almost identical to optimistic locking -- more like AR optimistic locking where any change to any column of the model will cause an error. I'm not totally sure what exception AR will raise in that case, or if it's consistent from db to db. It's hard to wrap your head around transaction isolation levels, and the semantics can differ between dbs (although there is an SQL spec for it, not always followed). One reason I like optimistic locking is cause I pretty much understand it's semantics and performance characteristics, whereas features like transaction isolation and pessimistic locking seem much more like dark arts.