Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For discussion: experiment using optimistic locking to avoid race condition #291

Closed

Conversation

jrochkind
Copy link
Contributor

This is not really proposed as being possibly ready for merge, it's sometimes just easier to talk starting from some code, instead of purely abstract discussion. I also did this as part of getting more familiar with shrine's internal implementation, and because concurrent race conditions are likely to be even more of a problem (and more complicated to solve) when dealing with 'derivatives' possibilities.

This uses an "optimistic locking" approach on promotion with backgrounding plugin, replacing the logic there before that meant to avoid the race condition but did not do so completely -- right now only with an ActiveRecord implementation -- but it does not use ActiveRecord's built-in optimistic locking approach, instead it locks on the shrine data column, while only updating the shrine data column.

One of the nice things about that approach is that, unlike built-in optimistic locking, if someone else has updated the record, but only other columns, it will sucessfully not conflict.

Because of the limitations on API's available in AR, the consequence of this is you don't get AR callbacks called -- so no AR callbacks would be called in promotion phase. Or in any manual uses of the swap method, as for instance recommended in the regenerating versions doc. It's possible that if we were to go ahead with this approach, we'd have to make sure there were easy ways to use built in before/after/around promotion/swap/update hooks that could serve as a substitute for the missing AR ones (which might be difficult to make them fully capable with access to deltas).

That may be a deal-breaker already. It does not cause any tests to fail, but that may be considered a test bug/hole. When I tried overriding update directly instead of swap to use this approach, it did cause test failures. It's not actually clear to me whether update or swap would be the "right" thing to override, I am unclear on what the difference in semantics between the two methods is supposed to be.

Other approaches

If this whole approach is not gonna work out (I'm not sure), and this race condition is actually in need of fixing (I'm not sure how serious it is; I was mostly investigating cause it will get more serious with 'derivatives' and I wanted to start playing with something simpler), other approaches could include:

  1. Pessimistic locking. Personally not a fan, the performance characteristics and deadlock potentials of pessimistic locking are something I find confusing and hard to deal with.

  2. For ActiveRecord, suggest that if someone cares about the race condition, they use AR's own optimistic locking feature on the relevant model -- but then shrine code could rescue the ActiveRecord:: StaleObjectError and do appropriate things in the rescue (the appropriate thing may differ in this simple case (where it's probably just silently abandon) and in the hypothetical derivatives case (where it would want to merge changes and try again).

  3. Use DB transactions with explicit raised isolation levels. It is possible to do that in AR, should be in Sequel. Hypothetically, if I am understanding the documentation right, postgres with "repeatable read" for this use case will result in functionality almost identical to optimistic locking -- more like AR optimistic locking where any change to any column of the model will cause an error. I'm not totally sure what exception AR will raise in that case, or if it's consistent from db to db. It's hard to wrap your head around transaction isolation levels, and the semantics can differ between dbs (although there is an SQL spec for it, not always followed). One reason I like optimistic locking is cause I pretty much understand it's semantics and performance characteristics, whereas features like transaction isolation and pessimistic locking seem much more like dark arts.

… race conditions

AR-only in this demo.
A side effect is you no longer get AR model callbacks on promote (or manual ) which may be a deal-breaker
@janko
Copy link
Member

janko commented Sep 13, 2018

Thanks a lot for the PR and your detailed thought. I should have time tomorrow to review it properly.

But for now I want to say that I've also tried this approach in the past, but that introduced a regression because people relied on ActiveRecord callbacks to trigger during promotion, so I reverted it. Maybe it could still work if it was introduced in a safer more backwards compatible way, we can discuss these possibilities.

@jrochkind
Copy link
Contributor Author

Thanks for response.

Yeah, I knew that might be a blocker. If it really is the right way to do things, might need to be a backwards incompat change -- or an option.

But really this was just some trial run for needing to do things with "derivatives", where, if we let people update derivatives "whenever" -- which I think is actually a core requirement -- an analagous situation becomes a problem which is both much more dangerous, and much more complicated to deal with. :(

The two problems could interact though, to the extent that someone could be using the backgrounding plugin and "new" derivatives. Spins my head to think about it.

@jrochkind
Copy link
Contributor Author

Here's a thought:

The mechanism of a race-condition-free update strategy needs to be entirely pluggable (not neccesarily meaning the shrine plugin system, but maybe) .

Like not even assumed by an ORM plugin -- although a given ORM plugin might have a default. But even with a given ORM, you might want to choose an optimistic locking approach (either using the optimistic locking built into the ORM, or not, as in this PR); a pessimistic locking approach; or transaction isolation levels.

Ideally there would be race-free-update "strategies" that could work for backgrounding update AND new-style derivatives update. You may or may not want to choose the same one for each kind of update.

Of course, if we were to provide all of those out of the box, it gives us a lot more work to do! And it's really hard to test and be sure you are free of race conditions with some of the approaches.

@janko
Copy link
Member

janko commented Sep 15, 2018

Ok, was able to properly read the code and your thoughts in this PR.

First of all, in case you're interested, here are the commits that introduced (and later reverted) a similar kind of optimistic locking in the past:

  • 5d7cab6 (added for ActiveRecord + Sequel)
  • 706ee20 (reverted for ActiveRecord)
  • 28b5447 (reverted for Sequel)

One interesting note is that with Sequel it's possible to add a WHERE filter to a model instance that will be applied on update, and this way the callbacks will still trigger. But I really disliked that it modifies the current model instance, which could be very surprising if the user wanted to make another update after attachment promotion (that update probably then wouldn't work).

So, I would prefer to avoid this approach if we can. Mainly because of the lack of callbacks, but also because I don't know how it would help with derivates (I'll explain further in this comment).

But really this was just some trial run for needing to do things with "derivatives", where, if we let people update derivatives "whenever" -- which I think is actually a core requirement -- an analagous situation becomes a problem which is both much more dangerous, and much more complicated to deal with. :(

Yes, I agree that having a mechanism to eliminate race conditions for the new derivatives plugin is a core requirement.

The two problems could interact though, to the extent that someone could be using the backgrounding plugin and "new" derivatives. Spins my head to think about it.

I don't think it should be too complicated, I imagine that we'd have an ORM-specific #atomic_update method like in this PR, which would be used in backgrounding plugin or with derivatives. The only reason why the current attempt of an atomic update is in the background plugin is because it's only really needed when using backgrounding IMO, but it's not coupled to the rest of the backgrounding implementation.

That may be a deal-breaker already. It does not cause any tests to fail, but that may be considered a test bug/hole.

Oh, I thought I added tests after introducing that backwards incompatibility with callbacks. I agree they should definitely exist.

It's not actually clear to me whether update or swap would be the "right" thing to override, I am unclear on what the difference in semantics between the two methods is supposed to be.

Yeah, #swap should have probably been named better. The difference is that #update simply updates the record with the new attachment data, while #swap wraps the #update and tries to make it race condition free.

  1. Pessimistic locking. Personally not a fan, the performance characteristics and deadlock potentials of pessimistic locking are something I find confusing and hard to deal with.

I would like to hear more. What makes you concerned about performance of optimistic locking? What are the scenarios which you imagine where pessimistic locking could cause deadlocks? Since all that would happen inside the lock would be an UPDATE query, I expect this lock to be held very shortly.

The advantages for me are: (a) callbacks are still triggered, (b) no need to rescue exceptions as the DB query will just block until the lock is released, and (c) it should work for both promoting and derivates because of record refreshing. To briefly explain (c); with promoting you only have to check whether the attachment has changed, but with derivates we also need to be able to retrieve the up-to-date list of derivates before updating the column, which we would because the record would be refreshed. This is how I imagine it:

def safe_update
  record.db.transaction do
    record.lock! # record is refreshed with a SELECT ... FOR UPDATE query
    yield # this updates any record column values in-memory
    record.save(validate: false) # persists any new values
  end
end
# promoting

# ... upload cached file to permanent storage ...
safe_update do
  record.send("#{data_column}=", stored_file.to_json) # write stored file data to attachment column
end
# adding new derivates

# ... process new derivates and upload them to permanent storage ...
safe_update do
  # record is refreshed with any derivates that were added while new derivates were processed, so we get the up-to-date list
  record.send("#{derivates_column}=", derivates.merge(**new_derivates).to_json)
end

Note that this is just a sketch. There should also be some failure callback which would get triggered if the attachment has changed, where we would delete the processed derivates. It should maybe also be possible to do promotion/derivate updating without locking, to avoid refreshing and lock in case people don't use backgrounding (though maybe this should still be done in either case for safety, e.g. if people decide to do their own backgrounding).

  1. For ActiveRecord, suggest that if someone cares about the race condition, they use AR's own optimistic locking feature on the relevant model -- but then shrine code could rescue the ActiveRecord:: StaleObjectError and do appropriate things in the rescue (the appropriate thing may differ in this simple case (where it's probably just silently abandon) and in the hypothetical derivatives case (where it would want to merge changes and try again).

I guess we could have ActiveRecord's/Sequel's built-in optimistic locking (with the separate version column) as an alternative locking implementation, and I agree with the behaviour that you suggested. I think your suggestion that we could make locking strategies configurable is great, e.g. plugin :activerecord, locking: :optimistic. I would like pessimistic locking to be the default as it doesn't have any prerequisites like an extra DB column.

  1. Use DB transactions with explicit raised isolation levels.

That's an interesting find, but I have to say that I'm having a trouble understanding how it works. And also, I thought that transactions are used for making multiple updates atomic, that they don't actually prevent updates from happening during a transaction. Since we're doing a single update, I don't think transactions will help us. Also, I'm worried about cross-DB support.

@jrochkind
Copy link
Contributor Author

Thanks for detailed comments!

But I really disliked that it modifies the current model instance, which could be very surprising if the user wanted to make another update after attachment promotion (that update probably then wouldn't work).

Hmm... but doesn't the current promotion code also modify the current model instance? Even with backgrounding, it fetches a new model instance to check for changes (still with a race condition as discussed)... but then just abandons the new model instance, and updates the original. And without backgrounding, it's always updating the original (and saving it to the db). No? Am I misunderstanding the code?

But yeah, this is some of the stuff I'm running into working on derivatives -- for which I hope to have some code to talk about (that will potentially have major things that need to be changed) next week. Always a lot easier to talk about this with some code to comment on that so abstractly.

Incidentally, an interesting thing about pessimistic locking is that, as far as I can tell, it's really only safe if ALL the code that might update that record (or at least that column in that record) uses it. As opposed to optimistic locking, that is safe even if other code updating that record/column isn't in on the game.

In derivatives, in order to have any hope of dealing with race conditions sanely, I think update_derivatives has got to make an immediately persisted change to the db.

Which I guess probably means it should not be on the current model instance, as, like I think you were saying, persisting other previous un-saved changes on the model instance would be unexpected. (But I thought that's what promotion was doing now.... in my following of the code, any call to update ends up persisting the current record (which may have other changes, hope you wanted them to persist), and I think modifying it too. But maybe I missed something!

@jrochkind
Copy link
Contributor Author

On performance characteristics of pessimistic locking -- well, when you take out a pessimistic lock, it means anything else that is going to need that lock is going to block waiting for it. I am not totally sure if that includes anything else wanting to read the row in question, or just anything else wanting to write to it. And hypothetically two processes can end up deadlocked each waiting for a lock the other one has out. (in which case I believe postgres, at least, will notice and return an error, rather than letting them both wait forever).

Will performance be an issue here, where the lock should not be held very long at all? Is deadlock actually possible? I'm not sure, it's quite possible it's just fine, but it's hard enough for me to wrap my head around that it always makes me nervous.

On transactions with isolation levels... theoretically it's an SQL standard, so all dbs should work the same. I am not totally sure how much practice matches this theory.

But you don't need to have multiple updates for transactions with isolation levels to be useful, even though examples to demonstrate them often use multiple records. Theoretically, transaction isolation will do something very similar to pessimistic locking, but with better performance/concurrency characteristics. Unlike a pessimistic lock, other processes won't be blocked waiting for access; rather, if another process does do something that would conflict with the semantics, one of the processes will error out (similar to an optimistic locking error in that case).

Here's the pertinent line from the postgres docs that makes me think this:

UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as SELECT in terms of searching for target rows: they will only find target rows that were committed as of the transaction start time. However, such a target row might have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. In this case, the repeatable read transaction will wait for the first updating transaction to commit or roll back (if it is still in progress). If the first updater rolls back, then its effects are negated and the repeatable read transaction can proceed with updating the originally found row. But if the first updater commits (and actually updated or deleted the row, not just locked it) then the repeatable read transaction will be rolled back with the message ERROR: could not serialize access due to concurrent update

If you have a transaction with "repeatable read isolation" -- if someone else updated the record you fetched and then tried to update -- postgres will refuse to do your update and error out. Accomplishig exactly what we are trying to accomplish here -- the race condition we are dealing with is when our process fetches a record, makes changes, and then saves it -- but in between our fetch and save, someone else updated it, and we're in danger of overwriting their update and/or putting data in inconsistent state because of it. "repeatable read" isolation avoids that, giving you an error if that was going to happen. But without actually taking out a lock that will make other processes wait in a queue for it to be released.

Have I ever actually used this myself? No. Does it make my head spin? Yes. Do I trust my ability to understand the implications when using it? Totally not. So I'm not really saying to use it... but I'm scared of pessimistic lock for the same reason.

It is really hard to make a test for any of this. I am not surprised there's no test for current backgrounding attempt to avoid race condition. Plus, if there WAS a test and it passed... woudln't that be some false confidence, since we in fact know there IS a race condition?

@jrochkind
Copy link
Contributor Author

I would like pessimistic locking to be the default as it doesn't have any prerequisites like an extra DB column).

So, this PR was meant to demonstrate that no extra DB column is required for optimistic locking, right? But yeah, to use the built-in AR optimistic locking, true. I don't know about Sequel.

As far as what to do with backgrounding and race conditions... I honestly have no idea. I'm not even totally sure how bad the consequences of the race condition are here. Thinking it through, if the race condition happens, what could happen is:

  1. You change an attached file to file1.
  2. You change it again to file2, before promotion has occured. file2 is visible in the db.
  3. The promition from file1 occurs... it's visible as file 1 again!
  4. (maybe) the promotion from file2 occurs, now it's file 2 again!

While annoying, that is not as disastrous as the race conditions that could happen with derivatives which could be either A) A record ends up, either temporarily or permanently, having derivatives that don't match_ the original file, they are for some other file, or B) some derivatives you tried to add didn't end up stored in the model at all (cause they were overwritten by another update), but the derivative files are still persisted in storage.

A lot more disastrous in both cases.

It's worth pointing out that if we took ActiveStorage's approach of having one row per derivative... it might make these race conditions a LOT easier to deal with. But it would also make some other things, especially with shrine's architecture, a lot harder. I don't even know how you'd deal with foreign-key related objects in a DB-agnostic way, like shrine does now. AS's approach is a lot harder when you can't assume a particular ORM.

@janko
Copy link
Member

janko commented Sep 16, 2018

Hmm... but doesn't the current promotion code also modify the current model instance? Even with backgrounding, it fetches a new model instance to check for changes (still with a race condition as discussed)... but then just abandons the new model instance, and updates the original. And without backgrounding, it's always updating the original (and saving it to the db). No? Am I misunderstanding the code?

True, promotion will discard any other in-memory column changes, and that is expected. Note that promotion is intended to be called after the record has been saved, so it's unexpected that the user will modify it before promotion. Also, the only way the user can modify the record before promotion is if they use Attacher#promote directly.

Incidentally, an interesting thing about pessimistic locking is that, as far as I can tell, it's really only safe if ALL the code that might update that record (or at least that column in that record) uses it. As opposed to optimistic locking, that is safe even if other code updating that record/column isn't in on the game.

This thought also came to my mind, so I went ahead and checked it, and while the lock has been taken by a SELECT ... WHERE id = 123 FOR UPDATE, the following holds:

  • a parallel SELECT ... WHERE id = 123 FOR UPDATE (on the same record) will block
  • a parallel UPDATE ... WHERE id = 123 will also block

Which I guess probably means it should not be on the current model instance, as, like I think you were saying, persisting other previous un-saved changes on the model instance would be unexpected. (But I thought that's what promotion was doing now.... in my following of the code, any call to update ends up persisting the current record (which may have other changes, hope you wanted them to persist), and I think modifying it too. But maybe I missed something!

I don't think saving other un-saved changes would be unexpected or unwanted, only that the caller shouldn't expect it. In the case of pessimistic locking (where reloading happens) the changes and your optimistic locking implementation other changes would be dropped, while in the case of built-in optimistic locking other changes would be persisted. I wouldn't worry about that.

@jrochkind
Copy link
Contributor Author

True, promotion will discard any other in-memory column changes, and that is expected

Hmm. I don't think it will discard them. I think it will save them. But we could obviously play with the code and find out for sure!

Also, the only way the user can modify the record before promotion is if they use Attacher#promote directly.

I believe there's at least some documentation somewhere that suggests calling promote (or update or swap, which I think will have the same issues) directly in some circumstances. I could go look for it.

But yes, ordinarily that won't be an issue.

Except with derivatives it will be, because we're definitely gonna be suggesting people call update_derivatives directly, whenever the heck they want... right? I think supporting that is the whole goal.

In some of my current implementations, update_derivatives ends up not even modifying the current record. To get around that. Like do a find_record and then modify that (with some kind of locking etc). But that could be considered weird too, if you want the current record to show your added derivatives, you'd have to reload it.

One way or another, something's gonna be weird I think, I can't find any ways to make all of these things act as one might expect, with no race conditions. But this will be more clear when there's code to look at maybe.

@janko
Copy link
Member

janko commented Sep 16, 2018

I am not totally sure if that includes anything else wanting to read the row in question, or just anything else wanting to write to it.

Normal SELECTs will not block when the lock is taken, only another SELECT ... FOR UPDATE. The UPDATEs will block.

Will performance be an issue here, where the lock should not be held very long at all? Is deadlock actually possible? I'm not sure, it's quite possible it's just fine, but it's hard enough for me to wrap my head around that it always makes me nervous.

Since the lock will be held for the shortest amount time possible, I don't think it could cause performance issues (otherwise I would never use this feature 😛).

I think the only way a deadlock is possible is if while holding the lock, the connection (a) does UPDATE to another record or (b) does SELECT ... FOR UPDATE for another record. Since none of this holds (as we only do one UPDATE of the current record), I'm pretty sure a deadlock is not possible.

Though it's possible that user-defined callbacks trigger some other UPDATE queries... You're right, we should consider risks of deadlocks.

Unlike a pessimistic lock, other processes won't be blocked waiting for access; rather, if another process does do something that would conflict with the semantics, one of the processes will error out (similar to an optimistic locking error in that case).

If there is a parallel update to the record while we're updating the attachment, I would like that update to succeed once that transaction is finished, not error out. What I like about the pessimistic locking approach is that there are no errors, only "adjusting" when UPDATES will happen.

If you have a transaction with "repeatable read isolation" -- if someone else updated the record you fetched and then tried to update -- postgres will refuse to do your update and error out. Accomplishig exactly what we are trying to accomplish here -- the race condition we are dealing with is when our process fetches a record, makes changes, and then saves it -- but in between our fetch and save, someone else updated it, and we're in danger of overwriting their update and/or putting data in inconsistent state because of it. "repeatable read" isolation avoids that, giving you an error if that was going to happen. But without actually taking out a lock that will make other processes wait in a queue for it to be released.

If while we're adding a derivate to the record someone updates another column on the same record, I wouldn't consider that a race condition, and I wouldn't want an exception to happen. I would want the other update to succeed (after the derivate has been added).

If the parallel update is caused by another derivate being added to the same record, than the pessimistic locking approach would correctly add one derivate after the other without race conditions, while the transaction isolation approach would error out?

It is really hard to make a test for any of this. I am not surprised there's no test for current backgrounding attempt to avoid race condition. Plus, if there WAS a test and it passed... woudln't that be some false confidence, since we in fact know there IS a race condition?

Hehe, yeah, I didn't test this because I didn't know how to stop the process between the SELECT (refresh) and the UPDATE. I'm comfortable with using Fibers, but I cannot just insert a Fiber.yield in between too lines of a method in tests 😛

So, this PR was meant to demonstrate that no extra DB column is required for optimistic locking, right? But yeah, to use the built-in AR optimistic locking, true. I don't know about Sequel.

Yes, I was considering pessimistic locking vs ActiveRecord's/Sequel's built-in optimistic locking here (Sequel has it as well). At the moment I don't know if the implementation from this PR is going to work, because (a) it doesn't trigger callbacks and (b) I don't see how it could be utilized for derivates (if update count is 0 do we refresh the model, check current attachment, and retry?).

A lot more disastrous in both cases.

Agreed 😄

It's worth pointing out that if we took ActiveStorage's approach of having one row per derivative... it might make these race conditions a LOT easier to deal with. But it would also make some other things, especially with shrine's architecture, a lot harder. I don't even know how you'd deal with foreign-key related objects in a DB-agnostic way, like shrine does now. AS's approach is a lot harder when you can't assume a particular ORM.

Actually, I'm not sure how would having one row per derivate in separate table help us. Can you clarify?

@janko
Copy link
Member

janko commented Sep 16, 2018

Hmm. I don't think it will discard them. I think it will save them. But we could obviously play with the code and find out for sure!

Oh, right, core promotion will save changes. The backgrounding plugin will also save changes in case the attachment hasn't changed (otherwise it skips the update altogether).

Except with derivatives it will be, because we're definitely gonna be suggesting people call update_derivatives directly, whenever the heck they want... right? I think supporting that is the whole goal.

The way I see it, we'll want to have two ways of adding derivates:

  • writes to the derivates record attribute in-memory, and lets the user save the record when they want (in this case they can chuck in any other record changes they want)
  • updates the derivates column in the database using locking (in this case the user cannot expect their record changes to be persisted, because the update is focused only on the derivates)

The latter would then probably be a wrapper around the former. We can then document that their record changes won't be saved in the latter case (I think that's completely reasonable).

@jrochkind
Copy link
Contributor Author

jrochkind commented Sep 16, 2018

Sorry for so many words, there's just a lot going on, thanks for the discussion.

If there is a parallel update to the record while we're updating the attachment, I would like that update to succeed once that transaction is finished, not error out. What I like about the pessimistic locking approach is that there are no errors, only "adjusting" when UPDATES will happen.

Yeah, I mean, everyone would LIKE that, but there's a reason people use optimistic locking instead... cause it's got a LOT more straightforward performance characteristics.

In this case, transparently recovering from an optimistic locking error (and this applies to the analogous transaction isolation "serialization" error too) is really straightforward:

Fetch the record again, re-apply your merges on top of the hash, save again. If error again, repeat to some number of max-tries.

That's how optimistic locking is actually designed to be used, although the way ORM's expose it can make it hard to use in this way sometimes!

Actually, I'm not sure how would having one row per derivate in separate table help us. Can you clarify?

So there's a new class of race condition introduced by single-column-hash derivatives. Not just the "backgrounding" one, where another process may have changed the original.

But also one where the original remains the same, and another process has changed the derivatives. Because we're storing all the derivatives in a single column, if someone else added or removed a derivative in between when we fetched and when we saved -- our change overwrites their changes, since we're writing the whole hash. And overwrites it without any app callbacks ever being called, so there may be actual derivative files still on storage but no longer referenced (or derivatives referenced but no longer persisted in storage since someone else thought they removed em).

Single derivative per row gets rid of that kind of conflict, because the db already lets you atomically update a row, that's what dbs do. But it doesn't let you atomically update part of a serialized json hash (well, except pg jsonb does, with the right calls, that ActiveRecord won't do for you on save).

The way I see it, we'll want to have two ways of adding derivates:

  • writes to the derivates record attribute in-memory, and lets the user save the record when they want (in this case they can chuck in any other record changes they want)
  • updates the derivates column in the database using locking (in this case the user cannot expect their record changes to be persisted, because the update is focused only on the derivates)

The latter would then probably be a wrapper around the former. We can then document that their record changes won't be saved in the latter case (I think that's completely reasonable).

The former puts you at risk of race conditions unless you really know what you're doing, and pay attention to where you're doing it in your code.

The latter... in addition to any in-memory unrelated record-changes not being saved (which I think is probably desirable in addition to being reasonable)... what if the derivatives changes aren't actually reflected in your in-memory record at all. You'd have to reload to see them? That is definitely not desirable... but may be necessary, with the API's of actual ORMs and the desire to have it work at any point of or unrelated to the actual shrine cache/promotion process.

@jrochkind
Copy link
Contributor Author

jrochkind commented Sep 16, 2018

on optimistic locking (with a failure that can be recovered from) compared to pessimistic locking (with blocking/waiting)... see "lock free concurrency", "wait free concurrency", "lock-free design" etc in concurrent programming. We are talking about a kind of concurrent programming here. There's a reason people prefer the non-blocking ones for performance reasons. That apply here, at least theoretically.

Still, the pessimistic locking approach is probably good enough... for anyone not dealing with massive scale (is that good enough?)... and I'm really no concurrency expert either, I just know enough to know what I don't know, heh.

If each derivative was in a separate row, it kind of gets us out of having to be a concurrency expert, cause the DB takes care of it for us -- DBs were designed for atomic updates on an individual column in an individual row, not for a key in a serialized hash.

But I really don't want to do one-row-per-derivative either if we can avoid it. In part cause I have no idea how to implement it in shrine.

@janko
Copy link
Member

janko commented Sep 16, 2018

Sorry for so many words, there's just a lot going on, thanks for the discussion.

That's fine, I'm glad we're discussing this, it's great for me to have other points of view besides mine 😃

Yeah, I mean, everyone would LIKE that, but there's a reason people use optimistic locking instead... cause it's got a LOT more straightforward performance characteristics.

I don't think that's the primary reason people use optimistic locking (I'm talking now about the built-in one to ActiveRecord/Sequel). I think people use it mainly for the use case of two people updating the same resource via a form; if the other person updated the resource you're looking at between the time you opened at and the time you clicked save, you want the save to fail and app to tell you to refresh the page.

This is not possible with pessimistic locking, because there updates always succeed, they just wait in certain scenarios. This is what I think I want in Shrine.

Note that if we're talking about ActiveRecord's/Sequel's built-in optimistic locking, then I'm not sure which one has better performance characteristics. For every time you retry the update you add two more queries (refresh + update), while with pessimistic locking you'll only ever make two queries ( SELECT ... FOR UPDATE + UPDATE). True, pessimistic locking will block any parallel user-made updates, but then again optimistic locking will raise an exception which I don't think is correct behaviour (there is no need to raise an exception if one process updates column A and the other column B, there is no race condition there).

So I think that ActiveRecord's/Sequel's built-in optimistic locking could produce more surprising behaviour (unnecessary exceptions) compared to pessimistic locking. I know in this PR we were talking about a different kind of optimistic locking, I just wanted to cover this one as well because it was mentioned here.

Single derivative per row gets rid of that kind of conflict, because the db already lets you atomically update a row, that's what dbs do.

Thanks for clarifying. Agreed, having a row per derivate would eliminate race condition considerations for the derivates column. Though, we would still need to check whether the original attachment has changed when updating derivates.

But I really don't want to do one-row-per-derivative either if we can avoid it. In part cause I have no idea how to implement it in shrine.

Yes, I really dislike the performance considerations of this approach. One is that people would need to remember to eager load whenever the have a collection of records with attachments – I want people to opt-in for this responsibility. The other is that the speed of retrieving derivates would become inversely proportional to number of derivates per attachment. I don't think that's a good tradeoff, as it shouldn't be that hard to make the current approach race-condition free.

The former puts you at risk of race conditions unless you really know what you're doing, and pay attention to where you're doing it in your code.

Yes, that would make sense mostly in synchronous examples, where you're not using background jobs. Still, I would like to allow people to save the record in their preferred way, should they really want to. But we can think about that after we merge to initial version of the derivates plugin.


I would like us to summarise our thoughts now, so that we can reach a conclusion. I will start:

For me the downsides of the "optimistic locking" proposed in this PR are:

  • no ORM callbacks triggered
    • means backwards incompatibility
    • also means that extensions like Paper Trail will not work with Shrine anymore
  • in-memory record will not reflect the actual state after promotion or updating derivates is finished (should the user choose to interact with the record further)

I also don't think that ActiveRecord's/Sequel's built-in optimistic locking would work well, because:

  • users would need to add the extra DB column
  • parallel updates would cause exceptions instead of correctly succeeding (as they're not updating the same column)

For the latter reason I'm also not fond of transaction with a specific isolation level – I don't want any exceptions to be raised when user does a regular record update. I also think that could be overwhelming for people trying to understand the Shrine code, as it's non-standard, at least I've never seen it (optimistic/pessimistic locking are more standard in ActiveRecord/Sequel IMO).

So, at the moment, I'm still leaning towards pessimistic locking.

  • I think the performance considerations will apply only in the worst-case scenario, where someone attempts to update the record between the time it was refreshed and the time it was updated, which is an extremely small time window so IMO it's very unlikely to happen.
  • In this case I think the correct thing would happen – the "outer" update will block until the Shrine's one is finished, and then execute normally. Also, when two parallel derivate updates occur, the pessimistic locking will sync them so that they happen one after the other, at minimal necessary performance cost IMO.
  • The only real danger IMO is a theoretical possibility of deadlocks. I think they can only happen by user mistake (if they add certain callbacks that would cause inter-connection between records that are locked). It seems that Postgres doesn't detect a deadlock, it just hangs forever (in case it would raise an exception we could rescue that and retry). I created an example of a deadlock just to verify that my understanding is right.

@jrochkind
Copy link
Contributor Author

parallel updates would cause exceptions instead of correctly succeeding (as they're not updating the same column)

Do keep in mind that this is not necessarily true, so long as we have a way to fetch what was in the db when the exception happens, our own logic can rescue it and fix it.

Pseudo-code:

begin do
   record.deriv_column.merge( new_derivs )
   record.save
rescue StaleLock
   record.reload
   # after retry we'll be merging in to what we _just_ fetched from the db, so
   # wont' get another locking failure ... unless there's been _another_ change. 
   retry unless max_tries
end

This works with any kind of optimistic locking, built-in ORM or not. It does potentially get complicated with how it impacts the existing record, if you're using the existing record, rather than a copy you got with find_record.

But I think even with pessimistic locking you've got to re-fetch the record and then update that other copy, instead of the user's original copy -- cause the fetch and update have to both happen within the lock, for it to work, right? So it's still got weird results on whether the in-memory record is updated.

But okay, I hear ya! My current code experiments do have a 'pluggable' place for how the update is done... but different ones have really different impacts on the semantics (callbacks, whether the in-memory record is updated, etc).

I've just got to get my code into a state where I can share it, and we can feedback and iterate from there. I'm not super happy with where the code is getting me, I just can't figure out any other way, more sets of eyes/brains would be very helpful.

(I am also considering exploring ActiveStorage more, to be honest... as well as doing a custom-fit ActiveRecord only implementation that uses one row per derivative, but with a shrine attacher on each one)

@janko
Copy link
Member

janko commented Sep 16, 2018

Do keep in mind that this is not necessarily true, so long as we have a way to fetch what was in the db when the exception happens, our own logic can rescue it and fix it.

Yes, I agree about this on the Shrine side. But I forgot to clarify, I meant I didn't want exceptions to be raised in user-defined code. For example, if an Article has a cover photo attachment, and while the cover photo is being promoted in the background the user e.g. updates the article title, in that case that article title update to fail.

Because, if you're using the ORM's optimistic locking, it will automatically be used for any kind of updates, not just on the Shrine side. And when update on the Shrine side increments the lock_version column, the user-defined update can fail with a ActiveRecord::StaleObject exception due to the lock_version not matching. This is what I would like to avoid, as I don't think that failing the user-defined update is appropriate in this case (I would rather that update succeeds).

As I understood, the same holds for the transaction isolation approach.

But I think even with pessimistic locking you've got to re-fetch the record and then update that other copy, instead of the user's original copy -- cause the fetch and update have to both happen within the lock, for it to work, right? So it's still got weird results on whether the in-memory record is updated.

By "copy" do you mean model instance? You can actually still update the same model instance with pessimistic locking, we can re-fetch the record into a separate model instance (so, instead of #lock! which does #reload, we do Record.where(id: id).lock(true).first), and just use it to check the attachment and derivates column, and merge the newly generated derivates with the existing ones. In pseudo-code (inside the Shrine::Attacher instance):

record.class.transaction do
  current_record = record.class.where(id: record.id).lock(true).first
  current_attacher = self.class.new(current_record, name)
  raise ActiveRecord::Rollback if current_attacher.read != self.read # attachment has changed
  write_derivates(current_attacher.derivates.merge(**new_derivates))
  record.save
end

I've just got to get my code into a state where I can share it, and we can feedback and iterate from there. I'm not super happy with where the code is getting me, I just can't figure out any other way, more sets of eyes/brains would be very helpful.

I've just got to get my code into a state where I can share it, and we can feedback and iterate from there.

Awesome, feel free to push these changes to a branch on this repo directly, and Hiren and I will definitely take a look.

(I am also considering exploring ActiveStorage more, to be honest... as well as doing a custom-fit ActiveRecord only implementation that uses one row per derivative, but with a shrine attacher on each one)

Ok, that makes sense, I would like to draw some knowledge and ideas from ActiveStorage.

@jrochkind
Copy link
Contributor Author

But I forgot to clarify, I meant I didn't want exceptions to be raised in user-defined code.

I think this is not true. If all we are updating is the derivatives column, on optimistic locking failure we reload the record from the db (so we got the title they updated), then re-apply changes and save again.

....You can actually still update the same model instance with pessimistic locking...

Hmm. I'm not sure your code really protects against the race condition... wait, maybe it does. Interesting. It's really a pain to create test/demonstration cases here that prove it and/or demonstrate a case where it doesn't, I'll spend more time thinking on it. But thanks for the code there, that's not the approach I was thinking of.

We also need to take care of removing from persistent storage any derivatives that have been replaced, but it's relatively straightforward how to fit that in there I think.

@janko
Copy link
Member

janko commented Sep 17, 2018

I think this is not true. If all we are updating is the derivatives column, on optimistic locking failure we reload the record from the db (so we got the title they updated), then re-apply changes and save again.

I think you are describing the scenario in which the "promote" backgrounding job loads the article and begins processing/upload, then in the meanwhile ArticlesController#update gets called because the user wanted to update the article's title (which then also updates the lock_version), and then when the "promote" backgrounding job attempts to save the article it gets a ActiveRecord::StaleObject exception (which can then of course be retried).

I'm talking about the opposite scenario: ArticlesController#update starts and loads the article record, then in the meanwhile the "promote" backgrounding job just finishes saving the article's attachment/derivates (which then also updates the lock_version), and then when ActionController#update wants to save the article's title change, it fails with a ActiveRecord::StaleObject because the lock_version has changed. I would like that in this scenario the update succeeds.

It's really a pain to create test/demonstration cases here that prove it and/or demonstrate a case where it doesn't, I'll spend more time thinking on it.

Yeah, assumed we just wouldn't have tests for this race condition safety, at least I don't know how I would write them 😛

@jrochkind
Copy link
Contributor Author

Okay, thanks for your code snippet -- I wasn't understanding/coming up with that myself for some reason, that DOES look like the right way to do it! Still have a buncha things I'm not sure about, but probably makes sense to just get it in code at this point.

My draft right now does use a second attacher/attachment -- partially just cause it was so useful to be able to re-use that functionality and overwhelming to try to reimplement stuff that was already there from the start; partially cause I'm not yet convinced it's not useful to do so (or at least an optionally custom uploader?). It'll be a place to start, if I can get it from random messy code proof of concept in a custom app I have to something I can actually share. :)

@janko
Copy link
Member

janko commented Nov 8, 2018

I'm going to close this, as we agreed to incorporate the ideas from this discussion into #299.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants