Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_or_create makes two GETs on a cache fail because of an oddity in Lock #122

Closed
sqlalchemy-bot opened this issue May 5, 2018 · 10 comments
Labels
bug Something isn't working

Comments

@sqlalchemy-bot
Copy link

Migrated issue, originally created by jvanasco (jvanasco)

I originally thought this was a documentation issue (#89) and desired, but now I think this is just an artifact of code changing over time. I started looking at this deeper when an app was slamming Redis when it's cache wasn't prime.

looking at lock.py lines 86-94

https://bitbucket.org/zzzeek/dogpile.cache/src/e0c1f63907e63ed1678f44ec2443314808ada1c3/dogpile/lock.py#lines-86:94

in line 87, dogpile calls value_fn() which is really self.value_and_created_fn to set the generated and createdtime values.

in line 94, dogpile calls ._enter_create(createdtime)

in that method, it looks like dogpile immediately tries to create a mutex, then calls self.value_and_created_fn again (line 131).

the two calls to get happen immediately after each other, within a microsecond or two.

i don't know all the scenarios that are needed in this area, but one 'fix' that works on tests is the following:

  1. _enter_create is modified to accept (value, createdtime) instead of just createdtime
  2. the second call is conditionally called.
#!python

-         generated = self._enter_create(createdtime)
+         generated = self._enter_create(value, createdtime)

-    def _enter_create(self, createdtime):
+    def _enter_create(self, value, createdtime):

            # see if someone created the value already
            try:
-                value, createdtime = self.value_and_created_fn()

            # see if someone created the value already
            try:
+                if not self._has_value(createdtime):
+                    value, createdtime = self.value_and_created_fn()

i'm not sure how that would affect the async model, or the call to value_fn in line 101 either.

@sqlalchemy-bot
Copy link
Author

Michael Bayer (zzzeek) wrote:

the race that this extra get solves is that there is no value, every thread races to reach _enter_create(), then one of them gets the lock, the rest wait; then when complete, the next one waiting to be the "creator" gets into the block, sees that someone just created, and then can return that. the emphasis is on not calling the creation function twice which is assumed to be vastly more expensive than get().

so then in the other case, we have the value, we acquired the mutex with block=False, which means we are definitely the ones chosen to create the new value, so we shouldnt need to check someone else did it.

I need to re-learn what the async thing does but this sort of should work:

#!diff

diff --git a/dogpile/lock.py b/dogpile/lock.py
index c4c5c80..77be275 100644
--- a/dogpile/lock.py
+++ b/dogpile/lock.py
@@ -115,11 +115,13 @@ class Lock(object):
         _async = False
 
         if self._has_value(createdtime):
+            has_value = True
             if not self.mutex.acquire(False):
                 log.debug("creation function in progress "
                             "elsewhere, returning")
                 return NOT_REGENERATED
         else:
+            has_value = False
             log.debug("no value, waiting for create lock")
             self.mutex.acquire()
 
@@ -127,19 +129,20 @@ class Lock(object):
             log.debug("value creation lock %r acquired" % self.mutex)
 
             # see if someone created the value already
-            try:
-                value, createdtime = self.value_and_created_fn()
-            except NeedRegenerationException:
-                pass
-            else:
-                if not self._is_expired(createdtime):
-                    log.debug("value already present")
-                    return value, createdtime
-                elif self.async_creator:
-                    log.debug("Passing creation lock to async runner")
-                    self.async_creator(self.mutex)
-                    _async = True
-                    return value, createdtime
+            if not has_value:
+                try:
+                    value, createdtime = self.value_and_created_fn()
+                except NeedRegenerationException:
+                    pass
+                else:
+                    if not self._is_expired(createdtime):
+                        log.debug("value already present")
+                        return value, createdtime
+                    elif self.async_creator:
+                        log.debug("Passing creation lock to async runner")
+                        self.async_creator(self.mutex)
+                        _async = True
+                        return value, createdtime
 
             log.debug("Calling creation function")
             created = self.creator()

@sqlalchemy-bot
Copy link
Author

Michael Bayer (zzzeek) wrote:

the async_creator thing makes this more intricate than that. i wonder if this issue was introduced by async_creator.

@sqlalchemy-bot
Copy link
Author

Michael Bayer (zzzeek) wrote:

so https://gerrit.sqlalchemy.org/#/c/zzzeek/dogpile.cache/+/740 will run through the test suite, though it would be nice to add a test that demonstrates the double-get behavior in the old version and confirms it's fixed.

@sqlalchemy-bot
Copy link
Author

jvanasco (jvanasco) wrote:

I can put together a test. i tracked this down via a "wrapper", which I thought was misconfigured. I ended up tearing apart dogpile from 11pm to 4am trying to work this out ;)

I remember the race condition you mentioned above well. I think this is just an artifact of dogpile being modified over the years - i'm pretty sure it only triggered a second get when necessary when I looked at this in the past.

@sqlalchemy-bot
Copy link
Author

Michael Bayer (zzzeek) wrote:

have you reviewed and tested https://gerrit.sqlalchemy.org/#/c/zzzeek/dogpile.cache/+/740 yet? it's ready to go. no hurry, jsut want to make sure you look at it since you're more focused on this than me.

@sqlalchemy-bot
Copy link
Author

jvanasco (jvanasco) wrote:

i got a bit frustrated with gerrit so just patched it myself and am still writing tests.

fyi, i have a few small cpython optimizations that i'm going to suggest in a separate pr too.

@sqlalchemy-bot
Copy link
Author

Michael Bayer (zzzeek) wrote:

when you're on the gerrit there's a "download" dropdown in the upper-right, you can bypass git and get a patchfile directly:

https://gerrit.sqlalchemy.org/changes/740/revisions/076acbaf8f882024acbd11f572c6c8f6e3426e4e/patch?zip

@sqlalchemy-bot
Copy link
Author

jvanasco (jvanasco) wrote:

This looks okay. I had one failure on one test run, but I haven't been able to replicate it. I think my computer froze and a cache value expired.

I haven't figured out how to write the test case for the old behavior yet.

FAIL tests/cache/test_dbm_backend.py::DBMBackendConditionTest::test_region_expire

@sqlalchemy-bot
Copy link
Author

Michael Bayer (zzzeek) wrote:

Don't call getter twice for existing, expired value

Fixed a condition in the :class:.Lock where the "get" function could be
called a second time unnecessarily, when returning an existing, expired
value from the cache.

Change-Id: I7b4f529a191130f99bc15763a446d1d441c50cbe
Fixes: #122

5b75a9f

@sqlalchemy-bot
Copy link
Author

Changes by Michael Bayer (zzzeek):

  • changed status to closed

@sqlalchemy-bot sqlalchemy-bot added the bug Something isn't working label Nov 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant