Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed lock cloud be acquired more then once. #127

Closed
laixintao opened this issue May 29, 2019 · 4 comments

Comments

@laixintao
Copy link
Contributor

commented May 29, 2019

The problem is, redbeat uses pexpires to extend lock lease without checking if it still owns the lock.

For example if we have two instances running, A is current owning the lock. A has a network issue and can't connect redis, then B acquired the lock. When A is reconnected, is use pexpires without checking is A still the owner of the lock. Finnally, the lock deadbeat::lock has two owners now.

image

To reproduce this scenario, you can start one redbeat with the command below:

➜ [redbeat] redbeat_test celery beat -S redbeat.RedBeatScheduler --config config -l de➜ [redbeat] redbeat_test celery beat -S redbeat.RedBeatScheduler --config config -l debug --pidfile="/tmp/celerybeat5.pid"
celery beat v4.3.0 (rhubarb) is starting.
__    -    ... __   -        _
LocalTime -> 2019-05-29 12:26:43
Configuration ->
    . broker -> redis://localhost:6379//
    . loader -> celery.loaders.default.Loader
    . scheduler -> redbeat.schedulers.RedBeatScheduler
       . redis -> redis:///
       . lock -> `redbeat::lock` 10.00 seconds (10s)
    . logfile -> [stderr]@%DEBUG
    . maxinterval -> 5.00 seconds (5s)
[2019-05-29 12:26:43,612: DEBUG/MainProcess] Setting default socket timeout to 30
[2019-05-29 12:26:43,613: INFO/MainProcess] beat: Starting...
[2019-05-29 12:26:43,617: DEBUG/MainProcess] Stored entry: <RedBeatSchedulerEntry: celery.backend_cleanup celery.backend_cleanup() <crontab: 0 4 * * * (m/h/d/dM/MY)>
[2019-05-29 12:26:43,617: DEBUG/MainProcess] beat: Ticking with max interval->5.00 seconds
[2019-05-29 12:26:43,617: DEBUG/MainProcess] beat: Acquiring lock...

The config.py file I use:

# -*- coding: utf-8 -*-
import os


broker_url = os.environ.get('BROKER_URL', 'redis://')

celerybeat_scheduler = 'redbeat.RedBeatScheduler'
beat_max_loop_interval=5
redbeat_lock_timeout=10
task_send_sent_event = True

And then use Ctrl+Z to suspended this process, to mock a network unavailable or python gc stop the world. Output like this:

[1]  + 92360 suspended  celery beat -S redbeat.RedBeatScheduler --config config -l debug

Wait until the lock in redis timeout:

127.0.0.1:6379> TTL redbeat::lock
(integer) 8
127.0.0.1:6379> TTL redbeat::lock
(integer) 7
127.0.0.1:6379> TTL redbeat::lock
(integer) -2

Then start another redbeat instance:

➜ [redbeat] redbeat_test celery beat -S redbeat.RedBeatScheduler --config config -l debug --pidfile="/tmp/celerybeat6.pid"
celery beat v4.3.0 (rhubarb) is starting.
__    -    ... __   -        _
LocalTime -> 2019-05-29 12:28:59
Configuration ->
    . broker -> redis://localhost:6379//
    . loader -> celery.loaders.default.Loader
    . scheduler -> redbeat.schedulers.RedBeatScheduler
       . redis -> redis:///
       . lock -> `redbeat::lock` 10.00 seconds (10s)
    . logfile -> [stderr]@%DEBUG
    . maxinterval -> 5.00 seconds (5s)
[2019-05-29 12:28:59,121: DEBUG/MainProcess] Setting default socket timeout to 30
[2019-05-29 12:28:59,121: INFO/MainProcess] beat: Starting...
[2019-05-29 12:28:59,126: DEBUG/MainProcess] Stored entry: <RedBeatSchedulerEntry: celery.backend_cleanup celery.backend_cleanup() <crontab: 0 4 * * * (m/h/d/dM/MY)>
[2019-05-29 12:28:59,126: DEBUG/MainProcess] beat: Ticking with max interval->5.00 seconds
[2019-05-29 12:28:59,126: DEBUG/MainProcess] beat: Acquiring lock...
[2019-05-29 12:28:59,127: DEBUG/MainProcess] beat: Extending lock...

Then use fg %1 to resume the first suspended redbeat instance:

➜ [redbeat] redbeat_test fg %1
[1]  - 92360 continued  celery beat -S redbeat.RedBeatScheduler --config config -l debug
[2019-05-29 12:29:53,169: DEBUG/MainProcess] beat: Synchronizing schedule...
[2019-05-29 12:29:53,170: DEBUG/MainProcess] beat: Extending lock...
[2019-05-29 12:29:53,170: DEBUG/MainProcess] Selecting tasks
[2019-05-29 12:29:53,170: INFO/MainProcess] Loading 0 tasks
[2019-05-29 12:29:53,170: DEBUG/MainProcess] beat: Waking up in 5.00 seconds.
[2019-05-29 12:29:58,172: DEBUG/MainProcess] beat: Extending lock...
[2019-05-29 12:29:58,172: DEBUG/MainProcess] Selecting tasks
[2019-05-29 12:29:58,173: INFO/MainProcess] Loading 1 tasks

So we have two instance running now. Each job will be put in queue twice and will be run twice.

@laixintao

This comment has been minimized.

Copy link
Contributor Author

commented May 29, 2019

I have tried use redis-py's lock.extend( https://github.com/andymccurdy/redis-py/blob/b51bfd818ce36cc3ae8591b54c988fbb16eb336d/redis/lock.py#L235 ) to replace pexpire. lock.extend checks if it is still the owner, if the check fails (means itself no longer holds the lock), a Exceptions will be raised. Thus only one instance can work.

But their additional_time( https://github.com/andymccurdy/redis-py/blob/b51bfd818ce36cc3ae8591b54c988fbb16eb336d/redis/lock.py#L249 ) is different than ours, it will plus the current left time, we don't.

I guess the best way to do this is register a lua script just like redis lock, first check the lock's owner, if ok, then extends lock.

@sibson what do you think?

@laixintao

This comment has been minimized.

Copy link
Contributor Author

commented May 29, 2019

The extend lua script:

    LUA_EXTEND_SCRIPT = """
        local token = redis.call('get', KEYS[1])
        if not token or token ~= ARGV[1] then
            return 0
        end
        local expiration = redis.call('pttl', KEYS[1])
        if not expiration then
            expiration = 0
        end
        if expiration < 0 then
            return 0
        end
        redis.call('pexpire', KEYS[1], ARGV[2])
        return 1
    """
@sibson

This comment has been minimized.

Copy link
Owner

commented May 29, 2019

Thanks for debugging this and providing a great summary. My first instinct would be to use redis-py's implementation, to avoid having more code to maintain but I'm not sure if that opens us to a situation where the lock extends far far into the future. @laixintao you have a good understanding of this. What do you think is the best option?

@laixintao

This comment has been minimized.

Copy link
Contributor Author

commented May 29, 2019

I think the best solution is that redis-py's lock.extend take a new timeout and do not add the left timeout. But there is an issue opened here for 4 years it seems that they won't support this feature: andymccurdy/redis-py#629 .

So the choices left to us would be:

  1. Submit a PR to redis-py (best choice)
  2. Monkey patch lock.py
  3. Monkey patch LUA_EXTEND_SCRIPT
  4. Calculate additional_timeout = time since last extend
  5. Set timeout to a small value

I will try 1, but it depends to redis-py weather to merge. 4 is not a good option because the time won't be always accurate. 5 is not possible, if the tick time smaller than timeout, the lock extends far far into the future; if the tick time bigger than timeout, the redbeat will be timeout after first tick.

I think we can try 3. Monkey patch LUA_EXTEND_SCRIPT. Let me know if you figure out something better.

laixintao added a commit to laixintao/redbeat that referenced this issue May 29, 2019

Bugfix: use lock.extend replace pexpire
monkey patch lua LUA_EXTEND_TO_SCRIPT

close sibson#127

@sibson sibson closed this in #128 Jul 29, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.