Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicate scheduling when running on multiple host #332

Open
rvyas opened this issue Mar 8, 2021 · 23 comments
Open

duplicate scheduling when running on multiple host #332

rvyas opened this issue Mar 8, 2021 · 23 comments
Milestone

Comments

@rvyas
Copy link

rvyas commented Mar 8, 2021

This gem works fine under general use case. However, we have some jobs that take longer time (20-25 min) and increases the queue latency.

During such scenario, jobs are scheduled twice which causes other issues in the app. There is a section in the readme that warns about this but is there a way i can minimize duplication? like tweaking certain parameters or limiting certain things?

OR can this issue be fixed in the gem itself?

Current config:

  • 4 hosts
  • 4 processes (1 per host)
  • 12 threads each process
  • 5 queue
  • 6 recurring jobs using cron at different times.
  • There can be >100 jobs scheduled in a queue at the same time.
@bpo
Copy link

bpo commented Mar 8, 2021

@rvyas you might want to take a look at sidekiq-uniq

@rvyas
Copy link
Author

rvyas commented Mar 8, 2021

Thanks @bpo. I will try this gem.

I was trying to do the same in the worker/checking other jobs in queue, etc, but there is some delay when those jobs show up in sidekiq APIs.

@bpo
Copy link

bpo commented Mar 8, 2021

@rvyas I don't use that one myself, but I know others have had similar issues. Sidekiq Enterprise has built-in support for unique jobs, which would probably be your best choice if that's an option for you.

@marcelolx
Copy link
Member

@rvyas Did you solved your problem?

@rvyas
Copy link
Author

rvyas commented Jun 14, 2021

No, it did not solve the problem. Even with sidekiq-uniq I see the same duplicate scheduling although the rate has reduced.
May be I need to leverage built-in support. I will also give a shot to sidekiq-cron which seems to have similar feature as this gem.

@mgoggin
Copy link

mgoggin commented Jul 13, 2021

@rvyas Were you able to solve your issue with sidekiq-cron?

@marcelolx
Copy link
Member

@rvyas @mgoggin I didn't take a look at this gem https://github.com/mhenrixon/sidekiq-unique-jobs, but maybe it can help.

@stale
Copy link

stale bot commented Oct 2, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale The issue or PR has been inactive label Oct 2, 2021
@marcelolx marcelolx removed the stale The issue or PR has been inactive label Oct 2, 2021
@stale
Copy link

stale bot commented Jan 9, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale The issue or PR has been inactive label Jan 9, 2022
@marcelolx marcelolx removed the stale The issue or PR has been inactive label Jan 9, 2022
@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale The issue or PR has been inactive label Apr 16, 2022
@marcelolx marcelolx removed the stale The issue or PR has been inactive label Apr 16, 2022
@marcelolx
Copy link
Member

ok I think one solution for those problems will be to use a Redis based Lock instead of rufus-scheduler default filelock

https://github.com/jmettraux/rufus-scheduler#advanced-lock-schemes

This would allow us to use something like https://github.com/kenn/redis-mutex to create a single lock to all schedulers running on different hosts which would ensure that only one scheduler would be able to schedule a job at a given time.

I'll investigate more and make some tests, this may land as an experimental feature in v4.1

@marcelolx marcelolx added this to the v4.1 milestone Apr 28, 2022
@bpo
Copy link

bpo commented May 1, 2022

@marcelolx this seems like a tricky thing for sidekiq-scheduler to handle well, as it's still always possible for the app to enqueue jobs externally from the scheduler.

@marcelolx
Copy link
Member

marcelolx commented May 2, 2022

@bpo Could you exemplify?

PS: Only thing we want is to avoid sidekiq-scheduler enqueueing duplicated jobs on multiple hosts

@bpo
Copy link

bpo commented May 2, 2022

I think there may be 3 issues here:

A) Sidekiq-scheduler can enqueue a job when it's already running
B) Sidekiq can enqueue a job when it's already running
C) Sidekiq can start running a job when it's already started

The initial description of the issue here is all 3 of those: The overall system has performance problems because two copies of a job are running at once (C), caused by Sidekiq enqueuing two copies of a job (B), and sidekiq-scheduler handled the actual enqueueing (A).

A is clearly a subset of B, which makes me think that a more general plugin to Sidekiq might be the right way to solve the whole problem rather than adding the complexity within sidekiq-scheduler.

Only thing we want is to avoid sidekiq-scheduler enqueueing duplicated jobs on multiple hosts

I think this means you are saying "we only want to solve A". I think you're right that A is solvable within sidekiq-scheduler. I wanted to raise a caution about added complexity vs value (since B/C will remain unaddressed) but it sounds like you've already thought that through.

@marcelolx
Copy link
Member

got it @bpo, that makes sense! I agree that adding too much complexity is not a path we want to follow and yeah, what I am looking for is to solve A, just to avoid sidekiq-scheduler enqueue the same job when running on multiple hosts.

@eldemcan
Copy link
Contributor

I would like to share one of our solution to this problem, (we ended up with different solution because we have a lot of dynamically scheduled tasks)

class RedisLock
  include Singleton
  attr_reader :redis

  LOCK_NAME = "rufus:trigger"
  INSTANCE_IDENTIFIER =  Socket.gethostname || SecureRandom.hex(10)

  def lock
      acquired =
        Sidekiq.redis do |r|
          res = r.set(LOCK_NAME, INSTANCE_IDENTIFIER, nx: true, ex: 5.minutes.to_i)

          res == false ? r.get(LOCK_NAME) == INSTANCE_IDENTIFIER : res
        end

      acquired
    end
  end

  def unlock
    true
  end

  def remove_instance_from_queue
    removed = Sidekiq.redis { |r| r.del(LOCK_NAME) }
    Honeycomb.add_field("scheduler_removed", removed == 1)
  end
end


# Monkey patched Sidekiq::Scheduler since we don't have way to
# inject scheduler_lock
module SidekiqScheduler
  class Scheduler
    def initialize(options = {})
      self.enabled = options[:enabled]
      self.dynamic = options[:dynamic]
      self.dynamic_every = options[:dynamic_every]
      self.listened_queues_only = options[:listened_queues_only]
      self.rufus_scheduler_options = options[:rufus_scheduler_options] || {}

      self.rufus_scheduler_options =
        rufus_scheduler_options.merge({ scheduler_lock: RedisLock.instance })
    end
end

This will make sure only one host can schedule. Each host needs to create a thread to check if rufus:trigger is still set in redis and if not they can declare themselves as scheduler host if others haven't done that yet. Hope this helps.

@marcelolx
Copy link
Member

@eldemcan You don't have to patch sidekiq-scheduler, it is just a matter of passing the scheduler_lock to the rufus_scheduler_options, no?

One question, what your team wants is a single scheduler running, right?

BTW, what happens if your server crashes and remove_instance_from_queue is not called to remove the lock? In the next deployment no server will be able to set up the scheduler, right?

Also, I suggest looking at how Square https://github.com/square/sqeduler have implemented its locking strategy using Redis which may be helpful (My plans are introduce support for that at some point, that you can opt-in, but first we want to solve #396)

@eldemcan
Copy link
Contributor

@eldemcan You don't have to patch sidekiq-scheduler, it is just a matter of passing the scheduler_lock to the rufus_scheduler_options, no?

How can I pass options to rufus_scheduler instance inside SidekiqScheduler:Scheduler? SidekiqScheduler:Scheduler seems to be getting its options from yaml. Let me know if I am overlooking something

One question, what your team wants is a single scheduler running, right?

We want one instance to be able to schedule at a time but if that instance goes away another instance should be able to take that role.

BTW, what happens if your server crashes and remove_instance_from_queue is not called to remove the lock? In the next deployment no server will be able to set up the scheduler, right?

Key will expire in 5 min. (this could be shorter) So if remove_instance_from_queue is not called redis will remove the key. We could also create a thread that just keeps checking if key is set if not instance will set the key meaning they will declare themselves as scheduler amongs other hosts.

Also, I suggest looking at how Square https://github.com/square/sqeduler have implemented its locking strategy using Redis which may be helpful (My plans are introduce support for that at some point, that you can opt-in, but first we want to solve #396)

Thank you I will take a look into this

@marcelolx
Copy link
Member

How can I pass options to rufus_scheduler instance inside SidekiqScheduler:Scheduler? SidekiqScheduler:Scheduler seems to be getting its options from yaml. Let me know if I am overlooking something

Doesn't this work? ⬇️

rufus_scheduler_options:
  scheduler_lock: <%= RedisLock.instance %>

We want one instance to be able to schedule at a time but if that instance goes away another instance should be able to take that role.

Then I think that you want to pass along a trigger_lock strategy, no? https://github.com/jmettraux/rufus-scheduler#trigger_lock
From the description of the scheduler_lock

The scheduler lock is an object that responds to #lock and #unlock. The scheduler calls #lock when starting up. If the answer is false, the scheduler stops its initialization work and won't schedule anything.

while the trigger_lock

The trigger lock in an object that responds to #lock. The scheduler calls that method on the job lock right before triggering any job. If the answer is false, the trigger doesn't happen, the job is not done (at least not in this scheduler).

So I think that what you're looking for is a trigger lock and not a scheduler lock....

Key will expire in 5 min. (this could be shorter) So if remove_instance_from_queue is not called redis will remove the key. We could also create a thread that just keeps checking if key is set if not instance will set the key meaning they will declare themselves as scheduler amongs other hosts.

🤦‍♂️ Didn't see the expire!

@eldemcan
Copy link
Contributor

@marcelolx thanks for engagement,

Doesn't this work? ⬇️

It doesn't work because yml is loaded before the class itself so it throws undefined class error, maybe there is a trick to overcome this problem.

Then I think that you want to pass along a trigger_lock strategy, no?

We tried trigger_lock as well but because of jmettraux/rufus-scheduler#130 there were jobs duplicated. (we believe)

@marcelolx
Copy link
Member

Thanks for the feedback @eldemcan, I'll keep an eye on that issue.

And once I work on supporting a redis lock strategy I'll see if we can allow users to set a trigger_lock/scheduler_lock without having to patch sidekiq-scheduler!

@mrjonesbot
Copy link

I recently experienced this same issue, although things had been working correctly despite having multiple hosts.

This occurred after I adjusted my sidekiq.yml file, where I removed a scheduled job, then changed the scheduled times of 3 others.

Overall, I deployed a wide array of changes to my scheduled jobs, so I'm wondering if this has anything to due with the scheduler queuing duplicates?

Perhaps each change + deploy to the scheduler populates additional scheduled jobs without clearing out the old ones?

@bibendi
Copy link

bibendi commented Oct 9, 2022

And once I work on supporting a redis lock strategy

I've made one for myself. You may be interested :)
bibendi/schked#31

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants