Using Que in a multi-node environment #10

odarriba · 2018-11-22T13:32:12Z

We are trying to use Que to schedule jobs in a multi-node environment, allowing nodes to connect to each other and share mnesia information.

What I'm doing:

Start iex console:

$ mix --sname node1 -S mix
$ mix --sname node2 -S mix

Connect nodes (on node1 console)

iex> Node.connect(:node2@local_hostname)

Check that they are connected with a Node.list() on both sides
Spawn 100 jobs of my TestWorker(which puts a log message and waits one second)

for number <- 1..100 do
  Que.add(TestQue.TestWorker, number)
end

It works, but all jobs run only on one of the nodes (lets say node1), while on the other (node2) no job is executed.
Also, if I execute the same for on node2, all jobs are executed in node1 too.

is there anything we can do to:

Run jobs on any node
Ensure that if one node is down, the tasks in :mnesia remains in the rest of the cluster.

Thanks in advace!

PD: I have also tried executing the setup of mnesia persisted in disk, with this result:

iex(node1@patata)2> nodes = [node() | Node.list]
[:node1@patata, :node2@patata]
iex(node1@patata)3> Que.Persistence.Mnesia.setup!(nodes)
[info] Application mnesia exited: :stopped
** (Memento.MnesiaException) Mnesia operation failed
   :not_active
   Mnesia Error: {:not_active, Que.Persistence.Mnesia.DB.Jobs, :node2@patata}
    (memento) lib/memento/table/table.ex:274: Memento.Table.handle_for_bang!/1

The text was updated successfully, but these errors were encountered:

sheharyarn · 2018-11-26T03:01:47Z

Hey @odarriba! This is an excellent question and something that I have previously deliberated a lot. In the end, I decided to not include this in que for multiple reasons:

Jobs are typically run only on a single machine, and in most of the cases, it's the main application server. I wanted to make Que super simple to use.
Not everyone wants to handle failures in the same way. And while it's common to retry them, Que's default mode of operation is to not do anything.
Distribution is hard and no single solution applies to all scenarios. I didn't want to lock que to a specific node distribution and job processing model.
Same as the 3rd point, not all consensus algorithms are created equal and apply to same situations.

How/when/if an application should be replicated to another node is up to the developer. See the Distributed Applications guide on the Erlang website. All that said, for a simple node failover model, this should be pretty straightforward. You can take a look at how singleton implements this (though this means you'll have to start que in runtime: false mode and manually set up the distribution strategy).

sheharyarn · 2018-11-26T03:10:31Z

Thinking about this more (and seeing how quickly your post got many 👍), I might have to reconsider my position on this.

I'm open to the idea of integrating this into Que if we're able to come up with sane defaults that are easy to change and do not negatively affect the developer experience. PRs/Issues are welcome if you have any ideas on how to approach this.

noizu · 2018-12-01T11:26:23Z

I might suggest adding a node parameter to the que table so that queue entries are associated with the node they were created on. My cluster handles about half a billion device reports per day and spans multiple nodes. There are certain tasks I'd like to queue up but they'd overwhelm any single node responsible for handling them.

noizu · 2018-12-01T11:27:24Z

Or possibly use Mnesia local table type although I'm not entirely sure with all of the details of that table type.

noizu · 2018-12-02T04:02:31Z

@odarriba because I needed some additional functionality I made a fork this morning.

https://github.com/noizu/que

I have added the ability to set a priority level on jobs such that all :pri0 jobs will be processed before any :pri1, :pri2, or :pri3 ones (in that order).

Additionally the schema has been updated to include current node. So that when restarting only jobs queued on a specific node will be requeued.

I have not yet added load balancing logic since I don't need it for my immediate needs but you can simulate it well enough using something like.

cluster = [:"node1@domain", :"node2@domain", ...]
for number <- 1..100 do
  Que.remote_add(Enum.random(cluster), TestQue.TestWorker, number)
end

Or if using the original version of the code.

timeout = 50_000
cluster = [:"node1@domain", :"node2@domain", ...]
for number <- 1..100 do
  :rpc.call(Enum.random(cluster), Que, :add, [TestQue.TestWorker, number], timeout)
end

use :rpc.cast if you don't care about confirmation.

@sheharyarn you have written some fantastically readable code here ^_^ keep up the good work. If you'd like to incorporate the priority logic chat with me sometime. I'll continue to refine it either way on my fork since I need to very efficiently process tens of thousands of jobs a minute with prioritization.

sheharyarn · 2019-01-18T05:54:11Z

Thank you for the suggestions @noizu (and sorry for the late reply!). I took a quick look at your fork and it looks very interesting. Over the next few weeks, I'll try to come up with a plan for adding distributed job execution support, and will post back here with an update.

noizu · 2019-01-18T06:27:56Z

My fork has been working pretty well and handling a few hundred thousand tasks per day. But i did some things for expediency that you're not going to want upstream. There also seems to be a bug with the memento dependency it doesn't correctly compile first when including Que in a project.

…

On Fri, Jan 18, 2019 at 12:54 PM Sheharyar Naseer ***@***.***> wrote: Thank you for the suggestions @noizu <https://github.com/noizu> (and sorry for the late reply!). I took a quick look at your fork and it looks very interesting. Over the next few weeks, I'll try to come up with a plan for adding distributed job execution support, and will post back here with an update. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGAaBh07aoeTcqWcqllY9xTKq4_14rtbks5vEWGDgaJpZM4YvW3t> .

sheharyarn added the Type: Question label Nov 26, 2018

sheharyarn added Type: Enhancement Status: In Progress labels Jan 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Que in a multi-node environment #10

Using Que in a multi-node environment #10

odarriba commented Nov 22, 2018

sheharyarn commented Nov 26, 2018 •

edited

Loading

sheharyarn commented Nov 26, 2018

noizu commented Dec 1, 2018 •

edited

Loading

noizu commented Dec 1, 2018

noizu commented Dec 2, 2018

sheharyarn commented Jan 18, 2019

noizu commented Jan 18, 2019 via email

Using Que in a multi-node environment #10

Using Que in a multi-node environment #10

Comments

odarriba commented Nov 22, 2018

sheharyarn commented Nov 26, 2018 • edited Loading

sheharyarn commented Nov 26, 2018

noizu commented Dec 1, 2018 • edited Loading

noizu commented Dec 1, 2018

noizu commented Dec 2, 2018

sheharyarn commented Jan 18, 2019

noizu commented Jan 18, 2019 via email

sheharyarn commented Nov 26, 2018 •

edited

Loading

noizu commented Dec 1, 2018 •

edited

Loading