Server clustering configuration possible? #1680

dwmcc · 2021-01-30T22:11:02Z

Is your feature request related to a problem? Please describe.

Reading through the reSolve documentation, I'm specifically curious about whether clustering is natively supported. Specifically, the ability for multiple reSolve servers to be run in a cluster, where a given aggregate state is accessible to every server across the cluster, and where events can be emitted and processed by any member of the cluster.

Looking through the Application Configuration documentation, the only potentially relevant information I've found is in the eventBroker section, where you can configure publisher and consumer settings. If I want to spin up multiple reSolve servers, should I set both of these configuration options on every server?

I'd also like to know if configuring the eventBroker will also allow multiple reSolve servers to cluster in the way I've mentioned?

An example use-case for this is deploying a reSolve application into a kubernetes cluster. The application will serve an API which will dispatch commands to aggregates.
You will likely want multiple instances (pods) of the server running, for availability's sake, but these instances should be aware of each-other. Otherwise, you have no way to guarantee a given aggregate is unique across the cluster of servers, you risk aggregate state getting out-of-sync between servers, etc.

Am I misunderstanding the design principles of the reSolve framework?
Any clarity is appreciated here. Thanks for your input.

dwmcc · 2021-02-01T23:50:24Z

In doing some more research, I've come across this StackOverflow answer (by @superroma , I believe), which says you can run multiple reSolve servers running from the same event store:

[...] you can have the same code in all instances and just route command and queries to the different instance pools.

I'm very glad to hear this! But I still don't fully understand the following:

When you're running multiple reSolve servers (with the same application code), do you need to configure the eventBroker so that all servers are connected to the same event broker?
If so, what configuration flags should be set on the servers (launchBroker/publisherAddress/consumerAddress)? If not, what is the purpose of the eventBroker? I can't seem to find detailed documentation on the broker configuration.
How does reSolve server ensure that an aggregate on one server does not get out of sync with an aggregate from another server, if commands/events are applied in quick succession? Wouldn't each aggregate have to be globally unique across the cluster of running servers?

Thanks for any help you can provide! I've worked with ES/CQRS systems in Elixir and Scala and both have had a clear way to cluster servers, so I'm looking forward to understanding the same about reSolve.

superroma · 2021-02-05T10:28:45Z

@dwmcc, sorry for delay, i would like to answer this myself, but i'm on vacation this week, will answer in detail next week.

As a very short answer - several options are possible, some of them we did not tested yet, since we decide to focus on serverless approach.

dwmcc · 2021-02-05T12:52:36Z

Hi @superroma - this is very much appreciated. Enjoy your vacation and I look forward to discussing next week!

Happy to help contribute to the documentation as I get through the process of running multiple hosts.

superroma · 2021-02-08T10:17:23Z

First let me briefly describe the state of the project.

There is "local" implementation based on expressjs, and "cloud" serverless implementation.
Local implementation was heavily tested in development scenario, and single-server scenario.
We did not have multiple-server/cluster projects so far, so those scenarios are kept in mind.

Here is what possible right now in the current version.

Processing Commands

How does reSolve server ensure that an aggregate on one server does not get out of sync with an aggregate from another server, if commands/events are applied in quick succession?

We are using serveless approach - nothing is stored in memory, when command arrives, aggregate state is calculated from events in the event store. Then command is processed and event is saved, response sent to caller. We ensure aggregate consistency by using optimistic locking with aggregate versions, as described here https://stackoverflow.com/a/54826095/318097

This works the same way in the local and cloud implementation.

Updating read models

After event is saved to event store, it should be applied to all read models.

Cloud implementation is using aws step-functions to collect and distribute events to read models. Local implementation is using simple event broker based on zeromq. I'll explain local in detail.

EventBroker is the service that notifies other instances about new events. It should be running only on one instance, all other connects to it.

It is very simple: instance that stored an event notifies eventBroker, and it notifies all other instances. This means that in the default configuration, all instances will try to update their read models, and if they point to the same db, first one will update and other will receive "already updated" result.

This can be further optimized by having different configs for different instances.

Exactly-once read model update problem

We spent a lot of time trying to have a central "event bus" that can reliably update any kind of read models, so the projection function can be a simple SQL statement or adding a line to the file and events arrive exactly once. We did not succeed, this is too complicated and still there are cases when it fails.

So at the moment, reliable read model implementation requires to have a "ledger" - store/table that knows which events were applied and in which order.

Unfortunately, this complicates the read model adapter, it should be written in certain way.

Another option is to make projection functions idempotent - being able to handle multiple calls with the same event.

Anyway, in server cluster scenario, multiple servers may attempt to apply the same event, but this is prevented by adapter implementation - the first will win, second will receive "already applied" error. There is no negotiation mechanism to decide which server will apply given event.

Serving queries

If instance has specified read model, then it just surves the query the same way as in single-process mode.

This is what implemented in the current codebase.

Devs told me that they did not test multy-instance use cases for quite a while, so there may be bugs. We'll test it ASAP and fix it.

I can see issues with current eventBus implementation in the cluster scenario - what if that server dies, it should be started somewhere else. As a solution, we could abstract away this service and have adapters for several systems like Kafka or RabbitMQ.

We could provide an example/tutorial for server cluster. Can you describe your scenario?

dwmcc · 2021-02-08T13:28:21Z

Hi @superroma -- first off, thank you for your thorough response - it is much appreciated! Hope you had a restful vacation.

Please bear with me with regards to my reply - I have several specific questions about how clustering might function with this framework.

We ensure aggregate consistency by using optimistic locking with aggregate versions, as described here [...]

Perfect - you're resolving the aggregate consistency problem (in both multi-node and serverless scenarios) with a database constraint. 👍

EventBroker is the service that notifies other instances about new events. It should be running only on one instance, all other connects to it.

Can you describe your scenario?

What I'm considering for deployment is a kubernetes cluster (wherein each instance will be able to privately reach the other instances). I'd have one instance specifically configured as the broker, and a pool of instances that are connected to the broker. To my understanding this will work, assuming it is configured properly.

I do still have a few questions on the broker setup for my clustering scenario, specifically:

Can you elaborate a bit on what configurations should be set for this option, both on the instance that is running the broker, and for the instances that are not?
Should incoming API traffic be routed to all instances, including the one hosting the broker, or only the pool of instances connected to the broker? I assume this has to do with the upstream config.

[...] in the default configuration, all instances will try to update their read models, and if they point to the same db, first one will update and other will receive "already updated" result.

Perfect - as long as this is the case for the PostgreSQL adapter, then I don't think there would be any trouble with this functionality when clustering.

So at the moment, reliable read model implementation requires to have a "ledger" - store/table that knows which events were applied and in which order.

Is this built in to the PostgreSQL adapter for the read model? I'd assume so but wanted to confirm.
Another question I have on this topic -- in the documentation for initializing a read model, you define the table structure using the adapter's store.defineTable method.

Ostensibly, the table would only need to be defined and created once, and then once it exists we wouldn't need to run that statement again (outside of schema changes). I need to spend some time digging into the repository, but for my sanity -- does the read model adapter check the existence of the table, and only run the init code if it doesn't exist, or if the enumeration of fields has changed?

The docs mention:

Note that reSolve does not force you to use adapters [...] you can just work with that system in the code of the projection function and query resolver, without writing a new Read Model adapter

If I wanted to use an ORM for defining and updating tables in the read models, such as TypeORM, would you suggest I use it directly (keeping in mind the "all instances will try to update their read models" constraint)?

Final questions:

I can see issues with current eventBus implementation in the cluster scenario - what if that server dies, it should be started somewhere else. As a solution, we could abstract away this service and have adapters for several systems like Kafka or RabbitMQ.

To clarify, you mention "eventBus" in that quote -- is this another term for eventBroker in the reSolve framework?
This could certainly be a good solution -- in fact, in parsing through commits I did see references several years ago to a separate broker. Maybe it will make a return! 🙂

If I deploy reSolve as described above -- one instance running the broker and a pool of instances connected, assuming kubernetes has the job of keeping the broker online and restarting it if it's unhealthy, should this operate as intended? If the broker goes offline, I would assume any API requests would fail until it comes back online? Would we be guarded against an instance entering an invalid state such as when applying commands to an aggregate?

Thanks Roman! Thoroughly appreciate your input here.
After we discuss the items above, and while I'm testing out the cluster deployment scenario, I'd be happy to contribute to the documentation with regards to deployment and clarifications on what we discussed here.

superroma · 2021-02-08T14:30:58Z

Dylan, thank you for your interest in reSolve and possible contribution,

First of all - devs are not sure that event broker currently works with more than one server - we will test it and fix in nearest time - I will keep you informed.

Can you elaborate a bit on what configurations should be set for this option, both on the instance that is running the broker, and for the instances that are not?

Let me do that after our tests, because implementation may change.

Should incoming API traffic be routed to all instances, including the one hosting the broker, or only the pool of instances connected to the broker?

Instance that hosts the broker actually runs two processes - a broker and resolve server. So yes, it can serve API calls.

Is this built in to the PostgreSQL adapter for the read model? I'd assume so but wanted to confirm.

Yes, all supplied adapters implement "ledger".

does the read model adapter check the existence of the table, and only run the init code if it doesn't exist, or if the enumeration of fields has changed?

Yes. It does not alter tables though, you reset read model - meaning delete everything, init db structure and rebuild from events.

The docs mention:

Note that reSolve does not force you to use adapters [...] you can just work with that system in the code of the projection function and query resolver, without writing a new Read Model adapter

If I wanted to use an ORM for defining and updating tables in the read models, such as TypeORM, would you suggest I use it directly (keeping in mind the "all instances will try to update their read models" constraint)?

This section is likely to be removed, since "central ledger" approach seems to be not working reliably.

You still can work without an adapter - just do anything in the init and projection functions, but then it should be idempotent - reSove will retry events.

At the moment it is not possible to get underlying db connection for adapter, so you can init TypeORM from it, but it is not hard to provide.

I guess using ORM would be an anti-pattern for CQRS/ES app, because of significant overhead (code is serverless - stateless, remember?) and assuming normalization, while CQRS works best with denormalized read models - meaning you just store answers to your queries. Perhaps this topic deserves its own article to explain.

To clarify, you mention "eventBus" in that quote -- is this another term for eventBroker in the reSolve framework?

Its a typo, I meand event broker.

If I deploy reSolve as described above -- one instance running the broker and a pool of instances connected, assuming kubernetes has the job of keeping the broker online and restarting it if it's unhealthy, should this operate as intended? If the broker goes offline, I would assume any API requests would fail until it comes back online? Would we be guarded against an instance entering an invalid state such as when applying commands to an aggregate?

Yes, everything should work this way (after our tests of course). If event broker dies, everything will work, except read models updates. When event broker starts, it will continue where it left - no events will be missed.

dwmcc · 2021-02-08T15:00:12Z

we will test it and fix in nearest time - I will keep you informed

🎉 💯

Yes, all supplied adapters implement "ledger".

Let me do that after our tests, because implementation may change.

👍

I guess using ORM would be an anti-pattern for CQRS/ES app, because of significant overhead (code is serverless - stateless, remember?)

Fair point - we should keep the projection logic slim since we do our state management in the aggregate. I was mainly curious about using an external library within the read models. Or possibly even running raw SQL.

Yes, everything should work this way (after our tests of course).

Thank you for taking the time for such thorough responses @superroma -- excited to see if the event broker is able to function with multiple servers without too much work! In the meantime, let me know if I can assist.

superroma · 2021-02-25T09:41:28Z

Hi Dylan,

JFYI, we have tested multi-instance config, and, as I feared, it doesn't work correctly with the current version. We have fixed the problem, and fix is likely to be included in the next release.

dwmcc · 2021-02-25T12:28:16Z

@superroma fantastic news! Excited to check it out. Thanks for the update.

superroma · 2021-03-16T12:45:33Z

Hi Dylan,

We have released reSolve 0.28. There we have removed event broker process. Now all instances are registered in event store and notify each other without explicit configuration.

So all instances working with the same event store work well together.

To demonstrate this without clusters and containers, we have added "replica" config to the hacker news example:

resolve/examples/hacker-news/run.js

Lines 76 to 81 in 39fbf8a

    
           case 'dev:replica': { 
        
             const moduleAdmin = resolveModuleAdmin() 
        
             const resolveConfig = merge(baseConfig, devReplicaConfig, moduleAdmin) 
        
             await watch(resolveConfig) 
        
             break 
        
           }

https://github.com/reimagined/resolve/blob/dev/examples/hacker-news/config.dev.replica.js

So you can run two instances of hacker news app on the same machine - they will use the same event store, but have their own read models and listen on different ports.

If you want, we can prepare an example with app containers and db container, what container orchestration system do you prefer?

mikecann · 2021-03-31T23:36:29Z

If you want, we can prepare an example with app containers and db container, what container orchestration system do you prefer?

@superroma that would be very helpful with understanding I think. Maybe docker compose because its quite simple? I personally would love to see a serverless example, perhaps using CDK, not sure if that is out of the scope of what you were thinking tho?

To be honest I think some good docs on how everything fits together would go a long way with trying to understand it all. For example I think Wolkenkit does a good job here: https://docs.wolkenkit.io/3.1.0/getting-started/understanding-wolkenkit/architecture/ (Wolkenkit docs in general are actually really good, worth a look)

superroma · 2021-04-01T13:09:31Z

Yes, we need a good architecture overview, will work on it.

Regarding docker and serverless examples - are you implying that resolve-cloud deployment work in containers?
If so, it is not correct. reSolve app deployed to the cloud is compiled into AWS lambdas, lambda@edge, and many AWS infrastructure elements like app gateways, step functions, aurora db instances etc.

So here we are describing how to run "local" express-based reSolve version in a multi-instance cluster. I would not call it serverless. It is mostly stateless, works similar to cloud version, but is not serverless.

We'll prepare an example using docker composer.

mikecann · 2021-04-02T00:06:13Z

If so, it is not correct. reSolve app deployed to the cloud is compiled into AWS lambdas, lambda@edge, and many AWS infrastructure elements like app gateways, step functions, aurora db instances etc.

This I would be very interested in seeing because it sounds the most scalable and lowest costs (pay only what you use) but I understand if this is your secret sauce you dont want to give away ..

dwmcc changed the title ~~Documentation for eventBroker and/or clustering configuration~~ Server clustering configuration possible? Jan 30, 2021

max-vasin closed this as completed Apr 9, 2021

reimagined locked and limited conversation to collaborators Apr 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Server clustering configuration possible? #1680

Server clustering configuration possible? #1680

dwmcc commented Jan 30, 2021 •

edited

Loading

dwmcc commented Feb 1, 2021 •

edited

Loading

superroma commented Feb 5, 2021

dwmcc commented Feb 5, 2021

superroma commented Feb 8, 2021 •

edited

Loading

dwmcc commented Feb 8, 2021

superroma commented Feb 8, 2021 •

edited

Loading

dwmcc commented Feb 8, 2021

superroma commented Feb 25, 2021

dwmcc commented Feb 25, 2021

superroma commented Mar 16, 2021 •

edited

Loading

mikecann commented Mar 31, 2021

superroma commented Apr 1, 2021

mikecann commented Apr 2, 2021

This issue was moved to a discussion.

This issue was moved to a discussion.

Server clustering configuration possible? #1680

Server clustering configuration possible? #1680

Comments

dwmcc commented Jan 30, 2021 • edited Loading

dwmcc commented Feb 1, 2021 • edited Loading

superroma commented Feb 5, 2021

dwmcc commented Feb 5, 2021

superroma commented Feb 8, 2021 • edited Loading

Processing Commands

Updating read models

Exactly-once read model update problem

Serving queries

dwmcc commented Feb 8, 2021

superroma commented Feb 8, 2021 • edited Loading

dwmcc commented Feb 8, 2021

superroma commented Feb 25, 2021

dwmcc commented Feb 25, 2021

superroma commented Mar 16, 2021 • edited Loading

mikecann commented Mar 31, 2021

superroma commented Apr 1, 2021

mikecann commented Apr 2, 2021

This issue was moved to a discussion.

dwmcc commented Jan 30, 2021 •

edited

Loading

dwmcc commented Feb 1, 2021 •

edited

Loading

superroma commented Feb 8, 2021 •

edited

Loading

superroma commented Feb 8, 2021 •

edited

Loading

superroma commented Mar 16, 2021 •

edited

Loading