Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

factor out membership layer #96

Merged
merged 5 commits into from
Nov 25, 2019

Conversation

mschuwalow
Copy link
Member

as described in the https://github.com/zio/zio-keeper/milestone/1 it makes sense to have a seperate transport a membership / transport layer.

This pr is basically a high level draft of how these layers could look like and are quite inspired from the current ziokeeper code. I do have more stuff ready, but I would like us to start by discussing the high level api and design.

Transport layer should be rather self explanatory. I would highly like us to go for udp for the gossip messages though. The user (who will be internal for this stuff) should have control over what protocol he wants to use. The implementation I have in mind are essentially lazily created long lived tcp connection when necessary and udp otherwise. Listening should not distinguish between both I believe.

On the membership side I believe we should drop reliable broadcast as it is very hard to support / will have wonky semantics. Instead I propose that we allow the user to register messages that will spread as part of the underlying membership protocols gossip messages. Another thing that came up during experimenting with the current api is that I think it's very useful to support "wait for reply" at this level. I believe this can be very easily done by essentially adapting the address to be %nodeId-%optional converstation id-%maybe an index into the conversation? and handled transparently for the user.

I believe swim is a good fit for our membership protocol and also be a good basis for spreading of our crdt updates. I am thinking of having crdts be broadcast using registerBroadcast and implemented by having a clock for each broadcast (including membership protocol gossiping). Whenever a clock is ready to fire it checks for other clocks that are ready to fire and groups them into one message up to the udp limits. This design is taken from consuls memberlist and I think is very elegant solutions to this.

There are a number of open questions here marked with todo that we should discuss. But all in all I think this forms a useful basis we can collaborate and split work on.

/cc @mijicd

@pshemass
Copy link
Contributor

how do you know that is very hard to support broadcast? i'm running 80 Hazelcast nodes cluster of production which pretty much provides reliable broadcast. I didn't have much problems with that for last 3 years... Apple App Store has probably 300 or more hazelcast nodes...

Other thing is do we want to replicate across entire cluster. why don't we split into partitions and then order updates in the partition? Having partition leader and backup that make thinks much simpler.

Consul SWIM is very elegant solution but has a different usecase. They trying to detect if node went down in the cluster but as far I remember they use strong consistent kv value store to keep metadata in the cluster. The same as etcd in kubernetese.

My point is if using the same mechanism for cluster failure detection and propagate data structure is good idea in the first place ?

membership/src/main/scala/zio/membership/NodeState.scala Outdated Show resolved Hide resolved

import zio.Chunk

final case class Message(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think message should have correlationId that we can match all the message in conversation. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean for tracing? a combination of both should be usable for this. The idea is to have the node address + monotonically increasing counter so that every conversation has a unique id

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

primarily I would like to have this for tracing. monotonically increased counter might be tricky if we will provide client library because we need to generate request id on client. this is important when someone need to debug what happen with his/her request.

* to the same port number on bind.
*/
trait Service[R] {
def sendBestEffort(to: SocketAddress, msg: Chunk[Byte]): ZIO[R, Error, Unit]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you think that we should distinct this in here? why not send send/bind only?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance mostly. I believe as much as possible should be done in udp for performance reasons, but for example user messages or large messages might be better send using tcp.

Also for example consul uses both in case a node is not reachable using udp but with tcp.

I'm not sure how exactly we want to do this, but I believe we should make both available.

Copy link
Contributor

@pshemass pshemass Nov 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good reasons. question is more about if this belongs to Transport. I thought that Transport is just UDP or TCP. In your example with consul I would use both, not one combined.

I was thinking about something similar to this https://haskell-distributed.github.io/tutorials/1ch.html#sending-messages

import Network.Transport.TCP (createTransport, defaultTCPParameters)
....

main = do
  Right t <- createTransport "127.0.0.1" "10501" defaultTCPParameters
...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue I was thinking about is that some algorithms specifically need and reliable transport layer for failure detection.
This gets us to OOP land, but would do you think about having Transport <:< UnreliableTransport?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your are optimizing for something that is not implemented or even planned. It would be better to just always keep MVP approach that we deliver minimal thing.

@mschuwalow
Copy link
Member Author

how do you know that is very hard to support broadcast? i'm running 80 Hazelcast nodes cluster of production which pretty much provides reliable broadcast. I didn't have much problems with that for last 3 years... Apple App Store has probably 300 or more hazelcast nodes...

That is very cool 👍
I'm not familiar with hazelcast, but it does sound like a very impressive system.
Still I believe for broadcast you would need to rely on hazelcast infrastructure or similar. If we are going to build this from scratch we can only provide this at a much higher level in my opinion.

Other thing is do we want to replicate across entire cluster. why don't we split into partitions and then order updates in the partition? Having partition leader and backup that make thinks much simpler.

That is also an option. I believe we should wait for itamars requirements and our session this week either way.

Consul SWIM is very elegant solution but has a different usecase. They trying to detect if node went down in the cluster but as far I remember they use strong consistent kv value store to keep metadata in the cluster. The same as etcd in kubernetese.

My point is if using the same mechanism for cluster failure detection and propagate data structure is good idea in the first place ?

That is a good point. What makes their architecture attractive in my opinion is that they solve these issues in completely independently.
If I understand their architecture correctly on a high level they have their membership layer that basically has the same responsibilities as what we came up with for v.0.1. and have raft running on top of it to elect a leader for a replicated kv store. I don't believe that they have crdts in userland at all (they do have them as part of the membership protocol).

I think this is quite nice for us as we also want to tackle these problems incrementally. We could reuse the membership layer for crdts, which I believe is not unusual for state-based (or the delta based crdts in the paper I linked, which target exactly this usecase) crdts. But we might also build this as a completely separate system on top of this, should that prove more viable.

Note that I have no strong attachment to this approach, it just seems like a sane way to get a lot of progress done while reusing a lot of ideas from a heavily battle-tested system.

@pshemass
Copy link
Contributor

pshemass commented Nov 13, 2019

@mschuwalow consul's raft protocol is running on servers only. Servers has replication mechanism based on TCP. This is because of size of quorum (it's growing when you adding nodes) and that affect performance dramatically. They have gossip on client side but it's not the same as server gossip which is only across data centers.
https://www.consul.io/docs/internals/architecture.html

https://www.consul.io/docs/internals/consensus.html#raft-in-consul

But I'm glad that you are not very attached to this. Hopefully we will get some requirements from Itamar.

@mschuwalow
Copy link
Member Author

mschuwalow commented Nov 24, 2019

I've updated the pr. In it's current form it is again very close to do the original design in zio-keeper

Do we want to merge the modules or keep them seperate for now?

@mijicd mijicd mentioned this pull request Nov 24, 2019
@mijicd
Copy link
Member

mijicd commented Nov 24, 2019

@mschuwalow Let's merge them, as discussed over the meeting. Once this is in, I'll enable the 2.13 build as well. Other than that, looks great!

@pshemass
Copy link
Contributor

@mschuwalow Let's merge them, as discussed over the meeting. Once this is in, I'll enable the 2.13 build as well. Other than that, looks great!

As we discussed membership should still be in core and we are going to decide what to do with this in the future.

@mschuwalow
Copy link
Member Author

Ok, one issue that I ran into just now is that core does not compile with current dependencies.
But I think it makes sense to keep it around until we reach feature parity

@mijicd
Copy link
Member

mijicd commented Nov 25, 2019

I'm merging this one in order to "unstuck" the other work. We can merge the modules and fix the deficiencies in subsequent PRs

@mijicd mijicd merged commit 1c79f23 into zio:master Nov 25, 2019
@mijicd mijicd mentioned this pull request Nov 25, 2019
pshemass pushed a commit to pshemass/scalaz-distributed that referenced this pull request Nov 27, 2019
pshemass pushed a commit to pshemass/scalaz-distributed that referenced this pull request Nov 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants