factor out membership layer #96

mschuwalow · 2019-11-10T22:33:38Z

as described in the https://github.com/zio/zio-keeper/milestone/1 it makes sense to have a seperate transport a membership / transport layer.

This pr is basically a high level draft of how these layers could look like and are quite inspired from the current ziokeeper code. I do have more stuff ready, but I would like us to start by discussing the high level api and design.

Transport layer should be rather self explanatory. I would highly like us to go for udp for the gossip messages though. The user (who will be internal for this stuff) should have control over what protocol he wants to use. The implementation I have in mind are essentially lazily created long lived tcp connection when necessary and udp otherwise. Listening should not distinguish between both I believe.

On the membership side I believe we should drop reliable broadcast as it is very hard to support / will have wonky semantics. Instead I propose that we allow the user to register messages that will spread as part of the underlying membership protocols gossip messages. Another thing that came up during experimenting with the current api is that I think it's very useful to support "wait for reply" at this level. I believe this can be very easily done by essentially adapting the address to be %nodeId-%optional converstation id-%maybe an index into the conversation? and handled transparently for the user.

I believe swim is a good fit for our membership protocol and also be a good basis for spreading of our crdt updates. I am thinking of having crdts be broadcast using registerBroadcast and implemented by having a clock for each broadcast (including membership protocol gossiping). Whenever a clock is ready to fire it checks for other clocks that are ready to fire and groups them into one message up to the udp limits. This design is taken from consuls memberlist and I think is very elegant solutions to this.

There are a number of open questions here marked with todo that we should discuss. But all in all I think this forms a useful basis we can collaborate and split work on.

/cc @mijicd

pshemass · 2019-11-10T23:54:31Z

how do you know that is very hard to support broadcast? i'm running 80 Hazelcast nodes cluster of production which pretty much provides reliable broadcast. I didn't have much problems with that for last 3 years... Apple App Store has probably 300 or more hazelcast nodes...

Other thing is do we want to replicate across entire cluster. why don't we split into partitions and then order updates in the partition? Having partition leader and backup that make thinks much simpler.

Consul SWIM is very elegant solution but has a different usecase. They trying to detect if node went down in the cluster but as far I remember they use strong consistent kv value store to keep metadata in the cluster. The same as etcd in kubernetese.

My point is if using the same mechanism for cluster failure detection and propagate data structure is good idea in the first place ?

membership/src/main/scala/zio/membership/NodeState.scala

pshemass · 2019-11-10T23:18:50Z

membership/src/main/scala/zio/membership/Message.scala

+
+import zio.Chunk
+
+final case class Message(


I think message should have correlationId that we can match all the message in conversation. WDYT?

Do you mean for tracing? a combination of both should be usable for this. The idea is to have the node address + monotonically increasing counter so that every conversation has a unique id

primarily I would like to have this for tracing. monotonically increased counter might be tricky if we will provide client library because we need to generate request id on client. this is important when someone need to debug what happen with his/her request.

pshemass · 2019-11-10T23:26:10Z

membership/src/main/scala/zio/membership/transport/Transport.scala

+   * to the same port number on bind.
+   */
+  trait Service[R] {
+    def sendBestEffort(to: SocketAddress, msg: Chunk[Byte]): ZIO[R, Error, Unit]


why do you think that we should distinct this in here? why not send send/bind only?

Performance mostly. I believe as much as possible should be done in udp for performance reasons, but for example user messages or large messages might be better send using tcp.

Also for example consul uses both in case a node is not reachable using udp but with tcp.

I'm not sure how exactly we want to do this, but I believe we should make both available.

Good reasons. question is more about if this belongs to Transport. I thought that Transport is just UDP or TCP. In your example with consul I would use both, not one combined.

I was thinking about something similar to this https://haskell-distributed.github.io/tutorials/1ch.html#sending-messages

import Network.Transport.TCP (createTransport, defaultTCPParameters) .... main = do Right t <- createTransport "127.0.0.1" "10501" defaultTCPParameters ...

One issue I was thinking about is that some algorithms specifically need and reliable transport layer for failure detection.
This gets us to OOP land, but would do you think about having Transport <:< UnreliableTransport?

your are optimizing for something that is not implemented or even planned. It would be better to just always keep MVP approach that we deliver minimal thing.

mschuwalow · 2019-11-11T10:05:58Z

how do you know that is very hard to support broadcast? i'm running 80 Hazelcast nodes cluster of production which pretty much provides reliable broadcast. I didn't have much problems with that for last 3 years... Apple App Store has probably 300 or more hazelcast nodes...

That is very cool 👍
I'm not familiar with hazelcast, but it does sound like a very impressive system.
Still I believe for broadcast you would need to rely on hazelcast infrastructure or similar. If we are going to build this from scratch we can only provide this at a much higher level in my opinion.

Other thing is do we want to replicate across entire cluster. why don't we split into partitions and then order updates in the partition? Having partition leader and backup that make thinks much simpler.

That is also an option. I believe we should wait for itamars requirements and our session this week either way.

Consul SWIM is very elegant solution but has a different usecase. They trying to detect if node went down in the cluster but as far I remember they use strong consistent kv value store to keep metadata in the cluster. The same as etcd in kubernetese.

My point is if using the same mechanism for cluster failure detection and propagate data structure is good idea in the first place ?

That is a good point. What makes their architecture attractive in my opinion is that they solve these issues in completely independently.
If I understand their architecture correctly on a high level they have their membership layer that basically has the same responsibilities as what we came up with for v.0.1. and have raft running on top of it to elect a leader for a replicated kv store. I don't believe that they have crdts in userland at all (they do have them as part of the membership protocol).

I think this is quite nice for us as we also want to tackle these problems incrementally. We could reuse the membership layer for crdts, which I believe is not unusual for state-based (or the delta based crdts in the paper I linked, which target exactly this usecase) crdts. But we might also build this as a completely separate system on top of this, should that prove more viable.

Note that I have no strong attachment to this approach, it just seems like a sane way to get a lot of progress done while reusing a lot of ideas from a heavily battle-tested system.

pshemass · 2019-11-13T02:22:54Z

@mschuwalow consul's raft protocol is running on servers only. Servers has replication mechanism based on TCP. This is because of size of quorum (it's growing when you adding nodes) and that affect performance dramatically. They have gossip on client side but it's not the same as server gossip which is only across data centers.
https://www.consul.io/docs/internals/architecture.html

https://www.consul.io/docs/internals/consensus.html#raft-in-consul

But I'm glad that you are not very attached to this. Hopefully we will get some requirements from Itamar.

membership/src/main/scala/zio/membership/transport/Transport.scala

membership/src/main/scala/zio/membership/Membership.scala

mschuwalow · 2019-11-24T15:06:51Z

I've updated the pr. In it's current form it is again very close to do the original design in zio-keeper

Do we want to merge the modules or keep them seperate for now?

mijicd · 2019-11-24T22:37:15Z

@mschuwalow Let's merge them, as discussed over the meeting. Once this is in, I'll enable the 2.13 build as well. Other than that, looks great!

pshemass · 2019-11-24T23:19:03Z

@mschuwalow Let's merge them, as discussed over the meeting. Once this is in, I'll enable the 2.13 build as well. Other than that, looks great!

As we discussed membership should still be in core and we are going to decide what to do with this in the future.

mschuwalow · 2019-11-24T23:53:48Z

Ok, one issue that I ran into just now is that core does not compile with current dependencies.
But I think it makes sense to keep it around until we reach feature parity

mijicd · 2019-11-25T22:05:14Z

I'm merging this one in order to "unstuck" the other work. We can merge the modules and fix the deficiencies in subsequent PRs

mschuwalow added 3 commits November 10, 2019 22:20

membership draft

e2097b3

remove traces of swim

85445bc

remove error hierarchy

401dde3

pshemass reviewed Nov 10, 2019

View reviewed changes

pshemass requested changes Nov 21, 2019

View reviewed changes

membership/src/main/scala/zio/membership/transport/Transport.scala Outdated Show resolved Hide resolved

membership/src/main/scala/zio/membership/Membership.scala Outdated Show resolved Hide resolved

membership/src/main/scala/zio/membership/Membership.scala Outdated Show resolved Hide resolved

update draft

933db64

mschuwalow force-pushed the feature/membership-draft-1 branch from 89aa63d to 933db64 Compare November 24, 2019 15:03

mijicd mentioned this pull request Nov 24, 2019

Prepare 2.13 build #108

Merged

pshemass approved these changes Nov 24, 2019

View reviewed changes

Merge branch 'master' into feature/membership-draft-1

7fc77fc

mijicd merged commit 1c79f23 into zio:master Nov 25, 2019

mijicd mentioned this pull request Nov 25, 2019

Upgrade to zio-nio 0.4.0 #109

Closed

pshemass pushed a commit to pshemass/scalaz-distributed that referenced this pull request Nov 27, 2019

factor out membership layer (zio#96)

edbe777

pshemass pushed a commit to pshemass/scalaz-distributed that referenced this pull request Nov 28, 2019

factor out membership layer (zio#96)

409b4f7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

factor out membership layer #96

factor out membership layer #96

mschuwalow commented Nov 10, 2019

pshemass commented Nov 10, 2019

pshemass Nov 10, 2019

mschuwalow Nov 11, 2019

pshemass Nov 13, 2019

pshemass Nov 10, 2019

mschuwalow Nov 11, 2019

pshemass Nov 13, 2019 •

edited

Loading

mschuwalow Nov 24, 2019

pshemass Nov 24, 2019

mschuwalow commented Nov 11, 2019

pshemass commented Nov 13, 2019 •

edited

Loading

mschuwalow commented Nov 24, 2019 •

edited

Loading

mijicd commented Nov 24, 2019

pshemass commented Nov 24, 2019

mschuwalow commented Nov 24, 2019

mijicd commented Nov 25, 2019

factor out membership layer #96

factor out membership layer #96

Conversation

mschuwalow commented Nov 10, 2019

pshemass commented Nov 10, 2019

pshemass Nov 10, 2019

Choose a reason for hiding this comment

mschuwalow Nov 11, 2019

Choose a reason for hiding this comment

pshemass Nov 13, 2019

Choose a reason for hiding this comment

pshemass Nov 10, 2019

Choose a reason for hiding this comment

mschuwalow Nov 11, 2019

Choose a reason for hiding this comment

pshemass Nov 13, 2019 • edited Loading

Choose a reason for hiding this comment

mschuwalow Nov 24, 2019

Choose a reason for hiding this comment

pshemass Nov 24, 2019

Choose a reason for hiding this comment

mschuwalow commented Nov 11, 2019

pshemass commented Nov 13, 2019 • edited Loading

mschuwalow commented Nov 24, 2019 • edited Loading

mijicd commented Nov 24, 2019

pshemass commented Nov 24, 2019

mschuwalow commented Nov 24, 2019

mijicd commented Nov 25, 2019

pshemass Nov 13, 2019 •

edited

Loading

pshemass commented Nov 13, 2019 •

edited

Loading

mschuwalow commented Nov 24, 2019 •

edited

Loading