-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
factor out membership layer #96
Conversation
how do you know that is very hard to support broadcast? i'm running 80 Hazelcast nodes cluster of production which pretty much provides reliable broadcast. I didn't have much problems with that for last 3 years... Apple App Store has probably 300 or more hazelcast nodes... Other thing is do we want to replicate across entire cluster. why don't we split into partitions and then order updates in the partition? Having partition leader and backup that make thinks much simpler. Consul SWIM is very elegant solution but has a different usecase. They trying to detect if node went down in the cluster but as far I remember they use strong consistent kv value store to keep metadata in the cluster. The same as etcd in kubernetese. My point is if using the same mechanism for cluster failure detection and propagate data structure is good idea in the first place ? |
|
||
import zio.Chunk | ||
|
||
final case class Message( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think message should have correlationId
that we can match all the message in conversation. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean for tracing? a combination of both should be usable for this. The idea is to have the node address + monotonically increasing counter so that every conversation has a unique id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
primarily I would like to have this for tracing. monotonically increased counter might be tricky if we will provide client library because we need to generate request id on client. this is important when someone need to debug what happen with his/her request.
* to the same port number on bind. | ||
*/ | ||
trait Service[R] { | ||
def sendBestEffort(to: SocketAddress, msg: Chunk[Byte]): ZIO[R, Error, Unit] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you think that we should distinct this in here? why not send send/bind only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance mostly. I believe as much as possible should be done in udp for performance reasons, but for example user messages or large messages might be better send using tcp.
Also for example consul uses both in case a node is not reachable using udp but with tcp.
I'm not sure how exactly we want to do this, but I believe we should make both available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good reasons. question is more about if this belongs to Transport. I thought that Transport is just UDP
or TCP
. In your example with consul I would use both, not one combined.
I was thinking about something similar to this https://haskell-distributed.github.io/tutorials/1ch.html#sending-messages
import Network.Transport.TCP (createTransport, defaultTCPParameters)
....
main = do
Right t <- createTransport "127.0.0.1" "10501" defaultTCPParameters
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue I was thinking about is that some algorithms specifically need and reliable transport layer for failure detection.
This gets us to OOP land, but would do you think about having Transport <:< UnreliableTransport?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your are optimizing for something that is not implemented or even planned. It would be better to just always keep MVP approach that we deliver minimal thing.
That is very cool 👍
That is also an option. I believe we should wait for itamars requirements and our session this week either way.
That is a good point. What makes their architecture attractive in my opinion is that they solve these issues in completely independently. I think this is quite nice for us as we also want to tackle these problems incrementally. We could reuse the membership layer for crdts, which I believe is not unusual for state-based (or the delta based crdts in the paper I linked, which target exactly this usecase) crdts. But we might also build this as a completely separate system on top of this, should that prove more viable. Note that I have no strong attachment to this approach, it just seems like a sane way to get a lot of progress done while reusing a lot of ideas from a heavily battle-tested system. |
@mschuwalow consul's raft protocol is running on servers only. Servers has replication mechanism based on TCP. This is because of size of quorum (it's growing when you adding nodes) and that affect performance dramatically. They have gossip on client side but it's not the same as server gossip which is only across data centers. https://www.consul.io/docs/internals/consensus.html#raft-in-consul But I'm glad that you are not very attached to this. Hopefully we will get some requirements from Itamar. |
membership/src/main/scala/zio/membership/transport/Transport.scala
Outdated
Show resolved
Hide resolved
89aa63d
to
933db64
Compare
I've updated the pr. In it's current form it is again very close to do the original design in zio-keeper Do we want to merge the modules or keep them seperate for now? |
@mschuwalow Let's merge them, as discussed over the meeting. Once this is in, I'll enable the 2.13 build as well. Other than that, looks great! |
As we discussed membership should still be in core and we are going to decide what to do with this in the future. |
Ok, one issue that I ran into just now is that core does not compile with current dependencies. |
I'm merging this one in order to "unstuck" the other work. We can merge the modules and fix the deficiencies in subsequent PRs |
as described in the https://github.com/zio/zio-keeper/milestone/1 it makes sense to have a seperate transport a membership / transport layer.
This pr is basically a high level draft of how these layers could look like and are quite inspired from the current ziokeeper code. I do have more stuff ready, but I would like us to start by discussing the high level api and design.
Transport layer should be rather self explanatory. I would highly like us to go for udp for the gossip messages though. The user (who will be internal for this stuff) should have control over what protocol he wants to use. The implementation I have in mind are essentially lazily created long lived tcp connection when necessary and udp otherwise. Listening should not distinguish between both I believe.
On the membership side I believe we should drop reliable broadcast as it is very hard to support / will have wonky semantics. Instead I propose that we allow the user to register messages that will spread as part of the underlying membership protocols gossip messages. Another thing that came up during experimenting with the current api is that I think it's very useful to support "wait for reply" at this level. I believe this can be very easily done by essentially adapting the address to be
%nodeId-%optional converstation id-%maybe an index into the conversation?
and handled transparently for the user.I believe swim is a good fit for our membership protocol and also be a good basis for spreading of our crdt updates. I am thinking of having crdts be broadcast using registerBroadcast and implemented by having a clock for each broadcast (including membership protocol gossiping). Whenever a clock is ready to fire it checks for other clocks that are ready to fire and groups them into one message up to the udp limits. This design is taken from consuls memberlist and I think is very elegant solutions to this.
There are a number of open questions here marked with todo that we should discuss. But all in all I think this forms a useful basis we can collaborate and split work on.
/cc @mijicd