-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ra performance in large data transfer scenarios #135
Comments
We cannot suggest anything with the amount of information provided. How large is "large"? Over what kind of link? What heartbeat was used? What did various metrics, both Ra WAL log and infrastructure tell? This is [RabbitMQ] mailing list material. |
We also have no information about how you use Ra (relevant parts of your code) or Erlang version information. Generally any Raft implementation will be highly sensitive to log entry size, disk and network I/O throughput. Log entry payload compression can be a decent answer to all of those. We have an Erlang distribution implementation that can use LZ4 or ZStandard compression (note: the repository is no longer public, we don't know if it would be). That can help significantly reduce the amount of information transferred between nodes at the cost of 5-30% of additional CPU load. Erlang 22 uses fragmented inter-node message transfers which help avoid the widely known phenomenon when transferring one large (say, a gigabyte in size) message could cause all other inter-node communication to be blocked and all processes that try to send messages to processes on other nodes to be suspended. That naturally manifests itself as a missed net tick/heartbeat. Use prometheus.erl and our Erlang distribution Grafana dashboard to see if there's any evidence of "head of the distribution transfer buffer" blocking. Don't guess, collect and use metrics instead. |
rabbitmq/inet_tcp_compress_dist is now open source under the same [double-]license as Ra. It's a very immature and highly experimental project but it sufficiently complete to run a real world RabbitMQ Quorum Queue workload so that our engineers can compare its CPU and I/O effects with the built-in (uncompressed) distribution carrier. |
I am very sorry for my statement. The following is our experimental environment:
We start one ra cluster with Client server send apply(#{index := Idx}, {write, Path, Offset, Data}, State) ->
case ets:lookup(?FDTAB, Path) of
[] ->
{State, {error, ebadf}, []};
[{Path, Fd}] ->
ok = file:pwrite(Fd, Offset, Data), %%write file data to storage
Effect = case Idx rem (32 * 1024) of
0 ->
lager:info("return effect release_cursor Idx: ~p", [Idx]),
[{release_cursor, Idx, State}];
_ ->
[]
end,
{State, ok, Effect}
end; The client sends an unlimited number of messages to the servers. The measured upload rate is about 200MB/S when write file to disk storage in apply. Then we write file to RAM or no write in apply, The measured upload rate is about 350MB/S. When we only use ranch to receive data without ra, upload rate is about 1000MB/S(network bandwidth), but we find leader transfer in ra cluster like I am very sorry for my English statement again :). |
Well, not writing data to the log (disk) obviously would always be significantly more efficient than doing so. I suggest that you set up a small example app that
Once you have some initial data,
My guess is that you overload the distribution in OTP 21, which suspends Ra processes, which in turn leads their peers to detect heartbeat timeouts and trigger a new election. We are aware of one workload where the WAL operations do not keep up with the Ra client. However, this doesn't seem to be the case here. |
It's important to mention that fragmented distribution only kicks in when both peers run OTP 22. So running only some nodes on OTP 22 will not make any difference. |
Our team had to make the repo private again as we don't know if it will be open sourced yet. So instead of using
|
Ok so your performance issue most likely comes from using https://www.youtube.com/watch?v=wHpNfCeX_Vk Do you need to write the data again? It is already on disk in the Ra log. Could the state machine just keep a map of the blobs and the references? I'm not sure what you are trying to do in this case so more info would be needed. |
Our system uses the ranch service and ra. The ranch is responsible for receiving data from the client and handing it over to the ra cluster for storage, but we found that the performance is not good. And we also found that the ra members lost heartbeat in large data transfer scenarios even the ra cluster did not do anything. How can we optimize it?
The text was updated successfully, but these errors were encountered: