Fix memory leak in `Xandra.Connection` #355

harunzengin · 2024-02-09T10:34:32Z

Closes #354

Fixed the memory leak in Xandra.Connection that was caused by not releasing stream ids.

This is how the memory usage of our application looks like after the fix at 08.02 around 10:00:

whatyouhide · 2024-02-09T13:01:50Z

lib/xandra/connection.ex

@@ -249,7 +255,7 @@ defmodule Xandra.Connection do
        {:error, {:connection_crashed, reason}}
    after
      timeout ->
-        Process.demonitor(req_alias, [:flush])
+        :gen_statem.cast(conn_pid, {:release_stream_id, stream_id})


Aaah, I see okay. Couple of questions, since it's gonna take me a sec to load all of Xandra's context back into my head 😄

We should still demonitor the request alias, no?

What happens if the response for this stream ID then comes after the timeout? Doesn't this happen?

Yeah, the req_alias gets demonitored here.

Good point, I thought that there was something fishy here, but didn't think it through.

If the old stream_id that we made a request for but didn't get an answer from cassandra within timeout ms isn't currently used by another request, we should raise that error, yes.

More problematic: If the old stream_id that we made a request for but didn't get an answer from cassandra within timeout ms is currently being used by another in_flight_request, we will be sending the answer to that new request. So we will be mixing up replies. And the fact that it gets used by another request is very likely, since we're fetching stream_ids with Enum.at(MapSet.new(5000), 0), which is eventually ordered and we're using the first elements of the MapSet.

Not sure how to solve this though. Is it possible to somehow encode the req_alias to the cassandra request?

Another idea would be to free the stream ids waaay later after the timout, so it is unlikely that we get a reply from cassandra.

Okay, the stream_id is a short, so a reference wouldn't fit.
https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v5.spec#L318

@harunzengin what's the throughput you're testing at? Because ~30 just unanswered responses in 5 minutes is a huge number, I’m pretty confused. Also, you could try asking whether this is a possibility in the C* community, they have a ton of ways to get in touch and people are really nice.

The rate is 1000 insertions per second. We have a cluster of 3 nodes, with 10 connections on each pool. So it's around 33 insertions per Xandra.Connection per second.

I have to emphasize that with DBConnection, we had 0 timeouts on our staging system, for the same 1000x insertions per second. That's the reason why I suspect the async protocol being the cause.

I'll ask in the Cassandra community, but in the meantime, can we agree on setting a second timeout to the timed_out_ids, let's say 30 minutes? I already implemented it and it is ready for review.

Question in Stackoverflow: https://stackoverflow.com/questions/78035081/concurrent-cassandra-async-writes-leading-for-some-packages-to-get-lost

@harunzengin 30 minutes is too long. If you won't get a resp within something like 5 min, I don't see how you would (for a single query at least).

@whatyouhide Reduced it to 5 minutes

…emory-leak

harunzengin · 2024-02-26T09:53:49Z

@whatyouhide This is ready for review if you have time :)

whatyouhide

Left a few comments but this is looking really good.

whatyouhide · 2024-02-26T11:47:24Z

lib/xandra/connection.ex

+    Enum.each(data.in_flight_requests, fn {_stream_id, req_alias} ->
+      send_reply(req_alias, {:error, :disconnected})
+    end)


I’m confused: now, we send the reply to the caller but we don't update the in_flight_requests, which we were resetting to %{} before. How does this work?

True, I must've overseen this, I deleted it in another iteration.

whatyouhide · 2024-02-26T11:49:35Z

lib/xandra/connection.ex

+    data = update_in(data.timed_out_ids, &MapSet.put(&1, stream_id))
+
+    actions = [
+      {{:timeout, {:stream_id, stream_id}}, @restore_timed_out_stream_id_timeout,


We should try to avoid setting potentially-thousands of timeouts here. Instead, what we could do is store the timestamp that a stream ID timed out at and then periodically clean those timeouts. For example, stored timed_out_ids as %{id => timed_out_at, ...}. Then, every 30 seconds, flush the ones that are older than 5 minutes. Makes sense?

lib/xandra/connection.ex

Co-authored-by: Andrea Leopardi <an.leopardi@gmail.com>

harunzengin · 2024-02-26T17:33:32Z

@whatyouhide ready for another pass

harunzengin · 2024-02-27T16:01:09Z

@whatyouhide Also added a telemetry event for a client timeout 3a5f2b2

whatyouhide · 2024-03-04T11:31:49Z

There are some adjustments to do here on docs but this looks fantastic. I'll take care of those @harunzengin, thank you for all the great work and for the patience with this long review time 😄

harunzengin · 2024-03-04T12:08:37Z

@whatyouhide cool and no worries. Should we release a patch then?

whatyouhide · 2024-03-04T12:52:37Z

@harunzengin I opened #358 first with a couple of fixes. Before releasing a patch, would you have a chance to run this for a bit and see the impact? Especially around the timed out IDs.

harunzengin added 3 commits January 31, 2024 14:32

demonitor processes on failure

06e89b0

Release stream id on timeout

94ac336

remove todo

8346b77

whatyouhide reviewed Feb 9, 2024

View reviewed changes

harunzengin added 18 commits February 15, 2024 17:36

wait for timed out stream ids

73b685a

fix match error

a71b67f

finish experiment

465fa6d

hold free stream ids in state

bc22c4a

release stream id

0bac294

Decrease possible ids

d619ad8

Decrease possible ids further

4d4f58e

try getting first stream id

0d90baa

use first id to improve performance

f61b7c2

Use Enum.random with range to run in constant time

3cedab5

Remove free_stream_id_reference

f78afff

Release timed out stream ids after 30 minutes

3c9dc73

Remove uuid lib

9211985

Merge branch 'attempt-fix-memory-leak-experiments' into attempt-fix-m…

63dc88d

…emory-leak

Add telemetry event for timed out response

13c8dce

remove uuid from mix.lock

6025706

fix timeout to 30 minutes

47bce50

reduce timeout to 5 minutes

db323fb

harunzengin mentioned this pull request Feb 26, 2024

Performance regression in 0.18 (at least compared to 0.12) #357

Closed

whatyouhide reviewed Feb 26, 2024

View reviewed changes

whatyouhide changed the title ~~Fix memory leak in Xandra.Connection~~ Fix memory leak in Xandra.Connection Feb 26, 2024

harunzengin and others added 3 commits February 26, 2024 13:49

semantically-correct error message

760e250

Co-authored-by: Andrea Leopardi <an.leopardi@gmail.com>

Flush timed out stream ids in 1 minute intervals

c04eee9

Flush every 30 seconds

bfbefb2

harunzengin added 2 commits February 27, 2024 16:53

Add client_timeout telemetry event

3a5f2b2

Empty-Commit to retrigger pipeline

a0e23c6

Fix older MapSet reference

ece50de

whatyouhide merged commit 33cd538 into whatyouhide:main Mar 4, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory leak in `Xandra.Connection` #355

Fix memory leak in `Xandra.Connection` #355

harunzengin commented Feb 9, 2024 •

edited

Loading

whatyouhide Feb 9, 2024

harunzengin Feb 9, 2024

harunzengin Feb 9, 2024

harunzengin Feb 9, 2024

harunzengin Feb 9, 2024

whatyouhide Feb 20, 2024

harunzengin Feb 21, 2024

harunzengin Feb 21, 2024

whatyouhide Feb 26, 2024

harunzengin Feb 26, 2024

harunzengin commented Feb 26, 2024

whatyouhide left a comment

whatyouhide Feb 26, 2024

harunzengin Feb 26, 2024

whatyouhide Feb 26, 2024

harunzengin Feb 26, 2024

harunzengin commented Feb 26, 2024

harunzengin commented Feb 27, 2024

whatyouhide commented Mar 4, 2024

harunzengin commented Mar 4, 2024

whatyouhide commented Mar 4, 2024

Fix memory leak in Xandra.Connection #355

Fix memory leak in Xandra.Connection #355

Conversation

harunzengin commented Feb 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harunzengin commented Feb 26, 2024

whatyouhide left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harunzengin commented Feb 26, 2024

harunzengin commented Feb 27, 2024

whatyouhide commented Mar 4, 2024

harunzengin commented Mar 4, 2024

whatyouhide commented Mar 4, 2024

Fix memory leak in `Xandra.Connection` #355

Fix memory leak in `Xandra.Connection` #355

harunzengin commented Feb 9, 2024 •

edited

Loading