com.twitter.ostrich.stats.LatchedStatsListener is leaking memory #96

Open
vogelito opened this Issue Apr 9, 2012 · 5 comments

Comments

Projects
None yet
2 participants
@vogelito

vogelito commented Apr 9, 2012

By taking memory snapshots of a live kestrel server using YourKit, we identified that the following object is leaking memory:

com.twitter.ostrich.stats.LatchedStatsListener

@robey

This comment has been minimized.

Show comment Hide comment
@robey

robey Apr 9, 2012

Contributor

this is on kestrel 2.1.5?

i can't see anything obvious in LatchedStatsListener that collects memory. do you have any other clues? is there stuff being reported through the stats API that you don't think should be there (like: is it reporting on more queues than actually exist, or something like that?)

Contributor

robey commented Apr 9, 2012

this is on kestrel 2.1.5?

i can't see anything obvious in LatchedStatsListener that collects memory. do you have any other clues? is there stuff being reported through the stats API that you don't think should be there (like: is it reporting on more queues than actually exist, or something like that?)

@vogelito

This comment has been minimized.

Show comment Hide comment
@vogelito

vogelito Apr 9, 2012

We were on 2.1.4 and noticed that after our servers had been up and running for about 3 days, performance would be significantly affected and that a restart of the software would solve the issue.

We started our servers with GC logging on, and realized that the size of the Perm pool was growing steadily up to a point where we were doing Full GC cycles more than once a second

Our queue names are slightly transient (around 100 active queues at all times, but names are changed about 5 - 10 times a day depending on live configuration), and some googling yielded this
We also have about 300 other queues that don't change names.

So currently we're currently running up commit fbc5548

Unfortunately, this didn't solve our issue, so we started taking memory snapshots using yourkit. It was evident that there was a big leak in com.twitter.ostrich.admin.TimeSeriesCollector since our first snapshot showed one object with a size of 12MB, 2 hours later the same object was 40MB, and it kept growing steadily.

Per this, we eliminated the TimeSeriesCollectorConfig and that alleviated our problem. The size of the Perm generation is still growing but nowhere close to what it was doing before.

However, from the memory snapshots I noticed the size of LatchedStatsListener has more than doubled in size in about 60 hours (5MB to 12MB).

This isn't a big concern for us anymore (our servers will most likely be restarted before this becomes an issue), but thought I'd share the findings.

vogelito commented Apr 9, 2012

We were on 2.1.4 and noticed that after our servers had been up and running for about 3 days, performance would be significantly affected and that a restart of the software would solve the issue.

We started our servers with GC logging on, and realized that the size of the Perm pool was growing steadily up to a point where we were doing Full GC cycles more than once a second

Our queue names are slightly transient (around 100 active queues at all times, but names are changed about 5 - 10 times a day depending on live configuration), and some googling yielded this
We also have about 300 other queues that don't change names.

So currently we're currently running up commit fbc5548

Unfortunately, this didn't solve our issue, so we started taking memory snapshots using yourkit. It was evident that there was a big leak in com.twitter.ostrich.admin.TimeSeriesCollector since our first snapshot showed one object with a size of 12MB, 2 hours later the same object was 40MB, and it kept growing steadily.

Per this, we eliminated the TimeSeriesCollectorConfig and that alleviated our problem. The size of the Perm generation is still growing but nowhere close to what it was doing before.

However, from the memory snapshots I noticed the size of LatchedStatsListener has more than doubled in size in about 60 hours (5MB to 12MB).

This isn't a big concern for us anymore (our servers will most likely be restarted before this becomes an issue), but thought I'd share the findings.

@robey

This comment has been minimized.

Show comment Hide comment
@robey

robey Apr 10, 2012

Contributor

I just did a diff from fbc5548 to head, and this change looks relevant:

@@ -405,6 +427,11 @@ class PersistentQueue(val name: String, persistencePath: String, @volatile var c
     Stats.clearGauge(statNamed("waiters"))
     Stats.clearGauge(statNamed("open_transactions"))
     Stats.clearGauge(statNamed("create_time"))
+    Stats.removeMetric(statNamed("set_latency_usec"))
+    Stats.removeMetric(statNamed("get_timeout_msec"))
+    Stats.removeMetric(statNamed("delivery_latency_msec"))
+    Stats.removeMetric(statNamed("get_hit_latency_usec"))
+    Stats.removeMetric(statNamed("get_miss_latency_usec"))
   }

   private final def nextXid(): Int = {

that fixed a bug where 5 metrics per queue were not removed from the stats tables when a queue was deleted. if you do have a series of temporary queues, this sounds like it's probably the problem.

sorry about that.

the oldest commit with the fix looks like: 58a75aa

i will try to do a release soon to get these fixes out.

Contributor

robey commented Apr 10, 2012

I just did a diff from fbc5548 to head, and this change looks relevant:

@@ -405,6 +427,11 @@ class PersistentQueue(val name: String, persistencePath: String, @volatile var c
     Stats.clearGauge(statNamed("waiters"))
     Stats.clearGauge(statNamed("open_transactions"))
     Stats.clearGauge(statNamed("create_time"))
+    Stats.removeMetric(statNamed("set_latency_usec"))
+    Stats.removeMetric(statNamed("get_timeout_msec"))
+    Stats.removeMetric(statNamed("delivery_latency_msec"))
+    Stats.removeMetric(statNamed("get_hit_latency_usec"))
+    Stats.removeMetric(statNamed("get_miss_latency_usec"))
   }

   private final def nextXid(): Int = {

that fixed a bug where 5 metrics per queue were not removed from the stats tables when a queue was deleted. if you do have a series of temporary queues, this sounds like it's probably the problem.

sorry about that.

the oldest commit with the fix looks like: 58a75aa

i will try to do a release soon to get these fixes out.

@vogelito

This comment has been minimized.

Show comment Hide comment
@vogelito

vogelito Apr 10, 2012

I'm pretty sure that won't fix the problem since we had merged that change in (that's the link in my previous post)

I'm pretty sure that won't fix the problem since we had merged that change in (that's the link in my previous post)

@vogelito

This comment has been minimized.

Show comment Hide comment
@vogelito

vogelito Apr 10, 2012

I apologize for misreporting. We're actually running up to 419009c

I apologize for misreporting. We're actually running up to 419009c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment