Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak #1625

Closed
cosminadrianpopescu opened this issue Aug 1, 2022 · 37 comments · Fixed by #2675
Closed

Memory leak #1625

cosminadrianpopescu opened this issue Aug 1, 2022 · 37 comments · Fixed by #2675
Assignees
Labels
bug Something isn't working

Comments

@cosminadrianpopescu
Copy link
Contributor

Thank you for taking the time to file this issue! Please follow the instructions and fill in the missing parts below the instructions, if it is meaningful. Try to be brief and concise.

In Case of Graphical or Performance Issues

  1. Delete the contents of /tmp/zellij-1000/zellij-log, ie with cd /tmp/zellij-1000/ and rm -fr zellij-log/
  2. Run zellij --debug
  3. Recreate your issue.
  4. Quit Zellij immediately with ctrl-q (your bug should ideally still be visible on screen)

Please attach the files that were created in /tmp/zellij-1000/zellij-log/ to the extent you are comfortable with.

Basic information

zellij --version: zellij 0.30.0
stty size: 46 197
uname -av or ver(Windows): Linux ip-### Wed Dec 16 22:44:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

List of programs you interact with as, PROGRAM --version: output cropped meaningful, for example:
nvim --version: NVIM v0.5.0-dev+1299-g1c2e504d5 (used the appimage release)
alacritty --version: alacritty 0.7.2 (5ac8060b)

Further information
Reproduction steps, noticeable behavior, related issues, etc

I have a long running zellij instance to which I connect and disconnect often (more than once per day). After weeks of using the same session, I notice that more than 2GB or RAM are being used by the zellij server. See the attached screen shot.

@cosminadrianpopescu cosminadrianpopescu added the bug Something isn't working label Aug 1, 2022
@cosminadrianpopescu
Copy link
Contributor Author

bug-zellij

@cosminadrianpopescu
Copy link
Contributor Author

Unfortunatelly I don't have the logs from the attached screen shot session. If this is a blocker in investigating this issue, then next time I will attach another screen shot and will copy the logs also before.

@olanod
Copy link

olanod commented Aug 30, 2022

I have a similar set up and was surprised to see zellij take 900Mb, I assume is not expected?

@har7an
Copy link
Contributor

har7an commented Sep 26, 2022

Thank you for reporting this!

Sorry we took so long to reply. I tried to reproduce this with the latest zellij version locally and I can confirm this behavior. That clearly shouldn't happen!

I have a suspicion where this may originate. I'll investigate and get back to you once I found something.

@har7an
Copy link
Contributor

har7an commented Oct 13, 2022

So this turns out to be more subtle than I had imagined. But I have another question: Do you have a scrollback limit set? Or is your scrollback infinite?

@cosminadrianpopescu
Copy link
Contributor Author

No, I don't have by default any limit set. Should I set up a limit?

@InnovativeInventor
Copy link

I have been encountering this issue as well (I have seen Zellij take up > 1 GB of memory).

I have attempted to reproduce this synthetically by:

  1. opening a new tab
  2. running yes in the new tab for a few seconds
  3. opening a new tab and closing the current tab
  4. observing that the memory usage at step 4 is higher than at step 1.

Note: this may be distinct from the > 1 GB possible memory leak. It may also be the same.

@InnovativeInventor
Copy link

Update: It appears that the memory leak may possibly originate from here:

let mut input_parser = InputParser::new();
. In particular, when I move this line into the loop block, the memory leak appears to go away. More investigation is needed, though, to figure out a fix and confirm the cause of the memory leak.

I'm not quite sure what zellij-client is supposed to be doing (or that event loop). Is there some documentation that provides a high-level overview of how zellij is structured?

@raphCode
Copy link
Contributor

raphCode commented Jan 19, 2023

I can confirm that the memory leaks are related to the scrollback. The memory is not (fully) released when the pane or tab is closed.
Test case:

  • open zellij
  • yes $(python -c "print('x' * 2000)") (in bash)
  • open new pane / tab
  • close first tab with the python command
  • memory consumption stays high

See also #2104

@tlinford
Copy link
Contributor

tlinford commented Feb 1, 2023

I can confirm that the memory leaks are related to the scrollback. The memory is not (fully) released when the pane or tab is closed.

I did a bit of testing on this, and I'm starting to doubt that this is really a memory leak. I basically repeated the test above 10 times in the same zellij session and saw memory usage stay pretty much constant. If the memory was actually leaked I would have expected usage to go up 10x.

Some searching around supports the idea (given that by default on linux rust uses malloc): https://stackoverflow.com/questions/45538993/why-dont-memory-allocators-actively-return-freed-memory-to-the-os/45539066#45539066.

Repeating the same kinds of tests after having switched to jemalloc, gives different results and I observed memory usage drop significantly after closing a tab.

@raphCode
Copy link
Contributor

raphCode commented Feb 1, 2023

I think you are correct, I have not repeated my tests often enough to average out the kind-of-nondeterministic behavior of the allocator.
I played around and found no definite signs of leaks when closing panes.

@DianaNites
Copy link

DianaNites commented Apr 23, 2023

layout {
    pane size=1 borderless=true {
        plugin location="zellij:tab-bar"
    }
    
    pane split_direction="vertical" {
        pane split_direction="vertical" size="80%"
        pane split_direction="vertical"
    }
    pane split_direction="horizontal" {
        pane split_direction="horizontal" size="80%"
        pane split_direction="vertical"
    }
    
    pane size=2 borderless=true {
        plugin location="zellij:status-bar"
    }
}

this layout causes zellij to rapidly consume all memory on the system, a recoverable condition only because I have 64 gigs of RAM and it takes it a bit to fill that much up. It gets to 20 gigs in seconds.

It is unresponsive SIGINT and SIGQUIT and has to be killed manually quickly

$ stty size
45 167

@raphCode
Copy link
Contributor

this layout causes zellij to rapidly consume all memory on the system

moved to #2407

This issue is more about memory usage in sessions that are long running and/or had a lot of tabs / panes / scrollback lines.

@kseistrup
Copy link

When I have btop running in its own tab, zellij will gobble up 3–5 GB RAM in a matter of a couple of days.

  • zellij: 0.36.0
  • btop: 1.2.13
  • O/S: Linux 6.3 (x86_64)

@kseistrup
Copy link

This is a comment regarding a sub-thread on HN where possible memory leaking is mentioned.

TL;DR: Zellij v0.37.2 is still leaking memory with btop running in a separate tab.

Setup

In a freshly launched zellij with a fairly default configuration (I believe I changed the mouse value only) I opened 3 tabs:

  1. a working shell
  2. lnav log file navigator (probably irrelevant in this case)
  3. btop

In all three tabs I exec'ed into the running program, so exec btop in tab 3.

To capture zellij's memory consumption I ran ps(1) every 71 seconds:

while :; do
  ps faux \
  | rg '[z]ellij --server' \
  | timestamp \
  | tee -a zellij-memory-leak.txt
  # sleep for 1m 11s
  sleep 71
done

Results

[]
2023-06-27 19:51:32	kas       894490  2.2  6.1 38396436 125172 ?     Sl   19:40   0:14 /usr/bin/zellij --server /run/user/1000/zellij/0.37.2/fertile-curtain
2023-06-27 19:52:43	kas       894490  2.1  6.3 38412820 128628 ?     Sl   19:40   0:15 /usr/bin/zellij --server /run/user/1000/zellij/0.37.2/fertile-curtain
2023-06-27 19:53:54	kas       894490  1.9  6.4 38412820 130676 ?     Sl   19:40   0:15 /usr/bin/zellij --server /run/user/1000/zellij/0.37.2/fertile-curtain
[]
2023-06-28 06:16:43	kas       894490  0.6 61.3 40483420 1238556 ?    Sl   Jun27   4:05 /usr/bin/zellij --server /run/user/1000/zellij/0.37.2/fertile-curtain
2023-06-28 06:17:54	kas       894490  0.6 61.4 40483420 1240604 ?    Sl   Jun27   4:06 /usr/bin/zellij --server /run/user/1000/zellij/0.37.2/fertile-curtain
2023-06-28 06:19:05	kas       894490  0.6 62.2 53066496 1257172 ?    Sl   Jun27   4:07 /usr/bin/zellij --server /run/user/1000/zellij/0.37.2/fertile-curtain
[]
[ btop is killed, and then:]
[]
2023-06-28 06:21:56	kas       894490  0.6  4.6 13220532 94736 ?      Sl   Jun27   4:10 /usr/bin/zellij --server /run/user/1000/zellij/0.37.2/fertile-curtain

Initially zellij was using ~93 MB RAM, that immediately climbed in roughly 2 kB increments for each loop above. Interestingly, it plateaued at 148 MB after 20 minutes and stayed there for almost half an hour, but then it continued climbing (except for 5 times during the test period, where memory consumption decreased between 1_664 B and 41 kB between two loop rounds).

The loop ran from 2023-06-27 @ 19:40 to 2023-06-28 @ 06:19 (i.e., 10h 39m), and at the end zellij was consuming 1_228 MB RAM (1.2 GB), all of which was released when the btop tab was closed.

Notes

I probably wouldn't notice this on a modern machine with lots of RAM, but on a small VPS with 2 GB RAM or less, zellij will eventually be killed with an OOM error by the system. For that reason I went back to using tmux on on remote machines because zellij would be killed after a day or two since I usually run btop in a separate tab.

Non-standard software used:

PS: The memory I am referring to is the RSS size as reported by ps. The VSZ increased from 36 GB to 51 GB during the test period, and went down to 13 GB when tabs 2 and 3 were closed.

@kseistrup
Copy link

PS:

$ stty size
50 211
$ uname -av
Linux fyhav 6.3.8-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 14 Jun 2023 20:10:31 +0000 x86_64 GNU/Linux

The whole thing is running on a 2 GB Linode VPS with 1 CPU running an ArchLinux installation.

I'm not comfortable with attaching zellij's debug logs as they seem to include everything that zellij has seen during its entire runtime.

(I reran the loop in a zellij --debug instance, rather than the zellij used above, but I don't see any difference: it's still eating memory.)

@imsnif
Copy link
Member

imsnif commented Jun 28, 2023

Hey, just to confirm: I'm reproducing this with the btop tab. Thanks for the detailed reproduction. I'll poke around and see what's up hopefully some time this month.

@imsnif
Copy link
Member

imsnif commented Aug 4, 2023

Alright - so I issued a fix for this in #2675 - it will be released in the next version. Thank you very much @kseistrup for the reproduction.

This issue has become a bit of a grab bag for memory issues, some of them actual issues others symptoms of other issues (eg. things we need to give an error about instead of crashing). So I'm going to close it now after this fix. If anyone is still experiencing memory issues (starting from next release), please open a separate issue and be sure to provide a detailed and minimal reproduction. Thanks everyone!

@kseistrup
Copy link

I'm afraid the changes made in #2675 hasn't had any effect on the memleak issue when running btop in a separate tab (please notice that there are two instances of zellij running in the logfile below: the vitreous-lemon (PID 2730169) had a tab with btop, as before, while excellent-galaxy (PID 2730826) had a tab with just a tty-based RSS reader):

2023-08-28 10:57:06	kas      2730169  8.4  0.7 50470704 85532 ?      Sl   10:51   0:26  \_ /usr/bin/zellij --server /run/user/1000/zellij/0.38.0/vitreous-lemon
2023-08-28 10:57:06	kas      2730826  1.5  0.4 12687136 56976 ?      Sl   10:52   0:04  \_ /usr/bin/zellij --server /run/user/1000/zellij/0.38.0/excellent-galaxy
2023-08-28 10:58:17	kas      2730169  7.3  0.7 50476792 91100 ?      Sl   10:51   0:28  \_ /usr/bin/zellij --server /run/user/1000/zellij/0.38.0/vitreous-lemon
2023-08-28 10:58:17	kas      2730826  1.5  0.5 12687136 61432 ?      Sl   10:52   0:04  \_ /usr/bin/zellij --server /run/user/1000/zellij/0.38.0/excellent-galaxy
2023-08-28 10:59:28	kas      2730169  6.6  0.8 50478992 93504 ?      Sl   10:51   0:30  \_ /usr/bin/zellij --server /run/user/1000/zellij/0.38.0/vitreous-lemon
2023-08-28 10:59:28	kas      2730826  1.4  0.5 12687136 61432 ?      Sl   10:52   0:05  \_ /usr/bin/zellij --server /run/user/1000/zellij/0.38.0/excellent-galaxy
[]
2023-08-28 12:16:28	kas      2730169  3.0  2.2 50724728 262328 ?     Sl   10:51   2:34  \_ /usr/bin/zellij --server /run/user/1000/zellij/0.38.0/vitreous-lemon
2023-08-28 12:16:28	kas      2730826  1.2  0.5 12687200 62464 ?      Sl   10:52   1:02  \_ /usr/bin/zellij --server /run/user/1000/zellij/0.38.0/excellent-galaxy
2023-08-28 12:17:39	kas      2730169  3.0  2.2 50724728 264840 ?     Sl   10:51   2:36  \_ /usr/bin/zellij --server /run/user/1000/zellij/0.38.0/vitreous-lemon
2023-08-28 12:17:39	kas      2730826  1.2  0.5 12687200 62464 ?      Sl   10:52   1:03  \_ /usr/bin/zellij --server /run/user/1000/zellij/0.38.0/excellent-galaxy
2023-08-28 12:18:50	kas      2730169  3.0  2.3 50724728 268596 ?     Sl   10:51   2:38  \_ /usr/bin/zellij --server /run/user/1000/zellij/0.38.0/vitreous-lemon
2023-08-28 12:18:50	kas      2730826  1.2  0.5 12687200 62464 ?      Sl   10:52   1:04  \_ /usr/bin/zellij --server /run/user/1000/zellij/0.38.0/excellent-galaxy
[ btop is killed, and then: ]
2023-08-28 12:20:01	kas      2730169  3.0  0.6 37873300 79588 ?      Sl   10:51   2:40  \_ /usr/bin/zellij --server /run/user/1000/zellij/0.38.0/vitreous-lemon
2023-08-28 12:20:01	kas      2730826  1.2  0.5 12687168 62336 ?      Sl   10:52   1:05  \_ /usr/bin/zellij --server /run/user/1000/zellij/0.38.0/excellent-galaxy

I only ran the loop for slightly more than an hour, but the results are clear:

For every loop an average of 2653 bytes is lost (min: 1044, max: 5568), which is at least as much as in v0.37.2, and possibly a little worse.

(The zellij without a btop tab (excellent-galaxy) is largely unaffected. Sometimes the memory consumption goes up with a few bytes, sometimes it goes down.)

Zellij v0.38.0 hasn't reached my Linux distribution yet, so I used the pre-compiled zellij-x86_64-unknown-linux-musl.tar.gz binary from the Releases page here on Microsoft GitHub.

@imsnif
Copy link
Member

imsnif commented Aug 28, 2023

@kseistrup - thanks for giving this a test so quickly. I spent some time on this issue in the past couple of months, and for me the fixes I made solved the problem with the separate btop tab. I'm sorry to hear it didn't for you.

I am growing suspicious that this is caused by the behavior of the rust allocator that might be a bit too greedy about releasing memory here and there. If you are comfortable compiling the repo from source and giving it a try, I'm wondering if you'd be able to reproduce the issue with a different allocator: main...jemalloc

EDIT: You might want to wait a day until I finish updating CONTRIBUTING.md with new build instructions since we added some stuff. Unless you don't mind the build-tool error-and-install game.

@kseistrup
Copy link

@imsnif

I'm not sure how a btop tab can be an issue here and not there. Does btop show colours when you run it? I always imagined that zellij blew up because of the gazillions of ANSI colour codes that btop is spewing out. But perhaps I lack imagination…

I also thought I wouldn't be able to compile zellij myself because it needed newer features than ArchLinux' rust version provides, but it seeem I can compile it, and it's only two tiny changes. I'll be back when I know more.

@imsnif
Copy link
Member

imsnif commented Aug 28, 2023

I'm not sure how a btop tab can be an issue here and not there. Does btop show colours when you run it? I always imagined that zellij blew up because of the gazillions of ANSI colour codes that btop is spewing out. But perhaps I lack imagination…

Honestly I'm also baffled. I've been around this area with a very fine comb and the only thing I found to help is the linked fix in the recent version. We keep the colors in a stack allocated structure that's essentially an elaborate "terminal character configuration". It's the same size for each character whether you have styles or not (and it's really not that big).

I thought this was somehow us not dropping (eventually deallocating) the terminal lines, but that also wasn't it. The particularities of this issue across machines and setups are the reason I currently suspect the allocator. Let's see.

@kseistrup
Copy link

The jemalloc variant started promising: Immediately it climbed from 132 MB til 162 MB and stayed there for almost 20 minutes, so that looked great. But after one hour we're at 226 MB and climbing for every loop. 😒

I'll let it run until tomorrow to see if some sort of garbage collection takes place at some point.

@kseistrup
Copy link

PS: I believe I'm running wild in B, kB and MB: ps(1) is showing its values in kB, doesn't it, so when I said earlier that we're losing 2653 bytes per loop, I think I meant 2653 kB. But whatever unit, we're losing memory for every loop (at the rate shown by ps, whichever unit that really is). Sorry for being unclear.

@tlinford
Copy link
Contributor

I may have found something - the key seems to be having the tab open but not being rendered!

@kseistrup would you be able to give #2745 a try?

I did a quick test and got:

@kseistrup
Copy link

@tlinford

I'm not a developer so I hope I got it right, but I believe I have your branch up and running now. I'll be back as soon as I see a trend.

@imsnif

Even the jemalloc appraoch blew up during the night: I woke up to a memory consumption that was 10 times the amount compared to what it was when I went to bed.

@kseistrup
Copy link

TL;DR: The @tlinford branch does seems to act differently, but it is still leaking memory here.

First I thought it had stabilised at 154_632 (whatever unit ps serves), then at 169_196, before it seemed to settle at 247_544 — at which point I was going to report the results. But after more than 2 hours it climbed to 270_236. For unrelated reasons I had to stop the experiment at that time, and when I took down btop the memory consumption immediately dropped to 157_956.

It doesn't seem to matter which tab is active: in stabler periods it doesn't change anything to switch away from btop and in more unstable periods it doesn't change anything to switch to btop. In short, I can only reliably correlate memory changes to running btop or not.

I hope you guys can make more sense of it than I can.

@tlinford
Copy link
Contributor

Thanks for checking it out! I'll keep investigating :)

@kseistrup
Copy link

It's really a strange bug. I have been monitoring another instance (with your branch) since I wrote my previous reply: it rose rather fast to 303908 and has stayed there since. I'm running the same tabs each time, so I find it very strange that the behaviour can be so different… Computer programs are never creative, so there must be something that triggers the leak.

@imsnif
Copy link
Member

imsnif commented Aug 29, 2023

Computer programs are never creative, so there must be something that triggers the leak.

This is what made me suspect the allocator, which in Rust's case doesn't immediately free memory when the relevant data is dropped, and so can be particular to the whole state of its memory space (the screen thread in this case). But it's really just guessing. We're chipping away at this (I think @tlinford found a really interesting leak in the output buffer) and trust him to find whatever can be found here.

We might eventually "explain" this away, but I think it's a good idea to try and track down every stray byte we cannot explain at the very least. Thank you very much for helping us out @kseistrup !

@kseistrup
Copy link

I had no luck with a long-running instance of @tlinford's branch: over 75 minutes it climbed steadily from 130_748 to 303_908 where it stayed for 3 hours and then climbed on. When I woke up this morning it had reached 1_421_400. So same pattern as before.

The terminator teminal emulator is VTE3 based. Grasping after a straw I tried the usual combo in a good ol' xterm (to see if the leak could be related to the way zellij communicates with/inside VTE3), only to observe the usual (mem leaking) behaviour.

@tgross35
Copy link

If this is still an issue - is self.alternate_screen_state ever Some before

let current_lines_above =
std::mem::replace(&mut self.lines_above, VecDeque::new());
let current_viewport =
std::mem::replace(&mut self.viewport, vec![Row::new().canonical()]);
let current_cursor = std::mem::replace(
&mut self.cursor,
Cursor::new(0, 0, self.styled_underlines),
);
let sixel_image_store = self.sixel_grid.sixel_image_store.clone();
let alternate_sixelgrid = std::mem::replace(
&mut self.sixel_grid,
SixelGrid::new(self.character_cell_size.clone(), sixel_image_store),
);
self.alternate_screen_state = Some(AlternateScreenState::new(
current_lines_above,
current_viewport,
current_cursor,
alternate_sixelgrid,
));
? If so, it seems like there are a lot of allocations to be saved by instead of mem::replaceing the fields with new ones, just .clear() the fields in self.alternate_screen_state and then mem::swap them with self.lines_above, self.viewport etc. That way it probably doesn't need to free then reallocate every time that gets hit.

Those were touched in #2675 so assuming that is a hot-ish loop. This change would make no difference if there is actually a leak, but might give the allocator a chance to catch up.

@imsnif
Copy link
Member

imsnif commented Feb 29, 2024

@tlinford did a thorough investigation of this a couple of months ago. Maybe he can help us out with a quick writeup?

@kseistrup
Copy link

@imsnif, @tgross35

I still see a minor leak when I run btop in a zellij tab. It's in the ballpark of 100-200 bytes per hour, so it's nothing near what I have seen or reported earlier, and I frankly didn't have the time or energy to re-report it, so I had decided to just live with it and restart zellij every now and then. But now that someone is commenting on it, so will I. I haven't saved any data, though, so this comment is all I can offer for now.

@imsnif
Copy link
Member

imsnif commented Feb 29, 2024

Thanks @kseistrup - @tlinford showed me graphs that show the memory being released after a few days. I know he's very busy, so I'm going to give him some time to write this up here.

@tlinford
Copy link
Contributor

tlinford commented Mar 5, 2024

So, it' been a while now since I looked at this, but I ran an experiment and gathered some data on this (thanks for being patient @imsnif).

At least for me the results were quite eye opening, and challenged what I thought I knew about memory (probably not a lot to start with 😄).

I started a couple of different sessions on a VPS, and left them running for some days, all using this layout:

layout {
    tab_template name="default-tab" {
        pane size=1 borderless=true {
            plugin location="zellij:compact-bar"
        }
        children
    }
    default-tab name="Btop" {
        pane command="btop"
    }
    default-tab name="Empty" {
        pane
    }
}

A couple of times I logged back in, reattached to each, and then left them detached again.

This chart shows the memory usage over time (logged using @kseistrup's script) of these sessions, where:

  • polite-mouse: release binary for v0.38.1 from github, using musl allocator
  • awesome-tomato: main branch (697723d) compiled with jemalloc as allocator
  • verdant-lake: main branch (697723d) compiled with mimalloc as allocator

Memory Usage Chart

Note: the peak for polite-mouse right at the end was me reattaching.

What I found surprising in all cases is that the memory usage takes quite a long time to go down from the initial peak, but does not seem to grow in a significant way afterwards (although the jemalloc build was trending upwards, unlike the other two). So it really looks like this isn't a memory leak with the zellij + btop combo but rather an allocator related issue (unless my scenario is missing some important interaction).

Another thing to note is that depending an how a user builds zellij one could end up with a program that uses one of multiple allocators, and each of these has different algorithms and behaviors:

  • musl if using the official Linux binaries from GitHub
  • glibc (probably?) if building from source on Linux (e.g. via cargo install)
  • the MacOs allocator on Mac

I'm hoping that someone else more knowledgeable on allocators can also provide some insights into this, but for now my take is that one can't just look at ps output on it's own to say there is or there isn't a memory leak, there could be other factors at play (like heap fragmentation and other allocator internals).

There's also been some substantial changes to the code since I ran this, so it might be worth giving another experiment like this a try.

@kseistrup
Copy link

@tlinford The memory jump when re-attaching is interesting. I have only ever seen numbers from an attached session, as I “never” detach a session. (I'm using the zellij binary provided by ArchLinux, and it is using glibc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants