systemd-logind - memory leak on SSH connections #8015

Open
guilhermepiccoli opened this Issue Jan 26, 2018 · 11 comments

Comments

Projects
None yet
4 participants

Submission type

  • Bug report

systemd version the issue has been seen with

Upstream version (at commit 6cddc79).
But it's reproducible in much older versions, tested with 204 and 229 too.

Used distribution

Ubuntu 18.04 (Bionic) - I've built the upstream version of systemd on top of it.

Bug description

The systemd-logind tool presents a clear memory leak on an event of SSH connection. Basically, at each SSH connection, some memory is allocated and this portion of memory stays there even after the SSH is disconnected - code seems to lack a free somewhere.

It has come to my knowledge a case of OOM in this tool due to the huge memory footprint of systemd-logind after some weeks of machine uptime.

Valgrind analysis showed many potential leaks in session creation routines:
valgrind_bionic.txt

Steps to reproduce the problem

To reproduce the issue and expose the memory leak, a simple ssh loop would be enough:

while true; do ssh <hostname> "whoami" 1>/dev/null; done

A 5min run led to these results:

upstream_anon

The numbers of the graph:
upstreamAnon.txt

We measured the anonymous pages of systemd-logind based on /proc smaps of the process.

This comment has been minimized.

Show comment Hide comment
@poettering

poettering Jan 27, 2018

Owner

Hmm, so valgrind output is generated during normal runtime when the process is abnormally terminated by SIGINT. It shows all memory allocated at that time, which is different from leaked memory...

Owner

poettering commented Jan 27, 2018

Hmm, so valgrind output is generated during normal runtime when the process is abnormally terminated by SIGINT. It shows all memory allocated at that time, which is different from leaked memory...

This comment has been minimized.

Show comment Hide comment
@guilhermepiccoli

guilhermepiccoli Jan 27, 2018

I think I understand what you're saying, and I disagree. Correct me if I figured it wrongly:
you're saying that Valgrind is measuring all the memory allocated during the process lifetime, and since we are terminating the process in an abnormal way (SIGINT), that memory isn't freed. The way you're saying seems that systemd-logind would free all memory in a regular program termination. Is my understanding of your statement right?

Well, the reason I consider it a clearly wrong behavior is simple: systemd-logind will consume all memory of a machine, if we continue doing SSH connection during it's lifetime. This behavior shouldn't be acceptable, do you agree? If all applications behaved this way, we couldn't get a machine running for 24 or 48h, because the applications would end up getting OOM'ed all the time.
As the graph (and data) showed, the memory consumption of logind is continuous increasing...

It is considered a leak if you allocate a memory, use it and don't free that memory in a feasible time. What is the point of freeing all the memory in the end of a program, letting it consume all RAM of a machine during the lifetime of the process? Basically it's the same as saying "this program should be terminated in a regular basis or it'll break your system" heheh

I think I understand what you're saying, and I disagree. Correct me if I figured it wrongly:
you're saying that Valgrind is measuring all the memory allocated during the process lifetime, and since we are terminating the process in an abnormal way (SIGINT), that memory isn't freed. The way you're saying seems that systemd-logind would free all memory in a regular program termination. Is my understanding of your statement right?

Well, the reason I consider it a clearly wrong behavior is simple: systemd-logind will consume all memory of a machine, if we continue doing SSH connection during it's lifetime. This behavior shouldn't be acceptable, do you agree? If all applications behaved this way, we couldn't get a machine running for 24 or 48h, because the applications would end up getting OOM'ed all the time.
As the graph (and data) showed, the memory consumption of logind is continuous increasing...

It is considered a leak if you allocate a memory, use it and don't free that memory in a feasible time. What is the point of freeing all the memory in the end of a program, letting it consume all RAM of a machine during the lifetime of the process? Basically it's the same as saying "this program should be terminated in a regular basis or it'll break your system" heheh

This comment has been minimized.

Show comment Hide comment
@poettering

poettering Jan 28, 2018

Owner

Well, I am not saying there wasn't a leak somewhere, I am just saying that the tool you used (or specifically, the way you used it) is not useful for finding it...

What does "loginctl" actually report when this happen? how many open sessions?

Owner

poettering commented Jan 28, 2018

Well, I am not saying there wasn't a leak somewhere, I am just saying that the tool you used (or specifically, the way you used it) is not useful for finding it...

What does "loginctl" actually report when this happen? how many open sessions?

This comment has been minimized.

Show comment Hide comment
@guilhermepiccoli

guilhermepiccoli Jan 29, 2018

Thanks for your clarification Lennart!

I did the following experiment: ran the "while true; ssh" for 1 minute, after that captured the output of loginctl:

loginctl_1min.txt

Then, waited another 9 minutes and re-captured the output of loginctl - I was hoping it maybe cleared the sessions due to a delayed mechanism (something in the line of garbage collection), but the results were the same:

loginctl_10min.txt

Cheers,

Guilherme

Thanks for your clarification Lennart!

I did the following experiment: ran the "while true; ssh" for 1 minute, after that captured the output of loginctl:

loginctl_1min.txt

Then, waited another 9 minutes and re-captured the output of loginctl - I was hoping it maybe cleared the sessions due to a delayed mechanism (something in the line of garbage collection), but the results were the same:

loginctl_10min.txt

Cheers,

Guilherme

This comment has been minimized.

Show comment Hide comment
@boucman

boucman Jan 29, 2018

Contributor

I had that behaviour once, which was due to an upgrade of logind without reboot of the machine (just saying in case it helps diagnose...)

Contributor

boucman commented Jan 29, 2018

I had that behaviour once, which was due to an upgrade of logind without reboot of the machine (just saying in case it helps diagnose...)

This comment has been minimized.

Show comment Hide comment
@guilhermepiccoli

guilhermepiccoli Jan 29, 2018

Thanks boucman ...in my case it's consistent, I mean you can start a machine, run the ssh loop aforementioned, and you'll realize the continous increase of RAM.

BTW, I noticed that those sessions created from the ssh loop are kept on "closing" state - what does prevent them to be released? Seems to me if after the session is on closing state for a while, a timeout was triggered and the session was removed, we wouldn't see the memory issue.

Thanks boucman ...in my case it's consistent, I mean you can start a machine, run the ssh loop aforementioned, and you'll realize the continous increase of RAM.

BTW, I noticed that those sessions created from the ssh loop are kept on "closing" state - what does prevent them to be released? Seems to me if after the session is on closing state for a while, a timeout was triggered and the session was removed, we wouldn't see the memory issue.

This comment has been minimized.

Show comment Hide comment
@yuwata

yuwata Jan 31, 2018

Member

Hmm... I cannot reproduce this (with recent snapshot of systemd on Fedora 27 x86_64)...

Member

yuwata commented Jan 31, 2018

Hmm... I cannot reproduce this (with recent snapshot of systemd on Fedora 27 x86_64)...

This comment has been minimized.

Show comment Hide comment
@poettering

poettering Jan 31, 2018

Owner

@guilhermepiccoli it appears you are leaking full sessions. Question is of course why. If you look into those sessions with "loginctl session-status", what do you see? Is this in some container env or so? or anything else weird? do those sessions possibly leave processes around? if so, we won't close them.

Owner

poettering commented Jan 31, 2018

@guilhermepiccoli it appears you are leaking full sessions. Question is of course why. If you look into those sessions with "loginctl session-status", what do you see? Is this in some container env or so? or anything else weird? do those sessions possibly leave processes around? if so, we won't close them.

This comment has been minimized.

Show comment Hide comment
@guilhermepiccoli

guilhermepiccoli Jan 31, 2018

yuwata, I was able to reproduce using upstream systemd, built from my own. Maybe the distro version is a bit different and somewhat does not show the issue?

Lennart: I've been testing using a LXD container, but the issue reproduces on bare-metal system, just re-checked. I'm using Ubuntu 18.04 candidate with upstream systemd.

I proposed a pull request that fixed it for me: #8062
I'm not sure how to relate issues/pull requests in GitHub, feel free to do it your way.
Thanks,

Guilherme

yuwata, I was able to reproduce using upstream systemd, built from my own. Maybe the distro version is a bit different and somewhat does not show the issue?

Lennart: I've been testing using a LXD container, but the issue reproduces on bare-metal system, just re-checked. I'm using Ubuntu 18.04 candidate with upstream systemd.

I proposed a pull request that fixed it for me: #8062
I'm not sure how to relate issues/pull requests in GitHub, feel free to do it your way.
Thanks,

Guilherme

This comment has been minimized.

Show comment Hide comment
@poettering

poettering Jan 31, 2018

Owner

Lennart: I've been testing using a LXD container, but the issue reproduces on bare-metal system, just re-checked.

cgroup empty notifications are not reliable inside containers, hence the LXD and the baremetal case are actually very different. Before looking into the LXD case I'd hence focus on the baremetal case.

Owner

poettering commented Jan 31, 2018

Lennart: I've been testing using a LXD container, but the issue reproduces on bare-metal system, just re-checked.

cgroup empty notifications are not reliable inside containers, hence the LXD and the baremetal case are actually very different. Before looking into the LXD case I'd hence focus on the baremetal case.

This comment has been minimized.

Show comment Hide comment
@guilhermepiccoli

guilhermepiccoli Jan 31, 2018

Thanks for the hint! I'll focus on bare-metal then

Thanks for the hint! I'll focus on bare-metal then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment