Rework the way updates work to fix #3584 #3603

zenhack · 2022-01-25T06:51:17Z

Rather than having the update monitor itself exec the newly downloaded
version of sandstorm, it simply exits (with status 0), and its parent
process takes this as an indictation that it should re-spawn it (with
the new version). This avoids the need to map ourselves to UID 0 in the
user namespace, which avoids the EPERM error we were getting when
later hiding the real UID from a grain.

Per discussion on the issue, the old design no longer works as of Linux
5.12, after which spawning the grain we would need CAP_SETFCAP, which we
do not want to have.

This seems to basically work, but I'm marking it as a draft for now for a couple reasons:

It touches the update monitor, so I want to test it very thoroughly before merging; an update that broke updates would be very bad.
There are still some test failures, and while I expected that (there were known failures before we started having this problem) I want to stare at them a bit to make sure I understand what's going on.
There is a minor issue where when stopping sandstorm, a message is spuriously printed stating that sandstorm exited abnormally, even when it did not. Doesn't really cause problems, but I should fix it.

Rather than having the update monitor itself exec the newly downloaded version of sandstorm, it simply exits (with status 0), and its parent process takes this as an indictation that it should re-spawn it (with the new version). This avoids the need to map ourselves to UID 0 in the user namespace, which avoids the EPERM error we were getting when later hiding the real UID from a grain. Per discussion on the issue, the old design no longer works as of Linux 5.12, after which spawning the grain we would need CAP_SETFCAP, which we do not want to have.

ocdtrekkie · 2022-01-26T22:18:57Z

These (found in your Action results) existed prior to the large number of account failures, so I think we are good there, at least in getting back to where we used to be.

✖ tests/account #3239
✖ apps/web-publishing #3240

You can see there are two other tests which also fail intermittently, all tracked here: https://github.com/sandstorm-io/sandstorm/milestone/19

zenhack · 2022-01-26T23:50:21Z

Ok, yeah, it looks like we're good then. I was still seeing the appHooks failure locally and I couldn't remember if that one was one of the existing ones.

zenhack · 2022-01-26T23:51:40Z

Might make sense to add this/#3584 to that milestone.

ocdtrekkie · 2022-01-27T00:22:15Z

Done.

zenhack · 2022-01-29T07:23:16Z

I can't seem to reproduce the error on stop anymore; maybe I fixed it and forgot, or maybe I was running a stale version when I last tested. In any case, it appears to work.

I still want to do a bit more testing; I patched the URLs used for updates and checked a bunch of stuff manually, both for a system wide install and just running the test configuration as an unprivileged user:

The manual sandstorm update command works as expected.
We still check for and download updates 10m after startup
We check for an update after an hour if a previous check fails.

I still want to check:

A valid, correctly signed update goes all the way through.
Afterwards, update checks happen after 24 hours.

I'll need to futz around with the signing code to check this, and then it's mostly a matter of backgrounding it and checking back.

I'm being way more careful about this patch than I normally would be; I'm pretty confident it's fine, but breaking automatic updates would be very very bad, so I'm being really pedantic about this.

At some point I'd like to have some actual automated tests for this stuff, but that will have to wait.

I encountered this while testing sandstorm-io#3603; as written, the `update-tool add` subcommand does not print out the correct public key for the keypair that it generates; instead, it prints out the seed used to generate the keypair. This patch changes the code to actually generate the keypair from the seed, and then print out the public key that should be added to update-tool.capnp I assume this code was not used to generate the current contents of that file, since if it was loading the keyring would fail (which is the problem I encountered).

zenhack · 2022-01-30T21:27:37Z

Alright, finished with the above checks and everything seems to work; marking ready for review.

kentonv · 2022-02-12T19:39:03Z

Trying to make sure I understand what's going on...

So previously:

The top-level long-lived process of a Sandstorm instance was the "update monitor" process.
We went out of our way to have this process run as PID 1 of a PID namespace, so that killing it would kill the entire Sandstorm instance.
When not root, to create a PID namespace, we have to create a user namespace first. So the update monitor was also in its own user namespace. However, no other namespaces are set up at this point: it's just PID and userns.
One of the children of the update monitor process is the "server monitor", which represents the actual Sandstorm instance. This process actually arranges unsharing all the other namespaces, setting up a mount namepsace with just the current version of Sandstorm mapped in, etc.
Hence, the server monitor needs to inherit all the capabilities to set up namespaces, which were obtained by creating the user namespace.
But there's a wrinkle: Upon updates, the update monitor process exec()s the new version. If the process isn't UID 0, then this exec() will drop all capabilities, and the new instance's server monitor will be unable to set up any namespaces.
Hence, the update monitor sets itself to UID 0 within its userns.

However, now you effectively cannot set yourself to UID 0 in a userns anymore.

Your solution is to create yet another level of process hierarchy: A parent process to the update monitor itself. This parent process stays entirely outside the user namespace and PID namespace. When it sees the update monitor exit, it restarts it.

A major problem with this approach is that the parent process itself never updates. So, we can't ever change anything about it. But also, I think it presents a problem on the first update: Existing sandstorm servers today don't have this parent process running. When they update to this logic the first time, they will end up with an update monitor that expects a parent process but doesn't have one. So the next update after that will, I think, cause the server to exit.

Here are two other solutions to consider:

Adjust things so that the update monitor itself does not run in any namespaces, and so the server monitor ends up being the root of the PID namespace and user namespace. We'd need to be careful about ensuring that if the update monitor dies, the server monitor dies too, rather than being reparented -- perhaps prctl(PR_SET_PDEATHSIG) can help here. (There is also a significant complication here in that already-running servers on previous versions are already in a user namespace with UID 0. They are presumably running on kernels where this works fine, but we'll need the code to account for this possibility.)
Just create an additional level of userns inside of the server monitor. This way it doesn't matter if the capabilities from the update monitor's userns have been lost, because they are regained upon creating another userns. So, the update monitor no longer needs to map itself to UID 0. This solution may feel icky (so many layers of userns) but is probably fine in practice.

I'm inclined to say we should try the extra-userns approach. If it works, it's a much narrower change. In fact, in this approach, we don't have to change anything for the common case where the server runs as root -- we only add some extra new code to the start of the server monitor for the non-root case. Additionally, this approach doesn't touch the update monitor at all, so in the case that it's terribly broken, we can put another update to fix it -- no servers will need human intervention to fix. So, that feels much less risky.

Thoughts?

zenhack · 2022-02-13T06:42:43Z

(2) does seem much simpler. Implemented in #3609, closing.

zenhack marked this pull request as draft January 25, 2022 06:51

zenhack mentioned this pull request Jan 25, 2022

Permissions error causing test failures #3584

Closed

ocdtrekkie added bug sandstorm-dev Issues hacking on Sandstorm labels Jan 26, 2022

zenhack mentioned this pull request Jan 29, 2022

Fix a bug in the update tool's key generation. #3605

Merged

zenhack marked this pull request as ready for review January 30, 2022 21:27

ocdtrekkie added the ready-for-review We think this is ready for review label Jan 30, 2022

zenhack mentioned this pull request Jan 31, 2022

Fix web-publishing test #3606

Merged

ocdtrekkie linked an issue Jan 31, 2022 that may be closed by this pull request

Permissions error causing test failures #3584

Closed

ocdtrekkie removed the ready-for-review We think this is ready for review label Feb 12, 2022

zenhack closed this Feb 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework the way updates work to fix #3584 #3603

Rework the way updates work to fix #3584 #3603

zenhack commented Jan 25, 2022 •

edited

Loading

ocdtrekkie commented Jan 26, 2022

zenhack commented Jan 26, 2022

zenhack commented Jan 26, 2022

ocdtrekkie commented Jan 27, 2022

zenhack commented Jan 29, 2022 •

edited

Loading

zenhack commented Jan 30, 2022 •

edited

Loading

kentonv commented Feb 12, 2022

zenhack commented Feb 13, 2022

Rework the way updates work to fix #3584 #3603

Rework the way updates work to fix #3584 #3603

Conversation

zenhack commented Jan 25, 2022 • edited Loading

ocdtrekkie commented Jan 26, 2022

zenhack commented Jan 26, 2022

zenhack commented Jan 26, 2022

ocdtrekkie commented Jan 27, 2022

zenhack commented Jan 29, 2022 • edited Loading

zenhack commented Jan 30, 2022 • edited Loading

kentonv commented Feb 12, 2022

zenhack commented Feb 13, 2022

zenhack commented Jan 25, 2022 •

edited

Loading

zenhack commented Jan 29, 2022 •

edited

Loading

zenhack commented Jan 30, 2022 •

edited

Loading