Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework the way updates work to fix #3584 #3603

Closed
wants to merge 1 commit into from

Conversation

zenhack
Copy link
Collaborator

@zenhack zenhack commented Jan 25, 2022

Rather than having the update monitor itself exec the newly downloaded
version of sandstorm, it simply exits (with status 0), and its parent
process takes this as an indictation that it should re-spawn it (with
the new version). This avoids the need to map ourselves to UID 0 in the
user namespace, which avoids the EPERM error we were getting when
later hiding the real UID from a grain.

Per discussion on the issue, the old design no longer works as of Linux
5.12, after which spawning the grain we would need CAP_SETFCAP, which we
do not want to have.


This seems to basically work, but I'm marking it as a draft for now for a couple reasons:

  • It touches the update monitor, so I want to test it very thoroughly before merging; an update that broke updates would be very bad.
  • There are still some test failures, and while I expected that (there were known failures before we started having this problem) I want to stare at them a bit to make sure I understand what's going on.
  • There is a minor issue where when stopping sandstorm, a message is spuriously printed stating that sandstorm exited abnormally, even when it did not. Doesn't really cause problems, but I should fix it.

Rather than having the update monitor itself exec the newly downloaded
version of sandstorm, it simply exits (with status 0), and its parent
process takes this as an indictation that it should re-spawn it (with
the new version). This avoids the need to map ourselves to UID 0 in the
user namespace, which avoids the EPERM error we were getting when
later hiding the real UID from a grain.

Per discussion on the issue, the old design no longer works as of Linux
5.12, after which spawning the grain we would need CAP_SETFCAP, which we
do not want to have.
@zenhack zenhack marked this pull request as draft January 25, 2022 06:51
@ocdtrekkie
Copy link
Collaborator

These (found in your Action results) existed prior to the large number of account failures, so I think we are good there, at least in getting back to where we used to be.

✖ tests/account #3239
✖ apps/web-publishing #3240

You can see there are two other tests which also fail intermittently, all tracked here: https://github.com/sandstorm-io/sandstorm/milestone/19

@ocdtrekkie ocdtrekkie added bug sandstorm-dev Issues hacking on Sandstorm labels Jan 26, 2022
@zenhack
Copy link
Collaborator Author

zenhack commented Jan 26, 2022

Ok, yeah, it looks like we're good then. I was still seeing the appHooks failure locally and I couldn't remember if that one was one of the existing ones.

@zenhack
Copy link
Collaborator Author

zenhack commented Jan 26, 2022

Might make sense to add this/#3584 to that milestone.

@ocdtrekkie
Copy link
Collaborator

Done.

@zenhack
Copy link
Collaborator Author

zenhack commented Jan 29, 2022

I can't seem to reproduce the error on stop anymore; maybe I fixed it and forgot, or maybe I was running a stale version when I last tested. In any case, it appears to work.

I still want to do a bit more testing; I patched the URLs used for updates and checked a bunch of stuff manually, both for a system wide install and just running the test configuration as an unprivileged user:

  • The manual sandstorm update command works as expected.
  • We still check for and download updates 10m after startup
  • We check for an update after an hour if a previous check fails.

I still want to check:

  • A valid, correctly signed update goes all the way through.
  • Afterwards, update checks happen after 24 hours.

I'll need to futz around with the signing code to check this, and then it's mostly a matter of backgrounding it and checking back.

I'm being way more careful about this patch than I normally would be; I'm pretty confident it's fine, but breaking automatic updates would be very very bad, so I'm being really pedantic about this.

At some point I'd like to have some actual automated tests for this stuff, but that will have to wait.

zenhack added a commit to zenhack/sandstorm that referenced this pull request Jan 29, 2022
I encountered this while testing sandstorm-io#3603; as written, the `update-tool add`
subcommand does not print out the correct public key for the keypair
that it generates; instead, it prints out the seed used to generate the
keypair. This patch changes the code to actually generate the keypair
from the seed, and then print out the public key that should be added
to update-tool.capnp

I assume this code was not used to generate the current contents of that
file, since if it was loading the keyring would fail (which is the
problem I encountered).
@zenhack zenhack marked this pull request as ready for review January 30, 2022 21:27
@zenhack
Copy link
Collaborator Author

zenhack commented Jan 30, 2022

Alright, finished with the above checks and everything seems to work; marking ready for review.

@ocdtrekkie ocdtrekkie added the ready-for-review We think this is ready for review label Jan 30, 2022
@zenhack zenhack mentioned this pull request Jan 31, 2022
@ocdtrekkie ocdtrekkie linked an issue Jan 31, 2022 that may be closed by this pull request
@kentonv
Copy link
Member

kentonv commented Feb 12, 2022

Trying to make sure I understand what's going on...

So previously:

  • The top-level long-lived process of a Sandstorm instance was the "update monitor" process.
  • We went out of our way to have this process run as PID 1 of a PID namespace, so that killing it would kill the entire Sandstorm instance.
  • When not root, to create a PID namespace, we have to create a user namespace first. So the update monitor was also in its own user namespace. However, no other namespaces are set up at this point: it's just PID and userns.
  • One of the children of the update monitor process is the "server monitor", which represents the actual Sandstorm instance. This process actually arranges unsharing all the other namespaces, setting up a mount namepsace with just the current version of Sandstorm mapped in, etc.
  • Hence, the server monitor needs to inherit all the capabilities to set up namespaces, which were obtained by creating the user namespace.
  • But there's a wrinkle: Upon updates, the update monitor process exec()s the new version. If the process isn't UID 0, then this exec() will drop all capabilities, and the new instance's server monitor will be unable to set up any namespaces.
  • Hence, the update monitor sets itself to UID 0 within its userns.

However, now you effectively cannot set yourself to UID 0 in a userns anymore.

Your solution is to create yet another level of process hierarchy: A parent process to the update monitor itself. This parent process stays entirely outside the user namespace and PID namespace. When it sees the update monitor exit, it restarts it.

A major problem with this approach is that the parent process itself never updates. So, we can't ever change anything about it. But also, I think it presents a problem on the first update: Existing sandstorm servers today don't have this parent process running. When they update to this logic the first time, they will end up with an update monitor that expects a parent process but doesn't have one. So the next update after that will, I think, cause the server to exit.

Here are two other solutions to consider:

  1. Adjust things so that the update monitor itself does not run in any namespaces, and so the server monitor ends up being the root of the PID namespace and user namespace. We'd need to be careful about ensuring that if the update monitor dies, the server monitor dies too, rather than being reparented -- perhaps prctl(PR_SET_PDEATHSIG) can help here. (There is also a significant complication here in that already-running servers on previous versions are already in a user namespace with UID 0. They are presumably running on kernels where this works fine, but we'll need the code to account for this possibility.)
  2. Just create an additional level of userns inside of the server monitor. This way it doesn't matter if the capabilities from the update monitor's userns have been lost, because they are regained upon creating another userns. So, the update monitor no longer needs to map itself to UID 0. This solution may feel icky (so many layers of userns) but is probably fine in practice.

I'm inclined to say we should try the extra-userns approach. If it works, it's a much narrower change. In fact, in this approach, we don't have to change anything for the common case where the server runs as root -- we only add some extra new code to the start of the server monitor for the non-root case. Additionally, this approach doesn't touch the update monitor at all, so in the case that it's terribly broken, we can put another update to fix it -- no servers will need human intervention to fix. So, that feels much less risky.

Thoughts?

@ocdtrekkie ocdtrekkie removed the ready-for-review We think this is ready for review label Feb 12, 2022
@zenhack
Copy link
Collaborator Author

zenhack commented Feb 13, 2022

(2) does seem much simpler. Implemented in #3609, closing.

@zenhack zenhack closed this Feb 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug sandstorm-dev Issues hacking on Sandstorm
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Permissions error causing test failures
3 participants