Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get Riju back up #92

Closed
raxod502 opened this issue Aug 8, 2021 · 3 comments
Closed

Get Riju back up #92

raxod502 opened this issue Aug 8, 2021 · 3 comments
Labels

Comments

@raxod502
Copy link
Member

raxod502 commented Aug 8, 2021

As per the status page, Riju is currently down:

image

This issue tracks the needed work to get it back up. I started by doing two things, each of which was a fair bit of work:

  • Use a system-level cgroup to limit the total resources consumed by all user containers together, thus guaranteeing that Riju itself will never run out of resources.
  • Configure persistent logging using Promtail and Loki so that I can actually get some visibility into anomalous behavior.

Unfortunately, due to moby/moby#42704, it turns out that the resource constraints don't do anything. My current plan to mitigate this issue is to update the sentinel process running inside the container to allow commands to be fed in on stdin, with the output exposed through named pipes. This would allow us to execute processes inside the container without using docker exec, thus bypassing the linked Docker issue. I believe this would fix the current problem I'm seeing where Riju goes down almost immediately as soon as it starts receiving traffic. (In testing, opening seven Python tabs is sufficient to bring it down, even without any other traffic.)

This process has taken longer than I would like because of the interference of "real life", as it were---health issues and preexisting social engagements. However, I still want to get Riju back up as soon as I can manage.

@raxod502 raxod502 added the sev label Aug 8, 2021
@raxod502 raxod502 pinned this issue Aug 8, 2021
@autolisis
Copy link

So is this due to possible malicious/spam usage by a few parties or are we truly at true capacity in terms of the infrastructure?

@raxod502
Copy link
Member Author

It's most likely due to being at infrastructure capacity. As I note above, simply having seven Python tabs open at the same time is sufficient to exhaust resources. This is primarily because language servers are very greedy for memory.

@raxod502
Copy link
Member Author

Riju is now back up. This is due to the following changes I made:

  • Work around the docker exec bug by setting up a sentinel process inside the container which receives commands over a named pipe and runs them inside the container, setting up additional named pipes for input and output. To make this work properly, I also needed to implement a pty frontend and backend in C, which turned out to eliminate the dependency on node-pty as I had implemented a superset of its functionality. This bullet point was the most important fix, as it caused container resource limits to actually be applied to user processes.
  • Rewrite the frontend layout and CSS, in support of the next bullet points.
  • Time out sessions on the server side after 1 hour.
  • Idle on the frontend after 15 minutes of inactivity and do not restore broken connections until a new user action.
  • Don't enable LSP by default, provide a button to turn it on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants