Get Riju back up #92

raxod502 · 2021-08-08T04:18:18Z

As per the status page, Riju is currently down:

This issue tracks the needed work to get it back up. I started by doing two things, each of which was a fair bit of work:

Use a system-level cgroup to limit the total resources consumed by all user containers together, thus guaranteeing that Riju itself will never run out of resources.
Configure persistent logging using Promtail and Loki so that I can actually get some visibility into anomalous behavior.

Unfortunately, due to moby/moby#42704, it turns out that the resource constraints don't do anything. My current plan to mitigate this issue is to update the sentinel process running inside the container to allow commands to be fed in on stdin, with the output exposed through named pipes. This would allow us to execute processes inside the container without using docker exec, thus bypassing the linked Docker issue. I believe this would fix the current problem I'm seeing where Riju goes down almost immediately as soon as it starts receiving traffic. (In testing, opening seven Python tabs is sufficient to bring it down, even without any other traffic.)

This process has taken longer than I would like because of the interference of "real life", as it were---health issues and preexisting social engagements. However, I still want to get Riju back up as soon as I can manage.

The text was updated successfully, but these errors were encountered:

autolisis · 2021-08-09T18:13:20Z

So is this due to possible malicious/spam usage by a few parties or are we truly at true capacity in terms of the infrastructure?

raxod502 · 2021-08-10T01:38:42Z

It's most likely due to being at infrastructure capacity. As I note above, simply having seven Python tabs open at the same time is sufficient to exhaust resources. This is primarily because language servers are very greedy for memory.

raxod502 · 2021-08-17T23:45:27Z

Riju is now back up. This is due to the following changes I made:

Work around the docker exec bug by setting up a sentinel process inside the container which receives commands over a named pipe and runs them inside the container, setting up additional named pipes for input and output. To make this work properly, I also needed to implement a pty frontend and backend in C, which turned out to eliminate the dependency on node-pty as I had implemented a superset of its functionality. This bullet point was the most important fix, as it caused container resource limits to actually be applied to user processes.
Rewrite the frontend layout and CSS, in support of the next bullet points.
Time out sessions on the server side after 1 hour.
Idle on the frontend after 15 minutes of inactivity and do not restore broken connections until a new user action.
Don't enable LSP by default, provide a button to turn it on.

raxod502 added the sev label Aug 8, 2021

raxod502 pinned this issue Aug 8, 2021

raxod502 closed this as completed Aug 17, 2021

This was referenced Aug 17, 2021

[SEV] 50 minutes of downtime due to unidentified tampering #83

Closed

Reduce latency for Run button #77

Closed

raxod502 unpinned this issue Aug 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get Riju back up #92

Get Riju back up #92

raxod502 commented Aug 8, 2021

autolisis commented Aug 9, 2021

raxod502 commented Aug 10, 2021

raxod502 commented Aug 17, 2021

Get Riju back up #92

Get Riju back up #92

Comments

raxod502 commented Aug 8, 2021

autolisis commented Aug 9, 2021

raxod502 commented Aug 10, 2021

raxod502 commented Aug 17, 2021