reorganize handle_child #606

sqwishy · 2022-09-20T04:23:07Z

There were a couple issues with the current implementation:

When reading stdout and stderr from the child, as soon as we hit EOF on one we would stop reading from both (line 1420). This could lead to the return value not being read from the job program.
Lines read from stdout and stderr are put into a channel and read elsewhere with rx.recv() (line 1497) but that channel isn't read until empty. It is only read in the while !done.load(...) (line 1449) loop and that loop can stop after any .store(true, ...). Which happens when the child exits, when the job is cancelled, when either stdout or stderr reach EOF...

This can be verified by putting dbg!(rx.recv().await) or a similar assertion after the while loop before returning from that function. It shows the channel still containing log lines on rare occasions.

I was pretty careful in this to maintain the current behaviour; adding comments to express intention.

One difference in this is that some regular intervals (cancel check and ping update) should be more regular?

Before...

at 00ms wait for 10ms
at 10ms do things for 3ms
at 13ms wait again for 10ms
at 23ms do things again ...

With change...

at 00ms wait for 10ms
at 10ms do things for 3ms
at 13ms wait again but for 7ms
at 20ms do things again ...

Which I'm guessing is preferable but I could be wrong.

There were a couple issues with the current implementation: 1. When reading stdout and stderr from the child, as soon as we hit EOF on one we would stop reading from both (line 1420). This could lead to the return value not being read from the job program. 2. Lines read from stdout and stderr are put into a channel and read elsewhere with `rx.recv()` (line 1497) but that channel isn't read until empty. It is only read in the `while !done.load(...)` (line 1449) loop and that loop can stop after any `.store(true, ...)`. Which happens when the child exits, when the job is cancelled, when either stdout or stderr reach EOF... This can be verified by putting `dbg!(rx.recv().await)` or a similar assertion after the while loop before returning from that function. It shows the channel still containing log lines on rare occasions. I was pretty careful in this to maintain the current behaviour; adding comments to express intention. One difference in this is that some regular intervals (cancel check and ping update) should be more regular? Before... > at 00ms wait for 10ms > at 10ms do things for 3ms > at 13ms wait again for *10ms* > at 23ms do things again ... With change... > at 00ms wait for 10ms > at 10ms do things for 3ms > at 13ms wait again but for *7ms* > at 20ms do things again ... Which I'm guessing is preferable but I could be wrong.

backend/src/worker.rs

rubenfiszel · 2022-09-20T13:25:10Z

backend/src/worker.rs

+             * We want append_logs() to run concurrently while reading output from the
+             * job process.  take_until() should only stop reading logs once either:
+             * - `write` has resovled


I'm confused. Is append_logs really running concurrently in the sense that there is no new parallel thread spawned to run it ?

Good catch. My understanding is that the future should run as long as it's being polled on, and it's being polled on in the take_until(). However, it's entirely possible that it doesn't even start that the scheduled write future until after sleep(write_logs_delay) is resolved.

Generally speaking something like:

future::join(sleep(Duration::from_secs(1)), sleep(Duration::from_secs(1))).await;

Will run both futures concurrently in the same task until they both finish and will resolve after one second. But I think only because join will poll both of them. Whereas sleep(...).then(bar) will only start polling bar after the sleep resolves. I guess I could just join(sleep(...), db_write_future).

Although, even won't run under while let Some(line) = output.by_ref().next().await {. So it might be best as a task. If a task panics it it caught in a JoinHandle so it adds another thing to deal with but it may make this scheduling way easier to think about.

I updated this to use a task. It's still a little ugly.

backend/src/worker.rs

rubenfiszel · 2022-09-20T13:48:10Z

at 00ms wait for 10ms
at 10ms do things for 3ms
at 13ms wait again but for 7ms
at 20ms do things again ...

This seems preferable indeed but need to look back at the code why it is the case

sqwishy · 2022-09-20T23:16:18Z

This fucking piece of shit CI dude.

Also @rubenfiszel this is an example of some nsjail jank that I've seen locally sometimes.

2022-09-20T23:08:36.8978994Z   left: `Object {"error": String("Error during execution of the script\nlast 5 logs lines:\n[I][2022-09-20T23:08:36+0000] Executing '/usr/local/bin/python3' for '[STANDALONE MODE]'\n[E][2022-09-20T23:08:36+0000][1] void subproc::subprocNewProc(nsjconf_t*, int, int, int, int, int)():211 execve('/usr/local/bin/python3') failed: No such file or directory\n[F][2022-09-20T23:08:36+0000][1] pid_t subproc::runChild(nsjconf_t*, int, int, int, int)():466 Launching child process failed\n[W][2022-09-20T23:08:36+0000][11929] pid_t subproc::runChild(nsjconf_t*, int, int, int, int)():486 Received error message from the child process before it has been executed\n[E][2022-09-20T23:08:36+0000][11929] int nsjail::standaloneMode(nsjconf_t*)():272 Couldn't launch the child process")}`,

aka

[I][2022-09-20T23:08:36+0000] Executing '/usr/local/bin/python3' for '[STANDALONE MODE]'
[E][2022-09-20T23:08:36+0000][1] void subproc::subprocNewProc(nsjconf_t*, int, int, int, int, int)():211 execve('/usr/local/bin/python3') failed: No such file or directory
[F][2022-09-20T23:08:36+0000][1] pid_t subproc::runChild(nsjconf_t*, int, int, int, int)():466 Launching child process failed
[W][2022-09-20T23:08:36+0000][11929] pid_t subproc::runChild(nsjconf_t*, int, int, int, int)():486 Received error message from the child process before it has been executed
[E][2022-09-20T23:08:36+0000][11929] int nsjail::standaloneMode(nsjconf_t*)():272 Couldn't launch the child process

rubenfiszel · 2022-09-21T08:10:05Z

@sqwishy I think for this error we might want to reproduce this under nsjail more verbose/debug mode. It's pretty annoying, I hope we will not have to patch nsjail...

sqwishy · 2022-09-21T17:47:32Z

@rubenfiszel Any changes regarding nsjail debugging will go in a separate branch. CI appears to be failing because it's using an old Go compiler.

If you're fine with everything else in the changes here you can merge it or if you want the CI to pass I can look at writing our Go bootstrapping code to support older compilers and put that in another pull request, let me know.

rubenfiszel · 2022-09-21T18:07:32Z

I'm in the plane, i will fix CI when i land :)

…

On Wed, Sep 21, 2022, 18:47 sqwishy ***@***.***> wrote: @rubenfiszel <https://github.com/rubenfiszel> Any changes regarding nsjail debugging will go in a separate branch. CI appears to be failing because it's using an old Go compiler. If you're fine with everything else in the changes here you can merge it or if you want the CI to pass I can look at writing our Go bootstrapping code to support older compilers and put that in another pull request, let me know. — Reply to this email directly, view it on GitHub <#606 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACDJAG2AF2EFMNWI3BRVNTV7NC37ANCNFSM6AAAAAAQQVTUNE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rubenfiszel · 2022-09-23T12:29:30Z

backend/src/worker.rs

+    };
+
+    /* a future that reads output from the child and appends to the database */
+    /* this whole section is kind of a mess and could use some love */


Remove this kind of language but list what could be improved if you want

When you gave me feedback earlier you said you were surprised that I wrote this the way because it wasn't very clear or something. I shared the general sentiment. This comment was just to be clear about that so that it would be less surprising to readers.

rubenfiszel · 2022-09-23T12:30:47Z

backend/src/worker.rs

+             *
+             * (This looks a bit nicer using try_for_each but the side-effects/capturing
+             * in FnMut closure seems impractical, _maybe_ a futures::Sink would work but
+             * I know next to nothing about that.) */


stop the comment at would work.

rubenfiszel · 2022-09-23T12:31:42Z

backend/src/worker.rs

-                }
+    let wait_result = tokio::select! {
+        (w, _) = future::join(wait_on_child, lines) => w,
+        _ = ping.collect::<()>() => unreachable!("job ping stopped"),


The unreachable part is weird, could you explain in more details ?
If really unreachable why even select! ?

The unreachable part is weird, could you explain in more details ?

The stream should repeat forever.

If really unreachable why even select! ?

To poll on the future.

rubenfiszel · 2022-09-23T12:55:38Z

backend/src/worker.rs

+        let mut log_remaining = (MAX_LOG_SIZE as usize).saturating_sub(logs.chars().count());
+        let mut result = io::Result::Ok(());
+        let mut output = output;
+        /* write_is_done resolves the task is done, same as write, but does not contain the Result


I could not find what write_is_done was referring to

sqwishy marked this pull request as ready for review September 20, 2022 04:23

sqwishy requested a review from rubenfiszel as a code owner September 20, 2022 04:23