Skip to content
This repository was archived by the owner on May 9, 2020. It is now read-only.

Remove race conditions from stream piping. r=garndt #332

Merged
merged 15 commits into from
Feb 7, 2018

Conversation

walac
Copy link
Contributor

@walac walac commented Nov 3, 2017

In several parts of the code there is a common pattern to combine nodejs
pipe method and waitForEvent:

stream1.pipe(stream2);
await waitForEvent(stream1, 'end');

It turns out this code is racy and may end up in a deadlock. If the 'end'
event arrives before the 'end' event, waitForEvent will get stuck
forever.

We solve it by using the promisepipe package, which supplies a promised
version of the pipe function and replace the code above with
promisepipe.

Update: this branch evolved and now includes a lot of fix for intermittent test failures.

@walac walac requested review from djmitche and gregarndt November 3, 2017 17:04
@gregarndt
Copy link
Contributor

This certainly seems a lot nicer, I wasn't aware of promisepipe!

Copy link
Contributor

@gregarndt gregarndt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious to see how the tests turn out, but this looks promising (no pun intended)

@djmitche
Copy link
Contributor

djmitche commented Nov 3, 2017

I don't understand streams well enough to say much about this. But Greg seems happy!

@djmitche djmitche removed their request for review November 3, 2017 17:44
@gregarndt
Copy link
Contributor

Appears the tests aren't happy with this. We'll need to investigate those.

@walac
Copy link
Contributor Author

walac commented Nov 5, 2017

Appears the tests aren't happy with this. We'll need to investigate those.

Yeah, I expected that, had to push the PR to run the tests on CI.

walac pushed a commit to walac/docker-worker that referenced this pull request Nov 7, 2017
scopes.hasPrefixedScopes is an async function and any call to it must
await. For reasons we probably we never know, the code used to pass
until we changed the code in [1]. Now it returns a pending promise.

We fix this bad behavior by awaiting hasPrefixedScopes.

[1] taskcluster#332
@walac walac mentioned this pull request Nov 7, 2017
walac pushed a commit to walac/docker-worker that referenced this pull request Nov 7, 2017
scopes.hasPrefixedScopes is an async function and any call to it must
await. For reasons we probably we never know, the code used to pass
until we changed it in [1]. Now it returns a pending promise.

We fix this bad behavior by awaiting hasPrefixedScopes.

[1] taskcluster#332
@walac walac force-pushed the fix-wait-for-event branch from 4c2ab21 to 0fec51d Compare November 22, 2017 12:22
@walac walac force-pushed the fix-wait-for-event branch 5 times, most recently from 5510cee to bdd6f53 Compare January 31, 2018 20:40
@walac walac requested review from djmitche and jonasfj February 2, 2018 18:52
Copy link
Contributor

@jonasfj jonasfj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a general improvement..

But I don't really believe that you are missing end events..
No event can happen when you're waiting for it in the same synchronize function. There is no awaits before you create the promise...

@jonasfj
Copy link
Contributor

jonasfj commented Feb 3, 2018

// state1
stream1.pipe(stream2);
// state2
await waitForEvent(stream1, 'end');
// state3

No events can fire between state1 and state2. If they do, the events goes on to the event queue and gets fired when the promise is awaited. NO, I'm wrong... I realize as I'm writing.... events can fire... you can call all the event handlers synchronously... And then we would miss the event... In that case the fix could have been.

const done = waitForEvent(stream1, 'end');
stream1.pipe(stream2);
await done;

But yeah, using this magic pipe seems better...

Copy link
Contributor

@djmitche djmitche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are some rather heroic measures to keep tests running. Nice work! If these tests continue to cause issues, I'd be happy deleting the run-in-parallel test (we're pretty sure docker can run multiple tasks in parallel now, and that's unlikely to regress), for example.

@walac walac force-pushed the fix-wait-for-event branch 2 times, most recently from a58068a to 9d93033 Compare February 7, 2018 11:13
Wander Lairson Costa added 6 commits February 7, 2018 10:52
got is a more stream friendly library.
In several parts of the code there is a common pattern to combine nodejs
pipe method and waitForEvent:

stream1.pipe(stream2);
await waitForEvent(stream1, 'end');

It turns out this code is racy and may end up in a deadlock. If the 'end'
event arrives before the 'end' event, waitForEvent will get stuck
forever.

We solve it by using the promisepipe package, which supplies a promised
version of the pipe function and replace the code above with
promisepipe.
* fs.exits is an asynchronous callback function, replace it by its sync
version.

* lastUsed isn't necessarily monotonic, this really depends on how fast
caches are added.
Some binary packages of nodejs randomly fail to build during yarn
install, causing the task to fail.
Parallel task test infers if tasks ran in parallel by testing if they ran
in a bounded amount of time. This methodology is weak because the time a
task takes to complete may vary dramastically because the virtual
machine where the tests execute may be under heavy load.

Theoretically we should apply some statistics by repeating the tests a
certain number of times, removing the outliers and calculate the average
time along with its variance, and then check if the results are
acceptable.

As the development time urges, we take a simpler approach: we run the
tests until the time spent to execute the tasks is bounded, at a maximum
of 10 tries, and if at least one try succeeds, the test passes.

As the parallelism of a task, by itself, is something deterministic, if
we get one green iteration, it is enough.
Wander Lairson Costa added 8 commits February 7, 2018 10:52
The reasoning behind blocking this test on development environment was
because we didn't want to mess up developer machine iptables
configuration. We the tests are intended to run under a Vagrant machine,
we reenable them.
Upload failures are, in most of the cases, due to a very short time
external problem; it doesn't make sense to wait for several minutes
to try to upload again. That said, we set the maximum back off time
to 30 seconds.

As we retry 10 times, we must adjust the exponential factor to make
the timeout at most 30 seconds in worst case to avoid timeout
saturation.
Dashes and underlines in the container image name may cause patterns
that are invalid in docker reference. We replace all dashes and
underlines by '0' to make sure no such invalid patterns are built.

We also fix the unlink call, which expects a callback instead of
returning a Promise.
The strategy to simulate a network failure when testing the artifact
upload retry feature is to add a rule to iptables when we detect the
worker is uploading an artifact to S3.

Due to the asynchronous nature of nodejs, other parts of the worker may
run while waiting for I/O response, and with iptables blocking outgoing
traffic, the unrelated network calls will be blocked.

The solution is to block only the outgoing packets to Amazon S3 service.
We do so by downloading the IP ranges from Amazon and filtering those
that belong to the us-west-2 region and whose service is S3. We then
block those IP ranges.
the app.use in koa package must receive a generator, even if we don't
yield anything.
For indexed image pulling tests, we have a fixed task containing a test
image. Unfortunatelly, the task expired and broke the tests.

We expand the create_lz4_images script to also upload tar and zst images
and use a unique task for image pulling tests.

We also update the task IDs accordingly.
waitForEvent installs a event handler and then waits for it be called.
But between the event sink and the handler, the event may be trigged
which would cause the Promise waiting forever.
@walac walac force-pushed the fix-wait-for-event branch 3 times, most recently from 0bc848d to 6a5d502 Compare February 7, 2018 14:44
As we use an event catching mechanism to simulate a network down for
retry upload, during spawning of iptables process, the upload may resume
and finish before the S3 IPs are blocked.

To avoid a false negative in these tests, we retry a few times before
mark the test as failed.
@walac walac force-pushed the fix-wait-for-event branch from 6a5d502 to 4f5563b Compare February 7, 2018 15:05
@walac walac merged commit 1e91d71 into taskcluster:master Feb 7, 2018
@walac walac deleted the fix-wait-for-event branch September 30, 2019 13:26
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants