Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] don't spin up an intermediate process when yarn run runs a script that yarn runs #3732

Open
2 tasks done
seansfkelley opened this issue Nov 15, 2021 · 5 comments
Open
2 tasks done
Labels
enhancement New feature or request

Comments

@seansfkelley
Copy link
Contributor

  • I'd be willing to implement this feature (contributing guide)
  • This feature is important to have in this repository; a contrib plugin wouldn't do

Describe the user story

I have a monorepo with ~200 packages. I use a combination of workspaces foreach, npm-run-all and common script names to do operations across the whole repo. For instance, a common pattern we have allows linting invocations to be configured per-package, which looks like:

{
  "scripts": {
    "lint": "npm-run-all -clp 'lint:**'",
    "lint:prettier": "yarn g:prettier --list-different .",
    "lint:eslint": "yarn g:eslint ."
  }
}

(where g:prettier and g:eslint are global scripts provided by a package that wraps those tools with some configuration, etc.)

This is useful when e.g. you are transitioning linters or linter configuration, but only part the project has been ported so far.

This is run as (among other ways) yarn workspaces foreach run lint. This spins up a LOT of processes just for this package:

  1. Yarn 1 wrapper, which delegates to
  2. Yarn 2, to run workspaces foreach, which executes the selected scripts in-process, except that
  3. npm-run-all is a opaque binary, which then invokes [following lint:prettier as an example]
    1. Yarn 2 (I believe, since yarn run puts the new yarn on $PATH), which sees a call to Yarn, so it spins up
    2. Yarn 2 again, this time to run a global script, which finally runs
    3. prettier

(Steps i-iii being repeated for ESLint as well.)

With ~200 packages, you can imagine this eats up a lot of memory for near-useless overhead (there's a surprising amount of memory used for an idle Yarn 2 in this repo, but that's another issue). I've spent some time noodling CI configuration, parallelism, etc., but ultimately these intermediates are just wasting memory and occasionally OOMing the CI machines.

Describe the solution you'd like

I think Yarn can safely and transparently collapse these invocations together, or at least, provide more-featureful hooks to allow third parties to do so.

Idea 1: yarn run inspects the script it's about to run. If it determines that it's going to invoke run again, and only run, it simply pulls the arguments, constructs the new command, and calls the implementation of run in-process.

Idea 2: Yarn's hooks are augmented to allow implementing the behavior in Idea 1. In particular, the cli object is provided to wrapScriptExecution or some new hook (executeScript?) is provided that can take its place.

Describe the drawbacks of your solution

In either version, there are some environment variables that would have to be juggled to make sure they are the same whether or not this in-process pathway is hit (INIT_CWD comes to mind).

Idea 1: I don't believe there are any correctness or performance issues with the feature as described. Yarn scripts don't generally do much work before they end up shelling out to some other binary, so even if you had 10x nested scripts that hit this pathway, all you'd really be doing is collecting arguments along the way until you finally hit a non-Yarn binary or an inline shell script, which would not be eligible for inlining.

Idea 2: The hooks seem to want to be tightly scoped for a specific purpose. Providing cli, or even special-casing the ability to invoke run (which seems... unlikely) would blow up the access that hooks have to Yarn considerably.

Describe alternatives you've considered

I've been trying to implement a hook with this idea, but it's really gross. It reimplements a chunk of the body of run (since it can't call it) and does it own parsing of the command (though I learned about shellquote after this particular implementation was written).

I also tried writing my own stripped-down version of npm-run-all (this is an earlier version of it; it's messy). Doing so allows me to collapse (3) into (i) above, which is a good start. However, because workspace foreach doesn't know about it, it can't collapse (2) into (3). It also doesn't help with collapsing (i) into (ii).

I think if workspaces foreach, run, and run-all were all aware of each other (i.e., all implemented together), the example above could condense from 9 processes to 4: Yarn 1 -> Yarn 2 -> {Prettier, ESLint}.

@seansfkelley seansfkelley added the enhancement New feature or request label Nov 15, 2021
@arcanis
Copy link
Member

arcanis commented Nov 15, 2021

Indeed, it'd be a good thing. Since we parse and evaluate the scripts it should be doable, although a little tricky. Perhaps as a first step yarn could be implemented as a builtin that would forward the execution to the current CLI?

@seansfkelley
Copy link
Contributor Author

Ah, that's a much more clever idea. If I understand correctly, that would also allow my yarn run-all plugin and workspace-tools to benefit with no further work, assuming they use this.cli.run to delegate.

@seansfkelley
Copy link
Contributor Author

seansfkelley commented Nov 17, 2021

Okay, I poked around this a little bit, and it certainly seems possible, though definitely tricky.

The biggest blocker I've discovered so far is process.env. A script like yarn run baz && FOO=bar yarn run baz has to correctly reflect FOO=bar only in the second invocation.

A possible solution is to follow the pattern of this.context.stdin, i.e., punch this.context.env around all over the place. This would require rewriting a bunch of utility methods to accept an env argument. It's also definitely a footgun for any custom plugins that aren't being super careful (though you could say the same for stdio, I suppose.) I believe this will allow it to interact properly with background jobs and the like.

There is also other global state to think about. For instance, if a custom plugin fiddles with, say, Workspace.manifest without serializing it to disk, other Yarns invoked in-process will see those changes, which they would not normally. One could figure out how to isolate that as well, though that's a bit whack-a-mole and partially undermines the benefits of this feature -- namely, that it doesn't need to reinitialize project state and spend time and memory on that.

With these kinds of things in mind, I'd really like to get your gut feeling on this before I spend much more time on it. Do you think:

  1. the isolation issues can be resolved without unduly (that word's doing a lot of work here) sacrificing performance, readability, maintainability and extensibility (e.g., without having to dependency-inject the entire world, or, without crippling plugins that modify state)
  2. common isolation issues can be transparently resolved (e.g. process.env), but we might want to permanently hide the feature behind a flag/new command so a user has to consciously trade feature restrictions for performance (e.g. throw exceptions if a plugin tries to modify shared state, which would be fine for my use-case)
  3. (2) can eventually become (1)
  4. it can only ever work with a smaller scope, e.g., as described in my original post (which would be a shame -- I really like the idea of things like workspace-tools getting the optimization for free and it would considerably relieve the burden my CI machines are under)
  5. this is fundamentally flawed

Thoughts?

@arcanis
Copy link
Member

arcanis commented Nov 23, 2021

Honestly it's difficult to say 😕 I knew env would be a thorny point, and even now I'm not quite sure what's the best way to deal with it.

One option that seems relatively enticing would be to use AsyncLocalStorage to have a state that would both be global, while at the same time local to the promise chain. Kinda like if the shell was to directly modify process.env, except that it would only affect the current execution. This way we wouldn't have to pass env everywhere (because it'd be a "global"), but we wouldn't either have to worry about mutating it (because it'd be promise-local).

The problem is that this approach is very novel and would involve quite a bit of R&D with trial-and-error process. Should this async state be managed inside Clipanion? Or inside the shell itself? My gut feeling would tend to say that the answer is a bit of both, which sounds quite complicated.

@seansfkelley
Copy link
Contributor Author

seansfkelley commented Nov 23, 2021

Interesting; I had not considered AsyncLocalStorage. It's more ergonomic than dependency-injecting process.env into everywhere that needs it, but agreed that it's a little spooky and still quite a bit of work.

I'm thinking that I may need to take a more scoped-down and/or buyer-beware approach, as the nature of the plugin architecture makes it very difficult or impossible to reason about every edge case (see: my example about Workspace.manifest in my previous comment).

I don't think I'll have time in the near future to give this another good go given what it seems to be becoming, but here are the directions I'm considering, in no particular order:

  • yarn-as-builtin and transparently handling environment, etc., as the main thread of conversation has been (sounds hard!)
  • yarn-as-builtin, but only in some cases:
    • when forwarding to nested runs only, or
    • when forwarding to any subcommand as long as no environment flags are set in the command, or
    • when global configuration or a run CLI flag opts-in to the behavior, or
    • when forwarding to subcommands that are known to be insensitive to in-memory state (?!)
  • a dedicated first-party or third-party command which is otherwise identical to run
  • adding to expressive power of hooks so this can be done transparently with a third-party plugin (as suggested in the original comment -- thus sidestepping the issue of needing to handle arbitrary use-cases)

(I'm not expecting any response here, but I'd welcome any input regardless. This is primarily a list for my future self or anyone else who may become interested in picking up this feature.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants