Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing post installation artifacts in offline mirror #50

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 124 additions & 0 deletions text/0000-post-installation-mirror.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
- Start Date: 2017-02-16
- RFC PR:
- Yarn Issue:

# Summary

Enable storing for post installation artifacts with Yarn offline mirror.

# Motivation

`Yarn` has improved node dependency resolution reliability dramatically. However, we still can not
achieve truly deterministic dependency resolution and hermetic build without addressing issues with installation scripts.

Node modules allow arbitrary code execution before, during and after installation. Although this feature has provided enormous flexibility for package owners, it is also one of the root causes for non-deterministic build. Some packages use installation scripts to download additional codes and / or dependencies. This means those packages cannot be installed in network restricted environments (NRE). Production zones in enterprise environments are one of the examples of those NREs. They tend to restrict network access for obvious security reasons.

Without further discussion with validity of installation scripts, a possible workaround can be achieved by storing the post installation artifacts using yarn offline mirror. In essence, we are aiming to provide a better alternative than an already prevalent practice among enterprise users, namely, save or check in the post installation node_modules folder. Formalize this practice provides at least following benefits:
- Allows node build to be truly offline, deterministic and hermetic
- Allows package owners to continue use installation scripts
- Allows saving of a single .tar.gz file instead of hundreds, if not thousands of files
- Easier code review for addition, update and deletion of dependencies

# Detailed design

## Assumptions
- Most, if not all, installation scripts only modify files and directories within its own folders.
- Storing post installation artifacts is voluntary on per package basis.

## Analysis of installation scripts
It is difficult to categorized all possible usages of installation scripts as they are truly free formed. That being said, there are a couple of common patterns have been observed so far:
- Download additional dependencies or codes
- Compile node native extensions

This observation leads to the assumption that most installation scripts only modify files and directories within its own folders. Consequently, we can store the post installation content as artifacts without worrying about inter-module dependencies.

##Modification to `yarn` offline mirror structure and `yarn.lock`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offline mirror kicks in at fetch phase.
After the .tgz file is extracted into global cache folder link phase starts.
During link phase files are copied from cache into node_modules, considering hoisting, and then lifecycle scripts are executed that modify some files on those node_modules.

You would have to generate a new .tgz file for each package folder that got modified after lifecycle scripts phase, disabling their lifecycle scripts, and then modify yarn.lock file to point to the new .tgz file.

That could be quite complex to implement without bringing too much complexity into Yarn.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like your suggestion is to use a separate cache not related to offline mirror to store those artifacts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this will make the offline mirror cache too confusing.
The idea of offline mirror cache is that it stores the file as it was downloaded from a remote repository, this RFC adds a lot of new conditions


- Store post installation artifacts under post-installation subdirectory when using yarn offline mirror. _resolved_ field should reflect this change as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the resolved field need to change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now need to store a path (post-install/foo.tar.gz#xxxx) instead of just a file name (foo.tar.gz#xxxx). Very minor difference, but I thought I should call it out.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the path configurable, i.e. is it ever going to change to something other than 'post-install' ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be configurable via a yarn option or even an environment variable. The key feature is that we need some way in the stored file itself to tell us that it is a post-install artifact.

Imaging a project adding a new dependency with only the offline mirror, Yarn cli must know what installation steps should be skipped. The only place that we can store this information is in the file name / directory path in my POV.

Copy link
Author

@UnrememberMe UnrememberMe Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought, there is an alternative to store the post installation information in file name / directory name. We could potentially add a file, say .post-install, in the stored artifact. In this case, we do not need to change resolved field or current structure of offline mirror, which is a flat list.


examples:
```
node-pre-gyp@^0.6.29:
version "0.6.31"
resolved "post-installation/node-pre-gyp-0.6.31.tgz#sha1"
dependencies:
mkdirp "~0.5.1"
nopt "~3.0.6"
npmlog "^4.0.0"
...
```

##Modification to cli

Storing post installation artifacts should be a voluntary feature. This means that we should only store post installation artifacts when asked specifically. When installing from offline mirror, post installation artifacts should be given preference since the existence of such artifacts reflects specific actions taken by the maintainer of the offline mirror previously.

- For _add_ command, add an additional command line parameter --save-post-install. When this parameter is specified, store post installation artifact to offline mirror. The post install artifact should include all files and directories after running installation scripts, but exclude node_modules subdirectory.

- For _install_ command, search post install subdirectory first, if there is no post installation artifacts, fail through to existing work flow.

- When installation a post install artifact from offline mirror
- extract the tar.gz file in place
- do not run install scripts
- install dependencies and create bin link as usual

# How We Teach This

*What names and terminology work best for these concepts and why?*
"post installation artifacts" to distinguish from node modules
"network restricted environments (NRE)" where access to Internet is controlled
"hermetic builds" means all dependencies are included, could be used as synonym for "deterministic builds"

*How is this idea best presented?*

- Emphases on the concept of "hermetic builds".
- Emphases on `yarn` usage within "network restricted environments (NRE)".

*Would the acceptance of this proposal mean the Yarn documentation must be re-organized or altered?*

No.

*Does it change how Yarn is taught to new users at any level?*

This feature should be considered as advanced and should be taught specifically to following users:
- users who need to operate within network restricted environments (NRE)
- users who value deterministic dependency resolution and hermetic builds above convenience

*How should this feature be introduced and taught to existing Yarn users?*
Explain the intended use case with illustrated work flow.

# Drawbacks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think it might work for a subset of Node.js npm packages that don't write or read to folders outside of package.
    This won't be a generic solution for packages heavy on native code, we are working on https://github.com/jordwalke/esy to address that.

  2. Offline mirror is designed to be cross platform because it caches things at the fetching phase.
    This feature will be platform specific and in some cases machine specific (sometimes binaries store local paths) and is a linking phase cache.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Are there examples of package that either read/write to folders outside of package or store absolute paths of local machine? The explicit assumption in the RFC is that we have very few, if any, packages have this kind of behavior.

  2. I specifically avoided the platform dependency issue in an effort to limit the reach of this RFC. I guess this is a can of worms that I cannot avoid :-(.

    There are two main ways to deal with platform specific codes. Storing precompiled binaries or compile during installation. Some prior arts include Python wheels (https://www.python.org/dev/peps/pep-0427/), Ruby Gems (http://guides.rubygems.org/specification-reference/#platform=), and Go (https://golang.org/src/go/build/doc.go). Python and Go stores platform dependent binaries in their package, while Ruby recompiles during Gem installation.

    Storing platform dependent post installation packages via a scheme similar to the one outlined in this RFC is my preferred choice.
    Pros:

    • Guaranteed consist installation across machines with same os / arch / node version combination
    • Compatible with NREs.
    • Adds no cost for package owners. The choice of what combination of os / arch /node version to support is done by post installation package maintainer, presumably someone has those specific needs.
    • Possible to tar the entire installed packages up and copy it to other machines with same platform. This means it will be possible to track a single version as a tar file across its life cycle and can be a great feature for enterprise.

    Cons:

    • Matrix of os / arch / node version to support can explode. This is somewhat mitigated by the fact that maintainers can choose how big a matrix they want to support.
    • Will not work if there are machine specific codes, like linking to an absolute path
    • Adds further complexity on Yarn (or a plug in).
  3. I choose to reuse offline mirror because I don't want to introduce another cache. If conceptually it is cleaner to have a separate cache for post installation artifacts, that's a change we should modify for this RFC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Here is an example with node-gyp Parallel workers running install scripts can interfere yarn#1874
  2. Yep, I know the pain but we have to deal with it as many projects are sharing same yarn.lock files and offline mirror .tgz files across all OS
  3. I am pretty sure a postinstall cache should be independent from offline mirror


- Somewhat a deviation from current node community norm. This feature may potentially require additional teaching.
- Added complexity for installation work flow and offline mirror structure
- Depending on directory structure to identify post installation modules

The following points are drawbacks under certain circumstances and advantages for other circumstances.

- Native extension support
If a node module has native extensions, a stored post installation module will not work on platforms different from where the module is created.

Currently, native extensions are compiled during installation. However, since the compiler and libraries are provided by running environment, the compilation output are not guaranteed to be repeatable. To ensure hermetic and repeatable build, a separate RFC is necessary due to the complexity of supporting node native extensions .

- Lost of some flexibility
In at least one package (cldr-data), [build result can be altered by environment variables $CLDR_COVERAGE](https://github.com/rxaviers/cldr-data-npm/blob/master/install.js#L91). Caching post installation artifacts will lose this flexibility.

- Update to installation time downloads are ignored / require explicit action
Installation scripts tends to download the latest version of dependencies. A stored post installation artifacts will always have the same version of dependencies and thus potentially will not have the latest dependencies. To update such installation time downloaded dependencies, explicit actions from offline mirror maintainers will be required.

# Alternatives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said above, this RFC goes beyond the concerns of Offline Mirror feature.

I think the problem may be solved by caching and sharing a built package in some way.
This may not work across platforms and machines, depends on every package and how a project is built.
I would try:

  • disable lifecycle scripts for a package that needs Internet (maybe have this setting in package.json)
  • before package is installed from Yarn cache, replace the cache with the prebuilt content. Packing, sharing and replacing in cache could be automated in some way by Yarn or a plugin or a third party script

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those that operate in a NRE would likely be less concerned about cross platform compatibility

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those that operate in a NRE would likely be less concerned about cross platform compatibility

I'm at Red Hat, working in NREs on multiple architectures.


The following alternatives have been considered:

- Do nothing
One may argue that this problem is not worthy to be addressed and do nothing is the correct approach. It has been in existence since the inception of npm and the node community has thrived at the same time. A second argument is that users in network restricted environments (NRE) are not the intended customers.
However, as adoption of node widens, the inability to run node build in network restricted environments has been and will continue to be a hurdle for adoption. Not addressing this problem is no longer a valid option.

- Work with each individual package owners to make sure package installation is hermetic
There are several drawbacks of this approach:
- Installation scripts may serve legitimate purposes in certain circumstances
- Requires significant efforts to educate node module writers
- Working on a per package basis and updating all dependent packages might take a long time for the necessary changes to propagate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Npm community is large and free to do anything, so it will be impossible to enforce any kind of behavior.

The right thing to do would be for the community members to work with the packages individually to provide ability to be installed while using a mirror (sinopia based mirrors have the same problem) and without Internet access: raise issues, send PRs, fork.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a dumb question but why do some npm packages need internet access to be install? Why can't they hold all needed information within the package itself? (aside from defined dependencies)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phantomjs, for example, actually downloads its platform-specific binary upon npm module installation. The npm module is just a wrapper.

I suppose it could package up each target platform/architecture binary and only configure the intended one for that runtime.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I agree that the right thing to do is to work with the package owner to remove network dependencies, the process has been proven as slow and sometimes unresponsive. We not only need to work with the owner of the package in question, in some case, we need to work with dependents package and dependents of dependents as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the default assumption - a package is released "as is" and I think it is an exception when a package author has time to support more use cases.

# Unresolved questions