-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Storing post installation artifacts in offline mirror #50
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,124 @@ | ||
- Start Date: 2017-02-16 | ||
- RFC PR: | ||
- Yarn Issue: | ||
|
||
# Summary | ||
|
||
Enable storing for post installation artifacts with Yarn offline mirror. | ||
|
||
# Motivation | ||
|
||
`Yarn` has improved node dependency resolution reliability dramatically. However, we still can not | ||
achieve truly deterministic dependency resolution and hermetic build without addressing issues with installation scripts. | ||
|
||
Node modules allow arbitrary code execution before, during and after installation. Although this feature has provided enormous flexibility for package owners, it is also one of the root causes for non-deterministic build. Some packages use installation scripts to download additional codes and / or dependencies. This means those packages cannot be installed in network restricted environments (NRE). Production zones in enterprise environments are one of the examples of those NREs. They tend to restrict network access for obvious security reasons. | ||
|
||
Without further discussion with validity of installation scripts, a possible workaround can be achieved by storing the post installation artifacts using yarn offline mirror. In essence, we are aiming to provide a better alternative than an already prevalent practice among enterprise users, namely, save or check in the post installation node_modules folder. Formalize this practice provides at least following benefits: | ||
- Allows node build to be truly offline, deterministic and hermetic | ||
- Allows package owners to continue use installation scripts | ||
- Allows saving of a single .tar.gz file instead of hundreds, if not thousands of files | ||
- Easier code review for addition, update and deletion of dependencies | ||
|
||
# Detailed design | ||
|
||
## Assumptions | ||
- Most, if not all, installation scripts only modify files and directories within its own folders. | ||
- Storing post installation artifacts is voluntary on per package basis. | ||
|
||
## Analysis of installation scripts | ||
It is difficult to categorized all possible usages of installation scripts as they are truly free formed. That being said, there are a couple of common patterns have been observed so far: | ||
- Download additional dependencies or codes | ||
- Compile node native extensions | ||
|
||
This observation leads to the assumption that most installation scripts only modify files and directories within its own folders. Consequently, we can store the post installation content as artifacts without worrying about inter-module dependencies. | ||
|
||
##Modification to `yarn` offline mirror structure and `yarn.lock` | ||
|
||
- Store post installation artifacts under post-installation subdirectory when using yarn offline mirror. _resolved_ field should reflect this change as well. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why does the resolved field need to change? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We now need to store a path (post-install/foo.tar.gz#xxxx) instead of just a file name (foo.tar.gz#xxxx). Very minor difference, but I thought I should call it out. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the path configurable, i.e. is it ever going to change to something other than 'post-install' ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It could be configurable via a yarn option or even an environment variable. The key feature is that we need some way in the stored file itself to tell us that it is a post-install artifact. Imaging a project adding a new dependency with only the offline mirror, Yarn cli must know what installation steps should be skipped. The only place that we can store this information is in the file name / directory path in my POV. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On a second thought, there is an alternative to store the post installation information in file name / directory name. We could potentially add a file, say |
||
|
||
examples: | ||
``` | ||
node-pre-gyp@^0.6.29: | ||
version "0.6.31" | ||
resolved "post-installation/node-pre-gyp-0.6.31.tgz#sha1" | ||
dependencies: | ||
mkdirp "~0.5.1" | ||
nopt "~3.0.6" | ||
npmlog "^4.0.0" | ||
... | ||
``` | ||
|
||
##Modification to cli | ||
|
||
Storing post installation artifacts should be a voluntary feature. This means that we should only store post installation artifacts when asked specifically. When installing from offline mirror, post installation artifacts should be given preference since the existence of such artifacts reflects specific actions taken by the maintainer of the offline mirror previously. | ||
|
||
- For _add_ command, add an additional command line parameter --save-post-install. When this parameter is specified, store post installation artifact to offline mirror. The post install artifact should include all files and directories after running installation scripts, but exclude node_modules subdirectory. | ||
|
||
- For _install_ command, search post install subdirectory first, if there is no post installation artifacts, fail through to existing work flow. | ||
|
||
- When installation a post install artifact from offline mirror | ||
- extract the tar.gz file in place | ||
- do not run install scripts | ||
- install dependencies and create bin link as usual | ||
|
||
# How We Teach This | ||
|
||
*What names and terminology work best for these concepts and why?* | ||
"post installation artifacts" to distinguish from node modules | ||
"network restricted environments (NRE)" where access to Internet is controlled | ||
"hermetic builds" means all dependencies are included, could be used as synonym for "deterministic builds" | ||
|
||
*How is this idea best presented?* | ||
|
||
- Emphases on the concept of "hermetic builds". | ||
- Emphases on `yarn` usage within "network restricted environments (NRE)". | ||
|
||
*Would the acceptance of this proposal mean the Yarn documentation must be re-organized or altered?* | ||
|
||
No. | ||
|
||
*Does it change how Yarn is taught to new users at any level?* | ||
|
||
This feature should be considered as advanced and should be taught specifically to following users: | ||
- users who need to operate within network restricted environments (NRE) | ||
- users who value deterministic dependency resolution and hermetic builds above convenience | ||
|
||
*How should this feature be introduced and taught to existing Yarn users?* | ||
Explain the intended use case with illustrated work flow. | ||
|
||
# Drawbacks | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
- Somewhat a deviation from current node community norm. This feature may potentially require additional teaching. | ||
- Added complexity for installation work flow and offline mirror structure | ||
- Depending on directory structure to identify post installation modules | ||
|
||
The following points are drawbacks under certain circumstances and advantages for other circumstances. | ||
|
||
- Native extension support | ||
If a node module has native extensions, a stored post installation module will not work on platforms different from where the module is created. | ||
|
||
Currently, native extensions are compiled during installation. However, since the compiler and libraries are provided by running environment, the compilation output are not guaranteed to be repeatable. To ensure hermetic and repeatable build, a separate RFC is necessary due to the complexity of supporting node native extensions . | ||
|
||
- Lost of some flexibility | ||
In at least one package (cldr-data), [build result can be altered by environment variables $CLDR_COVERAGE](https://github.com/rxaviers/cldr-data-npm/blob/master/install.js#L91). Caching post installation artifacts will lose this flexibility. | ||
|
||
- Update to installation time downloads are ignored / require explicit action | ||
Installation scripts tends to download the latest version of dependencies. A stored post installation artifacts will always have the same version of dependencies and thus potentially will not have the latest dependencies. To update such installation time downloaded dependencies, explicit actions from offline mirror maintainers will be required. | ||
|
||
# Alternatives | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I said above, this RFC goes beyond the concerns of Offline Mirror feature. I think the problem may be solved by caching and sharing a built package in some way.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think those that operate in a NRE would likely be less concerned about cross platform compatibility There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm at Red Hat, working in NREs on multiple architectures. |
||
|
||
The following alternatives have been considered: | ||
|
||
- Do nothing | ||
One may argue that this problem is not worthy to be addressed and do nothing is the correct approach. It has been in existence since the inception of npm and the node community has thrived at the same time. A second argument is that users in network restricted environments (NRE) are not the intended customers. | ||
However, as adoption of node widens, the inability to run node build in network restricted environments has been and will continue to be a hurdle for adoption. Not addressing this problem is no longer a valid option. | ||
|
||
- Work with each individual package owners to make sure package installation is hermetic | ||
There are several drawbacks of this approach: | ||
- Installation scripts may serve legitimate purposes in certain circumstances | ||
- Requires significant efforts to educate node module writers | ||
- Working on a per package basis and updating all dependent packages might take a long time for the necessary changes to propagate. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Npm community is large and free to do anything, so it will be impossible to enforce any kind of behavior. The right thing to do would be for the community members to work with the packages individually to provide ability to be installed while using a mirror (sinopia based mirrors have the same problem) and without Internet access: raise issues, send PRs, fork. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This may be a dumb question but why do some npm packages need internet access to be install? Why can't they hold all needed information within the package itself? (aside from defined dependencies) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Phantomjs, for example, actually downloads its platform-specific binary upon npm module installation. The npm module is just a wrapper. I suppose it could package up each target platform/architecture binary and only configure the intended one for that runtime. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Although I agree that the right thing to do is to work with the package owner to remove network dependencies, the process has been proven as slow and sometimes unresponsive. We not only need to work with the owner of the package in question, in some case, we need to work with dependents package and dependents of dependents as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is the default assumption - a package is released "as is" and I think it is an exception when a package author has time to support more use cases. |
||
# Unresolved questions | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offline mirror kicks in at
fetch
phase.After the .tgz file is extracted into global cache folder
link
phase starts.During
link
phase files are copied from cache into node_modules, considering hoisting, and thenlifecycle scripts
are executed that modify some files on those node_modules.You would have to generate a new .tgz file for each package folder that got modified after
lifecycle scripts
phase, disabling theirlifecycle scripts
, and then modify yarn.lock file to point to the new .tgz file.That could be quite complex to implement without bringing too much complexity into Yarn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like your suggestion is to use a separate cache not related to offline mirror to store those artifacts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this will make the offline mirror cache too confusing.
The idea of offline mirror cache is that it stores the file as it was downloaded from a remote repository, this RFC adds a lot of new conditions