Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lazyHashes option to define integrity hashes only in direct parents of assets #172

Merged
merged 25 commits into from
Jan 9, 2022

Conversation

MLoughry
Copy link
Contributor

Adds new option lazyHashes (name subject to change), which controls new behavior that creates a directed acyclic graph of the chunk graph by reducing cycles to strongly-connected components. The DAG is then topologically sorted such that the integrity hashes are calculated from the leaf nodes up, so the hash of the assets with child integrity hashes are not invalidated.

I did some minor refactoring in the process, but the only behavioral change for the default configuration was moving the hashes from __webpack_require__.sriHashes to ${globalObject}.sriHashes. In testing, I ran into issues where async chunks could not access __webpack_require__.

I still need to write tests, but the holidays are approaching, and I will be OOF for the last two weeks of the year. I wanted to make sure I got this out for the maintainers to look at before I disappeared for a while. I have verified that this does work in my group's project with a very large chunk graph (>1.5k chunks).

@MLoughry MLoughry marked this pull request as ready for review December 16, 2021 19:46
@MLoughry
Copy link
Contributor Author

Tests added, should be ready for review.

Copy link
Collaborator

@jscheid jscheid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @MLoughry, amazing work, thanks so much for this!

I've left a few comments with just minor tweaks to the documentation, and there are a few CI issues that need tending to.

I haven't yet had a chance to look in more depth at the actual code, I'll try and make time for that in the next few days.

webpack-subresource-integrity/README.md Outdated Show resolved Hide resolved
examples/lazy-hashes-cycles/README.md Outdated Show resolved Hide resolved
examples/lazy-hashes-simple/README.md Outdated Show resolved Hide resolved
MLoughry and others added 5 commits December 19, 2021 18:26
@MLoughry
Copy link
Contributor Author

Thanks. I think I resolved all the issues from the workflow, but I can't re-trigger it myself

@jscheid
Copy link
Collaborator

jscheid commented Dec 20, 2021

It shouldn't be too hard for you to run the tests locally in Node 12. I've just tested this, along the lines of:

git clone https://github.com/MLoughry/webpack-subresource-integrity.git
cd webpack-subresource-integrity
nvm use 12
yarn
yarn check # works
git checkout lazy-hashes
yarn check # doesn't

Let me know if you have any troubles with that process.

@jscheid
Copy link
Collaborator

jscheid commented Dec 20, 2021

I did some minor refactoring in the process, but the only behavioral change for the default configuration was moving the hashes from __webpack_require__.sriHashes to ${globalObject}.sriHashes. In testing, I ran into issues where async chunks could not access __webpack_require__.

Even if all tests still pass, I'm a bit uneasy about this change. Tobias Koppers had recommended to use __webpack_require__ here and here. If we have to make this change I'd prefer if we could get his blessings first.

Before I try to summon him, could you have another go at it and -- if there's no other workaround for the problem with async chunks that you've mentioned -- explain the problem in a bit more detail, perhaps with an example?

EDIT: it would also be nice if we didn't have to pollute the global object. And this would potentially open up a new attack vector, where an XSS might overwrite hashes. I haven't thought through the threat model fully yet but it doesn't feel good to expose this globally.

@MLoughry
Copy link
Contributor Author

Before I try to summon him, could you have another go at it and -- if there's no other workaround for the problem with async chunks that you've mentioned -- explain the problem in a bit more detail, perhaps with an example?

Unless there's a way to introduce a runtime module in async chunks, I'm not sure how to access __webpack_require__ in the chunks when adding the new hashes. As you can see from the (prettier-ified) output of the simple lazy hashes example, the code adds the hashes in the global scope before the chunk modules:

Object.assign(self.sriHashes, {
  493: "sha384-vAyVFrkhZk9hl1bkA2pImobnoaEEQojEZ2uX7uG40DjfzjWid83WONnaDpp/IQq+",
}),
  (self.webpackChunklazy_hashes_simple =
    self.webpackChunklazy_hashes_simple || []).push([
    [574],
    {
      574: (s, e, h) => {
        "use strict";
        h.r(e), h.d(e, { default: () => a }), h.e(493).then(h.bind(h, 493));
        const a = { chunk: 1 };
      },
    },
  ]);

If there were a way to extend the webpackJsonpCallback to accept more arguments (and do what we want with the additional arguments), that could work; however, I don't see that there's a way to do that.

@jscheid
Copy link
Collaborator

jscheid commented Dec 20, 2021

It appears that you can move the Object.assign code into the closure that has access to __webpack_require__ by using the renderModuleContent hook instead of renderContent. I've tried this, tests still pass but I haven't made sure that it actually works at runtime (it would be good to add an end-to-end (puppeteer) test, perhaps similar to the dynamic-modified example.)

Here is a rough patch: https://gist.github.com/jscheid/b03aa1a5bf0421d8e2afdd040e4302ab

@MLoughry
Copy link
Contributor Author

It appears that you can move the Object.assign code into the closure that has access to webpack_require by using the renderModuleContent hook instead of renderContent

Using renderModuleContent would render the sri hashes inside of every single module within a chunk, which is clearly not the desired result.

@MLoughry
Copy link
Contributor Author

We definitely wouldn't want to naively do it in the first module, as I don't think there's any guarantee that it would be executed first. As far as I can tell, webpack doesn't expose information about a specific chunk's entry modules, only the compliation's entry modules (for when you define multiple entry points).

@jscheid
Copy link
Collaborator

jscheid commented Dec 20, 2021

True. Then what do you think about generating a function, maybe called something like addSriHashesForThisChunk, in renderContent and invoking it from renderModuleContent like so: addSriHashesForThisChunk(__webpack_require__)? That would compile down to something like a(b) with minification, which in turn would gzip well even if you have a large number of modules per chunk. There would be a small overhead at runtime but maybe tolerable?

@jscheid
Copy link
Collaborator

jscheid commented Dec 20, 2021

I've pinged @sokra , perhaps he has an idea of how we could do it with zero overhead.

@MLoughry
Copy link
Contributor Author

That's still adding code to every module that would be merging the sri hash objects. Even if it were a marginal size hit, that's a lot of unnecessary processing.

@jscheid
Copy link
Collaborator

jscheid commented Dec 20, 2021

That's still adding code to every module that would be merging the sri hash objects.

You wouldn't have to merge the hash objects more than once, you'd use a guard boolean to only merge them if not done yet, or something like:

var sriHashes = {
  493: "sha384-wGv0P3xRYvSE7KBueqmOLlBkiUwMutW1CYZQYalt/kY8gdnfO0n4vm+szSHWX5iH",
};
function addSriHashes(r) {
  Object.assign(r.sriHashes, sriHashes);
  sriHashes = null;
}

A zero-overhead approach would be preferable, of course, but I'd estimate the size overhead to be a constant 10 bytes or so per chunk (post-gz), and the runtime overhead difficult to measure. For a data point, can you find out easily how many modules your product has per chunk, min/max/avg?

@MLoughry
Copy link
Contributor Author

For a data point, can you find out easily how many modules your product has per chunk, min/max/avg?

From our latest CI build (Which is still on webpack 4, but we're migrating), across 2348 chunks, it ranges between 1 and 1504 modules, with an average of 30.3. For our core scenario chunks (the ones we bother naming and tracking the size of), the average is 178

@jscheid
Copy link
Collaborator

jscheid commented Dec 21, 2021

I think I've found a better approach... or at the very least a lead pointing in the right direction: https://gist.github.com/jscheid/30ea9da1f2538a69b997dc4afb8dee08

Can I leave you to explore it more @MLoughry? Hopefully we'll hear from @sokra as well with some guidance on how to do this properly and future-proof.

@jscheid
Copy link
Collaborator

jscheid commented Dec 21, 2021

@MLoughry I've had a chat with Tobias on Slack and he says the RuntimeModule-based approach is looking good. Based on his advice I've updated the Gist above to use additionalChunkRuntimeRequirements. I think it's an elegant solution now that ticks all the boxes, I hope you agree?

If so, I think the next steps should be the following:

  1. Integrate and polish the RuntimeModule approach, make sure it all works... it might be necessary to postpone the call to generateSriHashPlaceholders (?) and the new class should probably live elsewhere.
  2. There are a few eslint warnings about Forbidden non-null assertion, they are only warnings but could you have a look to see if we can do without the !s?
  3. Add integration tests, we should have at least one E2E test to make sure all chunks load successfully with lazyHashes set and ideally also one to make sure that tampering will cause loading to fail (i.e. ensure that SRI is added.) Let me know if I can help with this, I noticed the wsi-test-helper doesn't have any documentation yet but hopefully is fairly self explanatory when looking at one of the other similar examples.

What do you think, sound good?

@MLoughry
Copy link
Contributor Author

MLoughry commented Jan 3, 2022

Sorry, I unplugged over the holidays. I'll take a look at this today.

@MLoughry
Copy link
Contributor Author

MLoughry commented Jan 3, 2022

I made the changes, and confirmed they work in our repo. However, while it is decreasing the size of the entry bundles, it's massively increasing the size of child bundles (more than off-setting any perf gain). I'll need to look more into this.

@MLoughry
Copy link
Contributor Author

MLoughry commented Jan 3, 2022

Digging into it, the issue is that the sri hashes are duplicated across every chunk within a chunk group. I can't create the graph by chunkGroup, since a given chunk can be in multiple chunk groups, but I need to try to de-dupe hash entries when a different chunk in the same chunk group already has them.

@jscheid
Copy link
Collaborator

jscheid commented Jan 3, 2022

Welcome back @MLoughry, hope you've had a nice break.

Sounds good, that's not ideal of course. Is it caused by the new RuntimeModule approach or was it an issue before?

@MLoughry
Copy link
Contributor Author

MLoughry commented Jan 3, 2022

It's not related to the runtime module approach, I just hadn't noticed before.

My initial local changes to mitigate this is helping a little, but not nearly enough. I need to noodle on this some more.

@jscheid
Copy link
Collaborator

jscheid commented Jan 3, 2022

Even if the issue at hand might be just a bug or missing optimization, as opposed to a fundamental problem, it makes me wonder if it would make sense to have not two, but three modes:

  1. Fully eager: full up-front cost but zero overhead (current behaviour)
  2. Fully lazy: defer loading of hashes to reduce up-front cost as much as possible, at the cost of overhead caused by duplication of hashes and extra runtime code (your current proposal)
  3. Lazy hoisted: move all hashes as far up the dependency tree as necessary to avoid duplication, but no further. You'd still have a small amount of overhead due to the extra runtime code but this might actually be a good new default?

(I'm not suggesting this would need to go into this PR, just thinking out loud.)

Also, it occurred to me that Object.assign(__webpack_require__.sriHashes, {...}) should probably be something like __webpack_require__.sri({...}) in the interest of reducing runtime module overhead as much as possible, where sri would be a helper function defined once in entry chunks.

@MLoughry
Copy link
Contributor Author

MLoughry commented Jan 4, 2022

My latest change completely eliminates duplicate hashes within a chunk group, which brings down the size increase to our async chunks to a tolerable level. However, I want to dig in some more, as I'm seeing a lot of hashes duplicated between our entry chunk (mailindex) and the first async chunk (MailBoot) that shouldn't be happening absent a bug, unless I'm missing something.

@MLoughry
Copy link
Contributor Author

MLoughry commented Jan 4, 2022

Realized the code wasn't taking into account whether a hash already existed in all parent branches to a given chunk, so the hash was being inserted for a child chunk in every chunk group that had that child chunk.

Retooling the code to account for this now.

Probably need to improve the speed/memory perf of the algorithm
@MLoughry
Copy link
Contributor Author

MLoughry commented Jan 4, 2022

Latest commit significantly improves the result (in our case, -100kB from entry chunk, now only +5kB to primary async chunk group), and I've verified it works.

Two remaining work items:

  1. Improve the perf, particularly memory perf, of these changes. I had to increase our webpack build's memory allotment to get it not to crash with these changes (16GB -> 18GB)
  2. See if I can better name the runtime module, especially per-chunk, so our bundle size report can differentiate the different modules

@jscheid
Copy link
Collaborator

jscheid commented Jan 4, 2022

Amazing, that's all very exciting!

I'm conscious of not further expanding the (already sizeable) scope of this change, but could I ask you a favor? As you continue working on it, could you keep an eye out to see how the idea of "lazy hoisted" hashes could best be implemented? I'm not asking you to implement it of course (unless it was trivial), just to keep it in the back of your mind so that later on you could perhaps point me in the right direction. I'm thinking it would make for a better default than what we have currently.

@MLoughry
Copy link
Contributor Author

MLoughry commented Jan 4, 2022

My current changes already have a Chunk -> (Chunks to include in this chunk's manifest) map. It would be trivial to invert that map, find any chunks which have their hash in more than one chunk, then walk up the existing graph to find a common parent (or root node(s))

@jscheid
Copy link
Collaborator

jscheid commented Jan 4, 2022

I guess I was wondering how it could best be implemented with least performance impact, but if you think that's going to be obvious then that's great news.

@MLoughry
Copy link
Contributor Author

MLoughry commented Jan 5, 2022

I think I'm happy with the code, pending your review. There are still a few outstanding ESLint warnings about the null-assertion operator (!), but I feel adding unnecessary null-checks there would just muddy up the code.

Copy link
Collaborator

@jscheid jscheid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks again!

I've added a few suggestions with the aim to remove all uses of the ! postfix operator, with only small impact on readability, or in some cases arguably an improvement in readability.

In the meantime I'll start working on adding one or two end-to-end (Puppeteer) tests.

webpack-subresource-integrity/util.ts Outdated Show resolved Hide resolved
webpack-subresource-integrity/util.ts Show resolved Hide resolved
webpack-subresource-integrity/util.ts Outdated Show resolved Hide resolved
webpack-subresource-integrity/util.ts Outdated Show resolved Hide resolved
webpack-subresource-integrity/util.ts Show resolved Hide resolved
webpack-subresource-integrity/util.ts Outdated Show resolved Hide resolved
webpack-subresource-integrity/index.ts Show resolved Hide resolved
webpack-subresource-integrity/util.ts Show resolved Hide resolved
Copy link
Collaborator

@jscheid jscheid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MLoughry, all looks great! I'm going to merge this now, but before cutting a new release I'm planning to follow up with the following changes:

  • add e2e tests
  • user feedback when hashLoading option isn't a supported value

@jscheid jscheid merged commit fc7cf19 into waysact:main Jan 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Only include hashes for direct child chunks
2 participants