Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leak in image optimization #23565

Merged
merged 1 commit into from Mar 31, 2021
Merged

Conversation

shuding
Copy link
Member

@shuding shuding commented Mar 30, 2021

This RP fixes the problem that the image optimization API uses a large amount of memory, and is not correctly freed afterwards. There're multiple causes of this problem:

1. Too many WebAssembly instances are created

We used to do all the image processing operations (decode, resize, rotate, encodeJpeg, encodePng, encodeWebp) inside each worker thread, where each operation creates at least one WASM instance, and we create os.cpus().length - 1 workers by default. That means in the worst case, there will be N*6 WASM instances created (N is the number of CPU cores minus one).

This PR changes it to a pipeline-like architecture: there will be at most 6 workers, and the same type of operations will always be assigned to the same worker. With this change, 6 WASM instances will be created in the worst case.

2. WebAssembly memory can't be deallocated

It's known that WebAssembly can't simply deallocate its memory as of today. And due to the implementation/design of the WASM modules that we are using, they're not very suitable for long-running cases and it's more like a one-off use. For each operation like resize, it will allocate new memory to store that data. So the memory will increase quickly as more images are processed.

The fix is to get rid of execOnce for WASM module initializations, so each time a new WASM module will be created and the old module will be GC'd entirely as there's no reference to it. That's the only and easiest way to free the memory use of a WASM module AFAIK.

3. WebAssembly memory isn't correctly freed after finishing the operation

wasm-bindgen generates code with global variables like cachegetUint8Memory0 and wasm that always hold the WASM memory as a reference. We need to manually clean them up after finishing each operation.

This PR ensures that these variables will be deleted so the memory overhead can go back to 0 when an operation is finished.

4. Memory leak inside event listeners

emscripten generates code with global error listener registration (without cleaning them up): https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L39-L43

And the listener has references to the WASM instance directly or indirectly: https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L183-L192 (e, y, r).

That means whenever a WASM module is created (emscripten), its memory will be kept by the global scope. And when we replace the WASM module with a new one, the newer will be added again and the old will still be referenced, which causes a leak.

Since we're running them inside worker threads (which will retry on fail), this PR simply removes these listeners.

Test

Here're some statistics showing that these changes have improved the memory usage a lot (the app I'm using to test has one page of 20 high-res PNGs):

Before this PR (next@10.1.0):

Memory went from ~250MB to 3.2GB (peak: 3.5GB) and never decreased again.

With fix 1 applied:

Memory went from ~280MB to 1.5GB (peak: 2GB).

With fix 1+2 applied:

Memory went from ~280MB to 1.1GB (peak: 1.6GB).

With fix 1+2+3+4 applied:

It's back to normal; memory changed from ~300MB to ~480MB, peaked at 1.2GB. You can clearly see that GC is working correctly here.


Bug

Feature

  • Implements an existing feature request or RFC. Make sure the feature request has been accepted for implementation before opening a PR.
  • Related issues linked using fixes #number
  • Integration tests added
  • Documentation added
  • Telemetry added. In case of a feature if it's used or not.

Documentation / Examples

  • Make sure the linting passes

@ijjk
Copy link
Member

ijjk commented Mar 30, 2021

Stats from current PR

Default Server Mode (Increase detected ⚠️)
General Overall decrease ✓
vercel/next.js canary shuding/next.js fix-image-memory Change
buildDuration 13s 13.1s ⚠️ +75ms
nodeModulesSize 45.9 MB 45.9 MB -295 B
Page Load Tests Overall increase ✓
vercel/next.js canary shuding/next.js fix-image-memory Change
/ failed reqs 0 0
/ total time (seconds) 2.325 2.31 -0.02
/ avg req/sec 1075.06 1082.34 +7.28
/error-in-render failed reqs 0 0
/error-in-render total time (seconds) 1.593 1.59 0
/error-in-render avg req/sec 1569.36 1572.24 +2.88
Client Bundles (main, webpack, commons)
vercel/next.js canary shuding/next.js fix-image-memory Change
677f882d2ed8..7edd.js gzip 13.4 kB 13.4 kB
framework.HASH.js gzip 39 kB 39 kB
main-HASH.js gzip 7.12 kB 7.12 kB
webpack-HASH.js gzip 751 B 751 B
Overall change 60.2 kB 60.2 kB
Legacy Client Bundles (polyfills)
vercel/next.js canary shuding/next.js fix-image-memory Change
polyfills-HASH.js gzip 31.3 kB 31.3 kB
Overall change 31.3 kB 31.3 kB
Client Pages
vercel/next.js canary shuding/next.js fix-image-memory Change
_app-8fbabfc..6440.js gzip 1.28 kB 1.28 kB
_error-af59f..582f.js gzip 3.46 kB 3.46 kB
amp-9716187d..0aa8.js gzip 536 B 536 B
hooks-107e90..74c7.js gzip 888 B 888 B
index-ac435c..ecf2.js gzip 227 B 227 B
link-c31053f..c329.js gzip 1.64 kB 1.64 kB
routerDirect..dc9d.js gzip 303 B 303 B
withRouter-6..0e02.js gzip 302 B 302 B
Overall change 8.64 kB 8.64 kB
Client Build Manifests
vercel/next.js canary shuding/next.js fix-image-memory Change
_buildManifest.js gzip 370 B 370 B
Overall change 370 B 370 B
Rendered Page Sizes
vercel/next.js canary shuding/next.js fix-image-memory Change
index.html gzip 612 B 612 B
link.html gzip 620 B 620 B
withRouter.html gzip 607 B 607 B
Overall change 1.84 kB 1.84 kB

Serverless Mode (Decrease detected ✓)
General Overall decrease ✓
vercel/next.js canary shuding/next.js fix-image-memory Change
buildDuration 15.2s 14.9s -324ms
nodeModulesSize 45.9 MB 45.9 MB -295 B
Client Bundles (main, webpack, commons)
vercel/next.js canary shuding/next.js fix-image-memory Change
677f882d2ed8..7edd.js gzip 13.4 kB 13.4 kB
framework.HASH.js gzip 39 kB 39 kB
main-HASH.js gzip 7.12 kB 7.12 kB
webpack-HASH.js gzip 751 B 751 B
Overall change 60.2 kB 60.2 kB
Legacy Client Bundles (polyfills)
vercel/next.js canary shuding/next.js fix-image-memory Change
polyfills-HASH.js gzip 31.3 kB 31.3 kB
Overall change 31.3 kB 31.3 kB
Client Pages
vercel/next.js canary shuding/next.js fix-image-memory Change
_app-8fbabfc..6440.js gzip 1.28 kB 1.28 kB
_error-af59f..582f.js gzip 3.46 kB 3.46 kB
amp-9716187d..0aa8.js gzip 536 B 536 B
hooks-107e90..74c7.js gzip 888 B 888 B
index-ac435c..ecf2.js gzip 227 B 227 B
link-c31053f..c329.js gzip 1.64 kB 1.64 kB
routerDirect..dc9d.js gzip 303 B 303 B
withRouter-6..0e02.js gzip 302 B 302 B
Overall change 8.64 kB 8.64 kB
Client Build Manifests
vercel/next.js canary shuding/next.js fix-image-memory Change
_buildManifest.js gzip 370 B 370 B
Overall change 370 B 370 B
Serverless bundles
vercel/next.js canary shuding/next.js fix-image-memory Change
_error.js 1.36 MB 1.36 MB
404.html 2.67 kB 2.67 kB
500.html 2.65 kB 2.65 kB
amp.amp.html 10.7 kB 10.7 kB
amp.html 1.86 kB 1.86 kB
hooks.html 1.92 kB 1.92 kB
index.js 1.36 MB 1.36 MB
link.js 1.42 MB 1.42 MB
routerDirect.js 1.41 MB 1.41 MB
withRouter.js 1.41 MB 1.41 MB
Overall change 6.99 MB 6.99 MB

Webpack 5 Mode (Decrease detected ✓)
General Overall decrease ✓
vercel/next.js canary shuding/next.js fix-image-memory Change
buildDuration 15.6s 15.6s ⚠️ +7ms
nodeModulesSize 45.9 MB 45.9 MB -295 B
Page Load Tests Overall decrease ⚠️
vercel/next.js canary shuding/next.js fix-image-memory Change
/ failed reqs 0 0
/ total time (seconds) 2.311 2.317 ⚠️ +0.01
/ avg req/sec 1081.92 1078.77 ⚠️ -3.15
/error-in-render failed reqs 0 0
/error-in-render total time (seconds) 1.552 1.646 ⚠️ +0.09
/error-in-render avg req/sec 1610.55 1519.26 ⚠️ -91.29
Client Bundles (main, webpack, commons)
vercel/next.js canary shuding/next.js fix-image-memory Change
597-e27c5352..db8c.js gzip 13.3 kB 13.3 kB
778-a4568938..e1f5.js gzip 7.04 kB 7.04 kB
framework.HASH.js gzip 39.3 kB 39.3 kB
main-HASH.js gzip 151 B 151 B
webpack-HASH.js gzip 993 B 993 B
Overall change 60.8 kB 60.8 kB
Legacy Client Bundles (polyfills)
vercel/next.js canary shuding/next.js fix-image-memory Change
polyfills-HASH.js gzip 31.1 kB 31.1 kB
Overall change 31.1 kB 31.1 kB
Client Pages
vercel/next.js canary shuding/next.js fix-image-memory Change
_app-5cc66b2..6f03.js gzip 1.3 kB 1.3 kB
_error-b58c1..9b8e.js gzip 3.4 kB 3.4 kB
amp-89a5460c..567f.js gzip 558 B 558 B
hooks-8c2e74..be37.js gzip 924 B 924 B
index-fec729..83b2.js gzip 243 B 243 B
link-dd34d9b..0ade.js gzip 1.66 kB 1.66 kB
routerDirect..5759.js gzip 336 B 336 B
withRouter-1..98bf.js gzip 334 B 334 B
Overall change 8.76 kB 8.76 kB
Client Build Manifests
vercel/next.js canary shuding/next.js fix-image-memory Change
_buildManifest.js gzip 349 B 349 B
Overall change 349 B 349 B
Rendered Page Sizes
vercel/next.js canary shuding/next.js fix-image-memory Change
index.html gzip 610 B 610 B
link.html gzip 616 B 616 B
withRouter.html gzip 604 B 604 B
Overall change 1.83 kB 1.83 kB

Diffs

Diff for index.html
@@ -43,7 +43,7 @@
         "props": { "pageProps": {} },
         "page": "/",
         "query": {},
-        "buildId": "MxEAJG5MAjHKhxXaLeDr9",
+        "buildId": "_tyFtaSES-ZJzwYweQSJC",
         "isFallback": false,
         "gip": true
       }
@@ -77,11 +77,11 @@
       async=""
     ></script>
     <script
-      src="/_next/static/MxEAJG5MAjHKhxXaLeDr9/_buildManifest.js"
+      src="/_next/static/_tyFtaSES-ZJzwYweQSJC/_buildManifest.js"
       async=""
     ></script>
     <script
-      src="/_next/static/MxEAJG5MAjHKhxXaLeDr9/_ssgManifest.js"
+      src="/_next/static/_tyFtaSES-ZJzwYweQSJC/_ssgManifest.js"
       async=""
     ></script>
   </body>
Diff for link.html
@@ -48,7 +48,7 @@
         "props": { "pageProps": {} },
         "page": "/link",
         "query": {},
-        "buildId": "MxEAJG5MAjHKhxXaLeDr9",
+        "buildId": "_tyFtaSES-ZJzwYweQSJC",
         "isFallback": false,
         "gip": true
       }
@@ -82,11 +82,11 @@
       async=""
     ></script>
     <script
-      src="/_next/static/MxEAJG5MAjHKhxXaLeDr9/_buildManifest.js"
+      src="/_next/static/_tyFtaSES-ZJzwYweQSJC/_buildManifest.js"
       async=""
     ></script>
     <script
-      src="/_next/static/MxEAJG5MAjHKhxXaLeDr9/_ssgManifest.js"
+      src="/_next/static/_tyFtaSES-ZJzwYweQSJC/_ssgManifest.js"
       async=""
     ></script>
   </body>
Diff for withRouter.html
@@ -43,7 +43,7 @@
         "props": { "pageProps": {} },
         "page": "/withRouter",
         "query": {},
-        "buildId": "MxEAJG5MAjHKhxXaLeDr9",
+        "buildId": "_tyFtaSES-ZJzwYweQSJC",
         "isFallback": false,
         "gip": true
       }
@@ -77,11 +77,11 @@
       async=""
     ></script>
     <script
-      src="/_next/static/MxEAJG5MAjHKhxXaLeDr9/_buildManifest.js"
+      src="/_next/static/_tyFtaSES-ZJzwYweQSJC/_buildManifest.js"
       async=""
     ></script>
     <script
-      src="/_next/static/MxEAJG5MAjHKhxXaLeDr9/_ssgManifest.js"
+      src="/_next/static/_tyFtaSES-ZJzwYweQSJC/_ssgManifest.js"
       async=""
     ></script>
   </body>
Commit: 38e894e

@shuding shuding marked this pull request as ready for review March 30, 2021 21:31
@kodiakhq kodiakhq bot merged commit 7adfce2 into vercel:canary Mar 31, 2021
@gu-stav
Copy link

gu-stav commented Mar 31, 2021

npm didn't pick up the new release yet. Excited to test and upgrade ✨

@shuding shuding deleted the fix-image-memory branch March 31, 2021 09:29
timcole added a commit to spaceflight-live/comet that referenced this pull request Apr 18, 2021
- Switched because vercel/next.js#23189 has
  been fixed there
  - vercel/next.js#23565
SokratisVidros pushed a commit to SokratisVidros/next.js that referenced this pull request Apr 20, 2021
This RP fixes the problem that the image optimization API uses a large amount of memory, and is not correctly freed afterwards. There're multiple causes of this problem:

### 1. Too many WebAssembly instances are created

We used to do all the image processing operations (decode, resize, rotate, encodeJpeg, encodePng, encodeWebp) inside each worker thread, where each operation creates at least one WASM instance, and we create `os.cpus().length - 1` workers by default. That means in the worst case, there will be `N*6` WASM instances created (N is the number of CPU cores minus one).

This PR changes it to a pipeline-like architecture: there will be at most 6 workers, and the same type of operations will always be assigned to the same worker. With this change, 6 WASM instances will be created in the worst case.

### 2. WebAssembly memory can't be deallocated

It's known that [WebAssembly can't simply deallocate its memory as of today](https://stackoverflow.com/a/51544868/2424786). And due to the implementation/design of the WASM modules that we are using, they're not very suitable for long-running cases and it's more like a one-off use. For each operation like resize, it will allocate **new memory** to store that data. So the memory will increase quickly as more images are processed.

The fix is to get rid of `execOnce` for WASM module initializations, so each time a new WASM module will be created and the old module will be GC'd entirely as there's no reference to it. That's the only and easiest way to free the memory use of a WASM module AFAIK.

### 3. WebAssembly memory isn't correctly freed after finishing the operation

`wasm-bindgen` generates code with global variables like `cachegetUint8Memory0` and `wasm` that always hold the WASM memory as a reference. We need to manually clean them up after finishing each operation. 

This PR ensures that these variables will be deleted so the memory overhead can go back to 0 when an operation is finished.

### 4. Memory leak inside event listeners

`emscripten` generates code with global error listener registration (without cleaning them up): https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L39-L43

And the listener has references to the WASM instance directly or indirectly: https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L183-L192 (`e`, `y`, `r`).

That means whenever a WASM module is created (emscripten), its memory will be kept by the global scope. And when we replace the WASM module with a new one, the newer will be added again and the old will still be referenced, which causes a leak.

Since we're running them inside worker threads (which will retry on fail), this PR simply removes these listeners.

### Test

Here're some statistics showing that these changes have improved the memory usage a lot (the app I'm using to test has one page of 20 high-res PNGs):

Before this PR (`next@10.1.0`):

<img src="https://user-images.githubusercontent.com/3676859/113058480-c3496100-91e0-11eb-9e5a-b325e484adac.png" width="500">

Memory went from ~250MB to 3.2GB (peak: 3.5GB) and never decreased again.

With fix 1 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059060-921d6080-91e1-11eb-8ac6-83c70c1f2f75.png" width="500">

Memory went from ~280MB to 1.5GB (peak: 2GB).

With fix 1+2 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059207-bf6a0e80-91e1-11eb-845a-870944f9e116.png" width="500">

Memory went from ~280MB to 1.1GB (peak: 1.6GB).

With fix 1+2+3+4 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059362-ec1e2600-91e1-11eb-8d9a-8fbce8808802.png" width="500">

It's back to normal; memory changed from ~300MB to ~480MB, peaked at 1.2GB. You can clearly see that GC is working correctly here.

---

## Bug

- [x] Related issues vercel#23189, vercel#23436
- [ ] Integration tests added

## Feature

- [ ] Implements an existing feature request or RFC. Make sure the feature request has been accepted for implementation before opening a PR.
- [ ] Related issues linked using `fixes #number`
- [ ] Integration tests added
- [ ] Documentation added
- [ ] Telemetry added. In case of a feature if it's used or not.

## Documentation / Examples

- [ ] Make sure the linting passes
flybayer pushed a commit to blitz-js/next.js that referenced this pull request Apr 29, 2021
This RP fixes the problem that the image optimization API uses a large amount of memory, and is not correctly freed afterwards. There're multiple causes of this problem:

### 1. Too many WebAssembly instances are created

We used to do all the image processing operations (decode, resize, rotate, encodeJpeg, encodePng, encodeWebp) inside each worker thread, where each operation creates at least one WASM instance, and we create `os.cpus().length - 1` workers by default. That means in the worst case, there will be `N*6` WASM instances created (N is the number of CPU cores minus one).

This PR changes it to a pipeline-like architecture: there will be at most 6 workers, and the same type of operations will always be assigned to the same worker. With this change, 6 WASM instances will be created in the worst case.

### 2. WebAssembly memory can't be deallocated

It's known that [WebAssembly can't simply deallocate its memory as of today](https://stackoverflow.com/a/51544868/2424786). And due to the implementation/design of the WASM modules that we are using, they're not very suitable for long-running cases and it's more like a one-off use. For each operation like resize, it will allocate **new memory** to store that data. So the memory will increase quickly as more images are processed.

The fix is to get rid of `execOnce` for WASM module initializations, so each time a new WASM module will be created and the old module will be GC'd entirely as there's no reference to it. That's the only and easiest way to free the memory use of a WASM module AFAIK.

### 3. WebAssembly memory isn't correctly freed after finishing the operation

`wasm-bindgen` generates code with global variables like `cachegetUint8Memory0` and `wasm` that always hold the WASM memory as a reference. We need to manually clean them up after finishing each operation. 

This PR ensures that these variables will be deleted so the memory overhead can go back to 0 when an operation is finished.

### 4. Memory leak inside event listeners

`emscripten` generates code with global error listener registration (without cleaning them up): https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L39-L43

And the listener has references to the WASM instance directly or indirectly: https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L183-L192 (`e`, `y`, `r`).

That means whenever a WASM module is created (emscripten), its memory will be kept by the global scope. And when we replace the WASM module with a new one, the newer will be added again and the old will still be referenced, which causes a leak.

Since we're running them inside worker threads (which will retry on fail), this PR simply removes these listeners.

### Test

Here're some statistics showing that these changes have improved the memory usage a lot (the app I'm using to test has one page of 20 high-res PNGs):

Before this PR (`next@10.1.0`):

<img src="https://user-images.githubusercontent.com/3676859/113058480-c3496100-91e0-11eb-9e5a-b325e484adac.png" width="500">

Memory went from ~250MB to 3.2GB (peak: 3.5GB) and never decreased again.

With fix 1 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059060-921d6080-91e1-11eb-8ac6-83c70c1f2f75.png" width="500">

Memory went from ~280MB to 1.5GB (peak: 2GB).

With fix 1+2 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059207-bf6a0e80-91e1-11eb-845a-870944f9e116.png" width="500">

Memory went from ~280MB to 1.1GB (peak: 1.6GB).

With fix 1+2+3+4 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059362-ec1e2600-91e1-11eb-8d9a-8fbce8808802.png" width="500">

It's back to normal; memory changed from ~300MB to ~480MB, peaked at 1.2GB. You can clearly see that GC is working correctly here.

---

## Bug

- [x] Related issues vercel#23189, vercel#23436
- [ ] Integration tests added

## Feature

- [ ] Implements an existing feature request or RFC. Make sure the feature request has been accepted for implementation before opening a PR.
- [ ] Related issues linked using `fixes #number`
- [ ] Integration tests added
- [ ] Documentation added
- [ ] Telemetry added. In case of a feature if it's used or not.

## Documentation / Examples

- [ ] Make sure the linting passes
@styfle styfle mentioned this pull request Jun 28, 2021
1 task
kodiakhq bot pushed a commit that referenced this pull request Oct 6, 2021
Bump `squoosh` to the latest version, currently commit [cad09160](GoogleChromeLabs/squoosh@cad0916).

Ideally, we would use the version published to npm but it hasn't been published in [two months](https://www.npmjs.com/package/@squoosh/lib?activeTab=versions) and we have a patch (#23565) that isn't available upstream.

This also is a precursor to getting support for AVIF.

- Fixes #27092
- Fixes #26527 
- Reapplies the patch from #23565
@vercel vercel locked as resolved and limited conversation to collaborators Jan 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants