Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition with Image Loader makes it unusable in a production environment #23436

Closed
kmdev1 opened this issue Mar 26, 2021 · 11 comments · Fixed by #24000
Closed

Race condition with Image Loader makes it unusable in a production environment #23436

kmdev1 opened this issue Mar 26, 2021 · 11 comments · Fixed by #24000
Assignees
Milestone

Comments

@kmdev1
Copy link

kmdev1 commented Mar 26, 2021

What version of Next.js are you using?

10.0.9

What version of Node.js are you using?

v12.19.0, v14.16.0

What browser are you using?

Any browser

What operating system are you using?

Any OS

How are you deploying your application?

AWS

Describe the Bug

The default image loader at present seems to suffer from a race condition. Unfortunately this issue has now meant that for us, the pretty awesome functionality, cannot be used at all.

Assuming the image cache does not contain the given image, when making a request the loader will go off and do it's thing and then return the given image and add it to its cache. The problem comes when you have real traffic.

Example scenario:

Lets assume you have a simple page that has 3 images that are utilising next/image. The page has an average traffic of 10 req/s for a given screen size. This page is brand new and the images have never been server before (ie the image's do not yet exist in the cache).

Currently when this above scenario happens, c.30 calls to the original image URL happen. Each one of these images then resolves, gets transformed and then placed into the cache and returned to the user. The important note here is that we now have a cache folder that contains 10 cached files of the same image three times.

When we replicate the above scenario but in more realistic terms (12 images on a new page with average traffic of 25 req/s) this means that we suddenly have a cache folder that contains 25 images x 12 (300 total). This is because when the first request comes in, the code says hey you're not in the cache, lets go get you. The second request comes in and says, hey you're not in the cache (because the first one is still processing), let's go get you and so on and so on.

So why does this cause the container to die? It doesn't; writing and reading from disk is not the only thing that the loader is doing. Http requests, FS, Image manipulation. These things are heavy on not only memory but CPU too.

In the above example we're asking a container to not only go and fetch 300 images per second, we're also manipulating, writing to file storage, serving back. The chances are then by the time these have finished, the next second comes around with another 300 requests, the chances are probably the images still aren't in the cache because the the CPU is rocketed and so has the memory so things just start to back up and get worse and worse until the container falls over completely.

TL;DR:

Requests to non cached images are not aware of each other which means there's a race condition when the same image is requested from the same user within the same time period of it not being in the cache. This then (with enough traffic) spirals into a horrible loop that is non escapable and leads to the containers falling over.

Expected Behavior

I'd expect that when parallel requests are made, that the image loader is directly only called once and is sat behind some sort of batching mechanism (similar to DataLoader) that removes duplicates and is aware of parallel requests and then maps the responses accordingly. This means that the site could require 500 images per second of the same image, but only do all of the hardware intensive stuff once.

To Reproduce

  • Create simple page with a single (or multiple) image(s) utilising next/image
  • Deploy/run the container so you can see memory/cpu usage
  • Start simultaneous request
  • Notice Memory/CPU usage climbing
  • Notice cache folder starting to drastically grow in size
  • If persisted, eventually hardware limitations will be met
@kmdev1 kmdev1 added the bug Issue was opened via the bug report template. label Mar 26, 2021
@goguda
Copy link
Contributor

goguda commented Mar 28, 2021

I assume this is related to the same thing I'm experiencing with image optimization on anything higher than Next.js v10.0.7. I've tried both v10.0.8 and v10.0.9 but any images that are not in cache seem to cause a memory leak during conversion and memory usage rapidly climbs until my server just runs out of memory and crashes. Conversion also seems much slower overall than previously. Obviously this isn't suitable for production.

I've downgraded back to v10.0.7 for now and I can see memory usage go up by about 10-15MB during a request that would cause several conversions, then the memory gets released like it should. Seems to be a change somewhere after v10.0.7.

@kmdev1
Copy link
Author

kmdev1 commented Mar 28, 2021

Hey @goguda,

So I think the issue that you mention could exaggerate the issue that I've raised, however the race condition in itself is a problem within itself unfortunately.

Maybe you'll also notice the same thing as me but when running 10.0.7 locally do you have multiple images in the cache that are the exact same (just with different expiry time stamps - the time stamp is the first bit of the file name)?

To replicate the traffic locally I used a tool called K6 - this replicated 15 users over 10 seconds and I was able to see my cache already being populated with duplicates.

As a side effect of all this over processing this definitely had a hit on hardware for me.

@atom258
Copy link

atom258 commented Mar 30, 2021

In version 10.1.1, the memory consumption is still too high

kodiakhq bot pushed a commit that referenced this issue Mar 31, 2021
This RP fixes the problem that the image optimization API uses a large amount of memory, and is not correctly freed afterwards. There're multiple causes of this problem:

### 1. Too many WebAssembly instances are created

We used to do all the image processing operations (decode, resize, rotate, encodeJpeg, encodePng, encodeWebp) inside each worker thread, where each operation creates at least one WASM instance, and we create `os.cpus().length - 1` workers by default. That means in the worst case, there will be `N*6` WASM instances created (N is the number of CPU cores minus one).

This PR changes it to a pipeline-like architecture: there will be at most 6 workers, and the same type of operations will always be assigned to the same worker. With this change, 6 WASM instances will be created in the worst case.

### 2. WebAssembly memory can't be deallocated

It's known that [WebAssembly can't simply deallocate its memory as of today](https://stackoverflow.com/a/51544868/2424786). And due to the implementation/design of the WASM modules that we are using, they're not very suitable for long-running cases and it's more like a one-off use. For each operation like resize, it will allocate **new memory** to store that data. So the memory will increase quickly as more images are processed.

The fix is to get rid of `execOnce` for WASM module initializations, so each time a new WASM module will be created and the old module will be GC'd entirely as there's no reference to it. That's the only and easiest way to free the memory use of a WASM module AFAIK.

### 3. WebAssembly memory isn't correctly freed after finishing the operation

`wasm-bindgen` generates code with global variables like `cachegetUint8Memory0` and `wasm` that always hold the WASM memory as a reference. We need to manually clean them up after finishing each operation. 

This PR ensures that these variables will be deleted so the memory overhead can go back to 0 when an operation is finished.

### 4. Memory leak inside event listeners

`emscripten` generates code with global error listener registration (without cleaning them up): https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L39-L43

And the listener has references to the WASM instance directly or indirectly: https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L183-L192 (`e`, `y`, `r`).

That means whenever a WASM module is created (emscripten), its memory will be kept by the global scope. And when we replace the WASM module with a new one, the newer will be added again and the old will still be referenced, which causes a leak.

Since we're running them inside worker threads (which will retry on fail), this PR simply removes these listeners.

### Test

Here're some statistics showing that these changes have improved the memory usage a lot (the app I'm using to test has one page of 20 high-res PNGs):

Before this PR (`next@10.1.0`):

<img src="https://user-images.githubusercontent.com/3676859/113058480-c3496100-91e0-11eb-9e5a-b325e484adac.png" width="500">

Memory went from ~250MB to 3.2GB (peak: 3.5GB) and never decreased again.

With fix 1 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059060-921d6080-91e1-11eb-8ac6-83c70c1f2f75.png" width="500">

Memory went from ~280MB to 1.5GB (peak: 2GB).

With fix 1+2 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059207-bf6a0e80-91e1-11eb-845a-870944f9e116.png" width="500">

Memory went from ~280MB to 1.1GB (peak: 1.6GB).

With fix 1+2+3+4 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059362-ec1e2600-91e1-11eb-8d9a-8fbce8808802.png" width="500">

It's back to normal; memory changed from ~300MB to ~480MB, peaked at 1.2GB. You can clearly see that GC is working correctly here.

---

## Bug

- [x] Related issues #23189, #23436
- [ ] Integration tests added

## Feature

- [ ] Implements an existing feature request or RFC. Make sure the feature request has been accepted for implementation before opening a PR.
- [ ] Related issues linked using `fixes #number`
- [ ] Integration tests added
- [ ] Documentation added
- [ ] Telemetry added. In case of a feature if it's used or not.

## Documentation / Examples

- [ ] Make sure the linting passes
@kmdev1
Copy link
Author

kmdev1 commented Mar 31, 2021

Hey @shuding,

I have just seen your comment on the other issue about Memory usage. I think the issues that I raise above are related. I believe this still exists in the canary version (unless I'm missing something 😄) I see the more concurrent traffic, the more memory pretty consistently.

Are you able to replicate?

@shuding
Copy link
Member

shuding commented Mar 31, 2021

Hi @kmdev1, yes I think these two issues are possibly related, and I'm now trying to reproduce this one. Just want to confirm that you are still able to see the high memory usage in the latest canary with concurrent traffic, right? Asking because in the same situation memory usage will be lower with the fix in #23565.

@PranavSathy
Copy link

PranavSathy commented Mar 31, 2021

For what it is worth, our organization is on NextJS 10.0.5 and seeing spikes up to 100% memory utilization and then instance cycling due to OOM on Cloud Run. This started happening once we enabled long-running instances (previously we cycled continuously).

#23188

One other option seems to be moving to the new Canary release with @shuding 's fix:

https://github.com/vercel/next.js/releases/tag/v10.1.3-canary.0

UPDATE: Turns out that after we upgraded to 10.1.0, realized this issue and reverted, we had our package.json as ^10.0.5 , so it still resolved to 10.1.0 due to semver policies, and thus our container image caching utilized the build image with 10.1.0 still.

@kmdev1
Copy link
Author

kmdev1 commented Apr 1, 2021

Hey @shuding,

Great piece of work there 😄 - The canary release does indeed fix the overload of memory and it now seems to stay similar to that of 10.0.7 with no real noticeable difference between the two.

Of course I know your work doesn't cater for this ticket so will also drop a line confirming what I've seen to on the original

@shuding shuding added kind: bug and removed bug Issue was opened via the bug report template. labels Apr 2, 2021
@shuding shuding self-assigned this Apr 2, 2021
dazedbear added a commit to dazedbear/dazedbear.github.io that referenced this issue Apr 5, 2021
@kmdev1
Copy link
Author

kmdev1 commented Apr 9, 2021

Hey @shuding,

I was just wondering if you'd started any work on this as I probably have a bit of free time this weekend where I could help out if you hadn't started already? 😄

@shuding
Copy link
Member

shuding commented Apr 12, 2021

Sorry @kmdev1 that I missed your comment, and yes I'm already working on a PR for this issue 😄

@timneutkens timneutkens added this to the Iteration 19 milestone Apr 12, 2021
@kmdev1
Copy link
Author

kmdev1 commented Apr 12, 2021

No worries at all @shuding! Awesome thank you very much my friend! If there's anything I can do to help just let me know 👍🏻

@kodiakhq kodiakhq bot closed this as completed in #24000 Apr 12, 2021
kodiakhq bot pushed a commit that referenced this issue Apr 12, 2021
This PR adds the deduplication logic to image optimizer so the cache will be reused as much as possible if there're multiple concurrent requests.

Fixes #23436.

## Bug

- [x] Related issues linked using `fixes #number`
- [ ] Integration tests added

## Feature

- [ ] Implements an existing feature request or RFC. Make sure the feature request has been accepted for implementation before opening a PR.
- [ ] Related issues linked using `fixes #number`
- [ ] Integration tests added
- [ ] Documentation added
- [ ] Telemetry added. In case of a feature if it's used or not.

## Documentation / Examples

- [ ] Make sure the linting passes
SokratisVidros pushed a commit to SokratisVidros/next.js that referenced this issue Apr 20, 2021
This RP fixes the problem that the image optimization API uses a large amount of memory, and is not correctly freed afterwards. There're multiple causes of this problem:

### 1. Too many WebAssembly instances are created

We used to do all the image processing operations (decode, resize, rotate, encodeJpeg, encodePng, encodeWebp) inside each worker thread, where each operation creates at least one WASM instance, and we create `os.cpus().length - 1` workers by default. That means in the worst case, there will be `N*6` WASM instances created (N is the number of CPU cores minus one).

This PR changes it to a pipeline-like architecture: there will be at most 6 workers, and the same type of operations will always be assigned to the same worker. With this change, 6 WASM instances will be created in the worst case.

### 2. WebAssembly memory can't be deallocated

It's known that [WebAssembly can't simply deallocate its memory as of today](https://stackoverflow.com/a/51544868/2424786). And due to the implementation/design of the WASM modules that we are using, they're not very suitable for long-running cases and it's more like a one-off use. For each operation like resize, it will allocate **new memory** to store that data. So the memory will increase quickly as more images are processed.

The fix is to get rid of `execOnce` for WASM module initializations, so each time a new WASM module will be created and the old module will be GC'd entirely as there's no reference to it. That's the only and easiest way to free the memory use of a WASM module AFAIK.

### 3. WebAssembly memory isn't correctly freed after finishing the operation

`wasm-bindgen` generates code with global variables like `cachegetUint8Memory0` and `wasm` that always hold the WASM memory as a reference. We need to manually clean them up after finishing each operation. 

This PR ensures that these variables will be deleted so the memory overhead can go back to 0 when an operation is finished.

### 4. Memory leak inside event listeners

`emscripten` generates code with global error listener registration (without cleaning them up): https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L39-L43

And the listener has references to the WASM instance directly or indirectly: https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L183-L192 (`e`, `y`, `r`).

That means whenever a WASM module is created (emscripten), its memory will be kept by the global scope. And when we replace the WASM module with a new one, the newer will be added again and the old will still be referenced, which causes a leak.

Since we're running them inside worker threads (which will retry on fail), this PR simply removes these listeners.

### Test

Here're some statistics showing that these changes have improved the memory usage a lot (the app I'm using to test has one page of 20 high-res PNGs):

Before this PR (`next@10.1.0`):

<img src="https://user-images.githubusercontent.com/3676859/113058480-c3496100-91e0-11eb-9e5a-b325e484adac.png" width="500">

Memory went from ~250MB to 3.2GB (peak: 3.5GB) and never decreased again.

With fix 1 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059060-921d6080-91e1-11eb-8ac6-83c70c1f2f75.png" width="500">

Memory went from ~280MB to 1.5GB (peak: 2GB).

With fix 1+2 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059207-bf6a0e80-91e1-11eb-845a-870944f9e116.png" width="500">

Memory went from ~280MB to 1.1GB (peak: 1.6GB).

With fix 1+2+3+4 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059362-ec1e2600-91e1-11eb-8d9a-8fbce8808802.png" width="500">

It's back to normal; memory changed from ~300MB to ~480MB, peaked at 1.2GB. You can clearly see that GC is working correctly here.

---

## Bug

- [x] Related issues vercel#23189, vercel#23436
- [ ] Integration tests added

## Feature

- [ ] Implements an existing feature request or RFC. Make sure the feature request has been accepted for implementation before opening a PR.
- [ ] Related issues linked using `fixes #number`
- [ ] Integration tests added
- [ ] Documentation added
- [ ] Telemetry added. In case of a feature if it's used or not.

## Documentation / Examples

- [ ] Make sure the linting passes
SokratisVidros pushed a commit to SokratisVidros/next.js that referenced this issue Apr 20, 2021
This PR adds the deduplication logic to image optimizer so the cache will be reused as much as possible if there're multiple concurrent requests.

Fixes vercel#23436.

## Bug

- [x] Related issues linked using `fixes #number`
- [ ] Integration tests added

## Feature

- [ ] Implements an existing feature request or RFC. Make sure the feature request has been accepted for implementation before opening a PR.
- [ ] Related issues linked using `fixes #number`
- [ ] Integration tests added
- [ ] Documentation added
- [ ] Telemetry added. In case of a feature if it's used or not.

## Documentation / Examples

- [ ] Make sure the linting passes
flybayer pushed a commit to blitz-js/next.js that referenced this issue Apr 29, 2021
This RP fixes the problem that the image optimization API uses a large amount of memory, and is not correctly freed afterwards. There're multiple causes of this problem:

### 1. Too many WebAssembly instances are created

We used to do all the image processing operations (decode, resize, rotate, encodeJpeg, encodePng, encodeWebp) inside each worker thread, where each operation creates at least one WASM instance, and we create `os.cpus().length - 1` workers by default. That means in the worst case, there will be `N*6` WASM instances created (N is the number of CPU cores minus one).

This PR changes it to a pipeline-like architecture: there will be at most 6 workers, and the same type of operations will always be assigned to the same worker. With this change, 6 WASM instances will be created in the worst case.

### 2. WebAssembly memory can't be deallocated

It's known that [WebAssembly can't simply deallocate its memory as of today](https://stackoverflow.com/a/51544868/2424786). And due to the implementation/design of the WASM modules that we are using, they're not very suitable for long-running cases and it's more like a one-off use. For each operation like resize, it will allocate **new memory** to store that data. So the memory will increase quickly as more images are processed.

The fix is to get rid of `execOnce` for WASM module initializations, so each time a new WASM module will be created and the old module will be GC'd entirely as there's no reference to it. That's the only and easiest way to free the memory use of a WASM module AFAIK.

### 3. WebAssembly memory isn't correctly freed after finishing the operation

`wasm-bindgen` generates code with global variables like `cachegetUint8Memory0` and `wasm` that always hold the WASM memory as a reference. We need to manually clean them up after finishing each operation. 

This PR ensures that these variables will be deleted so the memory overhead can go back to 0 when an operation is finished.

### 4. Memory leak inside event listeners

`emscripten` generates code with global error listener registration (without cleaning them up): https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L39-L43

And the listener has references to the WASM instance directly or indirectly: https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L183-L192 (`e`, `y`, `r`).

That means whenever a WASM module is created (emscripten), its memory will be kept by the global scope. And when we replace the WASM module with a new one, the newer will be added again and the old will still be referenced, which causes a leak.

Since we're running them inside worker threads (which will retry on fail), this PR simply removes these listeners.

### Test

Here're some statistics showing that these changes have improved the memory usage a lot (the app I'm using to test has one page of 20 high-res PNGs):

Before this PR (`next@10.1.0`):

<img src="https://user-images.githubusercontent.com/3676859/113058480-c3496100-91e0-11eb-9e5a-b325e484adac.png" width="500">

Memory went from ~250MB to 3.2GB (peak: 3.5GB) and never decreased again.

With fix 1 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059060-921d6080-91e1-11eb-8ac6-83c70c1f2f75.png" width="500">

Memory went from ~280MB to 1.5GB (peak: 2GB).

With fix 1+2 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059207-bf6a0e80-91e1-11eb-845a-870944f9e116.png" width="500">

Memory went from ~280MB to 1.1GB (peak: 1.6GB).

With fix 1+2+3+4 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059362-ec1e2600-91e1-11eb-8d9a-8fbce8808802.png" width="500">

It's back to normal; memory changed from ~300MB to ~480MB, peaked at 1.2GB. You can clearly see that GC is working correctly here.

---

## Bug

- [x] Related issues vercel#23189, vercel#23436
- [ ] Integration tests added

## Feature

- [ ] Implements an existing feature request or RFC. Make sure the feature request has been accepted for implementation before opening a PR.
- [ ] Related issues linked using `fixes #number`
- [ ] Integration tests added
- [ ] Documentation added
- [ ] Telemetry added. In case of a feature if it's used or not.

## Documentation / Examples

- [ ] Make sure the linting passes
@balazsorban44
Copy link
Member

This issue has been automatically locked due to no recent activity. If you are running into a similar issue, please create a new issue with the steps to reproduce. Thank you.

@vercel vercel locked as resolved and limited conversation to collaborators Jan 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants