Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automatically shrink aggregation lists #8657

Merged
merged 2 commits into from
Jul 4, 2024
Merged

automatically shrink aggregation lists #8657

merged 2 commits into from
Jul 4, 2024

Conversation

sokra
Copy link
Member

@sokra sokra commented Jul 3, 2024

Description

When unloading tasks, edges from the aggregation structure are removed, but their memory is not reclaimed as Vecs don't shrink automatically.

This adds a shrink_amortized method which shrinks the Vec/HashMap when capacity is 2 times larger than length.

This also calls shrink_to_fit when running GC on a task.

Testing Instructions

Copy link

vercel bot commented Jul 3, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
rust-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 4, 2024 1:14pm
9 Skipped Deployments
Name Status Preview Comments Updated (UTC)
examples-basic-web ⬜️ Ignored (Inspect) Visit Preview Jul 4, 2024 1:14pm
examples-designsystem-docs ⬜️ Ignored (Inspect) Visit Preview Jul 4, 2024 1:14pm
examples-gatsby-web ⬜️ Ignored (Inspect) Visit Preview Jul 4, 2024 1:14pm
examples-kitchensink-blog ⬜️ Ignored (Inspect) Visit Preview Jul 4, 2024 1:14pm
examples-native-web ⬜️ Ignored (Inspect) Visit Preview Jul 4, 2024 1:14pm
examples-svelte-web ⬜️ Ignored (Inspect) Visit Preview Jul 4, 2024 1:14pm
examples-tailwind-web ⬜️ Ignored (Inspect) Visit Preview Jul 4, 2024 1:14pm
examples-vite-web ⬜️ Ignored (Inspect) Visit Preview Jul 4, 2024 1:14pm
examples-nonmonorepo ⬜️ Skipped (Inspect) Jul 4, 2024 1:14pm

@sokra sokra requested review from bgw and arlyon July 3, 2024 09:47
Copy link
Contributor

github-actions bot commented Jul 3, 2024

🟢 Turbopack Benchmark CI successful 🟢

Thanks

Copy link
Contributor

github-actions bot commented Jul 3, 2024

⚠️ This change may fail to build next-swc.

Logs

packages/next-swc/crates/next-core/src/next_shared/transforms/relay.rs:15:46: �[38;5;9merror[E0061]: this function takes 2 arguments but 1 argument was supplied
error: could not compile `next-core` (lib) due to 1 previous error

See job summary for details

Copy link
Contributor

github-actions bot commented Jul 3, 2024

⚠️ CI failed ⚠️

The following steps have failed in CI:

  • Turbopack Rust tests (mac/win, non-blocking)
  • Turbopack Rust benchmark tests (mac/win, non-blocking)

See workflow summary for details

pub fn shrink_amortized(&mut self) {
match self {
AutoMap::List(list) => {
if list.capacity() > list.len() * 2 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because HashMap and Vec grow by powers of two, this could lead to situations were we're frequently growing and shrinking the same list every time something is added and removed.

It's worse for memory, but better for time complexity to grow and shrink by different thresholds. E.g. If list.capacity() > list.len() * 4 then shrink to list.len() * 2.

Many dynamic arrays also deallocate some of the underlying storage if its size drops below a certain threshold, such as 30% of the capacity. This threshold must be strictly smaller than 1/a in order to provide hysteresis (provide a stable band to avoid repeatedly growing and shrinking) and support mixed sequences of insertions and removals with amortized constant cost.

-- https://en.wikipedia.org/wiki/Dynamic_array#Geometric_expansion_and_amortized_cost

If you think it's worth it, we could intercept inserts here and manipulate Vec/HashMap to achieve a different growth factor than 2. Many languages use 1.5. Python uses the very low value of 1.125 (!).

It also may be worth adding an arbitrary minimum size here where it's not worth shrinking the Vec below (e.g. 4, 8, or 10) to avoid excessive resizes of very small arrays.

Copy link
Member

@bgw bgw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, as long as the shrink threshold is adjusted to be different than the growth threshold.

@vercel vercel bot temporarily deployed to Preview – examples-nonmonorepo July 4, 2024 13:12 Inactive
@sokra sokra merged commit 0df64d5 into main Jul 4, 2024
49 of 54 checks passed
@sokra sokra deleted the sokra/shrink-memory branch July 4, 2024 13:22
ForsakenHarmony pushed a commit to vercel/next.js that referenced this pull request Jul 25, 2024
### Description

When unloading tasks, edges from the aggregation structure are removed,
but their memory is not reclaimed as Vecs don't shrink automatically.

This adds a `shrink_amortized` method which shrinks the Vec/HashMap when
capacity is 2 times larger than length.

This also calls `shrink_to_fit` when running GC on a task.

### Testing Instructions

<!--
  Give a quick description of steps to test your changes.
-->
ForsakenHarmony pushed a commit to vercel/next.js that referenced this pull request Jul 29, 2024
### Description

When unloading tasks, edges from the aggregation structure are removed,
but their memory is not reclaimed as Vecs don't shrink automatically.

This adds a `shrink_amortized` method which shrinks the Vec/HashMap when
capacity is 2 times larger than length.

This also calls `shrink_to_fit` when running GC on a task.

### Testing Instructions

<!--
  Give a quick description of steps to test your changes.
-->
ForsakenHarmony pushed a commit to vercel/next.js that referenced this pull request Jul 29, 2024
### Description

When unloading tasks, edges from the aggregation structure are removed,
but their memory is not reclaimed as Vecs don't shrink automatically.

This adds a `shrink_amortized` method which shrinks the Vec/HashMap when
capacity is 2 times larger than length.

This also calls `shrink_to_fit` when running GC on a task.

### Testing Instructions

<!--
  Give a quick description of steps to test your changes.
-->
ForsakenHarmony pushed a commit to vercel/next.js that referenced this pull request Aug 1, 2024
### Description

When unloading tasks, edges from the aggregation structure are removed,
but their memory is not reclaimed as Vecs don't shrink automatically.

This adds a `shrink_amortized` method which shrinks the Vec/HashMap when
capacity is 2 times larger than length.

This also calls `shrink_to_fit` when running GC on a task.

### Testing Instructions

<!--
  Give a quick description of steps to test your changes.
-->
bgw added a commit to vercel/next.js that referenced this pull request Nov 1, 2024
… when constructing a cell (#72113)

Cell contents are immutable once constructed, so there's no chance that they'll grow in size again. Common collections can be shrunk to avoid storing empty spare capacity in this case (note: if they're already correctly sized, `shrink_to_fit` bails out early).

**Result:** This gives a ~1.4% decrease in top-line peak memory consumption, for a theoretical CPU/Wall time cost that's too small to measure.

**Inspiration:** The inspiration for this was vercel/turborepo#2873, which decreased task storage (not the top-line memory usage?) by ~14%, vercel/turborepo#8657, and other similar optimization PRs.

## Additional Opportunities

- There may be more places where cell are constructed (e.g. persistent storage deserialization) where a cell's `SharedReference` is constructed that is not currently captured by this.
  - Depending on the library used, deserialization may already construct exact-sized collections.
  - As an additional example, manually constructing a `ReadRef` and converting it into a cell skips this optimization because `ReadRef::cell` internally uses the type-erased shared-reference `raw_cell` API which is incompatible with this optimization. We could special-case that in the `ReadRef::new_owned` constructor (not in `ReadRef::new_arc` though), but nobody should be manually constructing `ReadRef`s.

- We still don't use `shrink_to_fit` on `RcStr` types. Some of these are in-place extended (when they have a refcount of 1) with `RcStr::map`, so we probably don't want to be too aggressive about this to avoid `O(n^2)` time complexity blowups.

## Memory Benchmark Setup

```bash
cd ~/next.js
cargo run -p next-build-test --release -- generate ~/shadcn-ui/apps/www/ > ~/shadcn-ui/apps/www/project_options.json
pnpm pack-next --project ~/shadcn-ui/apps/www/
```

```bash
cd ~/shadcn-ui
pnpm i
cd ~/shadcn-ui/apps/www/
heaptrack ~/next.js/target/release/next-build-test run sequential 1 1 '/sink'
heaptrack --analyze ~/shadcn-ui/apps/www/heaptrack.next-build-test.3604648.zst
```

### Memory Before (canary branch)

First Run:

```
peak heap memory consumption: 3.23G
peak RSS (including heaptrack overhead): 4.75G
```

Second Run:

```
peak heap memory consumption: 3.23G
peak RSS (including heaptrack overhead): 4.75G
```

### Memory After (this PR)

First Run:

```
peak heap memory consumption: 3.18G
peak RSS (including heaptrack overhead): 4.74G
```

Second Run:

```
peak heap memory consumption: 3.19G
peak RSS (including heaptrack overhead): 4.73G
```

This is about a 1.4% decrease in top-line memory consumption.

## Wall Time with `hyperfine` (Counter-Metric)

This is theoretically a time-memory tradeoff, as we'll spend some time `memcpy`ing things into smaller allocations, though in some cases reducing memory usage can improve cache locality, so it's not always obvious.

```
hyperfine --warmup 3 -r 30 --time-unit millisecond '~/next.js/target/release/next-build-test run sequential 1 1 /sink'
```

This benchmark is slow and takes about 30 minutes to run.

Before:

```
Benchmark 1: ~/next.js/target/release/next-build-test run sequential 1 1 /sink
  Time (mean ± σ):     56387.5 ms ± 212.6 ms    [User: 107807.5 ms, System: 9509.8 ms]
  Range (min … max):   55934.4 ms … 56872.9 ms    30 runs
```

After:

```
Benchmark 1: ~/next.js/target/release/next-build-test run sequential 1 1 /sink
  Time (mean ± σ):     56020.9 ms ± 235.4 ms    [User: 107483.8 ms, System: 9371.8 ms]
  Range (min … max):   55478.2 ms … 56563.6 ms    30 runs
```

This is a ~0.65% *reduction* in wall time. This is small enough (<2 standard deviations) to likely just be noise.

## Wall Time with `turbopack-bench` (Counter-Metric)

```
cargo bench -p turbopack-bench -p turbopack-cli -- "hmr_to_eval/Turbopack CSR"
```

Gives:

```
bench_hmr_to_eval/Turbopack CSR/1000 modules
                        time:   [15.123 ms 15.208 ms 15.343 ms]
                        change: [-0.8471% +0.4882% +1.9719%] (p = 0.55 > 0.05)
                        No change in performance detected.
```

Using https://github.com/bgw/benchmark-scripts/

In practice, it's not really possible to measure changes in wall time <1%, so this is within "noise" territory (as noted in the criterion output).

Closes PACK-3361
stipsan pushed a commit to sanity-io/next.js that referenced this pull request Nov 6, 2024
… when constructing a cell (vercel#72113)

Cell contents are immutable once constructed, so there's no chance that they'll grow in size again. Common collections can be shrunk to avoid storing empty spare capacity in this case (note: if they're already correctly sized, `shrink_to_fit` bails out early).

**Result:** This gives a ~1.4% decrease in top-line peak memory consumption, for a theoretical CPU/Wall time cost that's too small to measure.

**Inspiration:** The inspiration for this was vercel/turborepo#2873, which decreased task storage (not the top-line memory usage?) by ~14%, vercel/turborepo#8657, and other similar optimization PRs.

## Additional Opportunities

- There may be more places where cell are constructed (e.g. persistent storage deserialization) where a cell's `SharedReference` is constructed that is not currently captured by this.
  - Depending on the library used, deserialization may already construct exact-sized collections.
  - As an additional example, manually constructing a `ReadRef` and converting it into a cell skips this optimization because `ReadRef::cell` internally uses the type-erased shared-reference `raw_cell` API which is incompatible with this optimization. We could special-case that in the `ReadRef::new_owned` constructor (not in `ReadRef::new_arc` though), but nobody should be manually constructing `ReadRef`s.

- We still don't use `shrink_to_fit` on `RcStr` types. Some of these are in-place extended (when they have a refcount of 1) with `RcStr::map`, so we probably don't want to be too aggressive about this to avoid `O(n^2)` time complexity blowups.

## Memory Benchmark Setup

```bash
cd ~/next.js
cargo run -p next-build-test --release -- generate ~/shadcn-ui/apps/www/ > ~/shadcn-ui/apps/www/project_options.json
pnpm pack-next --project ~/shadcn-ui/apps/www/
```

```bash
cd ~/shadcn-ui
pnpm i
cd ~/shadcn-ui/apps/www/
heaptrack ~/next.js/target/release/next-build-test run sequential 1 1 '/sink'
heaptrack --analyze ~/shadcn-ui/apps/www/heaptrack.next-build-test.3604648.zst
```

### Memory Before (canary branch)

First Run:

```
peak heap memory consumption: 3.23G
peak RSS (including heaptrack overhead): 4.75G
```

Second Run:

```
peak heap memory consumption: 3.23G
peak RSS (including heaptrack overhead): 4.75G
```

### Memory After (this PR)

First Run:

```
peak heap memory consumption: 3.18G
peak RSS (including heaptrack overhead): 4.74G
```

Second Run:

```
peak heap memory consumption: 3.19G
peak RSS (including heaptrack overhead): 4.73G
```

This is about a 1.4% decrease in top-line memory consumption.

## Wall Time with `hyperfine` (Counter-Metric)

This is theoretically a time-memory tradeoff, as we'll spend some time `memcpy`ing things into smaller allocations, though in some cases reducing memory usage can improve cache locality, so it's not always obvious.

```
hyperfine --warmup 3 -r 30 --time-unit millisecond '~/next.js/target/release/next-build-test run sequential 1 1 /sink'
```

This benchmark is slow and takes about 30 minutes to run.

Before:

```
Benchmark 1: ~/next.js/target/release/next-build-test run sequential 1 1 /sink
  Time (mean ± σ):     56387.5 ms ± 212.6 ms    [User: 107807.5 ms, System: 9509.8 ms]
  Range (min … max):   55934.4 ms … 56872.9 ms    30 runs
```

After:

```
Benchmark 1: ~/next.js/target/release/next-build-test run sequential 1 1 /sink
  Time (mean ± σ):     56020.9 ms ± 235.4 ms    [User: 107483.8 ms, System: 9371.8 ms]
  Range (min … max):   55478.2 ms … 56563.6 ms    30 runs
```

This is a ~0.65% *reduction* in wall time. This is small enough (<2 standard deviations) to likely just be noise.

## Wall Time with `turbopack-bench` (Counter-Metric)

```
cargo bench -p turbopack-bench -p turbopack-cli -- "hmr_to_eval/Turbopack CSR"
```

Gives:

```
bench_hmr_to_eval/Turbopack CSR/1000 modules
                        time:   [15.123 ms 15.208 ms 15.343 ms]
                        change: [-0.8471% +0.4882% +1.9719%] (p = 0.55 > 0.05)
                        No change in performance detected.
```

Using https://github.com/bgw/benchmark-scripts/

In practice, it's not really possible to measure changes in wall time <1%, so this is within "noise" territory (as noted in the criterion output).

Closes PACK-3361
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants