Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rustdoc-search: tighter encoding for f index #119468

Merged
merged 4 commits into from Jan 6, 2024

Conversation

notriddle
Copy link
Contributor

@notriddle notriddle commented Dec 31, 2023

Depends on #119457

Two optimizations for the function signature search:

  • Instead of using JSON arrays, like [1,20], it uses VLQ
    hex with no commas, like [aAd].
  • This also adds backrefs: if you have more than one function
    with exactly the same signature, it'll not only store it once,
    it'll decode it once, and store in the typeIdMap only once.

Based partially on discussions on zulip:
https://rust-lang.zulipchat.com/#narrow/stream/266220-t-rustdoc/topic/search.20index.20size

Performance

https://notriddle.com/rustdoc-html-demo-8/compression-perf-v2/index.html

memory/time profiler output (for more details, consult the above link)

benchmarkbeforeafter
arti
user: 002.789 s
sys:  000.390 s
wall: 002.096 s
child_RSS_high:     440796 KiB
group_mem_high:     414924 KiB
user: 002.295 s
sys:  000.278 s
wall: 001.738 s
child_RSS_high:     314588 KiB
group_mem_high:     285220 KiB
cortex-m
user: 000.127 s
sys:  000.030 s
wall: 000.134 s
child_RSS_high:      60264 KiB
group_mem_high:      23824 KiB
user: 000.136 s
sys:  000.038 s
wall: 000.137 s
child_RSS_high:      59204 KiB
group_mem_high:      22712 KiB
sqlx
user: 000.887 s
sys:  000.118 s
wall: 000.592 s
child_RSS_high:     190408 KiB
group_mem_high:     157804 KiB
user: 000.798 s
sys:  000.101 s
wall: 000.525 s
child_RSS_high:     159292 KiB
group_mem_high:     126292 KiB
stm32f4
user: 013.884 s
sys:  005.399 s
wall: 013.149 s
child_RSS_high:    1942244 KiB
group_mem_high:    1954916 KiB
user: 006.128 s
sys:  003.297 s
wall: 007.994 s
child_RSS_high:    1038108 KiB
group_mem_high:    1023900 KiB
ripgrep
user: 000.441 s
sys:  000.063 s
wall: 000.264 s
child_RSS_high:     109180 KiB
group_mem_high:      74272 KiB
user: 000.408 s
sys:  000.044 s
wall: 000.238 s
child_RSS_high:     101488 KiB
group_mem_high:      66000 KiB

Size change

standard library without gzip:

$ du -bs search-index-old.js search-index-new.js
4976370 search-index-old.js
4404391 search-index-new.js

((4976370-4404391)/4404391)*100% = 12.9%

with gzip:

$ du -hs search-index-old.js.gz search-index-new.js.gz 
520K    search-index-old.js.gz
504K    search-index-new.js.gz

$ du -bs search-index-old.js.gz search-index-new.js.gz 
522092  search-index-old.js.gz
507654  search-index-new.js.gz

((522092-507654)/507654)*100% = 2.8%

Benchmarks are similarly shrunk.

Without gzip:

$ du -hs tmp/{arti,cortex-m,sqlx,stm32f4,ripgrep}/toolchain_{old,new}/doc/search-index.js
10555067        tmp/arti/toolchain_old/doc/search-index.js
8921236 tmp/arti/toolchain_new/doc/search-index.js
77018   tmp/cortex-m/toolchain_old/doc/search-index.js
66676   tmp/cortex-m/toolchain_new/doc/search-index.js
2876330 tmp/sqlx/toolchain_old/doc/search-index.js
2436812 tmp/sqlx/toolchain_new/doc/search-index.js
63632890        tmp/stm32f4/toolchain_old/doc/search-index.js
52337438        tmp/stm32f4/toolchain_new/doc/search-index.js
631150  tmp/ripgrep/toolchain_old/doc/search-index.js
541646  tmp/ripgrep/toolchain_new/doc/search-index.js

With gzip:

$ du -bs tmp/{arti,cortex-m,sqlx,stm32f4,ripgrep}/toolchain_{old,new}/doc/search-index.js.gz
1618852 tmp/arti/toolchain_old/doc/search-index.js.gz
1582007 tmp/arti/toolchain_new/doc/search-index.js.gz
16109   tmp/cortex-m/toolchain_old/doc/search-index.js.gz
15831   tmp/cortex-m/toolchain_new/doc/search-index.js.gz
422257  tmp/sqlx/toolchain_old/doc/search-index.js.gz
411507  tmp/sqlx/toolchain_new/doc/search-index.js.gz
4454761 tmp/stm32f4/toolchain_old/doc/search-index.js.gz
4334924 tmp/stm32f4/toolchain_new/doc/search-index.js.gz
98312   tmp/ripgrep/toolchain_old/doc/search-index.js.gz
96864   tmp/ripgrep/toolchain_new/doc/search-index.js.gz

$ du -hs tmp/{arti,cortex-m,sqlx,stm32f4,ripgrep}/toolchain_{old,new}/doc/search-index.j
s.gz
1.6M    tmp/arti/toolchain_old/doc/search-index.js.gz
1.6M    tmp/arti/toolchain_new/doc/search-index.js.gz
24K     tmp/cortex-m/toolchain_old/doc/search-index.js.gz
24K     tmp/cortex-m/toolchain_new/doc/search-index.js.gz
424K    tmp/sqlx/toolchain_old/doc/search-index.js.gz
412K    tmp/sqlx/toolchain_new/doc/search-index.js.gz
4.3M    tmp/stm32f4/toolchain_old/doc/search-index.js.gz
4.2M    tmp/stm32f4/toolchain_new/doc/search-index.js.gz
108K    tmp/ripgrep/toolchain_old/doc/search-index.js.gz
104K    tmp/ripgrep/toolchain_new/doc/search-index.js.gz

@rustbot
Copy link
Collaborator

rustbot commented Dec 31, 2023

r? @GuillaumeGomez

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. labels Dec 31, 2023
@rustbot
Copy link
Collaborator

rustbot commented Dec 31, 2023

Some changes occurred in HTML/CSS/JS.

cc @GuillaumeGomez, @jsha

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

Two optimizations for the function signature search:

* Instead of using JSON arrays, like `[1,20]`, it uses VLQ
  hex with no commas, like `[aAd]`.
* This also adds backrefs: if you have more than one function
  with exactly the same signature, it'll not only store it once,
  it'll *decode* it once, and store in the typeIdMap only once.

Size change
-----------

standard library

```console
$ du -bs search-index-old.js search-index-new.js
4976370 search-index-old.js
4404391 search-index-new.js
```

((4976370-4404391)/4404391)*100% = 12.9%

Benchmarks are similarly shrunk:

```console
$ du -hs tmp/{arti,cortex-m,sqlx,stm32f4,ripgrep}/toolchain_{old,new}/doc/search-index.js
10555067        tmp/arti/toolchain_old/doc/search-index.js
8921236 tmp/arti/toolchain_new/doc/search-index.js
77018   tmp/cortex-m/toolchain_old/doc/search-index.js
66676   tmp/cortex-m/toolchain_new/doc/search-index.js
2876330 tmp/sqlx/toolchain_old/doc/search-index.js
2436812 tmp/sqlx/toolchain_new/doc/search-index.js
63632890        tmp/stm32f4/toolchain_old/doc/search-index.js
52337438        tmp/stm32f4/toolchain_new/doc/search-index.js
631150  tmp/ripgrep/toolchain_old/doc/search-index.js
541646  tmp/ripgrep/toolchain_new/doc/search-index.js
```
@GuillaumeGomez
Copy link
Member

Can you post the difference for the search-index JS execution as well please?

Might be nice to add some explanations about what algorithm is used so it's easier to understand the code too.

id.serialize(serializer)
// zig-zag notation
let value: u32 = (id << 1) | (if sign { 1 } else { 0 });
// encode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you precise which encoding it is please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

I've opened another PR, rust-lang/rustc-dev-guide#1846, which documents this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's super nice! Could you also precise the encoding used here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've added some comments describing the format.

@rust-log-analyzer

This comment has been minimized.

@notriddle notriddle force-pushed the notriddle/compression branch 2 times, most recently from 69d86c9 to 448b19b Compare January 4, 2024 22:02
@GuillaumeGomez
Copy link
Member

Thanks! r=me when CI pass

@rust-log-analyzer

This comment has been minimized.

@notriddle
Copy link
Contributor Author

@bors r=GuillaumeGomez rollup

@bors
Copy link
Contributor

bors commented Jan 5, 2024

📌 Commit 004bfc5 has been approved by GuillaumeGomez

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 5, 2024
compiler-errors added a commit to compiler-errors/rust that referenced this pull request Jan 6, 2024
…=GuillaumeGomez

rustdoc-search: tighter encoding for f index

Depends on rust-lang#119457

Two optimizations for the function signature search:

* Instead of using JSON arrays, like `[1,20]`, it uses VLQ
  hex with no commas, like `[aAd]`.
* This also adds backrefs: if you have more than one function
  with exactly the same signature, it'll not only store it once,
  it'll *decode* it once, and store in the typeIdMap only once.

Based partially on discussions on zulip:
https://rust-lang.zulipchat.com/#narrow/stream/266220-t-rustdoc/topic/search.20index.20size

Performance
-----------

https://notriddle.com/rustdoc-html-demo-8/compression-perf-v2/index.html

### memory/time profiler output (for more details, consult the above link)

<table>
<thead><tr><th>benchmark<th>before<th>after</tr></thead>
<tbody>
<tr><th>arti<td>

```
user: 002.789 s
sys:  000.390 s
wall: 002.096 s
child_RSS_high:     440796 KiB
group_mem_high:     414924 KiB
```

</td><td>

```
user: 002.295 s
sys:  000.278 s
wall: 001.738 s
child_RSS_high:     314588 KiB
group_mem_high:     285220 KiB
```

</td></tr><tr><th>cortex-m<td>

```
user: 000.127 s
sys:  000.030 s
wall: 000.134 s
child_RSS_high:      60264 KiB
group_mem_high:      23824 KiB
```

</td><td>

```
user: 000.136 s
sys:  000.038 s
wall: 000.137 s
child_RSS_high:      59204 KiB
group_mem_high:      22712 KiB
```

</td></tr><tr><th>sqlx<td>

```
user: 000.887 s
sys:  000.118 s
wall: 000.592 s
child_RSS_high:     190408 KiB
group_mem_high:     157804 KiB
```

</td><td>

```
user: 000.798 s
sys:  000.101 s
wall: 000.525 s
child_RSS_high:     159292 KiB
group_mem_high:     126292 KiB
```

</td></tr><tr><th>stm32f4<td>

```
user: 013.884 s
sys:  005.399 s
wall: 013.149 s
child_RSS_high:    1942244 KiB
group_mem_high:    1954916 KiB
```

</td><td>

```
user: 006.128 s
sys:  003.297 s
wall: 007.994 s
child_RSS_high:    1038108 KiB
group_mem_high:    1023900 KiB
```

</td></tr><tr><th>ripgrep<td>

```
user: 000.441 s
sys:  000.063 s
wall: 000.264 s
child_RSS_high:     109180 KiB
group_mem_high:      74272 KiB
```

</td><td>

```
user: 000.408 s
sys:  000.044 s
wall: 000.238 s
child_RSS_high:     101488 KiB
group_mem_high:      66000 KiB
```

</td></tr></tbody></table>

Size change
-----------

standard library without gzip:

```console
$ du -bs search-index-old.js search-index-new.js
4976370 search-index-old.js
4404391 search-index-new.js
```

((4976370-4404391)/4404391)*100% = 12.9%

with gzip:

```console
$ du -hs search-index-old.js.gz search-index-new.js.gz
520K    search-index-old.js.gz
504K    search-index-new.js.gz

$ du -bs search-index-old.js.gz search-index-new.js.gz
522092  search-index-old.js.gz
507654  search-index-new.js.gz
```

((522092-507654)/507654)*100% = 2.8%

Benchmarks are similarly shrunk.

Without gzip:

```console
$ du -hs tmp/{arti,cortex-m,sqlx,stm32f4,ripgrep}/toolchain_{old,new}/doc/search-index.js
10555067        tmp/arti/toolchain_old/doc/search-index.js
8921236 tmp/arti/toolchain_new/doc/search-index.js
77018   tmp/cortex-m/toolchain_old/doc/search-index.js
66676   tmp/cortex-m/toolchain_new/doc/search-index.js
2876330 tmp/sqlx/toolchain_old/doc/search-index.js
2436812 tmp/sqlx/toolchain_new/doc/search-index.js
63632890        tmp/stm32f4/toolchain_old/doc/search-index.js
52337438        tmp/stm32f4/toolchain_new/doc/search-index.js
631150  tmp/ripgrep/toolchain_old/doc/search-index.js
541646  tmp/ripgrep/toolchain_new/doc/search-index.js
```

With gzip:

```console
$ du -bs tmp/{arti,cortex-m,sqlx,stm32f4,ripgrep}/toolchain_{old,new}/doc/search-index.js.gz
1618852 tmp/arti/toolchain_old/doc/search-index.js.gz
1582007 tmp/arti/toolchain_new/doc/search-index.js.gz
16109   tmp/cortex-m/toolchain_old/doc/search-index.js.gz
15831   tmp/cortex-m/toolchain_new/doc/search-index.js.gz
422257  tmp/sqlx/toolchain_old/doc/search-index.js.gz
411507  tmp/sqlx/toolchain_new/doc/search-index.js.gz
4454761 tmp/stm32f4/toolchain_old/doc/search-index.js.gz
4334924 tmp/stm32f4/toolchain_new/doc/search-index.js.gz
98312   tmp/ripgrep/toolchain_old/doc/search-index.js.gz
96864   tmp/ripgrep/toolchain_new/doc/search-index.js.gz

$ du -hs tmp/{arti,cortex-m,sqlx,stm32f4,ripgrep}/toolchain_{old,new}/doc/search-index.j
s.gz
1.6M    tmp/arti/toolchain_old/doc/search-index.js.gz
1.6M    tmp/arti/toolchain_new/doc/search-index.js.gz
24K     tmp/cortex-m/toolchain_old/doc/search-index.js.gz
24K     tmp/cortex-m/toolchain_new/doc/search-index.js.gz
424K    tmp/sqlx/toolchain_old/doc/search-index.js.gz
412K    tmp/sqlx/toolchain_new/doc/search-index.js.gz
4.3M    tmp/stm32f4/toolchain_old/doc/search-index.js.gz
4.2M    tmp/stm32f4/toolchain_new/doc/search-index.js.gz
108K    tmp/ripgrep/toolchain_old/doc/search-index.js.gz
104K    tmp/ripgrep/toolchain_new/doc/search-index.js.gz
```
bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 6, 2024
…mpiler-errors

Rollup of 9 pull requests

Successful merges:

 - rust-lang#119208 (coverage: Hoist some complex code out of the main span refinement loop)
 - rust-lang#119216 (Use diagnostic namespace in stdlib)
 - rust-lang#119414 (bootstrap: Move -Clto= setting from Rustc::run to rustc_cargo)
 - rust-lang#119420 (Handle ForeignItem as TAIT scope.)
 - rust-lang#119468 (rustdoc-search: tighter encoding for f index)
 - rust-lang#119628 (remove duplicate test)
 - rust-lang#119638 (fix cyle error when suggesting to use associated function instead of constructor)
 - rust-lang#119640 (library: Fix warnings in rtstartup)
 - rust-lang#119642 (library: Fix a symlink test failing on Windows)

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit b3f3074 into rust-lang:master Jan 6, 2024
11 checks passed
@rustbot rustbot added this to the 1.77.0 milestone Jan 6, 2024
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Jan 6, 2024
Rollup merge of rust-lang#119468 - notriddle:notriddle/compression, r=GuillaumeGomez

rustdoc-search: tighter encoding for f index

Depends on rust-lang#119457

Two optimizations for the function signature search:

* Instead of using JSON arrays, like `[1,20]`, it uses VLQ
  hex with no commas, like `[aAd]`.
* This also adds backrefs: if you have more than one function
  with exactly the same signature, it'll not only store it once,
  it'll *decode* it once, and store in the typeIdMap only once.

Based partially on discussions on zulip:
https://rust-lang.zulipchat.com/#narrow/stream/266220-t-rustdoc/topic/search.20index.20size

Performance
-----------

https://notriddle.com/rustdoc-html-demo-8/compression-perf-v2/index.html

### memory/time profiler output (for more details, consult the above link)

<table>
<thead><tr><th>benchmark<th>before<th>after</tr></thead>
<tbody>
<tr><th>arti<td>

```
user: 002.789 s
sys:  000.390 s
wall: 002.096 s
child_RSS_high:     440796 KiB
group_mem_high:     414924 KiB
```

</td><td>

```
user: 002.295 s
sys:  000.278 s
wall: 001.738 s
child_RSS_high:     314588 KiB
group_mem_high:     285220 KiB
```

</td></tr><tr><th>cortex-m<td>

```
user: 000.127 s
sys:  000.030 s
wall: 000.134 s
child_RSS_high:      60264 KiB
group_mem_high:      23824 KiB
```

</td><td>

```
user: 000.136 s
sys:  000.038 s
wall: 000.137 s
child_RSS_high:      59204 KiB
group_mem_high:      22712 KiB
```

</td></tr><tr><th>sqlx<td>

```
user: 000.887 s
sys:  000.118 s
wall: 000.592 s
child_RSS_high:     190408 KiB
group_mem_high:     157804 KiB
```

</td><td>

```
user: 000.798 s
sys:  000.101 s
wall: 000.525 s
child_RSS_high:     159292 KiB
group_mem_high:     126292 KiB
```

</td></tr><tr><th>stm32f4<td>

```
user: 013.884 s
sys:  005.399 s
wall: 013.149 s
child_RSS_high:    1942244 KiB
group_mem_high:    1954916 KiB
```

</td><td>

```
user: 006.128 s
sys:  003.297 s
wall: 007.994 s
child_RSS_high:    1038108 KiB
group_mem_high:    1023900 KiB
```

</td></tr><tr><th>ripgrep<td>

```
user: 000.441 s
sys:  000.063 s
wall: 000.264 s
child_RSS_high:     109180 KiB
group_mem_high:      74272 KiB
```

</td><td>

```
user: 000.408 s
sys:  000.044 s
wall: 000.238 s
child_RSS_high:     101488 KiB
group_mem_high:      66000 KiB
```

</td></tr></tbody></table>

Size change
-----------

standard library without gzip:

```console
$ du -bs search-index-old.js search-index-new.js
4976370 search-index-old.js
4404391 search-index-new.js
```

((4976370-4404391)/4404391)*100% = 12.9%

with gzip:

```console
$ du -hs search-index-old.js.gz search-index-new.js.gz
520K    search-index-old.js.gz
504K    search-index-new.js.gz

$ du -bs search-index-old.js.gz search-index-new.js.gz
522092  search-index-old.js.gz
507654  search-index-new.js.gz
```

((522092-507654)/507654)*100% = 2.8%

Benchmarks are similarly shrunk.

Without gzip:

```console
$ du -hs tmp/{arti,cortex-m,sqlx,stm32f4,ripgrep}/toolchain_{old,new}/doc/search-index.js
10555067        tmp/arti/toolchain_old/doc/search-index.js
8921236 tmp/arti/toolchain_new/doc/search-index.js
77018   tmp/cortex-m/toolchain_old/doc/search-index.js
66676   tmp/cortex-m/toolchain_new/doc/search-index.js
2876330 tmp/sqlx/toolchain_old/doc/search-index.js
2436812 tmp/sqlx/toolchain_new/doc/search-index.js
63632890        tmp/stm32f4/toolchain_old/doc/search-index.js
52337438        tmp/stm32f4/toolchain_new/doc/search-index.js
631150  tmp/ripgrep/toolchain_old/doc/search-index.js
541646  tmp/ripgrep/toolchain_new/doc/search-index.js
```

With gzip:

```console
$ du -bs tmp/{arti,cortex-m,sqlx,stm32f4,ripgrep}/toolchain_{old,new}/doc/search-index.js.gz
1618852 tmp/arti/toolchain_old/doc/search-index.js.gz
1582007 tmp/arti/toolchain_new/doc/search-index.js.gz
16109   tmp/cortex-m/toolchain_old/doc/search-index.js.gz
15831   tmp/cortex-m/toolchain_new/doc/search-index.js.gz
422257  tmp/sqlx/toolchain_old/doc/search-index.js.gz
411507  tmp/sqlx/toolchain_new/doc/search-index.js.gz
4454761 tmp/stm32f4/toolchain_old/doc/search-index.js.gz
4334924 tmp/stm32f4/toolchain_new/doc/search-index.js.gz
98312   tmp/ripgrep/toolchain_old/doc/search-index.js.gz
96864   tmp/ripgrep/toolchain_new/doc/search-index.js.gz

$ du -hs tmp/{arti,cortex-m,sqlx,stm32f4,ripgrep}/toolchain_{old,new}/doc/search-index.j
s.gz
1.6M    tmp/arti/toolchain_old/doc/search-index.js.gz
1.6M    tmp/arti/toolchain_new/doc/search-index.js.gz
24K     tmp/cortex-m/toolchain_old/doc/search-index.js.gz
24K     tmp/cortex-m/toolchain_new/doc/search-index.js.gz
424K    tmp/sqlx/toolchain_old/doc/search-index.js.gz
412K    tmp/sqlx/toolchain_new/doc/search-index.js.gz
4.3M    tmp/stm32f4/toolchain_old/doc/search-index.js.gz
4.2M    tmp/stm32f4/toolchain_new/doc/search-index.js.gz
108K    tmp/ripgrep/toolchain_old/doc/search-index.js.gz
104K    tmp/ripgrep/toolchain_new/doc/search-index.js.gz
```
@notriddle notriddle deleted the notriddle/compression branch January 6, 2024 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants