Add guide for rustdoc search implementation (#1846)

rust-lang · Jan 6, 2024 · 5606d30 · 5606d30
1 parent d13e851
commit 5606d30
Show file tree

Hide file tree

Showing 2 changed files with 245 additions and 0 deletions.
diff --git a/src/SUMMARY.md b/src/SUMMARY.md
@@ -74,6 +74,7 @@
 - [Serialization in Rustc](./serialization.md)
 - [Parallel Compilation](./parallel-rustc.md)
 - [Rustdoc internals](./rustdoc-internals.md)
+    - [Search](./rustdoc-internals/search.md)
 
 # Source Code Representation
 

diff --git a/src/rustdoc-internals/search.md b/src/rustdoc-internals/search.md
@@ -0,0 +1,244 @@
+# Rustdoc search
+
+Rustdoc Search is two programs: `search_index.rs`
+and `search.js`. The first generates a nasty JSON
+file with a full list of items and function signatures
+in the crates in the doc bundle, and the second reads
+it, turns it into some in-memory structures, and
+scans them linearly to search.
+
+<!-- toc -->
+
+## Search index format
+
+`search.js` calls this Raw, because it turns it into
+a more normal object tree after loading it.
+Naturally, it's also written without newlines or spaces.
+
+```json
+[
+    [ "crate_name", {
+        "doc": "Documentation",
+        "n": ["function_name", "Data"],
+        "t": "HF",
+        "d": ["This function gets the name of an integer with Data", "The data struct"],
+        "q": [[0, "crate_name"]],
+        "i": [2, 0],
+        "p": [[1, "i32"], [1, "str"], [5, "crate_name::Data"]],
+        "f": "{{gb}{d}}`",
+        "b": [],
+        "c": [],
+        "a": [["get_name", 0]],
+    }]
+]
+```
+
+[`src/librustdoc/html/static/js/externs.js`]
+defines an actual schema in a Closure `@typedef`.
+
+The above index defines a crate called `crate_name`
+with a free function called `function_name` and a struct called `Data`,
+with the type signature `Data, i32 -> str`,
+and an alias, `get_name`, that equivalently refers to `function_name`.
+
+[`src/librustdoc/html/static/js/externs.js`]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/static/js/externs.js#L204-L258
+
+The search index needs to fit the needs of the `rustdoc` compiler,
+the `search.js` frontend,
+and also be compact and fast to decode.
+It makes a lot of compromises:
+
+* The `rustdoc` compiler runs on one crate at a time,
+  so each crate has an essentially separate search index.
+  It [merges] them by having each crate on one line
+  and looking at the first quoted string.
+* Names in the search index are given
+  in their original case and with underscores.
+  When the search index is loaded,
+  `search.js` stores the original names for display,
+  but also folds them to lowercase and strips underscores for search.
+  You'll see them called `normalized`.
+* The `f` array stores types as offsets into the `p` array.
+  These types might actually be from another crate,
+  so `search.js` has to turn the numbers into names and then
+  back into numbers to deduplicate them if multiple crates in the
+  same index mention the same types.
+* It's a JSON file, but not designed to be human-readable.
+  Browsers already include an optimized JSON decoder,
+  so this saves on `search.js` code and performs better for small crates,
+  but instead of using objects like normal JSON formats do,
+  it tries to put data of the same type next to each other
+  so that the sliding window used by [DEFLATE] can find redundancies.
+  Where `search.js` does its own compression,
+  it's designed to save memory when the file is finally loaded,
+  not just size on disk or network transfer.
+
+[merges]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/render/write_shared.rs#L151-L164
+[DEFLATE]: https://en.wikipedia.org/wiki/Deflate
+
+### Parallel arrays and indexed maps
+
+Most data in the index
+(other than `doc`, which is a single string for the whole crate,
+`p`, which is a separate structure
+and `a`, which is also a separate structure)
+is a set of parallel arrays defining each searchable item.
+
+For example,
+the above search index can be turned into this table:
+
+| n | t | d | q | i | f | b | c |
+|---|---|---|---|---|---|---|---|
+| `function_name` | `H` | This function gets the name of an integer with Data | `crate_name` | 2 | `{{gb}{d}}` | NULL | NULL |
+| `Data` | `F` | The data struct | `crate_name` | 0 | `` ` `` | NULL | NULL |
+
+The above code doesn't use `c`, which holds deprecated indices,
+or `b`, which maps indices to strings.
+If `crate_name::function_name` used both, it would look like this.
+
+```json
+        "b": [[0, "impl-Foo-for-Bar"]],
+        "c": [0],
+```
+
+This attaches a disambiguator to index 0 and marks it deprecated.
+
+The advantage of this layout is that these APIs often have implicit structure
+that DEFLATE can take advantage of,
+but that rustdoc can't assume.
+Like how names are usually CamelCase or snake_case,
+but descriptions aren't.
+
+`q` is a Map from *the first applicable* ID to a parent module path.
+This is a weird trick, but it makes more sense in pseudo-code:
+
+```rust
+let mut parent_module = "";
+for (i, entry) in search_index.iter().enumerate() {
+    if q.contains(i) {
+        parent_module = q.get(i);
+    }
+    // ... do other stuff with `entry` ...
+}
+```
+
+This is valid because everything has a parent module
+(even if it's just the crate itself),
+and is easy to assemble because the rustdoc generator sorts by path
+before serializing.
+Doing this allows rustdoc to not only make the search index smaller,
+but reuse the same string representing the parent path across multiple in-memory items.
+
+### `i`, `f`, and `p`
+
+`i` and `f` both index into `p`, the array of parent items.
+
+`i` is just a one-indexed number
+(not zero-indexed because `0` is used for items that have no parent item).
+It's different from `q` because `q` represents the parent *module or crate*,
+which everything has,
+while `i`/`q` are used for *type and trait-associated items* like methods.
+
+`f`, the function signatures, use their own encoding.
+
+```ebnf
+f = { FItem | FBackref }
+FItem = FNumber | ( '{', {FItem}, '}' )
+FNumber = { '@' | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' }, ( '`' | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k ' | 'l' | 'm' | 'n' | 'o' )
+FBackref = ( '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | ':' | ';' | '<' | '=' | '>' | '?' )
+```
+
+An FNumber is a variable-length, self-terminating base16 number
+(terminated because the last hexit is lowercase while all others are uppercase).
+These are one-indexed references into `p`, because zero is used for nulls,
+and negative numbers represent generics.
+The sign bit is represented using [zig-zag encoding]
+(the internal object representation also uses negative numbers,
+even after decoding,
+to represent generics).
+This alphabet is chosen because the characters can be turned into hexits by
+masking off the last four bits of the ASCII encoding.
+
+For example, `{{gb}{d}}` is equivalent to the json `[[3, 1], [2]]`.
+Because of zigzag encoding, `` ` `` is +0, `a` is -0 (which is not used),
+`b` is +1, and `c` is -1.
+
+[empirically]: https://github.com/rust-lang/rust/pull/83003
+[zig-zag encoding]: https://en.wikipedia.org/wiki/Variable-length_quantity#Zigzag_encoding
+
+## Searching by name
+
+Searching by name works by looping through the search index
+and running these functions on each:
+
+* [`editDistance`] is always used to determine a match
+  (unless quotes are specified, which would use simple equality instead).
+  It computes the number of swaps, inserts, and removes needed to turn
+  the query name into the entry name.
+  For example, `foo` has zero distance from itself,
+  but a distance of 1 from `ofo` (one swap) and `foob` (one insert).
+  It is checked against an heuristic threshold, and then,
+  if it is within that threshold, the distance is stored for ranking.
+* [`String.prototype.indexOf`] is always used to determine a match.
+  If it returns anything other than -1, the result is added,
+  even if `editDistance` exceeds its threshold,
+  and the index is stored for ranking.
+* [`checkPath`] is used if, and only if, a parent path is specified
+  in the query. For example, `vec` has no parent path, but `vec::vec` does.
+  Within checkPath, editDistance and indexOf are used,
+  and the path query has its own heuristic threshold, too.
+  If it's not within the threshold, the entry is rejected,
+  even if the first two pass.
+  If it's within the threshold, the path distance is stored
+  for ranking.
+* [`checkType`] is used only if there's a type filter,
+  like the struct in `struct:vec`. If it fails,
+  the entry is rejected.
+
+If all four criteria pass
+(plus the crate filter, which isn't technically part of the query),
+the results are sorted by [`sortResults`].
+
+[`editDistance`]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/static/js/search.js#L137
+[`String.prototype.indexOf`]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/indexOf
+[`checkPath`]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/static/js/search.js#L1814
+[`checkType`]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/static/js/search.js#L1787
+[`sortResults`]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/static/js/search.js#L1229
+
+## Searching by type
+
+Searching by type can be divided into two phases,
+and the second phase has two sub-phases.
+
+* Turn names in the query into numbers.
+* Loop over each entry in the search index:
+   * Quick rejection using a bloom filter.
+   * Slow rejection using a recursive type unification algorithm.
+
+In the names->numbers phase, if the query has only one name in it,
+the editDistance function is used to find a near match if the exact match fails,
+but if there's multiple items in the query,
+non-matching items are treated as generics instead.
+This means `hahsmap` will match hashmap on its own, but `hahsmap, u32`
+is going to match the same things `T, u32` matches
+(though rustdoc will detect this particular problem and warn about it).
+
+Then, when actually looping over each item,
+the bloom filter will probably reject entries that don't have every
+type mentioned in the query.
+For example, the bloom query allows a query of `i32 -> u32` to match
+a function with the type `i32, u32 -> bool`,
+but unification will reject it later.
+
+The unification filter ensures that:
+
+* Bag semantics are respected. If you query says `i32, i32`,
+  then the function has to mention *two* i32s, not just one.
+* Nesting semantics are respected. If your query says `vec<option>`,
+  then `vec<option<i32>>` is fine, but `option<vec<i32>>` *is not* a match.
+* The division between return type and parameter is respected.
+  `i32 -> u32` and `u32 -> i32` are completely different.
+
+The bloom filter checks none of these things,
+and, on top of that, can have false positives.
+But it's fast and uses very little memory, so the bloom filter helps.