Warn when generated search index is huge #704

killercup · 2018-06-11T09:32:56Z

cf. rust-lang/rfcs#2469

TimNN · 2018-06-11T11:25:29Z

Some more thoughts from when I was investigating this:

One could probably get a significant size reduction by assigning integer keys to every entry (which are currently strings looking like rfc-0000-foo.html#heading, can be pretty long, and are generally repeated dozens or hundreds of times).
Generating JSON instead of JS and loading that asynchronously would probably be easier on the browser, since they don't have to execute it.

mattico · 2018-06-13T17:18:19Z

Few quick thoughts:

Modifying the index format is somewhat tricky since we're using a JS library (elasticlunr.js) which expects a specific format. We could work around this, of course.
We're loading the search index as JS is to avoid CORS issues when the book is opened as a local file in the browser. There's probably a better way to solve that problem, should look into it.
One reason the index is so big is because it includes a Document Store which contains the full processed text of each document in the index, used for the "teasers" in search results. This is optional, but I notice we don't provide a config option to disable it.
One option to shrink the index while retaining search teasers would be to separate the document store from the index. Each document could have a separate JSON file which could be fetched as needed. It might also be possible to build the teaser by fetching the actual html page and parsing that, which could take advantage of the browser's cache and avoid having two separate representations of the same content.
It seems we don't have a config option to disable search... the original intent was for search to be disabled if the [output.html.search] section was not present, but that was later changed to use the default config value for that section. We should add the output.html.search.enabled key back.

TimNN · 2018-06-13T17:50:15Z

It seems we don't have a config option to disable search...

That can be worked-around by setting copy-js = false, although I don't know if that was the intended usage of this function.

One reason the index is so big is because it includes a Document Store which contains the full processed text of each document in the index, used for the "teasers" in search results. This is optional, but I notice we don't provide a config option to disable it.

I don't think that is the main reason, removing ~all body content [0] reduce the RFC index from 38MB to 35MB.

Modifying the index format is somewhat tricky since we're using a JS library (elasticlunr.js) which expects a specific format. We could work around this, of course.

I'm not familiar with elasticlunr.js, however I would assume that it supports user-defined keys for the documents it indexes? In that case just replacing the current rfc-0000-foo.html#heading keys with integers could prove a huge win.

There are currently ~4500 unique URLs referenced in the index, the most common one occurs almost 1000 times, the least common ones occur 6 times. Not counting quotation marks, just these URLs make up about 21MB.

[0] perl -pi -e 's/"body":".*?","breadcrumbs":/"body":"...","breadcrumbs":/g' index.js

mattico · 2018-06-14T17:53:58Z

Here's the effect of the changes in #707 on page load performance for the rfcs book. The page load stats were recorded using the Firefox dev tools in "Good 3G" throttling and simple-http-server with GZip.

	v0.1.7	6ca68e4d
searchindex.js{,on}	38,659KB	18,697KB
searchindex.js{,on}.gz	6,726KB	2,915KB
Page Load (empty cache)	7.44s	633ms
Page Load (full cache)	346ms	233ms
Search Index Load (empty cache)	N/A*	849ms
Search Index Load (full cache)	N/A*	388ms

* searchindex.js was loaded before a few page resources, so it's included in the Page Load numbers

Don't know how much parse time would be spent on the JSON on a phone, but I think the above numbers are acceptable if imperfect.

TimNN · 2018-06-14T18:19:54Z

While iOS still crashes, this looks much better already, thanks a lot! The performance analysis tools offered by Chrome are also much happier :D

mattico mentioned this issue Jun 13, 2018

Optimize search index #707

Merged

mattico closed this as completed in #707 Jul 23, 2018

mattico mentioned this issue Aug 13, 2018

Consider using a search library that handles typo tolerance #764

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warn when generated search index is huge #704

Warn when generated search index is huge #704

killercup commented Jun 11, 2018

TimNN commented Jun 11, 2018

mattico commented Jun 13, 2018

TimNN commented Jun 13, 2018

mattico commented Jun 14, 2018 •

edited

TimNN commented Jun 14, 2018

Warn when generated search index is huge #704

Warn when generated search index is huge #704

Comments

killercup commented Jun 11, 2018

TimNN commented Jun 11, 2018

mattico commented Jun 13, 2018

TimNN commented Jun 13, 2018

mattico commented Jun 14, 2018 • edited

TimNN commented Jun 14, 2018

mattico commented Jun 14, 2018 •

edited