Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn when generated search index is huge #704

Closed
killercup opened this issue Jun 11, 2018 · 5 comments
Closed

Warn when generated search index is huge #704

killercup opened this issue Jun 11, 2018 · 5 comments

Comments

@killercup
Copy link
Member

cf. rust-lang/rfcs#2469

@TimNN
Copy link

TimNN commented Jun 11, 2018

Some more thoughts from when I was investigating this:

  • One could probably get a significant size reduction by assigning integer keys to every entry (which are currently strings looking like rfc-0000-foo.html#heading, can be pretty long, and are generally repeated dozens or hundreds of times).
  • Generating JSON instead of JS and loading that asynchronously would probably be easier on the browser, since they don't have to execute it.

@mattico
Copy link
Contributor

mattico commented Jun 13, 2018

Few quick thoughts:

  • Modifying the index format is somewhat tricky since we're using a JS library (elasticlunr.js) which expects a specific format. We could work around this, of course.
  • We're loading the search index as JS is to avoid CORS issues when the book is opened as a local file in the browser. There's probably a better way to solve that problem, should look into it.
  • One reason the index is so big is because it includes a Document Store which contains the full processed text of each document in the index, used for the "teasers" in search results. This is optional, but I notice we don't provide a config option to disable it.
  • One option to shrink the index while retaining search teasers would be to separate the document store from the index. Each document could have a separate JSON file which could be fetched as needed. It might also be possible to build the teaser by fetching the actual html page and parsing that, which could take advantage of the browser's cache and avoid having two separate representations of the same content.
  • It seems we don't have a config option to disable search... the original intent was for search to be disabled if the [output.html.search] section was not present, but that was later changed to use the default config value for that section. We should add the output.html.search.enabled key back.

@TimNN
Copy link

TimNN commented Jun 13, 2018

  • It seems we don't have a config option to disable search...

That can be worked-around by setting copy-js = false, although I don't know if that was the intended usage of this function.

  • One reason the index is so big is because it includes a Document Store which contains the full processed text of each document in the index, used for the "teasers" in search results. This is optional, but I notice we don't provide a config option to disable it.

I don't think that is the main reason, removing ~all body content [0] reduce the RFC index from 38MB to 35MB.

  • Modifying the index format is somewhat tricky since we're using a JS library (elasticlunr.js) which expects a specific format. We could work around this, of course.

I'm not familiar with elasticlunr.js, however I would assume that it supports user-defined keys for the documents it indexes? In that case just replacing the current rfc-0000-foo.html#heading keys with integers could prove a huge win.

There are currently ~4500 unique URLs referenced in the index, the most common one occurs almost 1000 times, the least common ones occur 6 times. Not counting quotation marks, just these URLs make up about 21MB.

[0] perl -pi -e 's/"body":".*?","breadcrumbs":/"body":"...","breadcrumbs":/g' index.js

@mattico
Copy link
Contributor

mattico commented Jun 14, 2018

Here's the effect of the changes in #707 on page load performance for the rfcs book. The page load stats were recorded using the Firefox dev tools in "Good 3G" throttling and simple-http-server with GZip.

v0.1.7 6ca68e4d
searchindex.js{,on} 38,659KB 18,697KB
searchindex.js{,on}.gz 6,726KB 2,915KB
Page Load (empty cache) 7.44s 633ms
Page Load (full cache) 346ms 233ms
Search Index Load (empty cache) N/A* 849ms
Search Index Load (full cache) N/A* 388ms

* searchindex.js was loaded before a few page resources, so it's included in the Page Load numbers

Don't know how much parse time would be spent on the JSON on a phone, but I think the above numbers are acceptable if imperfect.

@TimNN
Copy link

TimNN commented Jun 14, 2018

While iOS still crashes, this looks much better already, thanks a lot! The performance analysis tools offered by Chrome are also much happier :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants