Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search: index from html files for mkdocs projects #7208

Merged
merged 2 commits into from Jun 23, 2020

Conversation

stsewd
Copy link
Member

@stsewd stsewd commented Jun 19, 2020

This is on top of #7207

It's under a feature flag currently,
since I would like to test this with sphinx later.

Tested with all themes available in
https://github.com/mkdocs/mkdocs/wiki/MkDocs-Themes

I did a search in all of them to check that sections are correctly indexed,
and that irrelevant content isn't indexed.

I wasn't able to build these,
but I checked the html, and it's very similar to the default mkdocs theme,
so it should work.

These themes don't expose search,
but the content is indexed, and we show results in the dashboard.

Other problems I found

We don't override the search for https://github.com/squidfunk/mkdocs-material
(they don't use the same identifier we use for overriding search).
That isn't related to this, but just wanted to mention it.
We can send a patch to fix that.

You can check the results by either building mkdocs projects and
searching (you need to enable the mkdocs search feature flag and the
index from html feature flag), or just by opening the html files and
checking the content of the json files from tests and see if the indexed
content makes sense (note that I remove some content from the original
html file to make it easier to read the json files, since json doesn't
support multiple lines...)

Things to do later

  • Write a guide with recommendations for static site generators/themes
    authors, to integrate nice with Read the Docs.
  • With this method we may be indexing autogenerated pages (like 404.html or search.html),
    we could make this configurable by the user in the config file, like:
search:
  exlude:
    - 404.html
    - search.html
    - search/index.html

Those would be the default values.

@stsewd stsewd requested review from ericholscher and a team Jun 19, 2020
Copy link
Member

@ericholscher ericholscher left a comment

This should make life a lot easier for us to support going forward, but we likely need some kind of user-facing error when search indexing fails. That's likely a larger project, but perhaps we need some kind of build output interface for search?


return None

def _get_main_node(self, html):
Copy link
Member

@ericholscher ericholscher Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely part of the complexity that an "only HTML" parser runs into. We're definitely going to end up parsing random navigation and other content on pages, just because we're relying on heuristics instead of the tool telling us what the exact main content is.

I don't think we can avoid that, but definitely a downside of generic HTML parsing.

Copy link
Member Author

@stsewd stsewd Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's why I want just to write a guide for authors of theme/static site generators. That would also make their site to be better indexed by search engines and work better with screen readers.

readthedocs/search/parsers.py Show resolved Hide resolved
@stsewd
Copy link
Member Author

stsewd commented Jun 23, 2020

but perhaps we need some kind of build output interface for search?

I like the idea of having a "indexing search" in the output, but indexing happens async, so not sure how that will work. Maybe having an "extra logs" tab? I can open an issue to discuss that.

Base automatically changed from recursive-parser to master Jun 23, 2020
It's under a feature flag currently,
since I would like to test this with sphinx later.

Tested with all themes available in
https://github.com/mkdocs/mkdocs/wiki/MkDocs-Themes

I did a search in all of them to check that sections are correctly indexed,
and that irrelevant content isn't indexed.

I wasn't able to build these,
but I checked the html, and it's very similar to the default mkdocs theme,
so it should work.

- https://github.com/michaeltlombardi/mkdocs-psinder
- https://github.com/byrnereese/mkdocs-bootstrap4
- https://github.com/daizutabi/mkdocs-ivory

These themes don't expose search,
but the content is indexed and we show results in the dashboard.

- https://gitlab.com/lramage/mkdocs-bootstrap386
- https://gitlab.com/lramage/mkdocs-gitbook-theme

Things to do later:

Write a guide with recommendations for static site generators/themes
authors, to integrate nice with Read the Docs.

Other problems I found:

We don't override the search for https://github.com/squidfunk/mkdocs-material
(they don't use the identifier we use to override search).
That isn't related to this, but just wanted to mention it.
We can send a patch to fix that.

You can check the results by either building mkdocs projects and
searching (you need to enable the mkdocs search feature flag and the
index from html feature flag), or just by opening the html files and
checking the content of the json files from tests and see if the indexed
content makes sense (note that I remove some content from the original
html file to make it easier to read the json files, since json doesn't
support multiple lines...)
@stsewd stsewd force-pushed the index-from-html-mkdocs branch from a0d645c to 47ffe35 Compare Jun 23, 2020
@stsewd stsewd merged commit 9d5387d into master Jun 23, 2020
2 checks passed
@stsewd stsewd deleted the index-from-html-mkdocs branch Jun 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants