Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search: recursively parse sections #7207

Merged
merged 1 commit into from Jun 23, 2020
Merged

Search: recursively parse sections #7207

merged 1 commit into from Jun 23, 2020

Conversation

stsewd
Copy link
Member

@stsewd stsewd commented Jun 18, 2020

This is on top of #7204

If we have a structure like

  • parent
    • content
    • content
      • h1
      • content
    • content
      • h2
      • content

And we start indexing from parent,
we will index all children in the first step,
and then index each header later. This is, duplicating content.

This is solved by checking for a section till 1 level.
In this example, the first parsing will stop when finding the first h1,
not duplicating content. Later it will index the next nodes as usual.

Also, we can increase the depth check when parsing all sections, that way we don't rely anymore on the div used by sphinx to enclose a section, and avoid indexing duplicated content if other themes don't follow the same structure.

A real example of this is https://github.com/readthedocs/readthedocs.org/blob/a0d645c9b561c0189ba0956a1554f577c413ecdf/readthedocs/search/tests/data/mkdocs/in/gitbook/index.html (from #7208)

Copy link
Member

@ericholscher ericholscher left a comment

This could use a test to show exactly how it is working.

@stsewd
Copy link
Member Author

stsewd commented Jun 23, 2020

the gitbook theme at https://github.com/readthedocs/readthedocs.org/blob/a0d645c9b561c0189ba0956a1554f577c413ecdf/readthedocs/search/tests/data/mkdocs/in/gitbook/index.html is a test case for this, I just wanted to put this logic in another PR to not make the other one more complex, without this logic tests on the other PR fail.

Base automatically changed from more-general-parser to master Jun 23, 2020
If we have an structure like

- parent
  - content
  - content
    - h1
    - content
  - content
    - h2
    - content

And we start indexing from `parent`,
we will index all children in the first step,
and then index each header later. This is, duplicating content.

This is solved by checking for a section till 1 level.
In this example, the first parsing will stop when finding the first h1,
not duplicating content. Later it will index the next nodes as usual.
@stsewd stsewd merged commit ec9022c into master Jun 23, 2020
2 checks passed
@stsewd stsewd deleted the recursive-parser branch Jun 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants