Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code blocks in documentation search #10139

Open
2 of 5 tasks
tbrlpld opened this issue Feb 22, 2023 · 12 comments
Open
2 of 5 tasks

Code blocks in documentation search #10139

tbrlpld opened this issue Feb 22, 2023 · 12 comments

Comments

@tbrlpld
Copy link
Contributor

tbrlpld commented Feb 22, 2023

When using search on the docs, the surfaced results seem to ignore words that are found in code blocks (inline or block).

E.g. the search for image_url only reveals two entries:

https://docs.wagtail.org/en/stable/search.html?q=image_url

image

The second result contains the search term more accurately and multiple times. The other instances are not shown in the results view. Not sure if this is because of the page was already linked. But in that case it would be nice to see the more fitting results first.

https://docs.wagtail.org/en/stable/advanced_topics/images/image_serve_view.html#generating-dynamic-image-urls-in-python

Screenshot 2023-02-22 at 06 59 49

But, there are other pages in the docs that the search completely missed.

https://docs.wagtail.org/en/stable/advanced_topics/performance.html#image-urls

image

I wonder if we configure Algolia somehow to pick up code blocks better.


Tasks

  1. Documentation type:Bug
    laymonage
  2. Documentation Infrastructure type:Bug
    thibaudcolas
@gasman
Copy link
Collaborator

gasman commented Feb 22, 2023

This is a known problem with the Algolia search indexing along with #8159, although it seems that we didn't have an open issue for it already, so thanks for reporting!

@thibaudcolas
Copy link
Member

Thanks for the report @tbrlpld! Can confirm this is the case, though only for code blocks. Inline code formatting is still searchable as long as it’s inside a heading, paragraph, or list item.

For code blocks – this is intentional as per recommendations from Algolia, who state indexing code blocks creates a lot of noise since there’s lots of repetition in code.

We’ve discussed this at the last core team meeting and decided to give this a go anyway, as we now have much more control over how the indexing of the docs is configured. So we’ll be able to compare results with and without code blocks indexed.

@thibaudcolas thibaudcolas self-assigned this Mar 7, 2023
@thibaudcolas thibaudcolas added this to the 5.0 milestone Mar 7, 2023
@tbrlpld tbrlpld changed the title Code blocks don't appear seem to be discoverable through search Code blocks don't seem to be discoverable through search Mar 7, 2023
@tbrlpld
Copy link
Contributor Author

tbrlpld commented Mar 7, 2023

Awesome. Thanks for the update @thibaudcolas

@thibaudcolas
Copy link
Member

thibaudcolas commented Mar 9, 2023

I have updated the Documentation search wiki page with a copy of our crawler configuration. There are still a few steps to go through before we can try out code block indexing but we’re getting closer.

Once we’re ready to try this indexing, here is the recordExtractor configuration including code blocks:

recordExtractor: ({ helpers }) => {
        return helpers.docsearch({
          recordProps: {
            lvl1: ["header h1", "article h1", "main h1", "h1", "head > title"],
            content: ["article p, article li", "main p, main li", "p, li, pre"],
            lvl0: {
              selectors: "",
              defaultValue: "Documentation",
            },
            lvl2: ["article h2", "main h2", "h2"],
            lvl3: ["article h3", "main h3", "h3"],
            lvl4: ["article h4", "main h4", "h4"],
            lvl5: ["article h5", "main h5", "h5"],
            lvl6: ["article h6", "main h6", "h6"],
          },
          aggregateContent: true,
          recordVersion: "v3",
        });
      },

The only difference is the pre in the last content selector. I’m not sure how exactly to configure this so might need a bit more research before we do our trial.

@thibaudcolas
Copy link
Member

Now ready to be picked up (was waiting on #8159). If anyone has suggested searches to try this out with please post them here.

While investigating this I think I might have also spotted another related issue: dt elements which we use for Python classes don’t seem to be indexed either.

@thibaudcolas
Copy link
Member

Here is the comparison, with results from Algolia as-is, Algolia with code blocks indexed, and Read the Docs – side by side across 44 different queries: https://wagtail-docs-search-comparison.netlify.app/.

The code blocks indexing gives code blocks relatively low priority over keywords appearing in headings, so the difference isn’t too big, but still there. Searches with differences:

Based on this I’d say I generally find results with code blocks better than those without. The difference is rarely big (aside from get_children), but code blocks seem to have the advantage of adding extra results when there might not be anything otherwise.

@tbrlpld
Copy link
Contributor Author

tbrlpld commented Mar 21, 2023 via email

@thibaudcolas
Copy link
Member

@tbrlpld I’ve added this specific example at https://wagtail-docs-search-comparison.netlify.app/#image_url

I think that result is there already, even in the first screenshot you shared in this issue? We get one more result with code search turned on but it seems to be a partial match on "image" rather than a match of image_url.

@tbrlpld
Copy link
Contributor Author

tbrlpld commented Mar 21, 2023

Right yea, the page is there. I guess I was confused because it's listed because of the sub string match on "generate_image_url" instead of the exact match of "image_url" which is also multiple times on the page.

But right, it looks like it now highlights one exact match too 👍

@thibaudcolas thibaudcolas removed their assignment Mar 22, 2023
@allcaps
Copy link
Member

allcaps commented Mar 22, 2023

Searching for base_form_class, I'd expect to be in the results:

Currently, all Algolia results point to the Panel API.


As a comparison it might sometimes be useful to Google search with: site:docs.wagtail.org/en/latest/ SEARCH_QUERY

@thibaudcolas
Copy link
Member

Thanks Coen, I’ve added this specific query to the comparison. It’s a very interesting one because it highlights how Google, Algolia, RTD differ in what they index / how they return results:

  • Google gives you one result per relevant page
  • Algolia gives you one result per relevant section of a page
  • RTD gives you one result per relevant page, but has relevant sections within that parent result

There are a couple things we could try in Algolia to improve its results:

  • Only index signatures’ class names / function names, not the parameters – so Panel API results are no longer there
  • Index class attributes as content rather than sections of the page – so there is a single Panel API result.
  • Avoid indexing those heading levels separately (indexHeadings config option) – unsure of the result

@laymonage
Copy link
Member

laymonage commented Mar 22, 2023

Could we make it so that the rank of code blocks is lower than that of other contents? If so, I assume that it would make pages that mention base_form_class in a paragraph will be shown before the Panel API page in the results.

Edit: Just realised that it isn't a code block but rather a signature. Still, could potentially be applied too.

@thibaudcolas thibaudcolas removed this from the 5.0 milestone Apr 18, 2023
@thibaudcolas thibaudcolas changed the title Code blocks don't seem to be discoverable through search Code blocks in documentation search Apr 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants