Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup tokenization for Material for MkDocs search #264

Closed
HonkingGoose opened this issue Jan 5, 2023 · 21 comments
Closed

Setup tokenization for Material for MkDocs search #264

HonkingGoose opened this issue Jan 5, 2023 · 21 comments
Assignees
Labels
bug Something isn't working

Comments

@HonkingGoose
Copy link
Collaborator

What browser are you using?

Firefox

Other browser name

No response

Describe the bug

I can't find the presets via the Material for MkDocs search.

Steps to reproduce

  1. Go to production docs site.
  2. Enter workarounds:javaLTSVersions in the search bar
  3. Search bar says: "no matching documents"
  4. But we do have a page with workarounds:javaLTSVersions as the heading title: https://docs.renovatebot.com/presets-workarounds/#workaroundsjavaltsversions

Additional context

@viceice thinks the : character breaks the search somehow.

Related issue:

@HonkingGoose HonkingGoose added the bug Something isn't working label Jan 5, 2023
@TWiStErRob
Copy link

This might be a telling sign:
image
notice how the word "js" is not found on the page "js-lib", but if you search for js alone, there are results:
image
this should confirm the : theory.

@TWiStErRob
Copy link

https://www.mkdocs.org/user-guide/configuration/#separator

I think mkdocs.yml change would fix this:

plugins:
    - search:
        separator: '[\s\-.:]+'
        min_search_length: 2

@HonkingGoose
Copy link
Collaborator Author

@HonkingGoose please open another issue for search, I can confirm that search experience is pretty bad for any presets, even when trying to search just for the thing after the :. I think tokenization setup is messed up (if it's configurable)

The key term I need was tokenization. 😄 Quote from the Material for MkDocs manual: 1

separator

Default: automatically set – The separator for indexing and query tokenization can be customized, making it possible to index parts of words separated by other characters than whitespace and -, e.g. by including .:

plugins:
  - search:
      separator: '[\s\-\.]+'

With 9.0.0, a faster and more flexible tokenizer method is shipped, allowing for tokenizing with lookahead, which yields more influence on the way documents are indexed. As a result, we use the following separator setting for this site's search:

plugins:
  - search:
      separator: '[\s\-,:!=\[\]()"/]+|(?!\b)(?=[A-Z][a-z])|\.(?!\d)|&[lg]t;'

Footnotes

  1. https://squidfunk.github.io/mkdocs-material/setup/setting-up-site-search/#built-in-search-plugin

@HonkingGoose HonkingGoose changed the title No search results for workarounds:javaLTSVersions Setup tokenization for Material for MkDocs search Jan 5, 2023
@TWiStErRob
Copy link

'[\s\-,:!=\[\]()"/]+|(?!\b)(?=[A-Z][a-z])|\.(?!\d)|&[lg]t;'

Bless you! What a beauty 🤣

@TWiStErRob
Copy link

Make sure to test thoroughly because that "case change" part might mess things up. Is there a way to deploy the website into public URL but not prod to test before merge?

@viceice
Copy link
Member

viceice commented Jan 5, 2023

Only via dev server from local / codespaces or gitpod

@HonkingGoose
Copy link
Collaborator Author

I copy/pasted the example code into a branch:

plugins:
  - search:
      separator: '[\s\-,:!=\[\]()"/]+|(?!\b)(?=[A-Z][a-z])|\.(?!\d)|&[lg]t;'

This makes the search match too much. The search prediction also shows too much. So just copy/pasting things is right out. 😄

I don't understand this, so I'll let one of you regex wizards fix this problem. 😉

@TWiStErRob
Copy link

Can you give an example of "too much" please (screenshot)

@HonkingGoose
Copy link
Collaborator Author

HonkingGoose commented Jan 9, 2023

Hmm, I can't reproduce my problem with the example search tokenization anymore. Maybe the upstream fixed something, or I was messing things up. 😄

Here's my branch: https://github.com/HonkingGoose/renovatebot.github.io/tree/search-tokens

You can check it out with GitHub Codespaces, or use Gitpod to test things yourself. 😉

@TWiStErRob
Copy link

TWiStErRob commented Jan 9, 2023

Btw, I fully understand this regex, the question is what requirements do you want. What are reasonable token-separators for Renovate? Tokenization simply splits along these characters / magical places. The options are (from left to right from the above separator):

  • whitespace (foo bar -> foo bar)
    usual word separator, definitely on
  • dash (foo-bar -> foo bar)
    would split up HTML class-names / IDs, but Renovate docs don't have these (Bootstrap docs would have); anyway, usually these would be counted as one token in programming, but two words in sentences. I'm not sure. Probably off to reduce false positives. Edit: actually, some names use it for hierarchical separator: monorepo:sitecore-jss, maybe worth turning it on to be able to search jss.
  • comma (foo,bar -> foo bar)
    usual list separator, definitely on
  • colon (foo:bar -> foo bar)
    renovate specific (see OP), should be +, because otherwise can't find workarounds:javaLTSVersions by just looking for javaLTSVersions. It would also help splitting JSON "key":"value" pairs along with "
  • exclamation mark (foo!bar -> foo bar)
    sentence terminator + usually not part of any "token" in programming, keep on
  • equal sign (foo=bar -> foo bar)
    splits up assignments, there's no imperative code in renovate docs, but good to keep anyway.
  • square brackets (foo[bar] -> foo bar)
    definitely on for all kinds of parentheses
  • parentheses (foo(bar) -> foo bar)
    definitely on for all kinds of parentheses
  • parentheses (foo{bar} -> foo bar)
    I would add this too to the list.
  • parentheses (foo<bar> -> foo bar)
    I would add this too to the list.
  • quotes (foo"bar" -> foo bar)
    probably good to tokenize along, otherwise in the above example bar wouldn't be searchable.
  • slash (foo/bar -> foo bar)
    usual hierarchical separator, would keep it on
  • slash (foo|bar -> foo bar)
    usual alternative separator, would keep it on
  • case change (fooBar -> foo Bar)
    separates words, but not necessary because we treat programming names as one unit, i.e. no-one would split: javaLTSVersions -> java LTSVersions, the surrounding text will match "java" "LTS" and "versions" if the documention warrants it.
  • dot (foo.bar -> foo bar, except 1.2 -> 1.2)
    this is a nice one, would keep
  • &lt; and &gt; (I guess tokenization happens in HTML or XML?)
    can't hurt
  • anything else?

Please check/edit the ones above you want to keep and I'll refine the regex. I put my reasons why it should/shouldn't be included.

@TWiStErRob
Copy link

After going through the above exercise it looks like all of them are useful for something, maybe even case change if javaLTSVersions directly searched is matching the right thing.

Tip, you can experiment with the regex at https://regex101.com/r/VflcWH/1

@HonkingGoose
Copy link
Collaborator Author

Is there a way to deploy the website into public URL but not prod to test before merge?

Yes, it's called Vercel. 😜 I use Vercel to host a small docs site, and I get a link to a public preview URL on each pull request. Makes it really easy to click around and prod things until I'm happy things work.

Vercel can cost money though, once you exceed certain limits of the free tier. It's for the maintainers to decide if they want to spend time/money switching from GitHub Pages to Vercel.

Btw, I fully understand this regex, the question is what requirements do you want.

Lucky you, I never managed to get far with learning regex, it just looks like gibberish to me. I'd rather click and type around in the development server preview and see how the search behaves with real data. 😄

For me the big things are that you should be able to find the presets when searching for them, either by their full name or parts of their name.

@TWiStErRob
Copy link

Yes, it's called Vercel. 😜

I know, I was more curious if it or another was set up already. I managed to get Codespaces running, it's pretty nice, but not public. Anyway, it'll do for now.

Lucky you, I never managed to get far with learning regex

It was forced on us at uni, had to learn language theory and regex is the most basic class of languages. Although I knew practical regex before I knew the theory, because of Operating systems class taught basic grep/sed. I thoroughly recommend learning it, for text processing (search, replace) it's unbeatable, and since our websites and source code is text, we do text processing a lot ;) The basics are only a few symbols, and after that using a site like regex101.com just for syntax highlight helps to read them a lot.

For me the big things are that you should be able to find the presets when searching for them, either by their full name or parts of their name.

Looks to me that your branch is pretty good now compared to prod. I added a few more as described above (I kept case switches too, because they help to find config options partially):
separator: '[\s\-,:!?=\[\]()<>{}"/\\]+|(?!\b)(?=[A-Z][a-z])|\.(?!\d)|&[lg]t;'
The only problem is that foo:bar presets are not directly searchable, but searching for foo or bar alone (or even sub-words of those) yields way better results than in prod right now. To me it looks like this might be a bug in the search plugin, not our usage. If I was you, I would merge this ^ regex and open an issue upstream to ask them why search containing colon is not working.

This makes the search match too much. The search prediction also shows too much.

It shows more than production that's for sure, but that's only because there's a problem in prod :) I think you've been used to few search results, that normal amount looks "too much" :D additionally these "Missing" search results would be nice to disable, but can't seem to find an option for it:
image
that said, it might help people discover more related options.

@TWiStErRob
Copy link

Make sure #265 doesn't close this issue when it's merged. I reported the problem upstream: squidfunk/mkdocs-material#4884

@HonkingGoose
Copy link
Collaborator Author

Thank you for reporting the problem upstream. ❤️

This issue should remain open, I'm not using any closes keywords in my PR's body text. 😉

@TWiStErRob
Copy link

So this will be fixed as soon as Renovatebot picks up the new patch, right?

@HonkingGoose
Copy link
Collaborator Author

We should test the new behavior after applying the latest patch for Material for MkDocs. Then we know if the search is fixed now.

@viceice
Copy link
Member

viceice commented Jan 29, 2023

the update is currently pending because of stability days

@HonkingGoose
Copy link
Collaborator Author

We're using Material for MkDocs 9.0.7. When I put workarounds:javaLTSVersions in the search bar, I get the correct result! 🥳

search-matches-input

@TWiStErRob
Copy link

Confirmed, I think we can close this as fixed by squidfunk/mkdocs-material#4884 (comment) via 294979e.

It also works for prefixes.
image
image

@HonkingGoose
Copy link
Collaborator Author

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants