Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom word splitting regex #2818

Closed
wants to merge 3 commits into from

Conversation

thequeenofspades
Copy link

Added feature to pass in custom word splitting regex in html_search_options (English only)

Use by setting wordre in html_search_options in conf.py (example below)

html_search_options = { 'wordre': r'[\w\.\\:\/-]+(?u)' }

Sample regex above allows users to search for strings containing these punctuation characters: \, /, :, ., and -.

@tk0miya
Copy link
Member

tk0miya commented Aug 7, 2016

I feel regexp is too strong to configure the behavior of word splitting.
I prefer to introduce a new interface of splitter-extensions for this case.

@shimizukawa In Japanese search, you had already introduced three types of splitters.
If you have any idea, could you tell it to us?

@thequeenofspades
Copy link
Author

@tk0miya Can you describe what you mean by an interface of splitter-extensions? The main desired functionality (from my perspective) of this PR is to be able to search for URLS and hyphenated words without having to break them up into fairly unintuitive chunks (that often don't return what they should).

@shimizukawa
Copy link
Member

Japanese search module has plugable interfalce as: https://github.com/sphinx-doc/sphinx/blob/master/sphinx/search/ja.py#L534
IMO all languages should have same plugable interface for custom splitter instead of 'wordre' option for only English search.

@shimizukawa shimizukawa assigned tk0miya and unassigned shimizukawa Oct 15, 2016
@roxannemoslehi
Copy link

Hi @shimizukawa & @tk0miya. I'm taking over this custom word splitting that Hana was working on, and I wanted to clarify what exactly you're asking for.

I realize there are 3 specific splitter options for the Japanese language. Are you asking for a similar custom splitter within the English language (so introducing a custom splitter class the users can specify with the splitter name rather than providing a specific regex pattern in html_search_options)? Or, instead, are you asking that we give users the option to provide a custom regex pattern to use as the splitter for any of the available languages rather than just English?

@AA-Turner
Copy link
Member

Closing this PR as it needs more discussion on an issue -- making SearchLanguage._word_re or .split extensible would be my suggestion.

A

@AA-Turner AA-Turner closed this May 23, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants