Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

searchindex: non-main index entries scored too high #11578

Closed
bradking opened this issue Aug 10, 2023 · 7 comments · Fixed by #11696
Closed

searchindex: non-main index entries scored too high #11578

bradking opened this issue Aug 10, 2023 · 7 comments · Fixed by #11696

Comments

@bradking
Copy link
Contributor

bradking commented Aug 10, 2023

Describe the bug

Since #10819, index entries are returned as search results. When the query string exactly matches an indexed term, all index entries become search results with score 100. This places them above most other search results.

Reporting "main" index entries early in the search results makes sense. However, non-main index entries are often just arbitrary cross-references to the indexed term. Currently these receive the same score as main index entries and are therefore ordered early in search results. This obscures the main entries and other more-important matches such as document titles.

This problem affects CMake's documentation, and is tracked in CMake Issue 25175. Here are some examples showing a search in CMake 3.27.1's documentation.

Using Sphinx 5.3.0 (click to expand screenshot): cmake-3 27 1-sphinx-5 3 0
Using Sphinx 5.1.1 (click to expand screenshot): cmake-3 27 1-sphinx-5 1 1
Using Sphinx 5.3.0 plus a patch that removes non-main index entries (click to expand screenshot): cmake-3 27 1-sphinx-5 3 0-patch

The patch is for demonstration in this issue and is not a proposed fix:

diff --git a/sphinx/search/__init__.py b/sphinx/search/__init__.py
index 5330d7e7c..1741ff7ff 100644
--- a/sphinx/search/__init__.py
+++ b/sphinx/search/__init__.py
@@ -387,7 +387,7 @@ class IndexBuilder:
         index_entries: Dict[str, List[Tuple[int, str]]] = {}
         for docname, entries in self._index_entries.items():
             for entry, entry_id, main_entry in entries:
-                index_entries.setdefault(entry.lower(), []).append((fn2index[docname], entry_id))
+                index_entries.setdefault(entry.lower(), []).append((fn2index[docname], entry_id, main_entry))

         return dict(docnames=docnames, filenames=filenames, titles=titles, terms=terms,
                     objects=objects, objtypes=objtypes, objnames=objnames,
diff --git a/sphinx/themes/basic/static/searchtools.js b/sphinx/themes/basic/static/searchtools.js
index e89e34d4e..f4920587a 100644
--- a/sphinx/themes/basic/static/searchtools.js
+++ b/sphinx/themes/basic/static/searchtools.js
@@ -300,7 +300,7 @@ const Search = {
     // search for explicit entries in index directives
     for (const [entry, foundEntries] of Object.entries(indexEntries)) {
       if (entry.includes(queryLower) && (queryLower.length >= entry.length/2)) {
-        for (const [file, id] of foundEntries) {
+        for (const [file, id, main] of foundEntries.filter((fe) => fe[2] === "main")) {
           let score = Math.round(100 * queryLower.length / entry.length)
           results.push([
             docNames[file],

How to Reproduce

This is difficult to reproduce in a small example because it requires a large number of documents to demonstrate non-trivial search results.

One can reproduce the problem by building CMake 3.27.1's documentation like this:

$ curl -OL https://github.com/Kitware/CMake/releases/download/v3.27.1/cmake-3.27.1.tar.gz
$ tar xvf cmake-3.27.1.tar.gz
$ cmake -S cmake-3.27.1/Utilities/Sphinx -B cmake-docs -DSPHINX_HTML=ON
$ cmake --build cmake-docs
$ x-www-browser 'cmake-docs/html/search.html?q=CMAKE_TOOLCHAIN_FILE'

This should show results similar to the description's screenshots.

Environment Information

Platform:              linux; (Linux-6.4.0-1-amd64-x86_64-with-glibc2.37)
Python version:        3.11.3 (main, Apr  5 2023, 00:00:00) [GCC 13.0.1 20230401 (Red Hat 13.0.1-0)])
Python implementation: CPython
Sphinx version:        5.3.0
Docutils version:      0.19
Jinja2 version:        3.0.3
@mwoehlke-kitware
Copy link

This code fails to preserve whether an entry is 'main' or not. It looks like how index entries are stored needs to be reworked. @AA-Turner, maybe you can take a look?

@AA-Turner AA-Turner added this to the some future version milestone Aug 11, 2023
@picnixz
Copy link
Member

picnixz commented Aug 12, 2023

+1 for this. One approach is as follows:

  • instead of storing a list of tuple[str, str], first store two lists of tuple[str, str] where the first would contain main entries and the second would be non-main entries.
  • loop once more to concatenate the two lists. We would end up with a list of main entries, followed by non-main entries.

I'm not sure whether this would break search engines of large projects. This would also slow down the search index generation but not that much I think (it's just doing twice the freeze step).

@jayaddison
Copy link
Contributor

Does someone have an example of a non-main index entry that is indeed relevant to a corresponding search and should be included?

Search results that are only tangentially relevant can be a source of noise/distraction, so I'd like to check whether it's worthwhile to include non-main entries in the first place.

@bradking
Copy link
Contributor Author

I'd be fine with leaving non-main index entries out. If someone is really looking for all the mentions of an indexed entity, they can look at the index instead of using search.

@picnixz
Copy link
Member

picnixz commented Aug 25, 2023

What I really want to be sure about is that it does not cause any regression for the projects involved in the original issue (especially the Python doc).

bradking added a commit to bradking/sphinx that referenced this issue Sep 26, 2023
Since commit 8ae8183 (Support searching for index entries (sphinx-doc#10819),
2022-09-20, v5.2.0~16), index entries are returned as search results.
When the query string exactly matches an indexed term, all index
entries become search results with score 100.  This places them above
most other search results.

Reporting "main" index entries early in the search results makes sense,
but non-main index entries are often just arbitrary cross-references to
the indexed term.  Give non-main entries lower scores so they are not
ordered early in search results.  This avoids obscuring the main entries
and other more-important matches such as document titles.

Fixes: sphinx-doc#11578
@bradking
Copy link
Contributor Author

#11695 gives non-main index entries lower scores. This should order them much later, similar to the proposal in #11578 (comment), but without risk of dropping entries for existing projects.

bradking added a commit to bradking/sphinx that referenced this issue Sep 26, 2023
Since commit 8ae8183 (Support searching for index entries (sphinx-doc#10819),
2022-09-20, v5.2.0~16), index entries are returned as search results.
When the query string exactly matches an indexed term, all index
entries become search results with score 100.  This places them above
most other search results.

Reporting "main" index entries early in the search results makes sense,
but non-main index entries are often just arbitrary cross-references to
the indexed term.  Give non-main entries lower scores so they are not
ordered early in search results.  This avoids obscuring the main entries
and other more-important matches such as document titles.

Fixes: sphinx-doc#11578
bradking added a commit to bradking/sphinx that referenced this issue Sep 26, 2023
Since commit 8ae8183 (Support searching for index entries (sphinx-doc#10819),
2022-09-20, v5.2.0~16), index entries are returned as search results.
When the query string exactly matches an indexed term, all index
entries become search results with score 100.  This places them above
most other search results.

Reporting "main" index entries early in the search results makes sense,
but non-main index entries are often just arbitrary cross-references to
the indexed term.  Give non-main entries lower scores so they are not
ordered early in search results.  This avoids obscuring the main entries
and other more-important matches such as document titles.

Fixes: sphinx-doc#11578
bradking added a commit to bradking/sphinx that referenced this issue Sep 28, 2023
Since commit 8ae8183 (Support searching for index entries (sphinx-doc#10819),
2022-09-20, v5.2.0~16), index entries are returned as search results.
When the query string exactly matches an indexed term, all index
entries become search results with score 100.  This places them above
most other search results.

Reporting "main" index entries early in the search results makes sense,
but non-main index entries are often just arbitrary cross-references to
the indexed term.  Collect them in a separate group that is always
placed after other results.  This avoids obscuring the main entries and
other more-important matches such as document titles.

Fixes: sphinx-doc#11578
@bradking
Copy link
Contributor Author

#11696 splits non-main index entries into their own group that is placed after all other results.

bradking added a commit to bradking/sphinx that referenced this issue Sep 28, 2023
Since commit 8ae8183 (Support searching for index entries (sphinx-doc#10819),
2022-09-20, v5.2.0~16), index entries are returned as search results.
When the query string exactly matches an indexed term, all index
entries become search results with score 100.  This places them above
most other search results.

Reporting "main" index entries early in the search results makes sense,
but non-main index entries are often just arbitrary cross-references to
the indexed term.  Collect them in a separate group that is always
placed after other results.  This avoids obscuring the main entries and
other more-important matches such as document titles.

Fixes: sphinx-doc#11578
@jayaddison jayaddison added type:bug html search and removed type:enhancement enhance or introduce a new feature labels Mar 14, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
5 participants