New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ordering inside searchindex.js not being deterministic #11665
Fix ordering inside searchindex.js not being deterministic #11665
Conversation
d7636f7
to
302fe69
Compare
Is the Windows failure a spurious one? I don't think I changed anything that would influence translations. |
Yes, there's an issue about it somewhere (@jayaddison had a go at fixing it to no avail) A |
695b5d9
to
b8a1de6
Compare
Thanks for the quick review! I should've addressed all feedback. |
tests/test_search.py
Outdated
# Lists in the search index cannot be sorted: for some lists, their | ||
# position inside the list is referenced elsewhere in the index, so if | ||
# we were to sort lists, the search index would break. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a problem we should solve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how it's possible to sort all lists without changing the format of the search index itself, which feels outside the scope of this PR. Those orderings are deterministic, just not sortable.
There are two cases today where items are not sorted inside of lists:
-
In the
.titles
and.filenames
lists, each item corresponds to the item at the same position in the (sorted).docnames
list. Sorting.titles
and.filenames
would break that relationship. -
In the
.alltitles.*
lists, the lists are actually an(int, str)
tuple, which is represented as a list in JSON. Sorting them wouldn't make much sense, and won't work anyway due to the different types.
Still, I updated the test to check that everything but those two exceptions are sorted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those orderings are deterministic
As long as that's true, this seems reasonable to me. For a moment I was worried that if .titles
/ .filenames
were not deterministic, then we'd be solving 95% of cases while leaving a more difficult-to-resolve case deeper in the datastructure.
Could I check why/how confident we are that those fields are deterministically-generated, to get a sense for how robust or fragile that is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those orderings are deterministic
Ahh, this wasn't clear in the comment -- please would you be able to update it just to clarify?
A
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, this wasn't clear in the comment -- please would you be able to update it just to clarify?
Updated the comments to clarify why they are deterministic!
Could I check why/how confident we are that those fields are deterministically-generated, to get a sense for how robust or fragile that is?
The code seems pretty robust:
sphinx/sphinx/search/__init__.py
Lines 385 to 386 in fcc3899
docnames, titles = zip(*sorted(self._titles.items())) | |
filenames = [self._filenames.get(docname) for docname in docnames] |
Also, there is another test that verifies the search index content is the expected one, which (implicitly) tests the relationship between docnames and titles/filenames:
Lines 179 to 197 in fcc3899
assert index.freeze() == { | |
'docnames': ('docname1_1', 'docname1_2', 'docname2_1', 'docname2_2'), | |
'envversion': '1.0', | |
'filenames': ['filename1_1', 'filename1_2', 'filename2_1', 'filename2_2'], | |
'objects': {'': [(0, 0, 1, '#anchor', 'objdispname1'), | |
(2, 1, 1, '#anchor', 'objdispname1')]}, | |
'objnames': {0: ('dummy1', 'objtype1', 'objtype1'), 1: ('dummy2', 'objtype1', 'objtype1')}, | |
'objtypes': {0: 'dummy1:objtype1', 1: 'dummy2:objtype1'}, | |
'terms': {'ar': [0, 1, 2, 3], | |
'comment': [0, 1, 2, 3], | |
'fermion': [0, 1, 2, 3], | |
'index': [0, 1, 2, 3], | |
'non': [0, 1, 2, 3], | |
'test': [0, 1, 2, 3]}, | |
'titles': ('title1_1', 'title1_2', 'title2_1', 'title2_2'), | |
'titleterms': {'section_titl': [0, 1, 2, 3]}, | |
'alltitles': {'section_title': [(0, 'section-title'), (1, 'section-title'), (2, 'section-title'), (3, 'section-title')]}, | |
'indexentries': {}, | |
} |
Ah, yep - that's #11232. I think we've gotten fairly close to the cause - my latest guess was these lines here. It's something to do with whether an internationalized byte-order-mark file is rewritten. |
80b71f7
to
9a4e039
Compare
9a4e039
to
d8c0485
Compare
tests/test_search.py
Outdated
|
||
app.builder.build_all() | ||
index = load_searchindex(app.outdir / 'searchindex.js') | ||
print('"index contents:\n{json.dumps(index, indent=2)}') # Pretty print the index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't think it's a good idea to pprint. If the test fails in the future, then it is our job to know why it failed and only then would we likely debug the index. But otherwise, I prefer not having print messages in tests that are only meant for debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove it, but I think this print doesn't add noise to the code or the test passing output, while providing helpful context when debugging a failure (for example, if it fails in CI you can quickly glance what's wrong, rather than having to push a new commit adding prints or trying to reproduce locally). Still, it's your project, so if you'd rather not have the print I'll remove it 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well first of all it's not my project xD In general, we expect contributors to test locally by running tox or anything else (at least I think it's faster to do it locally first because failures are likely to occur both locally and on CI/CD if any).
Still, now that you mention it, I don't remember exactly whether the output is shown during the traceback or it its captured somewhere else (I know that pytest captures stdout and stdin when using capsys fixtures but I don't remember if, when testing locally my stuff, I could end up seeing those "print"). If nothing is shown locally, then I think it's fine but in general I tend to avoid leaving print messages whether it's in the src or the test code.
Nevertheless, I think that most of the tests do not print things (though I know that there are some that print messages but I think we didn't create new print messages for a while, meaning the former are some legacy from a long time ago).
Anyway, let's leave it like that unless there is a real local visual pollution when running the tests (we could always fix later though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, now that you mention it, I don't remember exactly whether the output is shown during the traceback or it its captured somewhere else (I know that pytest captures stdout and stdin when using capsys fixtures but I don't remember if, when testing locally my stuff, I could end up seeing those "print"). If nothing is shown locally, then I think it's fine but in general I tend to avoid leaving print messages whether it's in the src or the test code.
Nothing is printed when the output is successful, it's only shown during failure. I captured both the success and failure output in a gist 🙂
d8c0485
to
6412bd5
Compare
@pietroalbini it looks like I can't push to this branch / apply review comments -- please could you check if the tick-box is ticked for allowing me to do this? Thanks, |
Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
eaee141
to
5506d1a
Compare
Applied all suggestions and rebased on top of master.
Unfortunately the organization I created the fork on doesn't support that 🙁 (it's an enterprise org with SSO). |
Thanks @pietroalbini! A |
This PR changes
searchindex.js
's sorting of dictionary keys to be deterministic.Feature or Bugfix
Purpose
At my day job we have a need to digitally sign the HTML output of Sphinx, and for those signatures to stay valid as we rebuild the documentation (if all dependencies and source code do not change).
Unfortunately, the contents of
searchindex.js
are not deterministic, as the keys are inserted in random order. This breaks our digital signatures as soon as a rebuild is done.Detail
This PR changes the sorting of dumped dictionary keys to be deterministic, and adds a test verifying all dictionary keys in
searchindex.js
are sorted.Relates