Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ordering inside searchindex.js not being deterministic #11665

Merged
merged 3 commits into from Sep 16, 2023

Conversation

pietroalbini
Copy link
Contributor

This PR changes searchindex.js's sorting of dictionary keys to be deterministic.

Feature or Bugfix

  • Bugfix

Purpose

At my day job we have a need to digitally sign the HTML output of Sphinx, and for those signatures to stay valid as we rebuild the documentation (if all dependencies and source code do not change).

Unfortunately, the contents of searchindex.js are not deterministic, as the keys are inserted in random order. This breaks our digital signatures as soon as a rebuild is done.

Detail

This PR changes the sorting of dumped dictionary keys to be deterministic, and adds a test verifying all dictionary keys in searchindex.js are sorted.

Relates

@pietroalbini
Copy link
Contributor Author

Is the Windows failure a spurious one? I don't think I changed anything that would influence translations.

@AA-Turner
Copy link
Member

Yes, there's an issue about it somewhere (@jayaddison had a go at fixing it to no avail)

A

tests/test_search.py Outdated Show resolved Hide resolved
tests/test_search.py Outdated Show resolved Hide resolved
tests/test_search.py Outdated Show resolved Hide resolved
tests/test_search.py Outdated Show resolved Hide resolved
@pietroalbini pietroalbini force-pushed the pa-search-index-reproducible branch 2 times, most recently from 695b5d9 to b8a1de6 Compare September 1, 2023 10:05
@pietroalbini
Copy link
Contributor Author

Thanks for the quick review! I should've addressed all feedback.

Comment on lines 319 to 321
# Lists in the search index cannot be sorted: for some lists, their
# position inside the list is referenced elsewhere in the index, so if
# we were to sort lists, the search index would break.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a problem we should solve?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how it's possible to sort all lists without changing the format of the search index itself, which feels outside the scope of this PR. Those orderings are deterministic, just not sortable.

There are two cases today where items are not sorted inside of lists:

  • In the .titles and .filenames lists, each item corresponds to the item at the same position in the (sorted) .docnames list. Sorting .titles and .filenames would break that relationship.

  • In the .alltitles.* lists, the lists are actually an (int, str) tuple, which is represented as a list in JSON. Sorting them wouldn't make much sense, and won't work anyway due to the different types.

Still, I updated the test to check that everything but those two exceptions are sorted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those orderings are deterministic

As long as that's true, this seems reasonable to me. For a moment I was worried that if .titles / .filenames were not deterministic, then we'd be solving 95% of cases while leaving a more difficult-to-resolve case deeper in the datastructure.

Could I check why/how confident we are that those fields are deterministically-generated, to get a sense for how robust or fragile that is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those orderings are deterministic

Ahh, this wasn't clear in the comment -- please would you be able to update it just to clarify?

A

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, this wasn't clear in the comment -- please would you be able to update it just to clarify?

Updated the comments to clarify why they are deterministic!

Could I check why/how confident we are that those fields are deterministically-generated, to get a sense for how robust or fragile that is?

The code seems pretty robust:

docnames, titles = zip(*sorted(self._titles.items()))
filenames = [self._filenames.get(docname) for docname in docnames]

Also, there is another test that verifies the search index content is the expected one, which (implicitly) tests the relationship between docnames and titles/filenames:

sphinx/tests/test_search.py

Lines 179 to 197 in fcc3899

assert index.freeze() == {
'docnames': ('docname1_1', 'docname1_2', 'docname2_1', 'docname2_2'),
'envversion': '1.0',
'filenames': ['filename1_1', 'filename1_2', 'filename2_1', 'filename2_2'],
'objects': {'': [(0, 0, 1, '#anchor', 'objdispname1'),
(2, 1, 1, '#anchor', 'objdispname1')]},
'objnames': {0: ('dummy1', 'objtype1', 'objtype1'), 1: ('dummy2', 'objtype1', 'objtype1')},
'objtypes': {0: 'dummy1:objtype1', 1: 'dummy2:objtype1'},
'terms': {'ar': [0, 1, 2, 3],
'comment': [0, 1, 2, 3],
'fermion': [0, 1, 2, 3],
'index': [0, 1, 2, 3],
'non': [0, 1, 2, 3],
'test': [0, 1, 2, 3]},
'titles': ('title1_1', 'title1_2', 'title2_1', 'title2_2'),
'titleterms': {'section_titl': [0, 1, 2, 3]},
'alltitles': {'section_title': [(0, 'section-title'), (1, 'section-title'), (2, 'section-title'), (3, 'section-title')]},
'indexentries': {},
}

@jayaddison
Copy link
Contributor

@.pietroalbini Is the Windows failure a spurious one? I don't think I changed anything that would influence translations.

@.AA-Turner Yes, there's an issue about it somewhere (@jayaddison had a go at fixing it to no avail)

Ah, yep - that's #11232. I think we've gotten fairly close to the cause - my latest guess was these lines here. It's something to do with whether an internationalized byte-order-mark file is rewritten.

tests/test_search.py Outdated Show resolved Hide resolved

app.builder.build_all()
index = load_searchindex(app.outdir / 'searchindex.js')
print('"index contents:\n{json.dumps(index, indent=2)}') # Pretty print the index.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think it's a good idea to pprint. If the test fails in the future, then it is our job to know why it failed and only then would we likely debug the index. But otherwise, I prefer not having print messages in tests that are only meant for debugging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove it, but I think this print doesn't add noise to the code or the test passing output, while providing helpful context when debugging a failure (for example, if it fails in CI you can quickly glance what's wrong, rather than having to push a new commit adding prints or trying to reproduce locally). Still, it's your project, so if you'd rather not have the print I'll remove it 🙂

Copy link
Member

@picnixz picnixz Sep 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well first of all it's not my project xD In general, we expect contributors to test locally by running tox or anything else (at least I think it's faster to do it locally first because failures are likely to occur both locally and on CI/CD if any).

Still, now that you mention it, I don't remember exactly whether the output is shown during the traceback or it its captured somewhere else (I know that pytest captures stdout and stdin when using capsys fixtures but I don't remember if, when testing locally my stuff, I could end up seeing those "print"). If nothing is shown locally, then I think it's fine but in general I tend to avoid leaving print messages whether it's in the src or the test code.

Nevertheless, I think that most of the tests do not print things (though I know that there are some that print messages but I think we didn't create new print messages for a while, meaning the former are some legacy from a long time ago).

Anyway, let's leave it like that unless there is a real local visual pollution when running the tests (we could always fix later though).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, now that you mention it, I don't remember exactly whether the output is shown during the traceback or it its captured somewhere else (I know that pytest captures stdout and stdin when using capsys fixtures but I don't remember if, when testing locally my stuff, I could end up seeing those "print"). If nothing is shown locally, then I think it's fine but in general I tend to avoid leaving print messages whether it's in the src or the test code.

Nothing is printed when the output is successful, it's only shown during failure. I captured both the success and failure output in a gist 🙂

tests/test_search.py Outdated Show resolved Hide resolved
tests/test_search.py Outdated Show resolved Hide resolved
tests/test_search.py Outdated Show resolved Hide resolved
tests/test_search.py Outdated Show resolved Hide resolved
CHANGES Outdated Show resolved Hide resolved
@AA-Turner
Copy link
Member

@pietroalbini it looks like I can't push to this branch / apply review comments -- please could you check if the tick-box is ticked for allowing me to do this?

Thanks,
Adam

@pietroalbini
Copy link
Contributor Author

Applied all suggestions and rebased on top of master.

it looks like I can't push to this branch / apply review comments -- please could you check if the tick-box is ticked for allowing me to do this?

Unfortunately the organization I created the fork on doesn't support that 🙁 (it's an enterprise org with SSO).

@AA-Turner AA-Turner merged commit 8e768e6 into sphinx-doc:master Sep 16, 2023
26 of 27 checks passed
@AA-Turner
Copy link
Member

Thanks @pietroalbini!

A

@pietroalbini pietroalbini deleted the pa-search-index-reproducible branch September 17, 2023 14:05
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sphinx.search._JavaScriptIndex: non-determinite searchindex.js output
4 participants