Fix ordering inside searchindex.js not being deterministic #11665

pietroalbini · 2023-09-01T08:23:55Z

This PR changes searchindex.js's sorting of dictionary keys to be deterministic.

Feature or Bugfix

Bugfix

Purpose

At my day job we have a need to digitally sign the HTML output of Sphinx, and for those signatures to stay valid as we rebuild the documentation (if all dependencies and source code do not change).

Unfortunately, the contents of searchindex.js are not deterministic, as the keys are inserted in random order. This breaks our digital signatures as soon as a rebuild is done.

Detail

This PR changes the sorting of dumped dictionary keys to be deterministic, and adds a test verifying all dictionary keys in searchindex.js are sorted.

Relates

Fixes sphinx.search._JavaScriptIndex: non-determinite searchindex.js output #11622

pietroalbini · 2023-09-01T08:44:37Z

Is the Windows failure a spurious one? I don't think I changed anything that would influence translations.

AA-Turner · 2023-09-01T08:46:01Z

Yes, there's an issue about it somewhere (@jayaddison had a go at fixing it to no avail)

A

tests/test_search.py

pietroalbini · 2023-09-01T10:09:39Z

Thanks for the quick review! I should've addressed all feedback.

AA-Turner · 2023-09-01T10:22:26Z

tests/test_search.py

+        # Lists in the search index cannot be sorted: for some lists, their
+        # position inside the list is referenced elsewhere in the index, so if
+        # we were to sort lists, the search index would break.


This seems like a problem we should solve?

I don't see how it's possible to sort all lists without changing the format of the search index itself, which feels outside the scope of this PR. Those orderings are deterministic, just not sortable.

There are two cases today where items are not sorted inside of lists:

In the .titles and .filenames lists, each item corresponds to the item at the same position in the (sorted) .docnames list. Sorting .titles and .filenames would break that relationship.

In the .alltitles.* lists, the lists are actually an (int, str) tuple, which is represented as a list in JSON. Sorting them wouldn't make much sense, and won't work anyway due to the different types.

Still, I updated the test to check that everything but those two exceptions are sorted.

Those orderings are deterministic

As long as that's true, this seems reasonable to me. For a moment I was worried that if .titles / .filenames were not deterministic, then we'd be solving 95% of cases while leaving a more difficult-to-resolve case deeper in the datastructure.

Could I check why/how confident we are that those fields are deterministically-generated, to get a sense for how robust or fragile that is?

Those orderings are deterministic

Ahh, this wasn't clear in the comment -- please would you be able to update it just to clarify?

A

Ahh, this wasn't clear in the comment -- please would you be able to update it just to clarify?

Updated the comments to clarify why they are deterministic!

Could I check why/how confident we are that those fields are deterministically-generated, to get a sense for how robust or fragile that is?

The code seems pretty robust:

sphinx/sphinx/search/__init__.py

Lines 385 to 386 in fcc3899

docnames, titles = zip(*sorted(self._titles.items()))

filenames = [self._filenames.get(docname) for docname in docnames]

Also, there is another test that verifies the search index content is the expected one, which (implicitly) tests the relationship between docnames and titles/filenames:

sphinx/tests/test_search.py

Lines 179 to 197 in fcc3899

assert index.freeze() == {

'docnames': ('docname1_1', 'docname1_2', 'docname2_1', 'docname2_2'),

'envversion': '1.0',

'filenames': ['filename1_1', 'filename1_2', 'filename2_1', 'filename2_2'],

'objects': {'': [(0, 0, 1, '#anchor', 'objdispname1'),

(2, 1, 1, '#anchor', 'objdispname1')]},

'objnames': {0: ('dummy1', 'objtype1', 'objtype1'), 1: ('dummy2', 'objtype1', 'objtype1')},

'objtypes': {0: 'dummy1:objtype1', 1: 'dummy2:objtype1'},

'terms': {'ar': [0, 1, 2, 3],

'comment': [0, 1, 2, 3],

'fermion': [0, 1, 2, 3],

'index': [0, 1, 2, 3],

'non': [0, 1, 2, 3],

'test': [0, 1, 2, 3]},

'titles': ('title1_1', 'title1_2', 'title2_1', 'title2_2'),

'titleterms': {'section_titl': [0, 1, 2, 3]},

'alltitles': {'section_title': [(0, 'section-title'), (1, 'section-title'), (2, 'section-title'), (3, 'section-title')]},

'indexentries': {},

}

jayaddison · 2023-09-01T11:52:25Z

@.pietroalbini Is the Windows failure a spurious one? I don't think I changed anything that would influence translations.

@.AA-Turner Yes, there's an issue about it somewhere (@jayaddison had a go at fixing it to no avail)

Ah, yep - that's #11232. I think we've gotten fairly close to the cause - my latest guess was these lines here. It's something to do with whether an internationalized byte-order-mark file is rewritten.

tests/test_search.py

picnixz · 2023-09-04T07:37:58Z

tests/test_search.py

+
+    app.builder.build_all()
+    index = load_searchindex(app.outdir / 'searchindex.js')
+    print('"index contents:\n{json.dumps(index, indent=2)}')  # Pretty print the index.


I still don't think it's a good idea to pprint. If the test fails in the future, then it is our job to know why it failed and only then would we likely debug the index. But otherwise, I prefer not having print messages in tests that are only meant for debugging.

I can remove it, but I think this print doesn't add noise to the code or the test passing output, while providing helpful context when debugging a failure (for example, if it fails in CI you can quickly glance what's wrong, rather than having to push a new commit adding prints or trying to reproduce locally). Still, it's your project, so if you'd rather not have the print I'll remove it 🙂

Well first of all it's not my project xD In general, we expect contributors to test locally by running tox or anything else (at least I think it's faster to do it locally first because failures are likely to occur both locally and on CI/CD if any).

Still, now that you mention it, I don't remember exactly whether the output is shown during the traceback or it its captured somewhere else (I know that pytest captures stdout and stdin when using capsys fixtures but I don't remember if, when testing locally my stuff, I could end up seeing those "print"). If nothing is shown locally, then I think it's fine but in general I tend to avoid leaving print messages whether it's in the src or the test code.

Nevertheless, I think that most of the tests do not print things (though I know that there are some that print messages but I think we didn't create new print messages for a while, meaning the former are some legacy from a long time ago).

Anyway, let's leave it like that unless there is a real local visual pollution when running the tests (we could always fix later though).

Still, now that you mention it, I don't remember exactly whether the output is shown during the traceback or it its captured somewhere else (I know that pytest captures stdout and stdin when using capsys fixtures but I don't remember if, when testing locally my stuff, I could end up seeing those "print"). If nothing is shown locally, then I think it's fine but in general I tend to avoid leaving print messages whether it's in the src or the test code.

Nothing is printed when the output is successful, it's only shown during failure. I captured both the success and failure output in a gist 🙂

tests/test_search.py

CHANGES

AA-Turner · 2023-09-14T08:44:07Z

@pietroalbini it looks like I can't push to this branch / apply review comments -- please could you check if the tick-box is ticked for allowing me to do this?

Thanks,
Adam

Fixes sphinx-doc#11622

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>

pietroalbini · 2023-09-14T12:34:08Z

Applied all suggestions and rebased on top of master.

it looks like I can't push to this branch / apply review comments -- please could you check if the tick-box is ticked for allowing me to do this?

Unfortunately the organization I created the fork on doesn't support that 🙁 (it's an enterprise org with SSO).

AA-Turner · 2023-09-16T03:37:09Z

Thanks @pietroalbini!

A

pietroalbini force-pushed the pa-search-index-reproducible branch 2 times, most recently from d7636f7 to 302fe69 Compare September 1, 2023 08:27

pietroalbini mentioned this pull request Sep 1, 2023

sphinx.search._JavaScriptIndex: non-determinite searchindex.js output #11622

Closed

picnixz reviewed Sep 1, 2023

View reviewed changes

tests/test_search.py Outdated Show resolved Hide resolved

tests/test_search.py Outdated Show resolved Hide resolved

tests/test_search.py Outdated Show resolved Hide resolved

tests/test_search.py Outdated Show resolved Hide resolved

pietroalbini force-pushed the pa-search-index-reproducible branch 2 times, most recently from 695b5d9 to b8a1de6 Compare September 1, 2023 10:05

AA-Turner reviewed Sep 1, 2023

View reviewed changes

pietroalbini force-pushed the pa-search-index-reproducible branch from 80b71f7 to 9a4e039 Compare September 1, 2023 13:23

picnixz reviewed Sep 2, 2023

View reviewed changes

tests/test_search.py Outdated Show resolved Hide resolved

pietroalbini force-pushed the pa-search-index-reproducible branch from 9a4e039 to d8c0485 Compare September 4, 2023 07:33

picnixz reviewed Sep 4, 2023

View reviewed changes

pietroalbini force-pushed the pa-search-index-reproducible branch from d8c0485 to 6412bd5 Compare September 4, 2023 09:00

picnixz approved these changes Sep 4, 2023

View reviewed changes

AA-Turner reviewed Sep 14, 2023

View reviewed changes

tests/test_search.py Outdated Show resolved Hide resolved

tests/test_search.py Outdated Show resolved Hide resolved

tests/test_search.py Outdated Show resolved Hide resolved

tests/test_search.py Outdated Show resolved Hide resolved

CHANGES Outdated Show resolved Hide resolved

pietroalbini and others added 3 commits September 14, 2023 14:32

Fix ordering inside searchindex.js not being deterministic

c79c50d

Fixes sphinx-doc#11622

ensure lists in the search index are also deterministic

26bfeea

Apply suggestions from code review

5506d1a

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>

pietroalbini force-pushed the pa-search-index-reproducible branch from eaee141 to 5506d1a Compare September 14, 2023 12:33

AA-Turner merged commit 8e768e6 into sphinx-doc:master Sep 16, 2023
26 of 27 checks passed

pietroalbini deleted the pa-search-index-reproducible branch September 17, 2023 14:05

github-actions bot locked as resolved and limited conversation to collaborators Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ordering inside searchindex.js not being deterministic #11665

Fix ordering inside searchindex.js not being deterministic #11665

pietroalbini commented Sep 1, 2023

pietroalbini commented Sep 1, 2023

AA-Turner commented Sep 1, 2023

pietroalbini commented Sep 1, 2023

AA-Turner Sep 1, 2023

pietroalbini Sep 1, 2023

jayaddison Sep 1, 2023

AA-Turner Sep 1, 2023

pietroalbini Sep 1, 2023 •

edited

jayaddison commented Sep 1, 2023

picnixz Sep 4, 2023

pietroalbini Sep 4, 2023

picnixz Sep 4, 2023 •

edited

pietroalbini Sep 4, 2023

AA-Turner commented Sep 14, 2023

pietroalbini commented Sep 14, 2023

AA-Turner commented Sep 16, 2023

	docnames, titles = zip(*sorted(self._titles.items()))
	filenames = [self._filenames.get(docname) for docname in docnames]

	assert index.freeze() == {
	'docnames': ('docname1_1', 'docname1_2', 'docname2_1', 'docname2_2'),
	'envversion': '1.0',
	'filenames': ['filename1_1', 'filename1_2', 'filename2_1', 'filename2_2'],
	'objects': {'': [(0, 0, 1, '#anchor', 'objdispname1'),
	(2, 1, 1, '#anchor', 'objdispname1')]},
	'objnames': {0: ('dummy1', 'objtype1', 'objtype1'), 1: ('dummy2', 'objtype1', 'objtype1')},
	'objtypes': {0: 'dummy1:objtype1', 1: 'dummy2:objtype1'},
	'terms': {'ar': [0, 1, 2, 3],
	'comment': [0, 1, 2, 3],
	'fermion': [0, 1, 2, 3],
	'index': [0, 1, 2, 3],
	'non': [0, 1, 2, 3],
	'test': [0, 1, 2, 3]},
	'titles': ('title1_1', 'title1_2', 'title2_1', 'title2_2'),
	'titleterms': {'section_titl': [0, 1, 2, 3]},
	'alltitles': {'section_title': [(0, 'section-title'), (1, 'section-title'), (2, 'section-title'), (3, 'section-title')]},
	'indexentries': {},
	}

Fix ordering inside searchindex.js not being deterministic #11665

Fix ordering inside searchindex.js not being deterministic #11665

Conversation

pietroalbini commented Sep 1, 2023

Feature or Bugfix

Purpose

Detail

Relates

pietroalbini commented Sep 1, 2023

AA-Turner commented Sep 1, 2023

pietroalbini commented Sep 1, 2023

AA-Turner Sep 1, 2023

Choose a reason for hiding this comment

pietroalbini Sep 1, 2023

Choose a reason for hiding this comment

jayaddison Sep 1, 2023

Choose a reason for hiding this comment

AA-Turner Sep 1, 2023

Choose a reason for hiding this comment

pietroalbini Sep 1, 2023 • edited

Choose a reason for hiding this comment

jayaddison commented Sep 1, 2023

picnixz Sep 4, 2023

Choose a reason for hiding this comment

pietroalbini Sep 4, 2023

Choose a reason for hiding this comment

picnixz Sep 4, 2023 • edited

Choose a reason for hiding this comment

pietroalbini Sep 4, 2023

Choose a reason for hiding this comment

AA-Turner commented Sep 14, 2023

pietroalbini commented Sep 14, 2023

AA-Turner commented Sep 16, 2023

pietroalbini Sep 1, 2023 •

edited

picnixz Sep 4, 2023 •

edited