[8.x] [intersphinx] support for arbitrary title names #11932

picnixz · 2024-02-03T15:30:09Z

Since the logic is different, this required a complete update on how intersphinx inventories are represented. Before, each line in an inventory was of the form

name [SPACE] domain [COLON] objtype [SPACE] priority [SPACE] URI [SPACE] display name

However, if name contains a whitespace (which is the case in the linked issue), then it becomes extremely hard to have a "good" regex. Now, we represent records as follows:

priority ([COLON] name_length) name [SPACE] domain [COLON] objtype [SPACE] URI [SPACE] display name

where the name's length (and the colon) is only included if the name contains a space. In particular, the inventory's size is not changed if the name has no space inside. Otherwise, we increase a bit the inventory size with a 64-bit integer for those entries.

picnixz · 2024-02-06T09:08:17Z

sphinx/util/inventory.py

+                    # them internally to be docutils compatible. As such,
+                    # we encode the length of the name after the priority.
+                    slen = f':{len(name)}' if ' ' in name else ''
+                    entry = '%s%s %s %s:%s %s %s\n' % (


One alternative that I had in mind is to use a special character for the end of the name, but this means that we assume that this character is not used in the name (e.g., \x00). The size of the inventory won't be much bigger but we need to check that the name does not contain such character and raise an exception if this is the case during the writing phase.

picnixz · 2024-02-20T09:08:13Z

Quick update after discussing with someone:

If a project builds its documentation with the new format, then any other project using that project needs to be built using the latest sphinx version otherwise they won't be able to read the new inventory format.
I think this bug already exists for inventories built using v1 vs v2 so it's not really a problem I'd say. However, we could possibly have a configuration value for the format of the inventory we are actually expecting.

…indices

picnixz · 2024-03-14T11:17:43Z

technically, this one is compatible with the current version but... I'm not really sure it should land in 7.3 (when would 7.3 even release?). So I'll convert it into a draft and we'll merge it in 8.x (after dropping the intersphinx format, though it's unrelated).

bskinn · 2024-03-25T16:13:10Z

I don't think this change to the inventory semantics is necessary.

The most recent regex I implemented for sphobjinv, which incorporated a couple of tweaks from bskinn/sphobjinv#181 and bskinn/sphobjinv#256, has been working in the wild for two years now.

It successfully imports all objects from the minimal example provided by OP at #11711. After a pip install sphobjinv in a virtualenv:

>>> import sphobjinv as soi
>>> inv = soi.Inventory(url='https://thosse.github.io/Playground_GH_Pages/objects.inv')
>>> len(inv.objects)
18
>>> from pathlib import Path
>>> len(Path("objects.txt").read_text().splitlines()) - 4  # Minus 4 b/c the header lines aren't objects
18

That decoded inventory, for reference:

# Sphinx inventory version 2
# Project: Pages Playground
# Version: 
# The remainder of this file is compressed using zlib.
genindex std:label -1 genindex.html Index
index std:doc -1 index.html Intersphinx Minimal Example
index:1 2 3 fails std:label -1 index.html#fails 1 2 3 Fails
index:1 fails 1 std:label -1 index.html#fails-1 1 Fails 1
index:1 works std:label -1 index.html#works 1 Works
index:123 works std:label -1 index.html#id1 123 Works
index:fails 1 2 3 std:label -1 index.html#fails-1-2-3 Fails 1 2 3
index:fails 1 2 fails std:label -1 index.html#fails-1-2-fails Fails 1 2 Fails
index:fails 1 fails 2 std:label -1 index.html#fails-1-fails-2 Fails 1 Fails 2
index:fails fails 1 std:label -1 index.html#fails-fails-1 Fails Fails 1
index:fails fails 2 fails fails std:label -1 index.html#fails-fails-2-fails-fails Fails Fails 2 Fails Fails
index:headers without digits work std:label -1 index.html#headers-without-digits-work Headers without Digits work
index:intersphinx minimal example std:label -1 index.html#intersphinx-minimal-example Intersphinx Minimal Example
index:works 1 std:label -1 index.html#works-1 Works 1
index:works 1 works std:label -1 index.html#works-1-works Works 1 Works
modindex std:label -1 py-modindex.html Module Index
py-modindex std:label -1 py-modindex.html Python Module Index
search std:label -1 search.html Search Page

It's of course possible that my regex is still missing some key features. But I've spent a lot of time working on this reverse-engineering, and I think it's pretty solid based on the lack of tickets in the sphobjinv tracker, which is lately seeing ~40K downloads per month.

There are a few more issues in sphobjinv related to the semantics of the object-reference lines in the v2 objects.inv. I'd be happy to collect those together if it would be helpful.

picnixz · 2024-03-26T08:57:46Z

@bskinn Thank you for your help but there are some titles such as word word1:word2 integer that would fail your regex since word1:word2 would be incorrectly matched as domain:reftype. While I don't have a specific example of such title, I cannot exclude that possibility (that's the reason why I gave up on using regexes). However, would you be willing to check that this regex could also extend your regex (and handle my weird case)?

pattern = re.compile(r'''^
	(?P<title>(?:.|[^\s:]+:\S+\s+-?\d+\s*)+?)
	\s+
	(?P<domain>[^\s:]+):(?P<reftype>\S+)
	\s+
	(?P<prio>-?\d+)
	\s+?
	(?P<uri>\S*)
	\s+
	(?P<dispname>.+?)
	\r?
$''', re.M | re.X)

I would have wanted to use negative lookbehinds but only .NET regex parser supports variable-width patterns.

There are a few more issues in sphobjinv related to the semantics of the object-reference lines in the v2 objects.inv. I'd be happy to collect those together if it would be helpful.

I'd be glad to have it. It could help a lot for #12204.

bskinn · 2024-03-26T13:29:23Z

but there are some titles such as word word1:word2 integer that would fail your regex since word1:word2 would be incorrectly matched as domain:reftype.

Indeed, that is a point of failure.

I can take a look at your regex sometime soon, sure, but---I had an idea. What if we changed the paradigm of the format a bit further?

The key hangup as far as I can tell is that the single-line object format doesn't allow for an unambiguous separator character sequence, given the possible domain (not domain) of names and dispnames.

What if an uncompressed v3 inventory were something like this:

# Sphinx inventory version 3
# Project: MyProject
# Version: 0.1.0
# Object count: 4
# Uncompressed inventory MD5: abcdef123789654
# The remainder of this file is compressed using zlib.
word1:word2 1
  std
  label
  1
  foo/bar.html
  -

word1 word2:word3 2
  std
  label
  1
  foo/bar.html
  -

word1 1 word2:word3 2
  std
  label
  1
  foo/bar.html
  -

foobar
  rst
  directive:option
  1
  baz/quux.html
  -

This format would provide <NEWLINE><SPACE><SPACE> as a delimiter between fields, and <NEWLINE><NEWLINE> as a delimiter between objects. (The ability to use <NEWLINE> as part of the delimiter is the key feature here, as none of the fields for a given object reference permit newlines, as best I know.

This format would take a lot more vertical space when decompressed, but ... so what? In nearly all uses, humans won't be reading it, and it'll be zlib-compressed anyways. Even in the few cases where someone is looking directly at it, it still seems quite readable to me.

And, even though there are extra characters in these delimiters, the zlib compression should work quite effectively on them.

(I also tossed an object count and and MD5 hash into the header to improve the options for validity checking.)

What do you think?

picnixz · 2024-03-26T14:40:18Z

What if we changed the paradigm of the format a bit further?

Ideally yes: it's better to change the format so that we don't have possible ambiguities. An alternative for a simpler visual (without new lines) is to encode the length of the title (and there will be no need for a regex, or a fancy separator that could in the future exist (though I doubt newlines will ever be a valid title)).

I should check (but I don't have time) the exact specs of DEFLATE in stream mode but I'm pretty sure that the size shouldn't blow up (could you check the bytes size of the compressed inventories with and w/o newlines as separators and with titles with whitespaces?)

bskinn · 2024-03-26T15:14:50Z

Definitely true, including the name length as you propose is probably about the most minimal change achievable, and wouldn't dramatically change the flow of line parsing.

Quick size analysis, switching to my newline-delimited proposal would increase the current Python 3.12 compressed inventory by about 2%:

>>> import sphobjinv as soi
>>> inv = soi.Inventory(url="https://docs.python.org/3/objects.inv")
>>> import zlib
>>> ( len_v2 := len(zlib.compress(inv.data_file(), 9)) )  # Compress the whole thing for simplicity
136182
>>> def newline_fmt(do):
...   return f"{do.name}\n  {do.domain}\n  {do.role}\n  {do.priority}\n  {do.uri}\n  {do.dispname}"
...
>>> file_newlines = b'\n'.join(inv.data_file().splitlines()[:4]) + b'\n' + ('\n\n'.join(newline_fmt(o) for o in inv.objects)).encode()
>>> print(file_newlines.decode()[:245])
# Sphinx inventory version 2
# Project: Python
# Version: 3.12
# The remainder of this file is compressed using zlib.
CO_FUTURE_DIVISION
  c
  member
  1
  c-api/veryhigh.html#c.$
  -

METH_CLASS
  c
  macro
  1
  c-api/structures.html#c.$
  -

>>> ( len_nls := len(zlib.compress(file_newlines, 9)) )
138880
>>> len_nls - len_v2
2698
>>> len_nls/len_v2
1.019811722547767

Impact is likely greater for smaller inventories, due to the smaller number of repeats of the \n and \n\n motifs. But, for smaller inventories, the size is already not a problem.

bskinn · 2024-03-26T15:15:38Z

Re your regex, I think you have a typo in the first line:

(?P<title>(?:.|[^\s:]+:\S+\s+-?\d+\s*)+)?

Is that trailing ? supposed to be inside that final )? Where it's located now, I read it as making the entire title group optional.

picnixz · 2024-03-26T15:19:42Z

Is that trailing ? supposed to be inside that final )? Where it's located now, I read it as making the entire title group optional.

Ah, yes thank you, it's a typo (I did my regex this morning and I already forgot about it)

picnixz · 2024-03-26T15:21:56Z

Quick size analysis, switching to my newline-delimited proposal would increase the current Python 3.12 compressed inventory by about 2%:

I see. If we include the title length, I think we'll probably have less (in the end, it's as if you had a longer title with at most 3 characters (because, who in their right mind would have a title of more than 1000 characters)).

bskinn · 2024-03-26T15:32:12Z

If we include the title length, I think we'll probably have less

Yes, agreed, yours is much more concise.

because, who in their right mind would have a title of more than 1000 characters

Yessss, goodness.

I like your proposal far better than continuing work to find a more robust regex. (I rather suspect that it might always be possible to find a pathological example for any regex we might devise, and any time spent attempting to refine the regex is likely better spent on something else.)

picnixz · 2024-03-26T15:42:53Z

rather suspect that it might always be possible to find a pathological example for any regex we might devise, and any time spent attempting to refine the regex is likely better spent on something else.

This is clearly something that I'd agree with. As such I will close this PR and make a new one. For a more precise parsing, I think we can even encode the length of each field instead and completely get rid of any regex (this will save us a lot of work!). Note that we can even drop the spaces in the format ! (although we could want to keep them for testing purposes). So, I have two suggestions:

encode the length of each field on a separate line, e.g., (name_len, domain_len, reftype_len, prio (this is an integer so it can directly be included!), uri_len). And the line following after would be the current line that we have.
encode the length of each field at the beginning of each line (same as above, but leave the priority integer where it is currently).

I'm not sure which one would be the most compact but I like the first one because tests will be easier to write (and debug). I'm not sure if doubling the number of lines would have a direct impact.

However, we need to synchronize whatever decision we make with #12204 to avoid creating an unusable format. Jakob's suggested that each domain is responsible for dumping and loading their own intersphinx fragment and I suggested using a similar construction as in ELF where each domain would have a dedicated section where they would write whatever they want and they would know what to do. I think, we can introduce this line-encoded format as a "default" encoding if a domain does not have a specific way to encode its information or if it supports the default logic.

bskinn · 2024-03-29T16:27:34Z

I'd be glad to have it. It could help a lot for #12204

Ok, here's what I turned up, in no particular order:

Possible colon in role:
- Adjust role import to account for possible colon in role name bskinn/sphobjinv#256
- Freshen attrs and sphinx objects.invs and correct for colons actually being allowed in {role} bskinn/sphobjinv#258
Empty dispname is NOT possible
- InventoryFile ~appears to be equipped to~ *cannot* handle an empty dispname bskinn/sphobjinv#190
- Add crosschecks for objects & counts between Sphinx and sphobjinv parsing bskinn/sphobjinv#189
  - This PR also closes the following three issues, 181/182/183
Priority must be an integer
- Reverse loosening of priority constraints in regex & tests bskinn/sphobjinv#182
URI can be absent
- Investigate and accommodate the possibility of a missing URI bskinn/sphobjinv#183
Names can have spaces (this is not news 😅)
- Names can have spaces after all bskinn/sphobjinv#181
Encodings
- Test for extended ASCII and Unicode in project name bskinn/sphobjinv#104
- Re-evaluate usage of encoding throughout the package bskinn/sphobjinv#103 (comment)
In the header, it's possible for each of project and version to be blank
- empty version (or project (?)) cause regex matching crash bskinn/sphobjinv#85
Why sphobjinv doesn't support v1 objects.inv files
- Implement import of v1 objects.inv files bskinn/sphobjinv#69
Discussion of automatic directive -> role transformation in suggest output
- Suggested cross-reference syntax for attributes should be :py:attr: bskinn/sphobjinv#234

I also have an entire test module in sphobjinv dedicated to testing values that do/don't survive passage through sphinx.util.inventory.InventoryFile. It's very likely that there are errors here, but they're thorough/correct enough that they handle much of what's out in the wild.

picnixz added 5 commits February 3, 2024 16:22

implement intersphinx v3

62dd7f9

fix lint

87a9b3e

Merge branch 'master' into fix/11711-intersphinx-decimals-indices

bd23568

fix lint

8e1802d

update comment

14286f8

picnixz requested a review from AA-Turner February 3, 2024 15:44

picnixz added internals:refactoring extensions:intersphinx labels Feb 6, 2024

picnixz commented Feb 6, 2024

View reviewed changes

Merge branch 'sphinx-doc:master' into fix/11711-intersphinx-decimals-…

7a5f099

…indices

picnixz marked this pull request as draft March 14, 2024 11:17

picnixz added this to the 8.x milestone Mar 14, 2024

picnixz changed the title ~~[intersphinx] support for arbitrary title names~~ [8.x] [intersphinx] support for arbitrary title names Mar 14, 2024

picnixz added 3 commits March 23, 2024 13:55

Merge branch 'master' into fix/11711-intersphinx-decimals-indices

9ad014e

fixup

30598a4

Merge branch 'master' into fix/11711-intersphinx-decimals-indices

23df758

picnixz mentioned this pull request Mar 24, 2024

Add a new references builder #12190

Open

bskinn mentioned this pull request Mar 26, 2024

Add pathological inventory examples to test suite bskinn/sphobjinv#285

Open

picnixz removed the request for review from AA-Turner March 26, 2024 15:03

picnixz closed this Mar 27, 2024

github-actions bot locked as resolved and limited conversation to collaborators Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[8.x] [intersphinx] support for arbitrary title names #11932

[8.x] [intersphinx] support for arbitrary title names #11932

picnixz commented Feb 3, 2024 •

edited

Loading

picnixz Feb 6, 2024

picnixz commented Feb 20, 2024

picnixz commented Mar 14, 2024

bskinn commented Mar 25, 2024

picnixz commented Mar 26, 2024 •

edited

Loading

bskinn commented Mar 26, 2024 •

edited

Loading

picnixz commented Mar 26, 2024

bskinn commented Mar 26, 2024 •

edited

Loading

bskinn commented Mar 26, 2024 •

edited

Loading

picnixz commented Mar 26, 2024 •

edited

Loading

picnixz commented Mar 26, 2024

bskinn commented Mar 26, 2024

picnixz commented Mar 26, 2024

bskinn commented Mar 29, 2024

[8.x] [intersphinx] support for arbitrary title names #11932

[8.x] [intersphinx] support for arbitrary title names #11932

Conversation

picnixz commented Feb 3, 2024 • edited Loading

picnixz Feb 6, 2024

Choose a reason for hiding this comment

picnixz commented Feb 20, 2024

picnixz commented Mar 14, 2024

bskinn commented Mar 25, 2024

picnixz commented Mar 26, 2024 • edited Loading

bskinn commented Mar 26, 2024 • edited Loading

picnixz commented Mar 26, 2024

bskinn commented Mar 26, 2024 • edited Loading

bskinn commented Mar 26, 2024 • edited Loading

picnixz commented Mar 26, 2024 • edited Loading

picnixz commented Mar 26, 2024

bskinn commented Mar 26, 2024

picnixz commented Mar 26, 2024

bskinn commented Mar 29, 2024

picnixz commented Feb 3, 2024 •

edited

Loading

picnixz commented Mar 26, 2024 •

edited

Loading

bskinn commented Mar 26, 2024 •

edited

Loading

bskinn commented Mar 26, 2024 •

edited

Loading

bskinn commented Mar 26, 2024 •

edited

Loading

picnixz commented Mar 26, 2024 •

edited

Loading