Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.x] [intersphinx] support for arbitrary title names #11932

Conversation

picnixz
Copy link
Member

@picnixz picnixz commented Feb 3, 2024

Fix #11711.

Since the logic is different, this required a complete update on how intersphinx inventories are represented. Before, each line in an inventory was of the form

name [SPACE] domain [COLON] objtype [SPACE] priority [SPACE] URI [SPACE] display name

However, if name contains a whitespace (which is the case in the linked issue), then it becomes extremely hard to have a "good" regex. Now, we represent records as follows:

priority ([COLON] name_length) name [SPACE] domain [COLON] objtype [SPACE] URI [SPACE] display name

where the name's length (and the colon) is only included if the name contains a space. In particular, the inventory's size is not changed if the name has no space inside. Otherwise, we increase a bit the inventory size with a 64-bit integer for those entries.

# them internally to be docutils compatible. As such,
# we encode the length of the name after the priority.
slen = f':{len(name)}' if ' ' in name else ''
entry = '%s%s %s %s:%s %s %s\n' % (
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One alternative that I had in mind is to use a special character for the end of the name, but this means that we assume that this character is not used in the name (e.g., \x00). The size of the inventory won't be much bigger but we need to check that the name does not contain such character and raise an exception if this is the case during the writing phase.

@picnixz
Copy link
Member Author

picnixz commented Feb 20, 2024

Quick update after discussing with someone:

  • If a project builds its documentation with the new format, then any other project using that project needs to be built using the latest sphinx version otherwise they won't be able to read the new inventory format.
  • I think this bug already exists for inventories built using v1 vs v2 so it's not really a problem I'd say. However, we could possibly have a configuration value for the format of the inventory we are actually expecting.

@picnixz
Copy link
Member Author

picnixz commented Mar 14, 2024

technically, this one is compatible with the current version but... I'm not really sure it should land in 7.3 (when would 7.3 even release?). So I'll convert it into a draft and we'll merge it in 8.x (after dropping the intersphinx format, though it's unrelated).

@picnixz picnixz marked this pull request as draft March 14, 2024 11:17
@picnixz picnixz added this to the 8.x milestone Mar 14, 2024
@picnixz picnixz changed the title [intersphinx] support for arbitrary title names [8.x] [intersphinx] support for arbitrary title names Mar 14, 2024
@bskinn
Copy link
Contributor

bskinn commented Mar 25, 2024

I don't think this change to the inventory semantics is necessary.

The most recent regex I implemented for sphobjinv, which incorporated a couple of tweaks from bskinn/sphobjinv#181 and bskinn/sphobjinv#256, has been working in the wild for two years now.

It successfully imports all objects from the minimal example provided by OP at #11711. After a pip install sphobjinv in a virtualenv:

>>> import sphobjinv as soi
>>> inv = soi.Inventory(url='https://thosse.github.io/Playground_GH_Pages/objects.inv')
>>> len(inv.objects)
18
>>> from pathlib import Path
>>> len(Path("objects.txt").read_text().splitlines()) - 4  # Minus 4 b/c the header lines aren't objects
18

That decoded inventory, for reference:

# Sphinx inventory version 2
# Project: Pages Playground
# Version: 
# The remainder of this file is compressed using zlib.
genindex std:label -1 genindex.html Index
index std:doc -1 index.html Intersphinx Minimal Example
index:1 2 3 fails std:label -1 index.html#fails 1 2 3 Fails
index:1 fails 1 std:label -1 index.html#fails-1 1 Fails 1
index:1 works std:label -1 index.html#works 1 Works
index:123 works std:label -1 index.html#id1 123 Works
index:fails 1 2 3 std:label -1 index.html#fails-1-2-3 Fails 1 2 3
index:fails 1 2 fails std:label -1 index.html#fails-1-2-fails Fails 1 2 Fails
index:fails 1 fails 2 std:label -1 index.html#fails-1-fails-2 Fails 1 Fails 2
index:fails fails 1 std:label -1 index.html#fails-fails-1 Fails Fails 1
index:fails fails 2 fails fails std:label -1 index.html#fails-fails-2-fails-fails Fails Fails 2 Fails Fails
index:headers without digits work std:label -1 index.html#headers-without-digits-work Headers without Digits work
index:intersphinx minimal example std:label -1 index.html#intersphinx-minimal-example Intersphinx Minimal Example
index:works 1 std:label -1 index.html#works-1 Works 1
index:works 1 works std:label -1 index.html#works-1-works Works 1 Works
modindex std:label -1 py-modindex.html Module Index
py-modindex std:label -1 py-modindex.html Python Module Index
search std:label -1 search.html Search Page

It's of course possible that my regex is still missing some key features. But I've spent a lot of time working on this reverse-engineering, and I think it's pretty solid based on the lack of tickets in the sphobjinv tracker, which is lately seeing ~40K downloads per month.

There are a few more issues in sphobjinv related to the semantics of the object-reference lines in the v2 objects.inv. I'd be happy to collect those together if it would be helpful.

@picnixz
Copy link
Member Author

picnixz commented Mar 26, 2024

@bskinn Thank you for your help but there are some titles such as word word1:word2 integer that would fail your regex since word1:word2 would be incorrectly matched as domain:reftype. While I don't have a specific example of such title, I cannot exclude that possibility (that's the reason why I gave up on using regexes). However, would you be willing to check that this regex could also extend your regex (and handle my weird case)?

pattern = re.compile(r'''^
	(?P<title>(?:.|[^\s:]+:\S+\s+-?\d+\s*)+?)
	\s+
	(?P<domain>[^\s:]+):(?P<reftype>\S+)
	\s+
	(?P<prio>-?\d+)
	\s+?
	(?P<uri>\S*)
	\s+
	(?P<dispname>.+?)
	\r?
$''', re.M | re.X)

I would have wanted to use negative lookbehinds but only .NET regex parser supports variable-width patterns.

There are a few more issues in sphobjinv related to the semantics of the object-reference lines in the v2 objects.inv. I'd be happy to collect those together if it would be helpful.

I'd be glad to have it. It could help a lot for #12204.

@bskinn
Copy link
Contributor

bskinn commented Mar 26, 2024

but there are some titles such as word word1:word2 integer that would fail your regex since word1:word2 would be incorrectly matched as domain:reftype.

Indeed, that is a point of failure.

I can take a look at your regex sometime soon, sure, but---I had an idea. What if we changed the paradigm of the format a bit further?

The key hangup as far as I can tell is that the single-line object format doesn't allow for an unambiguous separator character sequence, given the possible domain (not domain) of names and dispnames.

What if an uncompressed v3 inventory were something like this:

# Sphinx inventory version 3
# Project: MyProject
# Version: 0.1.0
# Object count: 4
# Uncompressed inventory MD5: abcdef123789654
# The remainder of this file is compressed using zlib.
word1:word2 1
  std
  label
  1
  foo/bar.html
  -

word1 word2:word3 2
  std
  label
  1
  foo/bar.html
  -

word1 1 word2:word3 2
  std
  label
  1
  foo/bar.html
  -

foobar
  rst
  directive:option
  1
  baz/quux.html
  -

This format would provide <NEWLINE><SPACE><SPACE> as a delimiter between fields, and <NEWLINE><NEWLINE> as a delimiter between objects. (The ability to use <NEWLINE> as part of the delimiter is the key feature here, as none of the fields for a given object reference permit newlines, as best I know.

This format would take a lot more vertical space when decompressed, but ... so what? In nearly all uses, humans won't be reading it, and it'll be zlib-compressed anyways. Even in the few cases where someone is looking directly at it, it still seems quite readable to me.

And, even though there are extra characters in these delimiters, the zlib compression should work quite effectively on them.

(I also tossed an object count and and MD5 hash into the header to improve the options for validity checking.)

What do you think?

@picnixz
Copy link
Member Author

picnixz commented Mar 26, 2024

What if we changed the paradigm of the format a bit further?

Ideally yes: it's better to change the format so that we don't have possible ambiguities. An alternative for a simpler visual (without new lines) is to encode the length of the title (and there will be no need for a regex, or a fancy separator that could in the future exist (though I doubt newlines will ever be a valid title)).

I should check (but I don't have time) the exact specs of DEFLATE in stream mode but I'm pretty sure that the size shouldn't blow up (could you check the bytes size of the compressed inventories with and w/o newlines as separators and with titles with whitespaces?)

@picnixz picnixz removed the request for review from AA-Turner March 26, 2024 15:03
@bskinn
Copy link
Contributor

bskinn commented Mar 26, 2024

Definitely true, including the name length as you propose is probably about the most minimal change achievable, and wouldn't dramatically change the flow of line parsing.


Quick size analysis, switching to my newline-delimited proposal would increase the current Python 3.12 compressed inventory by about 2%:

>>> import sphobjinv as soi
>>> inv = soi.Inventory(url="https://docs.python.org/3/objects.inv")
>>> import zlib
>>> ( len_v2 := len(zlib.compress(inv.data_file(), 9)) )  # Compress the whole thing for simplicity
136182
>>> def newline_fmt(do):
...   return f"{do.name}\n  {do.domain}\n  {do.role}\n  {do.priority}\n  {do.uri}\n  {do.dispname}"
...
>>> file_newlines = b'\n'.join(inv.data_file().splitlines()[:4]) + b'\n' + ('\n\n'.join(newline_fmt(o) for o in inv.objects)).encode()
>>> print(file_newlines.decode()[:245])
# Sphinx inventory version 2
# Project: Python
# Version: 3.12
# The remainder of this file is compressed using zlib.
CO_FUTURE_DIVISION
  c
  member
  1
  c-api/veryhigh.html#c.$
  -

METH_CLASS
  c
  macro
  1
  c-api/structures.html#c.$
  -

>>> ( len_nls := len(zlib.compress(file_newlines, 9)) )
138880
>>> len_nls - len_v2
2698
>>> len_nls/len_v2
1.019811722547767

Impact is likely greater for smaller inventories, due to the smaller number of repeats of the \n and \n\n motifs. But, for smaller inventories, the size is already not a problem.

@bskinn
Copy link
Contributor

bskinn commented Mar 26, 2024

Re your regex, I think you have a typo in the first line:

(?P<title>(?:.|[^\s:]+:\S+\s+-?\d+\s*)+)?

Is that trailing ? supposed to be inside that final )? Where it's located now, I read it as making the entire title group optional.

@picnixz
Copy link
Member Author

picnixz commented Mar 26, 2024

Is that trailing ? supposed to be inside that final )? Where it's located now, I read it as making the entire title group optional.

Ah, yes thank you, it's a typo (I did my regex this morning and I already forgot about it)

@picnixz
Copy link
Member Author

picnixz commented Mar 26, 2024

Quick size analysis, switching to my newline-delimited proposal would increase the current Python 3.12 compressed inventory by about 2%:

I see. If we include the title length, I think we'll probably have less (in the end, it's as if you had a longer title with at most 3 characters (because, who in their right mind would have a title of more than 1000 characters)).

@bskinn
Copy link
Contributor

bskinn commented Mar 26, 2024

If we include the title length, I think we'll probably have less

Yes, agreed, yours is much more concise.

because, who in their right mind would have a title of more than 1000 characters

Yessss, goodness.

I like your proposal far better than continuing work to find a more robust regex. (I rather suspect that it might always be possible to find a pathological example for any regex we might devise, and any time spent attempting to refine the regex is likely better spent on something else.)

@picnixz
Copy link
Member Author

picnixz commented Mar 26, 2024

rather suspect that it might always be possible to find a pathological example for any regex we might devise, and any time spent attempting to refine the regex is likely better spent on something else.

This is clearly something that I'd agree with. As such I will close this PR and make a new one. For a more precise parsing, I think we can even encode the length of each field instead and completely get rid of any regex (this will save us a lot of work!). Note that we can even drop the spaces in the format ! (although we could want to keep them for testing purposes). So, I have two suggestions:

  • encode the length of each field on a separate line, e.g., (name_len, domain_len, reftype_len, prio (this is an integer so it can directly be included!), uri_len). And the line following after would be the current line that we have.
  • encode the length of each field at the beginning of each line (same as above, but leave the priority integer where it is currently).

I'm not sure which one would be the most compact but I like the first one because tests will be easier to write (and debug). I'm not sure if doubling the number of lines would have a direct impact.

However, we need to synchronize whatever decision we make with #12204 to avoid creating an unusable format. Jakob's suggested that each domain is responsible for dumping and loading their own intersphinx fragment and I suggested using a similar construction as in ELF where each domain would have a dedicated section where they would write whatever they want and they would know what to do. I think, we can introduce this line-encoded format as a "default" encoding if a domain does not have a specific way to encode its information or if it supports the default logic.

@picnixz picnixz closed this Mar 27, 2024
@bskinn
Copy link
Contributor

bskinn commented Mar 29, 2024

I'd be glad to have it. It could help a lot for #12204

Ok, here's what I turned up, in no particular order:

I also have an entire test module in sphobjinv dedicated to testing values that do/don't survive passage through sphinx.util.inventory.InventoryFile. It's very likely that there are errors here, but they're thorough/correct enough that they handle much of what's out in the wild.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Intersphinx cannot link to some Headers containing digits
2 participants