-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redo support for cross references to close numerous edge cases and fix buggy behaviour. #2067
Conversation
…sheets. man: redo support for cross references to close numerous edge cases and fix buggy behaviour. This change introduces helper templates for: - permalink-id-scheme-html.xsl : splits out support for ID conflict resolution and permalinks (anchors). - cross-refs-html.xsl : implements support for cross references to other systemd man pages. The new cross reference logic replaces previous commit: 958caa5 It addresses Github issue systemd#1956: systemd#1956 'RFE: make systemd.directives html pages jump to the anchor in the target' This version should fix the following issues inherent in the previous 'guessing' implementation: - Because of the fact the inline.seq DocBook template is used when generating values for id attributes (arguably broken), the value of a @target attribute cannot be assumed to correspond to the (sub)section ID even if it is otherwise entirely correct. E.g.: "--address=" from 'command line options' in systemd.directives (@target='--address=') lists a reference to systemd-bus-proxyd(8) but the corresponding section is actually called: "--address=ADDRESS[:ADDRESS...]" which is nonobvious as the square brackets ([]) are not part of the source XML but generated by the inline.seq DocBook template. - In fact, it is not generally safe to assume the ID attributes are, in fact, valid HTML id attributes. For instance some ID attributes contain linebreaks... - Beyond that a fair number of cross references refer to some kind of 'alias' which is not reflected in ID attributes at all. E.g.: "AssertDirectoryNotEmpty=" from 'Unit directives' in systemd.directives (@target='AssertDirectoryNotEmpty=') actually refers to the subsection with ID value 'AssertArchitecture=' in systemd.unit(5). - Also, a number of @target attributes refer to a partial value for the ID attributes. E.g.: "$attr{file}" from 'UDEV Directives' in systemd.directives (@target='$attr{'); Furthermore by noting structural correspondence between the location of a citerefentry and the referenced man page additional cross references are now also picked up on: - Cross references between variablelist structures: e.g. the kernel command line 'quiet' option is described as "Parameter understood by both the kernel and the system and service manager to control console log verbosity. For details, see systemd(1)" Because the systemd page contains a section on the 'quiet' option a direct link to the subsection can be synthesised. We get another step closer to the "ideal" world case described in the RFE. - Similarly, specific markup like filename, option or command elements following or preceding a citerefentry may also be turned into a fully function cross reference, e.g.: "Use machinectl(1)'s login command to request an additional login prompt in a running container." in systemd-nspawn and "(See KillMode= in systemd.kill(5).)" in systemd.socket. - Shorthand references which only supply a partial ID for target attribute or when the logic for the above two cases would otherwise fail to recover the full ID, e.g.: "For details about the format and contents of .nspawn files, consult systemd.nspawn(5)." in systemd-nspawn (using implicit: .nspawn to find '.nspawn File Discovery' in systemd.nspawn); "... with systemd-nspawn's --settings= switch, see systemd-nspawn(1) for details." in systemd.nspawn (using implicit --settings= to find '--settings=MODE' in systemd-nspawn) - Sometimes command line arguments can be related to subsections as well, for example: "systemd.journald.forward_to_syslog=, systemd.journald.forward_to_kmsg=, systemd.journald.forward_to_console=, systemd.journald.forward_to_wall= Enables/disables forwarding of collected log messages to syslog, the kernel log buffer, the system console or wall. See journald.conf(5) for details" in systemd-journald.service is related to the 'ForwardToSyslog=' section of journald.conf Below follows a high level overview of the extra logic introduced to the code: Support for cross references to anchor tags is divided into two general cases: 1. The explicit case, in which the source docbook manpage XML contains a relevant @target attribute 2. The implicit case, in which a number of potential substitutes for an explicit @target are attempted. Potential subsitutes are determined by noting various structural patterns in man page source XML. In either case a set of potential 'targets' is obtained. If this set is not empty the potential targets are then matched against the referenced XML by querying it for corresponding source nodes. This query succeeds if at least one XML node from the referenced XML document is returned. The query may fail if either the referenced XML document cannot be loaded, or no corresponding source node is found. In the case the query succeeds, a match is selected (currently the first) and a corresponding value for the fragment portion of the URL is generated by replaying the permalink ID scheme and returning the resulting ID value. Otherwise an empty fragment is returned. Additionally some citerefentry nodes referring to systemd man pages are excluded from this cross reference scheme, to wit: - References to section 3 of the man pages (because these generally refer to documents by an alias which does not exist as XML file) - References in the 'See Also' sections (because these are not intended to refer to subsections) Apart from that, the logic for references in general has been refactored to condense the code and fixed so URLs generated omit the hash (#) if the fragement value is empty. Finally, the parameter cross.refs.debug is introduced which can be used to generate some debug output following the hyperlink. This can be enabled by passing a non zero number for XSLT_DEBUG_CROSS_REFS when calling make, e.g.: make XSLT_DEBUG_CROSS_REFS=1
The most salient point about this PR/patch is: This PR fixes a lot of bugs, but does so by the referenced systemd man pages (XML) into memory (node sets). Currently XML files are being loaded & parsed each time for each reference (unless xsltproc is 'clever' enough to optimise this, but I doubt it), which increases the build times somewhat. Because the same parsing work is done twice if you enable debug output, I ran a comparison using In theory, this can be 'optimised' to only load each referenced XML document the once but it would not be trivial. Because XSLT does not feature destructive update (reassignment) you cannot simply 'cache' things, however you can preload things. Preloading all documents being referenced is probably not a good idea because in the worst case (systemd.directives) you wind up preloading all the XML in the man/ tree which is ... a lot of memory. This can be avoided by inverting the preloading logic: instead of doing the citerefentry work for systemd on as-you-go basis (using the normal XSLT template behaviour) we could try to preload the actual work as follows:
|
Semaphore didn't like it, probably due to memory requirements, dunno. I don't grok any of this stuff, @keszybz could you look into this please? |
document() out of the inner function call and moving it up by a few levels so it's inside the determineCrossID template. This ought not to make a difference for memory consumption, because the documents never need to be kept open concurrently; except, it turns out that it does because somehow calling document() triggers a memory leak in libxslt or something. This change may lop as much as 400MB off of the maximum rss size as reported by /usr/bin/time -v on a make XSLT_DEBUG_CROSS_REFS=1 call.
It could very well be: It seems xsltproc/libxslt are not in any hurry to release claimed memory. My first attempt at implementing the 3-step 'optimisation' thing turned out to be a glaring memory leak which quickly consumed GBs of memory on systemd.directives -- all because I used 'xsl:for-each' instead of 'xsl:apply-templates' at one point. Besides, it ran even slower so... I've pushed an experimental change that hoists the document() call up by a few levels so it's called directly in the context of the determineCrossID template (i.e. at the point where/when the link is inserted in the output) as opposed to inside helper functions that do the querying of the resulting node set for what the cross reference ID should be. At this point I suppose this PR should be kept open for discussion but eventually a new 'cleaner' PR should be prepared for merging if/when the effort is deemed successful. |
What are we doing with this one? This exceeds the available memory on Semaphore, which makes me nervous about other machines out there. Is there anything we can/want to do about this? |
Ugh, I'd forgotten about this one.
It's the big two (systemd.directives/systemd.index) that exhibit a pathological case of the problem. It appears that libxslt/xsltproc does not dispose of the documents (and therefore does not release the memory) until the transformation finishes. So it is effectively a memory leak that scales linearly with the number of cross references in a document. The upshot is that this PR 'works' just not for the case in which it would be most useful, unless you have a lot of free RAM to play with. We have the somewhat odd situation in which cross references without target attributes in 'normal' man pages work better than cross references with dedicated target attributes that are supposed to provide this feature. Keeping old behaviour also implies keeping the old bugs resulting from the fact that the 'target' attributes generated by the Python scripts that put together systemd.directives/systemd.index are mostly bogus.... Fixing that is nontrivial. Generated ID attributes may contain a lot of 'formatting' that is injected by the DocBook templates, and secondly the scheme to disambiguate the ID values means that values depend on the number of similar IDs that are located before it in the same document. For the target attributes to work properly you need to replicate this logic. Alternatively we'd need a different ID scheme that, in particular, doesn't go through the DocBook template stuff so we stand a change of generating valid target attributes. (Teaching Python to count is probably doable by introducing a lookup table, teaching it all quirks of DocBook whitespace handling seems an exercise in frustration.) If we are going to redo ID attributes that permalinks refer to, we might as well take the time to make sure the ID values are actually valid HTML (currently, they are not -- this is a side effect of the DocBook stuff, incidentally). Another alternative is to concede defeat and proclaim it not a priority right now. ;) |
Here I get a bunch of warnings:
Some of those seem to be because we refer to aliases of man pages, not to man pages directly. It is entirely legitimate to link to the alias, so this would have to be fixed too. Other seem to be because we refer to external man pages without specifying The memory usage high, but not extremely so: All in all, some more fixing would be required, and it requires more resources..., but what worries me the most about this patch is the complexity: a few hundred lines of xslt code to achieve something which should be rather simple... You suggest that we could switch to a different ID generation scheme. This seems like a more viable option. How would that work? |
Noticed in pull request systemd#2067.
Noticed in pull request systemd#2067.
Noticed in pull request systemd#2067.
It doesn't reduce the complexity of the patch by itself -- you'd still need that if you want to support the cross references elsewhere, it buys you a way to avoid running out of memory with pathological cases (systemd.directives, systemd.index). Alternatively you could of course update each and every man page to use proper target attributes, make it part of coding style/guidelines that cross references should be specified with targets where possible, and you'd only need to use those target attributes. Then your original attempt could be used with a tweak to make sure that without target attribute no trailing hash is appended to the URL ... :) |
That's infeasible.
Simply discarding non-letters and lowercasing would be enough. I could write the python code, that's easily, but I don't know how to do the xslt stuff. |
That means you have to retain the bulk of the complexity in the patch if you want cross references to work throughout the whole body of man pages. Most of the complexity is about the heuristics for inferring a target attribute/value for a
One of the tricky things with XSLT 1.0 is that you don't have any good string handling functions... Lowercasing, for example, is not a thing XSLT 1.0 does, you have to do a
So it would be more like a whitelist of supported/mapped characters and discard everything else. Additionally the Python code and XSLT code have to agree which node sets are run through the mapping. This seems trivial (pick the node that will correspond to the heading and carry the ID value) but it does mean the Python code will need to be altered to "back up" the tree and find the "correct" node. |
This
We already do that in Python code and it works without any trouble. I wrote the code a while ago and it just works. |
Can one of the admins verify this patch? |
Closing since this didn't get a response in a couple of month, and doesn't apply nor build. Feel free to reopen this if there's interest to work on this again. |
build: split custom-html.xsl into custom-html.xsl + helper XSLT stylesheets.
man: redo support for cross references to close numerous edge cases and fix buggy behaviour.
This change introduces helper templates for:
The new cross reference logic replaces previous commit: 958caa5
It addresses Github issue #1956: #1956
'RFE: make systemd.directives html pages jump to the anchor in the target'
This version should fix the following issues inherent in the previous 'guessing' implementation:
attributes (arguably broken), the value of a @target attribute cannot be assumed to correspond to the
(sub)section ID even if it is otherwise entirely correct.
E.g.: "--address=" from 'command line options' in systemd.directives (@target='--address=') lists a
reference to systemd-bus-proxyd(8) but the corresponding section is actually called:
"--address=ADDRESS[:ADDRESS...]" which is nonobvious as the square brackets ([]) are not part of the
source XML but generated by the inline.seq DocBook template.
For instance some ID attributes contain linebreaks...
in ID attributes at all. E.g.:
"AssertDirectoryNotEmpty=" from 'Unit directives' in systemd.directives
(@target='AssertDirectoryNotEmpty=') actually refers to the subsection with ID value
'AssertArchitecture=' in systemd.unit(5).
"$attr{file}" from 'UDEV Directives' in systemd.directives (@target='$attr{');
Furthermore by noting structural correspondence between the location of a citerefentry and the referenced
man page additional cross references are now also picked up on:
Cross references between variablelist structures:
e.g. the kernel command line 'quiet' option is described as
"Parameter understood by both the kernel and the system and service manager to control console log verbosity.
For details, see systemd(1)"
Because the systemd page contains a section on the 'quiet' option a direct link to the subsection
can be synthesised. We get another step closer to the "ideal" world case described in the RFE.
Similarly, specific markup like filename, option or command elements following or preceding a citerefentry
may also be turned into a fully function cross reference, e.g.:
"Use machinectl(1)'s login command to request an additional login prompt in a running container." in systemd-nspawn
and "(See KillMode= in systemd.kill(5).)" in systemd.socket.
Shorthand references which only supply a partial ID for target attribute or when the logic for the above two cases
would otherwise fail to recover the full ID, e.g.:
"For details about the format and contents of .nspawn files, consult systemd.nspawn(5)." in systemd-nspawn
(using implicit: .nspawn to find '.nspawn File Discovery' in systemd.nspawn);
"... with systemd-nspawn's --settings= switch, see systemd-nspawn(1) for details." in systemd.nspawn
(using implicit --settings= to find '--settings=MODE' in systemd-nspawn)
Sometimes command line arguments can be related to subsections as well, for example:
"systemd.journald.forward_to_syslog=, systemd.journald.forward_to_kmsg=, systemd.journald.forward_to_console=,
systemd.journald.forward_to_wall=
Enables/disables forwarding of collected log messages to syslog, the kernel log buffer, the system console or wall.
See journald.conf(5) for details" in systemd-journald.service is related to the 'ForwardToSyslog=' section of
journald.conf
Below follows a high level overview of the extra logic introduced to the code:
Support for cross references to anchor tags is divided into two general cases:
Potential subsitutes are determined by noting various structural patterns in man page source XML.
In either case a set of potential 'targets' is obtained. If this set is not empty the potential targets
are then matched against the referenced XML by querying it for corresponding source nodes.
This query succeeds if at least one XML node from the referenced XML document is returned. The query may
fail if either the referenced XML document cannot be loaded, or no corresponding source node is found.
In the case the query succeeds, a match is selected (currently the first) and a corresponding value for
the fragment portion of the URL is generated by replaying the permalink ID scheme and returning the resulting
ID value. Otherwise an empty fragment is returned.
Additionally some citerefentry nodes referring to systemd man pages are excluded from this cross reference
scheme, to wit:
(because these generally refer to documents by an alias which does not exist as XML file)
Apart from that, the logic for references in general has been refactored to condense the code and fixed so
URLs generated omit the hash (#) if the fragement value is empty. Finally, the parameter cross.refs.debug is
introduced which can be used to generate some debug output following the hyperlink. This can be enabled by
passing a non zero number for XSLT_DEBUG_CROSS_REFS when calling make, e.g.:
make XSLT_DEBUG_CROSS_REFS=1