Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redo support for cross references to close numerous edge cases and fix buggy behaviour. #2067

Closed
wants to merge 2 commits into from

Conversation

cmacq2
Copy link
Contributor

@cmacq2 cmacq2 commented Nov 30, 2015

build: split custom-html.xsl into custom-html.xsl + helper XSLT stylesheets.
man: redo support for cross references to close numerous edge cases and fix buggy behaviour.

This change introduces helper templates for:

  • permalink-id-scheme-html.xsl : splits out support for ID conflict resolution and permalinks (anchors).
  • cross-refs-html.xsl : implements support for cross references to other systemd man pages.

The new cross reference logic replaces previous commit: 958caa5

It addresses Github issue #1956: #1956
'RFE: make systemd.directives html pages jump to the anchor in the target'

This version should fix the following issues inherent in the previous 'guessing' implementation:

  • Because of the fact the inline.seq DocBook template is used when generating values for id
    attributes (arguably broken), the value of a @target attribute cannot be assumed to correspond to the
    (sub)section ID even if it is otherwise entirely correct.
    E.g.: "--address=" from 'command line options' in systemd.directives (@target='--address=') lists a
    reference to systemd-bus-proxyd(8) but the corresponding section is actually called:
    "--address=ADDRESS[:ADDRESS...]" which is nonobvious as the square brackets ([]) are not part of the
    source XML but generated by the inline.seq DocBook template.
  • In fact, it is not generally safe to assume the ID attributes are, in fact, valid HTML id attributes.
    For instance some ID attributes contain linebreaks...
  • Beyond that a fair number of cross references refer to some kind of 'alias' which is not reflected
    in ID attributes at all. E.g.:
    "AssertDirectoryNotEmpty=" from 'Unit directives' in systemd.directives
    (@target='AssertDirectoryNotEmpty=') actually refers to the subsection with ID value
    'AssertArchitecture=' in systemd.unit(5).
  • Also, a number of @target attributes refer to a partial value for the ID attributes. E.g.:
    "$attr{file}" from 'UDEV Directives' in systemd.directives (@target='$attr{');

Furthermore by noting structural correspondence between the location of a citerefentry and the referenced
man page additional cross references are now also picked up on:

  • Cross references between variablelist structures:
    e.g. the kernel command line 'quiet' option is described as
    "Parameter understood by both the kernel and the system and service manager to control console log verbosity.
    For details, see systemd(1)"
    Because the systemd page contains a section on the 'quiet' option a direct link to the subsection
    can be synthesised. We get another step closer to the "ideal" world case described in the RFE.

  • Similarly, specific markup like filename, option or command elements following or preceding a citerefentry
    may also be turned into a fully function cross reference, e.g.:
    "Use machinectl(1)'s login command to request an additional login prompt in a running container." in systemd-nspawn
    and "(See KillMode= in systemd.kill(5).)" in systemd.socket.

  • Shorthand references which only supply a partial ID for target attribute or when the logic for the above two cases
    would otherwise fail to recover the full ID, e.g.:
    "For details about the format and contents of .nspawn files, consult systemd.nspawn(5)." in systemd-nspawn
    (using implicit: .nspawn to find '.nspawn File Discovery' in systemd.nspawn);
    "... with systemd-nspawn's --settings= switch, see systemd-nspawn(1) for details." in systemd.nspawn
    (using implicit --settings= to find '--settings=MODE' in systemd-nspawn)

  • Sometimes command line arguments can be related to subsections as well, for example:
    "systemd.journald.forward_to_syslog=, systemd.journald.forward_to_kmsg=, systemd.journald.forward_to_console=,
    systemd.journald.forward_to_wall=
    Enables/disables forwarding of collected log messages to syslog, the kernel log buffer, the system console or wall.

    See journald.conf(5) for details" in systemd-journald.service is related to the 'ForwardToSyslog=' section of
    journald.conf

Below follows a high level overview of the extra logic introduced to the code:

Support for cross references to anchor tags is divided into two general cases:

  1. The explicit case, in which the source docbook manpage XML contains a relevant @target attribute
  2. The implicit case, in which a number of potential substitutes for an explicit @target are attempted.
    Potential subsitutes are determined by noting various structural patterns in man page source XML.

In either case a set of potential 'targets' is obtained. If this set is not empty the potential targets
are then matched against the referenced XML by querying it for corresponding source nodes.

This query succeeds if at least one XML node from the referenced XML document is returned. The query may
fail if either the referenced XML document cannot be loaded, or no corresponding source node is found.

In the case the query succeeds, a match is selected (currently the first) and a corresponding value for
the fragment portion of the URL is generated by replaying the permalink ID scheme and returning the resulting
ID value. Otherwise an empty fragment is returned.

Additionally some citerefentry nodes referring to systemd man pages are excluded from this cross reference
scheme, to wit:

  • References to section 3 of the man pages
    (because these generally refer to documents by an alias which does not exist as XML file)
  • References in the 'See Also' sections (because these are not intended to refer to subsections)

Apart from that, the logic for references in general has been refactored to condense the code and fixed so
URLs generated omit the hash (#) if the fragement value is empty. Finally, the parameter cross.refs.debug is
introduced which can be used to generate some debug output following the hyperlink. This can be enabled by
passing a non zero number for XSLT_DEBUG_CROSS_REFS when calling make, e.g.:
make XSLT_DEBUG_CROSS_REFS=1

…sheets.

man: redo support for cross references to close numerous edge cases and fix buggy behaviour.

This change introduces helper templates for:
 - permalink-id-scheme-html.xsl : splits out support for ID conflict resolution and permalinks (anchors).
 - cross-refs-html.xsl : implements support for cross references to other systemd man pages.

The new cross reference logic replaces previous commit: 958caa5

It addresses Github issue systemd#1956: systemd#1956
 'RFE: make systemd.directives html pages jump to the anchor in the target'

This version should fix the following issues inherent in the previous 'guessing' implementation:

 - Because of the fact the inline.seq DocBook template is used when generating values for id
   attributes (arguably broken), the value of a @target attribute cannot be assumed to correspond to the
   (sub)section ID even if it is otherwise entirely correct.
   E.g.: "--address=" from 'command line options' in systemd.directives (@target='--address=') lists a
   reference to systemd-bus-proxyd(8) but the corresponding section is actually called:
   "--address=ADDRESS[:ADDRESS...]" which is nonobvious as the square brackets ([]) are not part of the
   source XML but generated by the inline.seq DocBook template.
 - In fact, it is not generally safe to assume the ID attributes are, in fact, valid HTML id attributes.
   For instance some ID attributes contain linebreaks...
 - Beyond that a fair number of cross references refer to some kind of 'alias' which is not reflected
   in ID attributes at all. E.g.:
   "AssertDirectoryNotEmpty=" from 'Unit directives' in systemd.directives
   (@target='AssertDirectoryNotEmpty=') actually refers to the subsection with ID value
   'AssertArchitecture=' in systemd.unit(5).
 - Also, a number of @target attributes refer to a partial value for the ID attributes. E.g.:
   "$attr{file}" from 'UDEV Directives' in systemd.directives (@target='$attr{');

Furthermore by noting structural correspondence between the location of a citerefentry and the referenced
man page additional cross references are now also picked up on:

 - Cross references between variablelist structures:
   e.g. the kernel command line 'quiet' option is described as
   "Parameter understood by both the kernel and the system and service manager to control console log verbosity.
   For details, see systemd(1)"
   Because the systemd page contains a section on the 'quiet' option a direct link to the subsection
   can be synthesised. We get another step closer to the "ideal" world case described in the RFE.
 - Similarly, specific markup like filename, option or command elements following or preceding a citerefentry
   may also be turned into a fully function cross reference, e.g.:
   "Use machinectl(1)'s login command to request an additional login prompt in a running container." in systemd-nspawn
   and "(See KillMode= in systemd.kill(5).)" in systemd.socket.
 - Shorthand references which only supply a partial ID for target attribute or when the logic for the above two cases
   would otherwise fail to recover the full ID, e.g.:
   "For details about the format and contents of .nspawn files, consult systemd.nspawn(5)." in systemd-nspawn
   (using implicit: .nspawn to find '.nspawn File Discovery' in systemd.nspawn);
   "... with systemd-nspawn's --settings= switch, see systemd-nspawn(1) for details." in systemd.nspawn
   (using implicit --settings= to find '--settings=MODE' in systemd-nspawn)
 - Sometimes command line arguments can be related to subsections as well, for example:
   "systemd.journald.forward_to_syslog=, systemd.journald.forward_to_kmsg=, systemd.journald.forward_to_console=,
    systemd.journald.forward_to_wall=
    Enables/disables forwarding of collected log messages to syslog, the kernel log buffer, the system console or wall.

    See journald.conf(5) for details" in systemd-journald.service is related to the 'ForwardToSyslog=' section of
    journald.conf

Below follows a high level overview of the extra logic introduced to the code:

Support for cross references to anchor tags is divided into two general cases:
 1. The explicit case, in which the source docbook manpage XML contains a relevant @target attribute
 2. The implicit case, in which a number of potential substitutes for an explicit @target are attempted.
 Potential subsitutes are determined by noting various structural patterns in man page source XML.

In either case a set of potential 'targets' is obtained. If this set is not empty the potential targets
are then matched against the referenced XML by querying it for corresponding source nodes.

This query succeeds if at least one XML node from the referenced XML document is returned. The query may
fail if either the referenced XML document cannot be loaded, or no corresponding source node is found.

In the case the query succeeds, a match is selected (currently the first) and a corresponding value for
the fragment portion of the URL is generated by replaying the permalink ID scheme and returning the resulting
ID value. Otherwise an empty fragment is returned.

Additionally some citerefentry nodes referring to systemd man pages are excluded from this cross reference
scheme, to wit:
 - References to section 3 of the man pages
   (because these generally refer to documents by an alias which does not exist as XML file)
 - References in the 'See Also' sections (because these are not intended to refer to subsections)

Apart from that, the logic for references in general has been refactored to condense the code and fixed so
URLs generated omit the hash (#) if the fragement value is empty. Finally, the parameter cross.refs.debug is
introduced which can be used to generate some debug output following the hyperlink. This can be enabled by
passing a non zero number for XSLT_DEBUG_CROSS_REFS when calling make, e.g.:
 make XSLT_DEBUG_CROSS_REFS=1
@cmacq2
Copy link
Contributor Author

cmacq2 commented Nov 30, 2015

The most salient point about this PR/patch is:

This PR fixes a lot of bugs, but does so by the referenced systemd man pages (XML) into memory (node sets).

Currently XML files are being loaded & parsed each time for each reference (unless xsltproc is 'clever' enough to optimise this, but I doubt it), which increases the build times somewhat. Because the same parsing work is done twice if you enable debug output, I ran a comparison using time to see how that impacts build times. Currently it amounts to a ~20 sec increase in build times for all the xsltproc invocations. -- on this SSD enabled machine.

In theory, this can be 'optimised' to only load each referenced XML document the once but it would not be trivial.

Because XSLT does not feature destructive update (reassignment) you cannot simply 'cache' things, however you can preload things. Preloading all documents being referenced is probably not a good idea because in the worst case (systemd.directives) you wind up preloading all the XML in the man/ tree which is ... a lot of memory.

This can be avoided by inverting the preloading logic: instead of doing the citerefentry work for systemd on as-you-go basis (using the normal XSLT template behaviour) we could try to preload the actual work as follows:

  1. First obtain a list of all documents being referenced, call it $manPageList
  2. For each document in $manPage, grab all citerefentry nodes which refer to it, compute the proper cross reference and store it in a $refList. Each reference generated would have to be tagged with a 'cookie' value for later retrieval, e.g. using a wrapper structure and generate-id().
  3. Then in the normal templating logic try and fetch the same cookie value from $refList using an XPATH expression with generate-id() as key.

@poettering
Copy link
Member

Semaphore didn't like it, probably due to memory requirements, dunno.

I don't grok any of this stuff, @keszybz could you look into this please?

document() out of the inner function call and moving it up by a
few levels so it's inside the determineCrossID template.

This ought not to make a difference for memory consumption,
because the documents never need to be kept open concurrently; except,
it turns out that it does because somehow calling document() triggers
a memory leak in libxslt or something.

This change may lop as much as 400MB off of the maximum rss size as
reported by /usr/bin/time -v on a make XSLT_DEBUG_CROSS_REFS=1 call.
@cmacq2
Copy link
Contributor Author

cmacq2 commented Dec 1, 2015

It could very well be:

It seems xsltproc/libxslt are not in any hurry to release claimed memory. My first attempt at implementing the 3-step 'optimisation' thing turned out to be a glaring memory leak which quickly consumed GBs of memory on systemd.directives -- all because I used 'xsl:for-each' instead of 'xsl:apply-templates' at one point. Besides, it ran even slower so...

I've pushed an experimental change that hoists the document() call up by a few levels so it's called directly in the context of the determineCrossID template (i.e. at the point where/when the link is inserted in the output) as opposed to inside helper functions that do the querying of the resulting node set for what the cross reference ID should be.

At this point I suppose this PR should be kept open for discussion but eventually a new 'cleaner' PR should be prepared for merging if/when the effort is deemed successful.

@zonque
Copy link
Member

zonque commented Jan 20, 2016

What are we doing with this one? This exceeds the available memory on Semaphore, which makes me nervous about other machines out there. Is there anything we can/want to do about this?

@zonque zonque added the needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer label Jan 20, 2016
@cmacq2
Copy link
Contributor Author

cmacq2 commented Jan 22, 2016

Ugh, I'd forgotten about this one.

Is there anything we can/want to do about this?

It's the big two (systemd.directives/systemd.index) that exhibit a pathological case of the problem. It appears that libxslt/xsltproc does not dispose of the documents (and therefore does not release the memory) until the transformation finishes. So it is effectively a memory leak that scales linearly with the number of cross references in a document.

The upshot is that this PR 'works' just not for the case in which it would be most useful, unless you have a lot of free RAM to play with. We have the somewhat odd situation in which cross references without target attributes in 'normal' man pages work better than cross references with dedicated target attributes that are supposed to provide this feature. Keeping old behaviour also implies keeping the old bugs resulting from the fact that the 'target' attributes generated by the Python scripts that put together systemd.directives/systemd.index are mostly bogus....


Fixing that is nontrivial.

Generated ID attributes may contain a lot of 'formatting' that is injected by the DocBook templates, and secondly the scheme to disambiguate the ID values means that values depend on the number of similar IDs that are located before it in the same document.

For the target attributes to work properly you need to replicate this logic.

Alternatively we'd need a different ID scheme that, in particular, doesn't go through the DocBook template stuff so we stand a change of generating valid target attributes. (Teaching Python to count is probably doable by introducing a lookup table, teaching it all quirks of DocBook whitespace handling seems an exercise in frustration.)

If we are going to redo ID attributes that permalinks refer to, we might as well take the time to make sure the ID values are actually valid HTML (currently, they are not -- this is a side effect of the DocBook stuff, incidentally).

Another alternative is to concede defeat and proclaim it not a priority right now. ;)

@keszybz
Copy link
Member

keszybz commented Jan 25, 2016

Here I get a bunch of warnings:

  XSLT     man/systemd-detect-virt.html
warning: failed to load external entity "man/systemd-journal-gatewayd.xml"
  XSLT     man/systemd-escape.html
  XSLT     man/systemd-fsck@.service.html
  XSLT     man/systemd-fstab-generator.html
  XSLT     man/systemd-getty-generator.html
warning: failed to load external entity "man/keyctl.xml"
warning: failed to load external entity "man/keyctl.xml"
  XSLT     man/systemd-gpt-auto-generator.html
...
warning: failed to load external entity "man/bootchart.conf.d.xml"
...

Some of those seem to be because we refer to aliases of man pages, not to man pages directly. It is entirely legitimate to link to the alias, so this would have to be fixed too. Other seem to be because we refer to external man pages without specifying project=..., i'll submit a fix for those.

The memory usage high, but not extremely so: xsltproc ... man/systemd.index.xml has maximum RSS of 800MB, and takes 8s. In master, it's 0.5s, 37MB. OK, so it's a bit more ;)

All in all, some more fixing would be required, and it requires more resources..., but what worries me the most about this patch is the complexity: a few hundred lines of xslt code to achieve something which should be rather simple... You suggest that we could switch to a different ID generation scheme. This seems like a more viable option. How would that work?

keszybz added a commit to keszybz/systemd that referenced this pull request Jan 26, 2016
keszybz added a commit to keszybz/systemd that referenced this pull request Jan 26, 2016
@poettering poettering added the ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR label Jan 27, 2016
keszybz added a commit to keszybz/systemd that referenced this pull request Jan 28, 2016
@cmacq2
Copy link
Contributor Author

cmacq2 commented Jan 28, 2016

All in all, some more fixing would be required, and it requires more resources..., but what worries me the most about this patch is the complexity: a few hundred lines of xslt code to achieve something which should be rather simple...

You suggest that we could switch to a different ID generation scheme. This seems like a more viable option. How would that work?

  • Start with a well-defined <node> -> <string> -> QNAME mapping. This would work similarly to how static/blog site generators deal with article slugs. So something complex like "Abenteuerspielplatz & Ponyhof" would be translated into, e.g. "abenteuerspielplatz-ponyhof" for IDs. The point is that the IDs become predictable and independent from whitespace/formatting issues. As a bonus they also become valid.
  • Teach Python code to replay the same mapping when generating target attributes. This could be a bit more complex because multiple 'directives' may be grouped underneath the same permalink/heading.
  • Teach XSLT not to load documents when cross referencing from within systemd.directives/systemd.index man pages and to trust target attributes instead. This is fairly straightforward.

It doesn't reduce the complexity of the patch by itself -- you'd still need that if you want to support the cross references elsewhere, it buys you a way to avoid running out of memory with pathological cases (systemd.directives, systemd.index).

Alternatively you could of course update each and every man page to use proper target attributes, make it part of coding style/guidelines that cross references should be specified with targets where possible, and you'd only need to use those target attributes. Then your original attempt could be used with a tweak to make sure that without target attribute no trailing hash is appended to the URL ... :)

@keszybz
Copy link
Member

keszybz commented Jan 29, 2016

Alternatively you could of course update each and every man page to use proper target attributes,

That's infeasible.

well-defined -> -> QNAME mapping.

Simply discarding non-letters and lowercasing would be enough.

I could write the python code, that's easily, but I don't know how to do the xslt stuff.

@cmacq2
Copy link
Contributor Author

cmacq2 commented Jan 29, 2016

That's infeasible.

That means you have to retain the bulk of the complexity in the patch if you want cross references to work throughout the whole body of man pages. Most of the complexity is about the heuristics for inferring a target attribute/value for a citerefentry from its context. (This is basically the same whether you have broken target attributes or no target attributes at all.)

Simply discarding non-letters and lowercasing would be enough.

One of the tricky things with XSLT 1.0 is that you don't have any good string handling functions...

Lowercasing, for example, is not a thing XSLT 1.0 does, you have to do a translate() call and translate is basically a version of the Unix tr command that only understands literal character strings. E.g. lowercasing is typically implemented like this:

<xsl:variable name="lc" select="translate($str, 
                                          'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
                                          'abcdefghijklmnopqrstuvwxyz')"/>

So it would be more like a whitelist of supported/mapped characters and discard everything else.

Additionally the Python code and XSLT code have to agree which node sets are run through the mapping. This seems trivial (pick the node that will correspond to the heading and carry the ID value) but it does mean the Python code will need to be altered to "back up" the tree and find the "correct" node.

@keszybz
Copy link
Member

keszybz commented Jan 30, 2016

This translate doesn't look too bad. We don't have any non-ascii characters so it would be enough. But on second thought, downcasing wouldn't be necessary. It would make things nicer for external users, but internally we wouldn't care.

but it does mean the Python code will need to be altered to "back up" the tree and find the "correct" node.

We already do that in Python code and it works without any trouble. I wrote the code a while ago and it just works.

@centos-ci
Copy link

Can one of the admins verify this patch?

@poettering
Copy link
Member

Closing since this didn't get a response in a couple of month, and doesn't apply nor build. Feel free to reopen this if there's interest to work on this again.

@poettering poettering closed this May 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR documentation needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer
Development

Successfully merging this pull request may close these issues.

None yet

5 participants