Prevent header collisions on generation.#126
Prevent header collisions on generation.#126rtfb merged 1 commit intorussross:masterfrom halostatue:generate-unique-header-ids
Conversation
|
Note 1: this does not protect against collisions of manually specified header IDs in either form. I can come up with an alternative patch that does this. Ideally, this would also warn on any collision rather than just overriding the collision blindly. |
|
I think a The logic, when generating unique anchor names, would need to check that the generated name does not already exist, and if it does, just skip it and try the next number. In the worst case, there'll be O(n) iterations where n is number of headers, but that doesn't seem like a big deal if you consider the average/expected case.
So my suggested approach would result in |
|
Note 2: As with |
block.go
Outdated
There was a problem hiding this comment.
If you just check if that potential anchorText already exists, i.e.:
alreadyUsedUp := p.headers[anchorText] > 0And if so, repeat (increment count and try again).
There was a problem hiding this comment.
Although, this approach will potentially create sub-optimal mappings where # Header 2 may actually have anchor name of header-2-1 because another # Header already used up header-2.
|
Yeah. This has the potential of being a very difficult-to-deal-with problem, which is also the reason that I think that we need to somehow warn users of collisions when they happen. This is related to gohugoio/hugo#591, which has a pretty extensive discussion (and isn’t 100% germane to this issue, as Hugo deals with multiple documents being generated, and we need to consider cross-document collisions as those multiple documents may be rendered in a single HTML page). This is a real problem in multiple ways that I don’t know how to solve cleanly. |
Yeah, I'm starting to see that.
I will propose the following solution, which I think has the potential to solve the problem, but it has costs. There can be a notion of However, there could be the option of supplying it an already existing context, that it will reuse. Hugo can then maintain its own context at the per-page scope, and reuse it for all documents it renders within that page. It would guarantee all anchor names within a page are unique. If that approach is to be taken, I think it's best for this anchor name generation to be a separate lower-level package that blackfriday (and Hugo) imports. I'd like this I think at this point blackfriday having more imports is a lesser cost to pay compared to its internal code growing larger (making it more complex and diluted). |
|
I can’t see Hugo using the The solution we’re talking about in gohugoio/hugo#591 is to use the single conversion, but to strip header IDs when used in a list context (possibly with a different method; that hasn’t been resolved). |
|
I see. I'm not familiar with Hugo so I can't comment on what it could/should do, but I wanted to share the above proposal just in case it's helpful. |
|
Prevent generated header collisions, less naively.
The automatic header ID generation code submitted in #125 has a subtle bug where it will use the same ID for multiple headers with identical text. In the case below, all the headers are rendered a # Header
# Header
# Header
# HeaderThis change is a simple but robust approach that uses an incrementing counter and pre-checking to prevent header collision. (The above would be rendered as # Header
# Header 1
# Header
# HeaderThis will generate This code has two additional changes over the prior version:
|
Pretty cool. I gave it a shot to create similar functionality that would ensure uniqueness of headers within a context, which you can see on this branch. I've kinda abandoned it because I've realized I have no (or very little) need for it in any of my code. Most of the headers I generate come from things that are guaranteed to be unique (Go import paths, symbol names, files in a folder, etc.). I didn't want to work on something I wouldn't use and thus could not test, that's why I left it as is.
👍 from me, but it's up to the maintainers if they're willing to accept an external dependency. I'm willing to move it elsewhere if needed. Also, I'm planning to make a small improvement to the current behavior. It simply skips all non-alphanumeric characters, but I want to change it so that each sequence of 1 or more non-alphanumeric characters will be replaced with a single dash. That way, for example, a header titled |
block.go
Outdated
There was a problem hiding this comment.
There's no need to rename the imported package to sanitized_anchor_name, that is already its package name.
There was a problem hiding this comment.
Ah. That’s my newness to Go showing. (I had also originally just renamed it “sanitizer”.)
|
I like the idea of squashing sequences of non-alphanumeric strings into a single dash. |
> This is a rework of an earlier version of this code. The automatic header ID generation code submitted in #125 has a subtle bug where it will use the same ID for multiple headers with identical text. In the case below, all the headers are rendered a `<h1 id="header">Header</h1>`. ```markdown # Header # Header # Header # Header ``` This change is a simple but robust approach that uses an incrementing counter and pre-checking to prevent header collision. (The above would be rendered as `header`, `header-1`, `header-2`, and `header-3`.) In more complex cases, it will append a new counter suffix (`-1`), like so: ```markdown # Header # Header 1 # Header # Header ``` This will generate `header`, `header-1`, `header-1-1`, and `header-1-2`. This code has two additional changes over the prior version: 1. Rather than reimplementing @shurcooL’s anchor sanitization code, I have imported it as from `github.com/shurcooL/go/github_flavored_markdown/sanitized_anchor_name`. 2. The markdown block parser is now only interested in *generating* a sanitized anchor name, not with ensuring its uniqueness. That code has been moved to the HTML renderer. This means that if the HTML renderer is modified to identify all unique headers prior to rendering, the hackish nature of the collision detection can be eliminated.
Prevent header collisions on generation.
|
Thanks! |
The automatic header ID generation code that I submitted (#125) has a subtle bug where it will use the same ID for two headers that are identical. In the case below, both headers will be rendered as
<h1 id="header">Header</h1>.This change is a relatively naive approach that uses an incrementing counter for cases like this to prevent header collision (resulting in
header,header-1, andheader-2). This will not prevent a collision like this:This could be prevented by a somewhat smarter approach that appends a suffix (like
-1) to each collision (resulting inheader,header-1,header-1-1), but that feels a bit wrong and the implementation is heavyweight in two ways:parser.headerswould be changed frommap[string]inttomap[string]bool, and would grow larger for each header that collides, because each of the formsheader,header-1, andheader-1-1would be put intoparser.headers.parser.createSanitizedAnchorNamewould need to be modified to be something like what follows. (The code here is untested.)