New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent header collisions on generation. #126

Merged
merged 1 commit into from Nov 24, 2014

Conversation

Projects
None yet
3 participants
@halostatue
Copy link
Contributor

halostatue commented Nov 1, 2014

This is the original description of the pull request. For the updated version which is less naive and more robust, please see this comment.

The automatic header ID generation code that I submitted (#125) has a subtle bug where it will use the same ID for two headers that are identical. In the case below, both headers will be rendered as <h1 id="header">Header</h1>.

    # Header
    # Header
    # Header

This change is a relatively naive approach that uses an incrementing counter for cases like this to prevent header collision (resulting in header, header-1, and header-2). This will not prevent a collision like this:

    # Header
    # Header 1
    # Header

This could be prevented by a somewhat smarter approach that appends a suffix (like -1) to each collision (resulting in header, header-1, header-1-1), but that feels a bit wrong and the implementation is heavyweight in two ways:

  1. The parser.headers would be changed from map[string]int to map[string]bool, and would grow larger for each header that collides, because each of the forms header, header-1, and header-1-1 would be put into parser.headers.
  2. parser.createSanitizedAnchorName would need to be modified to be something like what follows. (The code here is untested.)
func (p *parser) createSanitizedAnchorName(text string) string {
  var anchorName []rune
  for _, r := range []rune(text) {
    switch {
    case r == ' ':
      anchorName = append(anchorName, '-')
    case unicode.IsLetter(r) || unicode.IsNumber(r):
      anchorName = append(anchorName, unicode.ToLower(r))
    }
  }

  return ensureUniqueAnchor(string(anchorName))
}

func (p *parser) ensureUniqueAnchor(anchor string) string {
  for _, found := p.headers[anchor]; found; found = p.headers[anchor] {
    anchor = anchor + "-1";
  }

  p.headers[anchor] = true

  return anchor
}
@halostatue

This comment has been minimized.

Copy link
Contributor

halostatue commented Nov 2, 2014

Note 1: this does not protect against collisions of manually specified header IDs in either form. I can come up with an alternative patch that does this. Ideally, this would also warn on any collision rather than just overriding the collision blindly.

@dmitshur

This comment has been minimized.

Copy link
Collaborator

dmitshur commented Nov 2, 2014

I think a map[string]int and some logic is enough to prevent all collisions.

The logic, when generating unique anchor names, would need to check that the generated name does not already exist, and if it does, just skip it and try the next number. In the worst case, there'll be O(n) iterations where n is number of headers, but that doesn't seem like a big deal if you consider the average/expected case.

```markdown
# Header

# Header 1

# Header
```

This could be prevented by a somewhat smarter approach that appends a
suffix (like -1) to each collision (resulting in header, header-1,
header-1-1), but that feels a bit wrong

So my suggested approach would result in header, header-1 and header-2 (it would try header-1 first, but since header-1 is already used up, it would increment count to 2 and try header-2 next, successfully).

@halostatue

This comment has been minimized.

Copy link
Contributor

halostatue commented Nov 2, 2014

Note 2: As with FootnoteAnchorPrefix, it may be desirable to have HeaderAnchorPrefix and/or HeaderAnchorSuffix…but it isn’t clear to me whether such a prefix/suffix should be applied to only automatically generated header IDs or whether it should be applied to any header ID. The use of this on an explicit header ID partially breaks the contract the author of the markdown document expects when writing # Header {#my-header}, resulting in #<prefix>my-header<suffix>.

block.go Outdated
if count, found := p.headers[anchorText]; found {
count++
p.headers[anchorText] = count
anchorText = fmt.Sprintf("%s-%d", anchorText, count)

This comment has been minimized.

@dmitshur

dmitshur Nov 2, 2014

Collaborator

If you just check if that potential anchorText already exists, i.e.:

alreadyUsedUp := p.headers[anchorText] > 0

And if so, repeat (increment count and try again).

This comment has been minimized.

@dmitshur

dmitshur Nov 2, 2014

Collaborator

Although, this approach will potentially create sub-optimal mappings where # Header 2 may actually have anchor name of header-2-1 because another # Header already used up header-2.

@halostatue

This comment has been minimized.

Copy link
Contributor

halostatue commented Nov 2, 2014

Yeah. This has the potential of being a very difficult-to-deal-with problem, which is also the reason that I think that we need to somehow warn users of collisions when they happen. This is related to gohugoio/hugo#591, which has a pretty extensive discussion (and isn’t 100% germane to this issue, as Hugo deals with multiple documents being generated, and we need to consider cross-document collisions as those multiple documents may be rendered in a single HTML page).

This is a real problem in multiple ways that I don’t know how to solve cleanly.

@dmitshur

This comment has been minimized.

Copy link
Collaborator

dmitshur commented Nov 2, 2014

This has the potential of being a very difficult-to-deal-with problem

Yeah, I'm starting to see that.

(and isn’t 100% germane to this issue, as Hugo deals with multiple documents being generated, and we need to consider cross-document collisions as those multiple documents may be rendered in a single HTML page).

This is a real problem in multiple ways that I don’t know how to solve cleanly.

I will propose the following solution, which I think has the potential to solve the problem, but it has costs.

There can be a notion of Context, which contains a map of used/generated anchor names. By default, blackfriday should create its own single context for each Markdown document it renders.

However, there could be the option of supplying it an already existing context, that it will reuse.

Hugo can then maintain its own context at the per-page scope, and reuse it for all documents it renders within that page. It would guarantee all anchor names within a page are unique.

If that approach is to be taken, I think it's best for this anchor name generation to be a separate lower-level package that blackfriday (and Hugo) imports. I'd like this sanitized_anchor_name package to be it, and I'd be happy to move it if that's deemed necessary.

I think at this point blackfriday having more imports is a lesser cost to pay compared to its internal code growing larger (making it more complex and diluted). go get makes getting dependencies very inexpensive, so why not take advantage of that and outsource well-defined problems to dedicated smaller packages?

@halostatue

This comment has been minimized.

Copy link
Contributor

halostatue commented Nov 2, 2014

I can’t see Hugo using the Context solution here, because when the markdown Page is converted, it is converted in isolation deliberately. The Page may be inserted in a template in a single page context or in a list context—and the pages aren’t converted each time. Hugo could convert each page twice; once for single context and once for list context, using the Context solution you’ve suggested—but that would be bad for memory use and/or performance, especially if Pygments-based syntax highlighting is used.

The solution we’re talking about in gohugoio/hugo#591 is to use the single conversion, but to strip header IDs when used in a list context (possibly with a different method; that hasn’t been resolved).

@dmitshur

This comment has been minimized.

Copy link
Collaborator

dmitshur commented Nov 2, 2014

I see. I'm not familiar with Hugo so I can't comment on what it could/should do, but I wanted to share the above proposal just in case it's helpful.

@halostatue

This comment has been minimized.

Copy link
Contributor

halostatue commented Nov 23, 2014

Prevent generated header collisions, less naively.

This is a rework of an earlier version of this code.

The automatic header ID generation code submitted in #125 has a subtle bug where it will use the same ID for multiple headers with identical text. In the case below, all the headers are rendered a <h1 id="header">Header</h1>.

# Header
# Header
# Header
# Header

This change is a simple but robust approach that uses an incrementing counter and pre-checking to prevent header collision. (The above would be rendered as header, header-1, header-2, and header-3.) In more complex cases, it will append a new counter suffix (-1), like so:

  # Header
  # Header 1
  # Header
  # Header

This will generate header, header-1, header-1-1, and header-1-2.

This code has two additional changes over the prior version:

  1. Rather than reimplementing @shurcooL’s anchor sanitization code, I have imported it as sanitized_anchor_name from github.com/shurcooL/go/github_flavored_markdown/sanitized_anchor_name.
  2. The markdown block parser is now only interested in generating a sanitized anchor name, not with ensuring its uniqueness. That code has been moved to the HTML renderer. This means that if the HTML renderer is modified to identify all unique headers prior to rendering, the hackish nature of the collision detection can be eliminated.
@dmitshur

This comment has been minimized.

Copy link
Collaborator

dmitshur commented Nov 23, 2014

This change is a simple but robust approach that uses an incrementing counter and pre-checking to prevent header collision.

Pretty cool.

I gave it a shot to create similar functionality that would ensure uniqueness of headers within a context, which you can see on this branch. I've kinda abandoned it because I've realized I have no (or very little) need for it in any of my code. Most of the headers I generate come from things that are guaranteed to be unique (Go import paths, symbol names, files in a folder, etc.). I didn't want to work on something I wouldn't use and thus could not test, that's why I left it as is.

Rather than reimplementing @shurcooL’s anchor sanitization code, I have imported it as sanitized_anchor_name from github.com/shurcooL/go/github_flavored_markdown/sanitized_anchor_name.

👍 from me, but it's up to the maintainers if they're willing to accept an external dependency. I'm willing to move it elsewhere if needed.

Also, I'm planning to make a small improvement to the current behavior. It simply skips all non-alphanumeric characters, but I want to change it so that each sequence of 1 or more non-alphanumeric characters will be replaced with a single dash.

That way, for example, a header titled main.go will become #main-go instead of #maingo. Let me know if you're okay with that change (I'll do it as a PR anyway).

block.go Outdated
@@ -15,7 +15,8 @@ package blackfriday

import (
"bytes"
"unicode"

sanitized_anchor_name "github.com/shurcooL/go/github_flavored_markdown/sanitized_anchor_name"

This comment has been minimized.

@dmitshur

dmitshur Nov 23, 2014

Collaborator

There's no need to rename the imported package to sanitized_anchor_name, that is already its package name.

This comment has been minimized.

@halostatue

halostatue Nov 23, 2014

Contributor

Ah. That’s my newness to Go showing. (I had also originally just renamed it “sanitizer”.)

@halostatue

This comment has been minimized.

Copy link
Contributor

halostatue commented Nov 23, 2014

I like the idea of squashing sequences of non-alphanumeric strings into a single dash.

Prevent generated header collisions, less naively.
> This is a rework of an earlier version of this code.

The automatic header ID generation code submitted in #125 has a subtle
bug where it will use the same ID for multiple headers with identical
text. In the case below, all the headers are rendered a `<h1
id="header">Header</h1>`.

  ```markdown
  # Header
  # Header
  # Header
  # Header
  ```

This change is a simple but robust approach that uses an incrementing
counter and pre-checking to prevent header collision. (The above would
be rendered as `header`, `header-1`, `header-2`, and `header-3`.) In
more complex cases, it will append a new counter suffix (`-1`), like so:

  ```markdown
  # Header
  # Header 1
  # Header
  # Header
  ```

This will generate `header`, `header-1`, `header-1-1`, and `header-1-2`.

This code has two additional changes over the prior version:

1.  Rather than reimplementing @shurcooL’s anchor sanitization code, I
    have imported it as from
    `github.com/shurcooL/go/github_flavored_markdown/sanitized_anchor_name`.

2.  The markdown block parser is now only interested in *generating* a
    sanitized anchor name, not with ensuring its uniqueness. That code
    has been moved to the HTML renderer. This means that if the HTML
    renderer is modified to identify all unique headers prior to
    rendering, the hackish nature of the collision detection can be
    eliminated.

rtfb added a commit that referenced this pull request Nov 24, 2014

Merge pull request #126 from halostatue/generate-unique-header-ids
Prevent header collisions on generation.

@rtfb rtfb merged commit 1f004e1 into russross:master Nov 24, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
@halostatue

This comment has been minimized.

Copy link
Contributor

halostatue commented Nov 24, 2014

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment