Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markdown lexer edge case with links and square brackets (rare) #1444

Closed
Ravlen opened this issue Feb 20, 2020 · 9 comments · Fixed by #1445
Closed

Markdown lexer edge case with links and square brackets (rare) #1444

Ravlen opened this issue Feb 20, 2020 · 9 comments · Fixed by #1445
Assignees
Labels
bugfix-request A request for a bugfix to be developed.

Comments

@Ravlen
Copy link
Contributor

Ravlen commented Feb 20, 2020

Name of the lexer
Markdown

This might be too rare to bother put effort into it, but this is something I run into from time to time as some of our engineers like linking to TOML section examples, and the section names are usually [section] or [[section]].

I'm raising the issue because::

Screen Shot 2020-02-20 at 14 16 12

Code sample

Does not work with GitHub's coloring either:

[This is a link `with backticks`](example.com)
[This is a link to a TOML section `[with brackets]` (and backticks)](example.com)
[This is a link to a TOML section `[[with double brackets]]` (and backticks)](example.com)

Does not work with Rouge:

Screen Shot 2020-02-20 at 14 31 08

Works fine with VS Code:

Screen Shot 2020-02-20 at 14 31 51

Additional context
Yeah, it's a rare edge case. Now that I saw one fix earlier this week (that was fast, btw!), I'll see if I can spot any issues with the regex (though I'm still learning).

@Ravlen Ravlen added the bugfix-request A request for a bugfix to be developed. label Feb 20, 2020
@pyrmont
Copy link
Contributor

pyrmont commented Feb 20, 2020

Agreed this should parse. It works fine in CommonMark.

@pyrmont pyrmont self-assigned this Feb 20, 2020
@pyrmont
Copy link
Contributor

pyrmont commented Feb 20, 2020

The problem is with how the regex uses non-greedy matching in rules like this:

rule %r/(!?\[)(#{edot}*?)(\])/ do

The problem that I can imagine is that a more complex set of rules potentially leads you to a state where it's difficult to know on an initial pass whether you're in a link or not.

@Ravlen
Copy link
Contributor Author

Ravlen commented Feb 20, 2020

@pyrmont Would it be crazy to add the opening paren for the link, as a way to identify the full link-text section?

- rule %r/(!?\[)(#{edot}*?)(\])/ do
+ rule %r/(!?\[)(#{edot}*?)(\]\()/ do

Screen Shot 2020-02-20 at 15 00 48

Screen Shot 2020-02-20 at 15 00 54

@pyrmont
Copy link
Contributor

pyrmont commented Feb 20, 2020

It would need to be either a ( or a [ (since you can have inline links and reference links) and it might be preferable to do it as a lookahead so that the rest of the lexing logic stays the same.

@Ravlen
Copy link
Contributor Author

Ravlen commented Feb 20, 2020

(Sidenote, I just noticed you and jneen are both in Japan, which is funny because I'm in Kagoshima!)

@Ravlen
Copy link
Contributor Author

Ravlen commented Feb 20, 2020

@pyrmont I struggle with lookahead logic and never got a good grasp of it, but I could easily add a ( OR [:

rule %r/(!?\[)(#{edot}*?)(\]\(|\]\[)/ do

Screen Shot 2020-02-20 at 15 09 56

@pyrmont
Copy link
Contributor

pyrmont commented Feb 20, 2020

You can think of a lookahead as a special kind of 'group' (using (?=...)). So in this case: /(!?\[)(#{edot}*?)(?=\][\[(])/. It's difficult to read because of all the escaping but in plain English, that's a ] and then a character class consisting of either [ or (.

Ravlen added a commit to Ravlen/rouge that referenced this issue Feb 20, 2020
When hiding brackets inside links with backticks, rendering
was breaking. This makes the lexer search for `](` or `](`
to find the end delimiting the link text.
@Ravlen
Copy link
Contributor Author

Ravlen commented Feb 20, 2020

@pyrmont I read through some regex docs and I'm starting to grasp it, but with the tool I was using your example wasn't working, though a slight modification worked. Does this make sense?

rule %r/(!?\[)(#{edot}*?)(?=\]\(|\]\[)/ do

Screen Shot 2020-02-20 at 16 03 18

Once I got that working, I started to wonder if I should use lookbehind to skip the first bracket as well. Does this make sense?

rule %r/(!?(?<=\[))(#{edot}*?)(?=\]\(|\]\[)/ do

Screen Shot 2020-02-20 at 16 03 05

@Ravlen
Copy link
Contributor Author

Ravlen commented Feb 20, 2020

I got it working now, turns out the online tool was just buggy and I had to refresh and it just started working again. I'll add the lookahead to the PR.

Ravlen added a commit to Ravlen/rouge that referenced this issue Feb 20, 2020
When hiding brackets inside links (with backticks),
coloring was breaking. This makes the lexer search
for the open bracket or paren that starts the url
or reference link
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix-request A request for a bugfix to be developed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants