Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better clickable URL regexes #3112

Closed
pragma- opened this issue Feb 14, 2023 · 12 comments
Closed

Better clickable URL regexes #3112

pragma- opened this issue Feb 14, 2023 · 12 comments
Labels
enhancement New feature or request

Comments

@pragma-
Copy link

pragma- commented Feb 14, 2023

The default URL regexes leave a lot to be desired. URLs ending with ) e.g. https://en.wikipedia.org/wiki/Class_(set_theory) will not include the trailing closing ) in the clicked result. Likewise for URLs ending with - as in https://www.onlinegdb.com/fork/Jd2x8Yvh-. There also were quirky issues with URLs wrapped with parentheses or brackets or quotes.

After a lot of tweaking and experimentation, I've settled on these two regexes for URLs:

    hyperlink_rules = {
        -- Handle URLs wrapped with punctuation
        -- e.g. 'http://foo' (http://foo) <http://foo> etc
        -- the punctuation will be underlined but excluded when clicked
        {
            regex = '[[:punct:]](\\w+://\\S+)[[:punct:]]',
            format = '$1',
        },

        -- Linkify things that look like URLs
        -- and include terminating ), / or - characters, if any
        {
            regex = '\\b\\w+://\\S+[)/a-zA-Z0-9-]+',
            format = '$0',
        },
    },

I have used these two regexes for the last couple of days on busy IRC channels and they seem to work much, much better than the default regexes. I don't profess to be an expert with regexes and I appreciate any critical feedback/improvements/suggestions.

What does everyone think?

@pragma- pragma- added the enhancement New feature or request label Feb 14, 2023
@pragma-
Copy link
Author

pragma- commented Feb 14, 2023

Oh, and some websites require the closing / on URLs, e.g. some cgit repositories. The default wezterm URL regexes do not include the closing /. The above regexes do.

@pragma-
Copy link
Author

pragma- commented Feb 14, 2023

Is there a way to not underline the surrounding punctuation in the first regex? I use a capture-group to capture $1 within the [[:punct:]] classes.

@pragma-
Copy link
Author

pragma- commented Feb 14, 2023

Unfortunately it doesn't handle messages like:

(test if your terminal supports sixels before anything: https://github.com/csdvrx/sixel-testsuite)

It ends up making https://github.com/csdvrx/sixel-testsuite) the clickable URL, with the ) attached to the end. :-( Hopefully someone has an idea to improve this situation.

@wez
Copy link
Owner

wez commented Feb 17, 2023

Essentially a duplicate of #928
I'm open to a PR that is accompanied by a set of regexes that works well. As you've discovered, it's not as easy as it might appear at first glance!
I'm going to close this issue in favor of the other one so that there are fewer things to keep track of. You are welcome to continue iterating and sharing your findings on the other issue.

@wez wez closed this as completed Feb 17, 2023
@pragma-
Copy link
Author

pragma- commented Feb 17, 2023

What about a way to underline just the part that is captured? My first regex is regex = '[[:punct:]](\\w+://\\S+)[[:punct:]] which is a capture group within punct classes. Is there a way to make the underlined part be just the captured $1 region? Can the hyperlink underliner be patched to underline just the captured region instead?

@wez
Copy link
Owner

wez commented Feb 18, 2023

If you don't need the punctuation to match or be underlined, can you omit the [[:punct:]] from either end? It doesn't seem like it serves a purpose.

@pragma-
Copy link
Author

pragma- commented Feb 18, 2023

The purpose is to specifically match hyperlinks that are within punctuation. It is a simple (and therefore performant) way to match things like <http://foo/> and (http://foo/) and [http://foo/] without losing the trailing /. It will even correctly extract http://foo/bar_(baz) from (http://foo/bar_(baz)).

There are two regexes in my snippet in the first comment. The [[:punct:]] regex is first specifically to handle hyperlinks that are wrapped with punctuation. The regex is very general in order to be much simpler so it can be more performant. It's intended to match hyperlinks surrounded by [] or () or <> or whatever punctuation.

If a hyperlink doesn't match this first regex, the next regex gets tested, which has a \b word-boundary to handle the beginning protocol identifier and then terminates on [)/a-zA-Z0-9-]+ to include the terminating - or ) or / in the hyperlinks, for reasons described in my previous comments.

I've been using these two regexes instead of the default regexes for the last 3 or 4 days, and it is much, much, much better for matching hyperlinks. 99.9% of the links are correctly clickable now whereas before only about 60% to 70% of the links would be clickable without needing to manually add a trailing ) or / or to manually highlight the entire link because it'd incorrectly stop at an earlier word-boundary.

I was hoping to share these with the community so they could try them out and iterate on them to improve them. They are much better than the defaults because not only do they match more hyperlinks more accurately but they're also simpler, and therefore more performant, than the defaults. There is just this one edge-case, as mentioned, that if resolved would make these regexes 100% perfect for me.

@wez
Copy link
Owner

wez commented Feb 18, 2023

Have you considered using a non-capturing group for the punctuation?
https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-in-regular-expressions

@pragma-
Copy link
Author

pragma- commented Feb 18, 2023

... but the capture group isn't wrapped around the punctuation classes!

@pragma-
Copy link
Author

pragma- commented Feb 18, 2023

Okay, I tried setting the regex to (?:[[:punct:]])(\\w+://\\S+)(?:[[:punct:]]) with non-capturing groups around the punctuation classes. It had no effect.

@pragma-
Copy link
Author

pragma- commented Feb 18, 2023

I'd like to move this conversation to #928 and to continue it there. I've condensed my multiple comments here into one heavily edited comment on that issue.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants