bad URL recognition #47

slavkoja · 2013-09-21T07:25:54Z

I often see, that the URL recognized in bad manner. For example, here is text:

sftp://raspi.skk.

(dot at the end). The url is recognized as "ftp://raspi.skk." - with dot at the end. While i understand the changing SFTP to FTP, but the dot at the end is bad and this happen with parenthesis and other chars too...

Tetralet · 2013-09-22T02:22:09Z

Although 'dot' is a legal letter of URL, I think that we can assume that all URLs will not end with dot... XD

Thanks for reporting this bug. I'll fix it ASAP.

slavkoja · 2013-09-22T07:50:53Z

Thanks

This URL recognition is feature, for which lilyterm is better choice than others terminal emulators ;-)

But recognizing the "sftp" can be nice too, but it is not in a "must have" catergory.

slavkoja · 2013-11-09T08:07:27Z

Hi,

i found another problem, the text::

... 'http://raspi.skk/gitlist/rm-hull_pcd8544/': ...

is recognized as::

http://raspi.skk/gitlist/rm-hull_pcd8544/':

regards

eevee · 2013-11-19T00:06:52Z

Similar happens with question marks and trailing parentheses ()).

Also, URLs whose domains don't contain a dot are not recognized, so I can't click on http://bugtracker/ticket/1234 in work IRC.

Profpatsch · 2013-11-22T13:12:23Z

r Emacs <http://emacswiki.org/wiki/EMMS>, tries to open http://emacswiki.org/wiki/EMMS>,

What is a good rule for which chars should be stripped at the end?

Oh, I found it: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
I just knew there had to be someone who already solved this.

eevee · 2013-11-22T19:23:32Z

DO NOT use that regex; it's susceptible to DoS via catastrophic backtracking.

I'm not sure a pure regex is appropriate for this; more likely you want a dead simple regex with some manual inspection to weed out false positives.

Profpatsch · 2013-11-22T23:14:27Z

Can you further elaborate why this should be dangerous?
You just ordered us not to use that regex without explanation. I think I’m not the only who doesn’t want to get ordered around by a complete stranger.

As far as security goes, there are no bad comments here: https://news.ycombinator.com/item?id=1552766

eevee · 2013-11-22T23:47:47Z

Try matching that regex against http://www.foo.com/!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.

Depends on the engine, of course, but most engines handle nested +/* operators very poorly. Perl has special protection against this, so it works; PCRE (or at least pcregrep(1)) hits some backtracking limit and aborts; Python obediently carries on for more time than I cared to finish measuring, appearing frozen in the meantime.

It's an artifact of naïve backtracking that makes pathological input take exponential time. And you'd never notice on normal input, just like the author and the HN commentors didn't. But I'd rather not have every single terminal window potentially lock up because some jerk said a malformed URL on IRC. :)

Don't use hairy regexes, and particularly not ones you just got off some guy's website, unless you really really understand what you're doing. Just use something broad and touch it up with postprocessing. Or use a real parser, I guess.

Profpatsch · 2013-11-23T01:14:42Z

Hm, I think I understand.

How about something like http://uriparser.sourceforge.net/.

slavkoja · 2013-12-25T19:27:22Z

Hi again :-)

Here is opposite problem, the URL "http://localhost:8000/" is not recognized as URL. Perhaps due port number?

Shougo · 2014-01-28T11:17:23Z

@Tetralet You should close this issue. It is not closed.

Profpatsch · 2015-02-27T16:33:44Z

Hm, I still get (https://something.xyz) recognized as https://something.xyz).

Cool would be if URLs with \n in between them could be recognized, too. Something like: If it hits \n, try to match until the next space nonetheless and strip out every \n.

Tetralet added a commit that referenced this issue Jan 9, 2014

May custom the URL recognition from profile. colse #47

e677145

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bad URL recognition #47

bad URL recognition #47

slavkoja commented Sep 21, 2013

Tetralet commented Sep 22, 2013

slavkoja commented Sep 22, 2013

slavkoja commented Nov 9, 2013

eevee commented Nov 19, 2013

Profpatsch commented Nov 22, 2013

eevee commented Nov 22, 2013

Profpatsch commented Nov 22, 2013

eevee commented Nov 22, 2013

Profpatsch commented Nov 23, 2013

slavkoja commented Dec 25, 2013

Shougo commented Jan 28, 2014

Profpatsch commented Feb 27, 2015

bad URL recognition #47

bad URL recognition #47

Comments

slavkoja commented Sep 21, 2013

Tetralet commented Sep 22, 2013

slavkoja commented Sep 22, 2013

slavkoja commented Nov 9, 2013

eevee commented Nov 19, 2013

Profpatsch commented Nov 22, 2013

eevee commented Nov 22, 2013

Profpatsch commented Nov 22, 2013

eevee commented Nov 22, 2013

Profpatsch commented Nov 23, 2013

slavkoja commented Dec 25, 2013

Shougo commented Jan 28, 2014

Profpatsch commented Feb 27, 2015