adding the possibility for extracting www-based URLs as well #8

MTDdk · 2016-10-19T13:45:27Z

No description provided.

robinst

Hey! Thanks for taking the time to start the work for this, and sorry that it took a while for me to get to reviewing this :).

I've added some comments inline, can you please address them and update this pull request? Once that's done, I'll merge and release a new version with these changes.

robinst · 2016-10-24T02:16:02Z

src/main/java/org/nibor/autolink/LinkExtractor.java

@@ -88,8 +93,9 @@ public Builder emailDomainMustHaveDot(boolean emailDomainMustHaveDot) {
         */
        public LinkExtractor build() {
            UrlScanner urlScanner = linkTypes.contains(LinkType.URL) ? new UrlScanner() : null;
+            WwwUrlScanner wwwScanner = linkTypes.contains(LinkType.URL) ? new WwwUrlScanner() : null;


Can you please add a new LinkType.WWW for this instead? Some users of the library might want to opt out of www scanning but keep normal URL scanning.

Sure thing, but should we not have a third type to handle all URLs?
So we have a LinkType.HTTP, LinkType.WWW and LinkType.URL, whereas the latter includes to two former.
This will not exactly break backwards compatibility, but just add extra functionality to the same methods

robinst · 2016-10-24T02:20:34Z

src/main/java/org/nibor/autolink/LinkSpan.java

+    /**
+     * @return the found sequence
+     */
+    CharSequence sequence();


Please remove this part of the change, it's not related to the WWW change at all as far as I can tell. It should be in a separate PR. But I'm also not sure I want to include a method like this, because it means that LinkSpan is required to keep either a reference to the original input or make a copy of the substring.

This is fine by me. Removed it

robinst · 2016-10-24T02:21:16Z

src/main/java/org/nibor/autolink/internal/UrlScanner.java

    }

    // See "scheme" in RFC 3986
-    private int findFirst(CharSequence input, int beginIndex, int rewindIndex) {
+    protected int findFirst(CharSequence input, int beginIndex, int rewindIndex) {


Don't make this protected, it's not used in the subclass you added.

robinst · 2016-10-24T02:23:31Z

src/main/java/org/nibor/autolink/internal/WwwUrlScanner.java

+ * <p>
+ * Based on RFC 3986.
+ */
+public class WwwUrlScanner extends UrlScanner {


Instead of inheriting from UrlScanner, extract the code of findLast into a new utility function, maybe Scanners.findLastForUrl.

robinst · 2016-10-24T02:27:18Z

src/main/java/org/nibor/autolink/internal/WwwUrlScanner.java

+        int last = super.findLast(input, beginIndex);
+
+        // Make sure there is at least one dot after the first dot,
+        // so www.something is not allowed, but www.something.co.uk is


www.something should be allowed. See for example www.com. Also, rinku recognizes that as well. But there should be at least a single domain letter after www..

I am a little reluctant towards this, as I believe the URL in itself is not valid, but http://www.com would be. And you have already taken care of this.
I try to add the functionality to extract links where http:// is often stripped.

Ok, fair enough, it's unusual. While looking at this, I wondered whether www..com is handled correctly (not being linked). Looks like it is :). Can you add that to the test cases?

robinst · 2016-10-24T02:29:59Z

src/main/java/org/nibor/autolink/internal/WwwUrlScanner.java

+            return null;
+        }
+
+        int first = triggerIndex;


I think this is not enough for finding the first character. What about something like foo.www.example.org?

Actually, I am not quite convinced that would be a valid URL.
If that ought to be valid, would it not be along with http:// ?

That's what I mean, it feels like it shouldn't extract www.example.org from foo.www.example.org, but with the current code it does:

org.junit.ComparisonFailure:
Expected :foo.www.example.org
Actual :foo.|www.example.org|

It even extracts www.example.org from wwww.example.org. It has to check the characters before the triggerIndex, maybe using Character.isSpaceChar.

It turns out, that Character.isWhitespace is not sufficient to look for, as this will not extract links from HTML (<a href="">www.something.com</a>), so I have tried to look at what characters definitely are not allowed before www

I actually came to the conclusion that just a single dot should render the whole link unusable. Everything else should just be stripped. Do you agree on this?

robinst · 2016-10-24T02:33:51Z

src/test/java/org/nibor/autolink/AutolinkWwwUrlTest.java

+import org.junit.runners.Parameterized.Parameters;
+
+@RunWith(Parameterized.class)
+public class AutolinkWwwUrlTest extends AutolinkUrlTest {


When we have a new LinkType for www, you won't be able to inherit from AutolinkUrlTest anymore.

robinst

Please add handling for characters before the trigger w and the additional test I suggested, otherwise looks good :)!

robinst · 2016-10-26T03:41:33Z

src/main/java/org/nibor/autolink/internal/WwwUrlScanner.java

+            return null;
+        }
+
+        int first = triggerIndex;


That's what I mean, it feels like it shouldn't extract www.example.org from foo.www.example.org, but with the current code it does:

org.junit.ComparisonFailure:
Expected :foo.www.example.org
Actual :foo.|www.example.org|

It even extracts www.example.org from wwww.example.org. It has to check the characters before the triggerIndex, maybe using Character.isSpaceChar.

robinst · 2016-10-26T03:47:46Z

src/main/java/org/nibor/autolink/internal/WwwUrlScanner.java

+        int last = super.findLast(input, beginIndex);
+
+        // Make sure there is at least one dot after the first dot,
+        // so www.something is not allowed, but www.something.co.uk is


Ok, fair enough, it's unusual. While looking at this, I wondered whether www..com is handled correctly (not being linked). Looks like it is :). Can you add that to the test cases?

…ww.

MTDdk · 2016-11-02T08:44:12Z

Have you had some time to look at the updated pull request?

robinst · 2016-11-02T11:02:16Z

Sorry, thanks for pinging :). I don't think we should also accept WWW.example.org, that doesn't seem to be something someone would type that often. And if they really want that, they can add http:// in front. Would you mind removing it? The rest looks good now!

MTDdk · 2016-11-02T11:35:36Z

I have removed the check for uppercase W.

I also have a suggestion to rename UrlScanner to HttpScanner and introduce an extra link type, so we have a LinkType.HTTP, LinkType.WWW and LinkType.URL, whereas the latter includes to two former.
This will not exactly break backwards compatibility, but just add extra functionality to the same methods.

What do you say?

robinst · 2016-11-02T12:30:31Z

HttpScanner wouldn't be accurate because it also scans https or ssh or other schemes. I think leaving it as now is good.

What would be gained by adding another LinkType? I can see it lead to bugs because a user could get confused when checking the type of an extracted LinkSpan. And what happens when you tell it to extract type URL and HTTP, would be redundant.

MTDdk · 2016-11-02T12:37:53Z

My thinking was to rename UrlScanner -> HttpScanner
and WwwUrlScanner -> WwwScanner
I think this would more appropriately tell what the classes do.

The addition to LinkType just needed to reflect this and thus have an additional type. The LinkType.URL would then be used when you wanted to extract ALL links (http and www) but not email.

However, I do not mind, if you find this confusing to the user. It was just a thought.
I could create a new pull request, so you could see what I meant

robinst · 2016-11-03T01:35:20Z

The thing that might confuse users is that www.example.org is not a URL at all, see here: https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Syntax

You can already extract URL and WWW links but not emails by configuring the builder like this:

LinkExtractor linkExtractor = LinkExtractor.builder()
        .linkTypes(EnumSet.of(LinkType.URL, LinkType.WWW))
        .build();

Given that, I'm not sure what having an additional LinkType that is a combination of other link types would improve.

robinst · 2016-11-03T01:59:44Z

Merged. Thanks a lot for submitting the pull request and seeing it through :)!

I noticed that the README also needs a section for the new functionality. Would you mind submitting another PR for that?

robinst · 2016-11-07T04:32:01Z

Released in 0.6.0 now 🎉: https://github.com/robinst/autolink-java/releases/tag/autolink-0.6.0

adding the possibility for extracting www-based URLs as well

0da6fcf

robinst requested changes Oct 24, 2016

View reviewed changes

MTDdk added 3 commits October 24, 2016 10:19

incorporating inspection comments

6c799fd

incorporating inspection comments

120593f

perhaps a more indicative method name

3563c68

robinst requested changes Oct 26, 2016

View reviewed changes

looking for uppercase W as well and not accepting alphanumeric before w…

b21ddee

…ww.

removed the check for uppercase W

eb10bf4

robinst approved these changes Nov 3, 2016

View reviewed changes

robinst merged commit 92b4d11 into robinst:master Nov 3, 2016

robinst mentioned this pull request Feb 9, 2024

Autolink Extension not detecting links that start with www commonmark/commonmark-java#298

Open

Uh oh!

adding the possibility for extracting www-based URLs as well #8

adding the possibility for extracting www-based URLs as well #8

Uh oh!

Conversation

MTDdk commented Oct 19, 2016

Uh oh!

robinst left a comment

Choose a reason for hiding this comment

Uh oh!

robinst Oct 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robinst Oct 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robinst Oct 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robinst Oct 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robinst left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MTDdk commented Nov 2, 2016

Uh oh!

robinst commented Nov 2, 2016

Uh oh!

MTDdk commented Nov 2, 2016

Uh oh!

robinst commented Nov 2, 2016

Uh oh!

MTDdk commented Nov 2, 2016

Uh oh!

robinst commented Nov 3, 2016

Uh oh!

robinst commented Nov 3, 2016

Uh oh!

robinst commented Nov 7, 2016

Uh oh!

Uh oh!

robinst Oct 24, 2016 •

edited

Loading

robinst Oct 24, 2016 •

edited

Loading

robinst Oct 24, 2016 •

edited

Loading

robinst Oct 24, 2016 •

edited

Loading