Skip to content

Conversation

MTDdk
Copy link
Contributor

@MTDdk MTDdk commented Oct 19, 2016

No description provided.

Copy link
Owner

@robinst robinst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Thanks for taking the time to start the work for this, and sorry that it took a while for me to get to reviewing this :).

I've added some comments inline, can you please address them and update this pull request? Once that's done, I'll merge and release a new version with these changes.

@@ -88,8 +93,9 @@ public Builder emailDomainMustHaveDot(boolean emailDomainMustHaveDot) {
*/
public LinkExtractor build() {
UrlScanner urlScanner = linkTypes.contains(LinkType.URL) ? new UrlScanner() : null;
WwwUrlScanner wwwScanner = linkTypes.contains(LinkType.URL) ? new WwwUrlScanner() : null;
Copy link
Owner

@robinst robinst Oct 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Can you please add a new LinkType.WWW for this instead? Some users of the library might want to opt out of www scanning but keep normal URL scanning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, but should we not have a third type to handle all URLs?
So we have a LinkType.HTTP, LinkType.WWW and LinkType.URL, whereas the latter includes to two former.
This will not exactly break backwards compatibility, but just add extra functionality to the same methods

/**
* @return the found sequence
*/
CharSequence sequence();
Copy link
Owner

@robinst robinst Oct 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Please remove this part of the change, it's not related to the WWW change at all as far as I can tell. It should be in a separate PR. But I'm also not sure I want to include a method like this, because it means that LinkSpan is required to keep either a reference to the original input or make a copy of the substring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine by me. Removed it

}

// See "scheme" in RFC 3986
private int findFirst(CharSequence input, int beginIndex, int rewindIndex) {
protected int findFirst(CharSequence input, int beginIndex, int rewindIndex) {
Copy link
Owner

@robinst robinst Oct 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Don't make this protected, it's not used in the subclass you added.

* <p>
* Based on RFC 3986.
*/
public class WwwUrlScanner extends UrlScanner {
Copy link
Owner

@robinst robinst Oct 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Instead of inheriting from UrlScanner, extract the code of findLast into a new utility function, maybe Scanners.findLastForUrl.

int last = super.findLast(input, beginIndex);

// Make sure there is at least one dot after the first dot,
// so www.something is not allowed, but www.something.co.uk is
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • www.something should be allowed. See for example www.com. Also, rinku recognizes that as well. But there should be at least a single domain letter after www..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little reluctant towards this, as I believe the URL in itself is not valid, but http://www.com would be. And you have already taken care of this.
I try to add the functionality to extract links where http:// is often stripped.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, fair enough, it's unusual. While looking at this, I wondered whether www..com is handled correctly (not being linked). Looks like it is :). Can you add that to the test cases?

return null;
}

int first = triggerIndex;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think this is not enough for finding the first character. What about something like foo.www.example.org?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I am not quite convinced that would be a valid URL.
If that ought to be valid, would it not be along with http:// ?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I mean, it feels like it shouldn't extract www.example.org from foo.www.example.org, but with the current code it does:

org.junit.ComparisonFailure:
Expected :foo.www.example.org
Actual :foo.|www.example.org|

It even extracts www.example.org from wwww.example.org. It has to check the characters before the triggerIndex, maybe using Character.isSpaceChar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out, that Character.isWhitespace is not sufficient to look for, as this will not extract links from HTML (<a href="">www.something.com</a>), so I have tried to look at what characters definitely are not allowed before www

I actually came to the conclusion that just a single dot should render the whole link unusable. Everything else should just be stripped. Do you agree on this?

import org.junit.runners.Parameterized.Parameters;

@RunWith(Parameterized.class)
public class AutolinkWwwUrlTest extends AutolinkUrlTest {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we have a new LinkType for www, you won't be able to inherit from AutolinkUrlTest anymore.

Copy link
Owner

@robinst robinst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add handling for characters before the trigger w and the additional test I suggested, otherwise looks good :)!

return null;
}

int first = triggerIndex;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I mean, it feels like it shouldn't extract www.example.org from foo.www.example.org, but with the current code it does:

org.junit.ComparisonFailure:
Expected :foo.www.example.org
Actual :foo.|www.example.org|

It even extracts www.example.org from wwww.example.org. It has to check the characters before the triggerIndex, maybe using Character.isSpaceChar.

int last = super.findLast(input, beginIndex);

// Make sure there is at least one dot after the first dot,
// so www.something is not allowed, but www.something.co.uk is
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, fair enough, it's unusual. While looking at this, I wondered whether www..com is handled correctly (not being linked). Looks like it is :). Can you add that to the test cases?

@MTDdk
Copy link
Contributor Author

MTDdk commented Nov 2, 2016

Have you had some time to look at the updated pull request?

@robinst
Copy link
Owner

robinst commented Nov 2, 2016

Sorry, thanks for pinging :). I don't think we should also accept WWW.example.org, that doesn't seem to be something someone would type that often. And if they really want that, they can add http:// in front. Would you mind removing it? The rest looks good now!

@MTDdk
Copy link
Contributor Author

MTDdk commented Nov 2, 2016

I have removed the check for uppercase W.

I also have a suggestion to rename UrlScanner to HttpScanner and introduce an extra link type, so we have a LinkType.HTTP, LinkType.WWW and LinkType.URL, whereas the latter includes to two former.
This will not exactly break backwards compatibility, but just add extra functionality to the same methods.

What do you say?

@robinst
Copy link
Owner

robinst commented Nov 2, 2016

HttpScanner wouldn't be accurate because it also scans https or ssh or other schemes. I think leaving it as now is good.

What would be gained by adding another LinkType? I can see it lead to bugs because a user could get confused when checking the type of an extracted LinkSpan. And what happens when you tell it to extract type URL and HTTP, would be redundant.

@MTDdk
Copy link
Contributor Author

MTDdk commented Nov 2, 2016

My thinking was to rename UrlScanner -> HttpScanner
and WwwUrlScanner -> WwwScanner
I think this would more appropriately tell what the classes do.

The addition to LinkType just needed to reflect this and thus have an additional type. The LinkType.URL would then be used when you wanted to extract ALL links (http and www) but not email.

However, I do not mind, if you find this confusing to the user. It was just a thought.
I could create a new pull request, so you could see what I meant

@robinst
Copy link
Owner

robinst commented Nov 3, 2016

The thing that might confuse users is that www.example.org is not a URL at all, see here: https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Syntax

You can already extract URL and WWW links but not emails by configuring the builder like this:

LinkExtractor linkExtractor = LinkExtractor.builder()
        .linkTypes(EnumSet.of(LinkType.URL, LinkType.WWW))
        .build();

Given that, I'm not sure what having an additional LinkType that is a combination of other link types would improve.

@robinst robinst merged commit 92b4d11 into robinst:master Nov 3, 2016
@robinst
Copy link
Owner

robinst commented Nov 3, 2016

Merged. Thanks a lot for submitting the pull request and seeing it through :)!

I noticed that the README also needs a section for the new functionality. Would you mind submitting another PR for that?

@robinst
Copy link
Owner

robinst commented Nov 7, 2016

Released in 0.6.0 now 🎉: https://github.com/robinst/autolink-java/releases/tag/autolink-0.6.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants