-
-
Notifications
You must be signed in to change notification settings - Fork 42
adding the possibility for extracting www-based URLs as well #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Thanks for taking the time to start the work for this, and sorry that it took a while for me to get to reviewing this :).
I've added some comments inline, can you please address them and update this pull request? Once that's done, I'll merge and release a new version with these changes.
@@ -88,8 +93,9 @@ public Builder emailDomainMustHaveDot(boolean emailDomainMustHaveDot) { | |||
*/ | |||
public LinkExtractor build() { | |||
UrlScanner urlScanner = linkTypes.contains(LinkType.URL) ? new UrlScanner() : null; | |||
WwwUrlScanner wwwScanner = linkTypes.contains(LinkType.URL) ? new WwwUrlScanner() : null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can you please add a new
LinkType.WWW
for this instead? Some users of the library might want to opt out of www scanning but keep normal URL scanning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, but should we not have a third type to handle all URLs?
So we have a LinkType.HTTP
, LinkType.WWW
and LinkType.URL
, whereas the latter includes to two former.
This will not exactly break backwards compatibility, but just add extra functionality to the same methods
/** | ||
* @return the found sequence | ||
*/ | ||
CharSequence sequence(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please remove this part of the change, it's not related to the WWW change at all as far as I can tell. It should be in a separate PR. But I'm also not sure I want to include a method like this, because it means that
LinkSpan
is required to keep either a reference to the original input or make a copy of the substring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine by me. Removed it
} | ||
|
||
// See "scheme" in RFC 3986 | ||
private int findFirst(CharSequence input, int beginIndex, int rewindIndex) { | ||
protected int findFirst(CharSequence input, int beginIndex, int rewindIndex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Don't make this protected, it's not used in the subclass you added.
* <p> | ||
* Based on RFC 3986. | ||
*/ | ||
public class WwwUrlScanner extends UrlScanner { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Instead of inheriting from UrlScanner, extract the code of
findLast
into a new utility function, maybeScanners.findLastForUrl
.
int last = super.findLast(input, beginIndex); | ||
|
||
// Make sure there is at least one dot after the first dot, | ||
// so www.something is not allowed, but www.something.co.uk is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
www.something
should be allowed. See for example www.com. Also, rinku recognizes that as well. But there should be at least a single domain letter afterwww.
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little reluctant towards this, as I believe the URL in itself is not valid, but http://www.com
would be. And you have already taken care of this.
I try to add the functionality to extract links where http://
is often stripped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, fair enough, it's unusual. While looking at this, I wondered whether www..com
is handled correctly (not being linked). Looks like it is :). Can you add that to the test cases?
return null; | ||
} | ||
|
||
int first = triggerIndex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think this is not enough for finding the first character. What about something like
foo.www.example.org
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I am not quite convinced that would be a valid URL.
If that ought to be valid, would it not be along with http://
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I mean, it feels like it shouldn't extract www.example.org
from foo.www.example.org
, but with the current code it does:
org.junit.ComparisonFailure:
Expected :foo.www.example.org
Actual :foo.|www.example.org|
It even extracts www.example.org
from wwww.example.org
. It has to check the characters before the triggerIndex, maybe using Character.isSpaceChar
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out, that Character.isWhitespace
is not sufficient to look for, as this will not extract links from HTML (<a href="">www.something.com</a>
), so I have tried to look at what characters definitely are not allowed before www
I actually came to the conclusion that just a single dot should render the whole link unusable. Everything else should just be stripped. Do you agree on this?
import org.junit.runners.Parameterized.Parameters; | ||
|
||
@RunWith(Parameterized.class) | ||
public class AutolinkWwwUrlTest extends AutolinkUrlTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we have a new LinkType
for www, you won't be able to inherit from AutolinkUrlTest
anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add handling for characters before the trigger w
and the additional test I suggested, otherwise looks good :)!
return null; | ||
} | ||
|
||
int first = triggerIndex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I mean, it feels like it shouldn't extract www.example.org
from foo.www.example.org
, but with the current code it does:
org.junit.ComparisonFailure:
Expected :foo.www.example.org
Actual :foo.|www.example.org|
It even extracts www.example.org
from wwww.example.org
. It has to check the characters before the triggerIndex, maybe using Character.isSpaceChar
.
int last = super.findLast(input, beginIndex); | ||
|
||
// Make sure there is at least one dot after the first dot, | ||
// so www.something is not allowed, but www.something.co.uk is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, fair enough, it's unusual. While looking at this, I wondered whether www..com
is handled correctly (not being linked). Looks like it is :). Can you add that to the test cases?
Have you had some time to look at the updated pull request? |
Sorry, thanks for pinging :). I don't think we should also accept WWW.example.org, that doesn't seem to be something someone would type that often. And if they really want that, they can add http:// in front. Would you mind removing it? The rest looks good now! |
I have removed the check for uppercase W. I also have a suggestion to rename What do you say? |
What would be gained by adding another LinkType? I can see it lead to bugs because a user could get confused when checking the type of an extracted LinkSpan. And what happens when you tell it to extract type |
My thinking was to rename The addition to However, I do not mind, if you find this confusing to the user. It was just a thought. |
The thing that might confuse users is that You can already extract URL and WWW links but not emails by configuring the builder like this: LinkExtractor linkExtractor = LinkExtractor.builder()
.linkTypes(EnumSet.of(LinkType.URL, LinkType.WWW))
.build(); Given that, I'm not sure what having an additional |
Merged. Thanks a lot for submitting the pull request and seeing it through :)! I noticed that the README also needs a section for the new functionality. Would you mind submitting another PR for that? |
Released in 0.6.0 now 🎉: https://github.com/robinst/autolink-java/releases/tag/autolink-0.6.0 |
No description provided.