New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test case for 2 letter top level domains #59

Open
chmac opened this Issue Sep 26, 2013 · 2 comments

Comments

Projects
None yet
2 participants
@chmac
Copy link

chmac commented Sep 26, 2013

I've been testing a little by posting to twitter, and domains like neustar.us and cal.io and github.io are not auto linked. However, domains like chmac.com and army.mil are. These differences are not covered in the test cases.

I tried to dig into the javascript implementation to see what the actual behaviour is, but I spent enough time on PHP regexs today, maybe another day!

I just did another test and a domain foo.dd.uk is auto linked, even though it's a non existent second level domain. While gov.uk which is a valid domain is not linked, but www.gov.uk is.

I'd hazard a guess and say that any 2 letter top level domain (a.xx) is not linked, while domains like a.xxx.xx or a.xx.xx are linked. Ironically, t.co is not linked!

@jakl

This comment has been minimized.

Copy link
Contributor

jakl commented Sep 27, 2013

Good sleuthing! Signal to noise in tweets is often more towards noise so we only autolink if we're mostly certain it's a URL; it could be an emoticon or internet meme or have meaning in another language. Some domains are treated as especially strong signals like .com

For now, you're right, we should have clear tests. Later we can revisit which domains should link and how changes would affect a sample set of tweets.

@chmac

This comment has been minimized.

Copy link

chmac commented Sep 27, 2013

A little more sleuthing later, it looks like on both the API and the javascript frontend on twitter.com, the following happens:

  • Any first level valid CC domain is not linked github.io
  • Any second level valid CC domain is linked www.github.io or chmac.github.io
  • Any domain with a non existent CC TLD is not linked, so github.pp, www.github.pp or chmac.github.pp
  • Any first or second level valid global or US domain is linked github.aero, github.museum, github.mil or github.edu, or chmac.github.museum

There's presumably a list of valid TLDs in the js and other codebases. It's probably possible to tests from that.

I'd suggest that domains like github.io probably should be linked automatically, because a non existent domain like github.pp wouldn't be linked anyway, so things like that.ll wouldn't be linked, but that's a decision for somebody else to make.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment