New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add TLD list to detect all domains links and stay under twitter char limit #416

Closed
snarfed opened this Issue Jun 5, 2015 · 10 comments

Comments

Projects
None yet
3 participants
@snarfed
Owner

snarfed commented Jun 5, 2015

...on this post of @diplix's: http://wirres.net/article/articleview/7773/1/6/

log here (and below). looks like he's since revised that post and successfully published it.

Params: [(u'source', u'http://wirres.net/7773'), (u'target', u'http://brid.gy/publish/twitter')]
Starting new HTTP connection (1): wirres.net
"HEAD /7773 HTTP/1.1" 302 None
"HEAD /article/articleview/7773/1/6/ HTTP/1.1" 200 None
Resolved http://wirres.net/7773 to http://wirres.net/article/articleview/7773/1/6/
Source: https://www.brid.gy/twitter/diplix , features [u'publish', u'listen'], status enabled
Publish entity: aglzfmJyaWQtZ3lyVgsSDVB1Ymxpc2hlZFBhZ2UiL2h0dHA6Ly93aXJyZXMubmV0L2FydGljbGUvYXJ0aWNsZXZpZXcvNzc3My8xLzYvDAsSB1B1Ymxpc2gYgICAgICAgAoM
Starting new HTTP connection (1): wirres.net
"GET /article/articleview/7773/1/6/ HTTP/1.1" 200 None
Parsed microformats2: {
...
Converted to ActivityStreams object: {
"updated": "2015-06-05T23:29:22+0200",
"author": {
"url": "http://wirres.net",
"image": {
"url": "http://root.wirres.net/03022687630_h2.jpg"
},
"displayName": "felix schwenzel",
"objectType": "person"
},
"url": "http://wirres.net/7773",
"content": "\n<p><aside class=\"pullout\">ix freue mich auf die <a href=\"http://nebenan.hamburg\">nebenan.hamburg</a> morgen. ich spreche auch ne halbe stunde \u00fcbers <a href=\"http://wirres.net/article/index/indieweb\">#indieweb</a> und @reclaim_fm.</aside><a href=\"https://www.brid.gy/publish/twitter\"></a><a href=\"https://www.brid.gy/publish/facebook\"></a></p>\n<a name=\"mehr\"></a>\n",
"published": "2015-06-05T23:29:22+0200",
"objectType": "article"
}
Rendered content to:
.hamburg morgen. ich spreche auch ne halbe stunde übers #indieweb und @reclaim_fm.
Collected params: [(u'status', u'ix freue mich auf die nebenan.hamburg morgen. ich spreche auch ne halbe stunde \xfcbers #indieweb und @reclaim_fm. (http://wirres.net/7773)'), (u'oauth_nonce', u'...'), (u'oauth_timestamp', u'1433539767'), (u'oauth_consumer_key', u'...'), (u'oauth_signature_method', u'HMAC-SHA1'), (u'oauth_version', u'1.0'), (u'oauth_token', u'...')]
Normalized params: oauth_consumer_key=...&oauth_nonce=...&oauth_signature_method=HMAC-SHA1&oauth_timestamp=1433539767&oauth_token=...&oauth_version=1.0&status=ix%20freue%20mich%20auf%20die%20nebenan.hamburg%20morgen.%20ich%20spreche%20auch%20ne%20halbe%20stunde%20%C3%BCbers%20%23indieweb%20und%20%40reclaim_fm.%20%28http%3A%2F%2Fwirres.net%2F7773%29
Normalized URI: https://api.twitter.com/1.1/statuses/update.json
Base signing string: POST&https%3A%2F%2Fapi.twitter.com%2F1.1%2Fstatuses%2Fupdate.json&oauth_consumer_key%3D...
Signature: ...=
Encoding URI, headers and body to utf-8.
Generated Authorization header from access token ... 5101... and secret ...
Fetching https://api.twitter.com/1.1/statuses/update.json?status=ix+freue+mich+auf+die+nebenan.hamburg+morgen.+ich+spreche+auch+ne+halbe+stunde+%C3%BCbers+%23indieweb+und+%40reclaim_fm.+%28http%3A%2F%2Fwirres.net%2F7773%29
Error 403, response body: {"errors":[{"code":186,"message":"Status is over 140 characters."}]}
Error 403, response body: {"errors":[{"code":186,"message":"Status is over 140 characters."}]}
Error: {"errors":[{"code":186,"message":"Status is over 140 characters."}]} HTTP Error 403: Forbidden
Traceback (most recent call last):
File "/base/data/home/apps/s~brid-gy/3.384761739021503328/publish.py", line 195, in _run
  result = self.attempt_single_item(item)
File "/base/data/home/apps/s~brid-gy/3.384761739021503328/publish.py", line 313, in attempt_single_item
  result = self.source.as_source.create(obj, include_link=not omit_link)
File "/base/data/home/apps/s~brid-gy/3.384761739021503328/activitystreams/twitter.py", line 405, in create
  return self._create(obj, preview=False, include_link=include_link)
File "/base/data/home/apps/s~brid-gy/3.384761739021503328/activitystreams/twitter.py", line 589, in _create
  resp = self.urlopen(API_POST_TWEET_URL, data=data)
File "/base/data/home/apps/s~brid-gy/3.384761739021503328/activitystreams/twitter.py", line 735, in urlopen
  return request()
File "/base/data/home/apps/s~brid-gy/3.384761739021503328/activitystreams/twitter.py", line 714, in request
  url, self.access_token_key, self.access_token_secret, **kwargs)
File "/base/data/home/apps/s~brid-gy/3.384761739021503328/activitystreams/oauth_dropins/twitter_auth.py", line 60, in signed_urlopen
  timeout=timeout)
...
HTTPError: HTTP Error 403: Forbidden
@snarfed

This comment has been minimized.

Show comment
Hide comment
@snarfed

snarfed Jun 6, 2015

Owner

my best guess so far is that twitter auto-linked the nebenan.hamburg text inside the content, and since links cost 24 chars, that put it over.

Owner

snarfed commented Jun 6, 2015

my best guess so far is that twitter auto-linked the nebenan.hamburg text inside the content, and since links cost 24 chars, that put it over.

@kylewm

This comment has been minimized.

Show comment
Hide comment
@kylewm

kylewm Jun 6, 2015

Collaborator

pretty sure your diagnosis is correct, and dang .hamburg appears to be a TLD (http://nic.hamburg), so I guess Twitter's linker has some pretty extensive cases for autolinking these new domains.

our regex matches TLDs that are 2-4 characters, and I get enough false positives on my site that it's annoying with just that (like mm.dd.yyyy the other day).

I guess I would open this up to 2+ characters, and then post-process to check the match against http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains or better yet http://data.iana.org/TLD/tlds-alpha-by-domain.txt

I kinda want to go ahead and add this to https://github.com/kylewm/brevity now, and see how I like it

Collaborator

kylewm commented Jun 6, 2015

pretty sure your diagnosis is correct, and dang .hamburg appears to be a TLD (http://nic.hamburg), so I guess Twitter's linker has some pretty extensive cases for autolinking these new domains.

our regex matches TLDs that are 2-4 characters, and I get enough false positives on my site that it's annoying with just that (like mm.dd.yyyy the other day).

I guess I would open this up to 2+ characters, and then post-process to check the match against http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains or better yet http://data.iana.org/TLD/tlds-alpha-by-domain.txt

I kinda want to go ahead and add this to https://github.com/kylewm/brevity now, and see how I like it

@kylewm

This comment has been minimized.

Show comment
Hide comment
@kylewm
Collaborator

kylewm commented Jun 6, 2015

@diplix

This comment has been minimized.

Show comment
Hide comment
@diplix

diplix Jun 6, 2015

yup, twitter did (correctly) autoexpand nebenan.hamburg.

curiously i could publish the post without revision via the bridgy web-frontend — while ommiting the link back to wirres. without the backlink, the tweet is < 140 chars.

as always, there also was a mistake on my side. i should have flagged the tweet to omit the backlink, but i also miscounted the chars.

diplix commented Jun 6, 2015

yup, twitter did (correctly) autoexpand nebenan.hamburg.

curiously i could publish the post without revision via the bridgy web-frontend — while ommiting the link back to wirres. without the backlink, the tweet is < 140 chars.

as always, there also was a mistake on my side. i should have flagged the tweet to omit the backlink, but i also miscounted the chars.

@snarfed

This comment has been minimized.

Show comment
Hide comment
@snarfed

snarfed Jun 6, 2015

Owner

thanks guys, and nice addition to brevity @kylewm! I'm lazy and probably won't follow suit here, so that's one more reason i should start using brevity too. :P I'll update this issue in the meantime.

Owner

snarfed commented Jun 6, 2015

thanks guys, and nice addition to brevity @kylewm! I'm lazy and probably won't follow suit here, so that's one more reason i should start using brevity too. :P I'll update this issue in the meantime.

@snarfed snarfed changed the title from twitter publish: hitting 140 char limit to add TLD list to detect all domains links and stay under twitter char limit Jun 6, 2015

@snarfed snarfed removed the now label Jun 6, 2015

@kylewm

This comment has been minimized.

Show comment
Hide comment
@kylewm

kylewm Sep 5, 2015

Collaborator

Here's an odd one from @myfreeweb. I would expect bridgy to autolink Matrix.org and include it in the count, but maybe the capital letter throws it off. My best guess is it's counting as 10 characters instead of 23.

Logs here: https://www.brid.gy/log?start_time=1441413320&key=aglzfmJyaWQtZ3lyXwsSDVB1Ymxpc2hlZFBhZ2UiOGh0dHBzOi8vdW5yZWxlbnRpbmcudGVjaG5vbG9neS9ub3Rlcy8yMDE1LTA5LTA1LTAwLTM1LTEzDAsSB1B1Ymxpc2gYgICAgICAgAoM

Bridgy shortener:

The Telegram Bot API is the best bot API ever. Everyone should learn from it, especially Matrix.org, which… (https://unrelenting.technology/notes/2015-09-05-00-35-13)

And just for kicks, here's what Brevity gives:

The Telegram Bot API is the best bot API ever. Everyone should learn from it, especially Matrix.org,… https://unrelenting.technology/notes/2015-09-05-00-35-13

I should update it to trim that trailing comma...

Collaborator

kylewm commented Sep 5, 2015

Here's an odd one from @myfreeweb. I would expect bridgy to autolink Matrix.org and include it in the count, but maybe the capital letter throws it off. My best guess is it's counting as 10 characters instead of 23.

Logs here: https://www.brid.gy/log?start_time=1441413320&key=aglzfmJyaWQtZ3lyXwsSDVB1Ymxpc2hlZFBhZ2UiOGh0dHBzOi8vdW5yZWxlbnRpbmcudGVjaG5vbG9neS9ub3Rlcy8yMDE1LTA5LTA1LTAwLTM1LTEzDAsSB1B1Ymxpc2gYgICAgICAgAoM

Bridgy shortener:

The Telegram Bot API is the best bot API ever. Everyone should learn from it, especially Matrix.org, which… (https://unrelenting.technology/notes/2015-09-05-00-35-13)

And just for kicks, here's what Brevity gives:

The Telegram Bot API is the best bot API ever. Everyone should learn from it, especially Matrix.org,… https://unrelenting.technology/notes/2015-09-05-00-35-13

I should update it to trim that trailing comma...

@snarfed

This comment has been minimized.

Show comment
Hide comment
@snarfed

snarfed Sep 5, 2015

Owner

@kylewm good instincts, i bet it's the capital letter too. the exception email for this has actually been in my inbox for half a day. :P thanks for triaging!

Owner

snarfed commented Sep 5, 2015

@kylewm good instincts, i bet it's the capital letter too. the exception email for this has actually been in my inbox for half a day. :P thanks for triaging!

@kylewm

This comment has been minimized.

Show comment
Hide comment
@kylewm

kylewm Sep 5, 2015

Collaborator

I checked twitter and it seems to ignore the case altogether, so I followed suit. (Sorry this didn't really end up relating to the original issue)

snarfed/webutil@45bb252

Collaborator

kylewm commented Sep 5, 2015

I checked twitter and it seems to ignore the case altogether, so I followed suit. (Sorry this didn't really end up relating to the original issue)

snarfed/webutil@45bb252

@snarfed

This comment has been minimized.

Show comment
Hide comment
@snarfed

snarfed Sep 5, 2015

Owner

👍

Owner

snarfed commented Sep 5, 2015

👍

@snarfed

This comment has been minimized.

Show comment
Hide comment
@snarfed

snarfed Sep 5, 2016

Owner

we're now using brevity, and we also better match twitter's linkifying logic (#696). closing.

Owner

snarfed commented Sep 5, 2016

we're now using brevity, and we also better match twitter's linkifying logic (#696). closing.

@snarfed snarfed closed this Sep 5, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment