Twitter publish preview shortening a post unnecessarily #343

kylewm · 2015-01-13T21:41:52Z

source post: http://tantek.com/2015/013/t1/names-ind-ie-indie-vc-not-indieweb
manually POSSEd tweet demonstrates no shortening necessary https://twitter.com/t/status/555115245405229056

~ confusing screenshot removed ~

snarfed · 2015-01-13T21:51:06Z

hmm, that's not an apples to apples comparison. tantek's tweet includes a PSC at the end, which costs 14 chars. bridgy includes a PSL (ie link), which iirc costs 23 chars.

kylewm · 2015-01-13T21:54:04Z

oh my bad, here is a screenshot without "include link" selected

kylewm · 2015-01-13T21:55:13Z

Dollars to donuts, the link regex recognizes ind.ie and indie.vc as URLs and thinks they will cost 23 character each.

snarfed · 2015-01-13T21:55:58Z

could be a good hypothesis

kylewm · 2015-01-13T21:59:37Z

I think if that's the case, I'd vote to WONTFIX it. Even the Cassis autolinker recognizes them as links http://tantek.com/2015/013/t1/names-ind-ie-indie-vc-not-indieweb

snarfed · 2015-01-13T22:48:20Z

hmm. i'm actually not sure that's the bug after all. i tried manually, and it thinks it can do it without truncating:

hell:~/src/bridgy/activitystreams> python
...
>>> import twitter
>>> twitter.Twitter('', '').preview_create({'objectType': 'note', 'content': 'ind.ie&indie.vc are NOT #indieweb @indiewebcamp indiewebcamp.com/2014-review#Indie_Term_Re-use @iainspad @sashtown @thomatronic'})
CreationResult(content='ind.ie&indie.vc are NOT #indieweb @indiewebcamp indiewebcamp.com/2014-review#Indie_Term_Re-use @iainspad @sashtown @thomatronic', description='<span class="verb">tweet</span>:', abort=False, error_plain=None, error_html=None)

snarfed · 2015-01-13T22:55:03Z

aha, it's the newlines. they're triggering it somehow.

>>> twitter.Twitter('', '').preview_create({'objectType': 'note', 'content': 'Despite names,\nind.ie&indie.vc are NOT #indieweb @indiewebcamp\nindiewebcamp.com/2014-review#Indie_Term_Re-use\n@iainspad @sashtown @thomatronic'})
CreationResult(content=u'Despite names, ind.ie&indie.vc are NOT #indieweb @indiewebcamp indiewebcamp.com/2014-review#Indie_Term_Re-use @iainspad @sashtown\u2026', description='<span class="verb">tweet</span>:', abort=False, error_plain=None, error_html=None)

snarfed · 2015-01-13T22:58:47Z

nm, it's not the newlines. i left the first line, Despite names,\n, off my first newline-less test. :guilty:

snarfed · 2015-01-13T23:03:07Z

anyway. regardless, @kylewm sgtm, tentatively closing as wontfix. feel free to reopen anytime if you want.

kylewm · 2015-01-14T04:40:15Z

agreed, thanks for helping me diagnose!

kylewm · 2015-02-01T06:16:19Z

My guess on this was wrong. Bridgy doesn't recognize the schemaless indiewebcamp.com/2014-review#Indie_Term_Re-use URL as a URL, so it counts as more characters than it should.

webutil's extract_links is a little too simplistic, it expects links to have a schema, and expects them to be separated by word boundaries (for instance util.extract_links('http://kylewm.com;http://snarfed.org') returns one long string).

The trailing punctuation will also mess with a-u's tweet shortening logic because

util.extract_links('http://kylewm.com, http://snarfed.org') == ['http://kylewm.com', 'http://snarfed.org']

but twitter._create is looking at the slightly different string tokens (note the trailing comma after kylewm.com)

['http://kylewm.com,', 'http://snarfed.org']

kylewm · 2015-02-01T20:31:10Z

I'm gaining an appreciation for @tantek's research in this area that I admittedly did not have before. Cassis's tw_length_check counts the number of characters in this text the same way Twitter does -- it explicitly skips bare ccTLDs. So even though auto_link would linkify them, indie.vc and ind.ie are not counted as 23 characters each against the tweet length limit.

I propose we:

replace the current _LINKIFY_RE with https://github.com/kylewm/cassis-autolink-py/blob/master/cassis.py#L13 -- I don't think it will have the lookahead problems the huck regex did (it is basically just a more thorough and explicit version of the one we use now)
use re.findall and re.split instead of string.split to split the string up into link and non-link tokens.
add logic to remove trailing punctuation for link tokens and tack it on the beginning of the next non-link token.
count normal links as 22 or 23 characters. count bare ccTLDs and @-names as len(string).

This should get us pretty close to counting tweet length the way twitter does.

snarfed · 2015-02-01T23:31:09Z

thanks for diving onto this so deep!

that proposal all makes sense... but hoo boy, that regex is a monster. I'm not exactly jumping at the maintenance burden. I'm sensitive to diminishing returns here, and I think my goal is "just good enough," not "as good as possible."

...having said that, so far I've fixed bugs by switching to entirely new regexes just as often as I've debugged existing ones, so meh.

what the hell, sure, go for it!

one question: how twitter-specific is that regex? if it is, should we maybe keep the existing one in webutil since it's simpler and more predictable, and only use the new one in a-u/twitter?

kylewm · 2015-02-02T04:42:14Z

I'm sensitive to diminishing returns here, and I think my goal is "just good enough," not "as good as possible."

No I'm with you on that. I think I can modify the current regex to make https?:// optional, and then ignore the obvious cases where twitter doesn't autolink ([domain-name].[2-letter-TLD]) in the length calculation, and that would cover like 98% of cases.

snarfed · 2015-02-06T21:37:01Z

piling on with another related but different link example that hopefully this will fix (and we should confirm). the content in @dissolve's https://ben.thatmustbe.me/note/2015/2/6/9/ is:

The next IndieWebCamp is fast approaching. If anyone is interested, head over the https://indiewebcamp.com/2015/Cambridge<br><br>(btmb.me s/7D)

we render it to:
The next IndieWebCamp is fast approaching. If anyone is interested, head over the indiewebcamp.com/2015/Cambridge... s/7D)

ie we interpret the <br><br>(btmb.me as part of the URL. we should probably not count < and > in URLs.

snarfed · 2015-02-10T19:21:41Z

just checking in here. @kylewm, do you think we should go ahead with tantek's regexp and try to simplify it later? or upgrade the current one like you mentioned? or something else?

kylewm · 2015-02-10T19:32:11Z

Ah thanks for the reminder, this had fallen off my queue :p I think we should

modify the current regex to accept urls without a schema
explicitly skip bare two-letter TLDs with no path when doing the shortening.

I'm happy to do this if you'd like me to, but wouldn't be offended if you want to before I have a chance.

P.S. eventually I think I'd like to write a library for tweet shortening based on https://github.com/kylewm/cassis-autolink-py, and then I will lobby you for its inclusion in Bridgy :)

snarfed · 2015-02-10T21:33:18Z

@kylewm sgtm, definitely go ahead if you want. thank you!

kylewm · 2015-02-11T16:24:02Z

I started beefing up the regex to match urls without a schema. kylewm/webutil@f53386e

Based partially on Cassis, it will match punctuation at the end of a link greedily, and then do some post processing to back off (same for matching links that are already inside an <a>)

Still todo:

optionally skip those ccTLD domains like ind.ie and indie.vc
use tokenize_links in the tweet shortening algorithm.

snarfed · 2015-02-13T00:09:49Z

yay!

kylewm · 2015-02-13T00:10:13Z

The code is a little labyrinthine, but I added enough test cases that I feel fairly confident it's close to correct :D

snarfed · 2015-02-13T00:11:05Z

agreed, the tests are what make this feel ok.

kylewm added now publish labels Jan 13, 2015

snarfed closed this as completed Jan 13, 2015

kylewm reopened this Feb 1, 2015

kylewm self-assigned this Feb 11, 2015

kylewm closed this as completed in snarfed/granary@d2e8821 Feb 13, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Twitter publish preview shortening a post unnecessarily #343

Twitter publish preview shortening a post unnecessarily #343

kylewm commented Jan 13, 2015

snarfed commented Jan 13, 2015

kylewm commented Jan 13, 2015

kylewm commented Jan 13, 2015

snarfed commented Jan 13, 2015

kylewm commented Jan 13, 2015

snarfed commented Jan 13, 2015

snarfed commented Jan 13, 2015

snarfed commented Jan 13, 2015

snarfed commented Jan 13, 2015

kylewm commented Jan 14, 2015

kylewm commented Feb 1, 2015

kylewm commented Feb 1, 2015

snarfed commented Feb 1, 2015

kylewm commented Feb 2, 2015

snarfed commented Feb 6, 2015

snarfed commented Feb 10, 2015

kylewm commented Feb 10, 2015

snarfed commented Feb 10, 2015

kylewm commented Feb 11, 2015

snarfed commented Feb 13, 2015

kylewm commented Feb 13, 2015

snarfed commented Feb 13, 2015

Twitter publish preview shortening a post unnecessarily #343

Twitter publish preview shortening a post unnecessarily #343

Comments

kylewm commented Jan 13, 2015

snarfed commented Jan 13, 2015

kylewm commented Jan 13, 2015

kylewm commented Jan 13, 2015

snarfed commented Jan 13, 2015

kylewm commented Jan 13, 2015

snarfed commented Jan 13, 2015

snarfed commented Jan 13, 2015

snarfed commented Jan 13, 2015

snarfed commented Jan 13, 2015

kylewm commented Jan 14, 2015

kylewm commented Feb 1, 2015

kylewm commented Feb 1, 2015

snarfed commented Feb 1, 2015

kylewm commented Feb 2, 2015

snarfed commented Feb 6, 2015

snarfed commented Feb 10, 2015

kylewm commented Feb 10, 2015

snarfed commented Feb 10, 2015

kylewm commented Feb 11, 2015

snarfed commented Feb 13, 2015

kylewm commented Feb 13, 2015

snarfed commented Feb 13, 2015