Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Twitter publish preview shortening a post unnecessarily #343

Closed
kylewm opened this issue Jan 13, 2015 · 22 comments
Closed

Twitter publish preview shortening a post unnecessarily #343

kylewm opened this issue Jan 13, 2015 · 22 comments
Assignees
Labels

Comments

@kylewm
Copy link
Contributor

kylewm commented Jan 13, 2015

~ confusing screenshot removed ~

cc: @tantek

@snarfed
Copy link
Owner

snarfed commented Jan 13, 2015

hmm, that's not an apples to apples comparison. tantek's tweet includes a PSC at the end, which costs 14 chars. bridgy includes a PSL (ie link), which iirc costs 23 chars.

@kylewm
Copy link
Contributor Author

kylewm commented Jan 13, 2015

oh my bad, here is a screenshot without "include link" selected

ellipsized-tweet2

@kylewm
Copy link
Contributor Author

kylewm commented Jan 13, 2015

Dollars to donuts, the link regex recognizes ind.ie and indie.vc as URLs and thinks they will cost 23 character each.

@snarfed
Copy link
Owner

snarfed commented Jan 13, 2015

could be a good hypothesis

@kylewm
Copy link
Contributor Author

kylewm commented Jan 13, 2015

I think if that's the case, I'd vote to WONTFIX it. Even the Cassis autolinker recognizes them as links http://tantek.com/2015/013/t1/names-ind-ie-indie-vc-not-indieweb

@snarfed
Copy link
Owner

snarfed commented Jan 13, 2015

hmm. i'm actually not sure that's the bug after all. i tried manually, and it thinks it can do it without truncating:

hell:~/src/bridgy/activitystreams> python
...
>>> import twitter
>>> twitter.Twitter('', '').preview_create({'objectType': 'note', 'content': 'ind.ie&indie.vc are NOT #indieweb @indiewebcamp indiewebcamp.com/2014-review#Indie_Term_Re-use @iainspad @sashtown @thomatronic'})
CreationResult(content='ind.ie&indie.vc are NOT #indieweb @indiewebcamp indiewebcamp.com/2014-review#Indie_Term_Re-use @iainspad @sashtown @thomatronic', description='<span class="verb">tweet</span>:', abort=False, error_plain=None, error_html=None)

@snarfed
Copy link
Owner

snarfed commented Jan 13, 2015

aha, it's the newlines. they're triggering it somehow.

>>> twitter.Twitter('', '').preview_create({'objectType': 'note', 'content': 'Despite names,\nind.ie&indie.vc are NOT #indieweb @indiewebcamp\nindiewebcamp.com/2014-review#Indie_Term_Re-use\n@iainspad @sashtown @thomatronic'})
CreationResult(content=u'Despite names, ind.ie&indie.vc are NOT #indieweb @indiewebcamp indiewebcamp.com/2014-review#Indie_Term_Re-use @iainspad @sashtown\u2026', description='<span class="verb">tweet</span>:', abort=False, error_plain=None, error_html=None)

@snarfed
Copy link
Owner

snarfed commented Jan 13, 2015

nm, it's not the newlines. i left the first line, Despite names,\n, off my first newline-less test. :guilty:

@snarfed
Copy link
Owner

snarfed commented Jan 13, 2015

anyway. regardless, @kylewm sgtm, tentatively closing as wontfix. feel free to reopen anytime if you want.

@snarfed snarfed closed this as completed Jan 13, 2015
@kylewm
Copy link
Contributor Author

kylewm commented Jan 14, 2015

agreed, thanks for helping me diagnose!

@kylewm kylewm reopened this Feb 1, 2015
@kylewm
Copy link
Contributor Author

kylewm commented Feb 1, 2015

My guess on this was wrong. Bridgy doesn't recognize the schemaless indiewebcamp.com/2014-review#Indie_Term_Re-use URL as a URL, so it counts as more characters than it should.

webutil's extract_links is a little too simplistic, it expects links to have a schema, and expects them to be separated by word boundaries (for instance util.extract_links('http://kylewm.com;http://snarfed.org') returns one long string).

The trailing punctuation will also mess with a-u's tweet shortening logic because

util.extract_links('http://kylewm.com, http://snarfed.org') == ['http://kylewm.com', 'http://snarfed.org']

but twitter._create is looking at the slightly different string tokens (note the trailing comma after kylewm.com)

['http://kylewm.com,', 'http://snarfed.org']

@kylewm
Copy link
Contributor Author

kylewm commented Feb 1, 2015

I'm gaining an appreciation for @tantek's research in this area that I admittedly did not have before. Cassis's tw_length_check counts the number of characters in this text the same way Twitter does -- it explicitly skips bare ccTLDs. So even though auto_link would linkify them, indie.vc and ind.ie are not counted as 23 characters each against the tweet length limit.

I propose we:

  1. replace the current _LINKIFY_RE with https://github.com/kylewm/cassis-autolink-py/blob/master/cassis.py#L13 -- I don't think it will have the lookahead problems the huck regex did (it is basically just a more thorough and explicit version of the one we use now)
  2. use re.findall and re.split instead of string.split to split the string up into link and non-link tokens.
  3. add logic to remove trailing punctuation for link tokens and tack it on the beginning of the next non-link token.
  4. count normal links as 22 or 23 characters. count bare ccTLDs and @-names as len(string).

This should get us pretty close to counting tweet length the way twitter does.

@snarfed
Copy link
Owner

snarfed commented Feb 1, 2015

thanks for diving onto this so deep!

that proposal all makes sense... but hoo boy, that regex is a monster. I'm not exactly jumping at the maintenance burden. I'm sensitive to diminishing returns here, and I think my goal is "just good enough," not "as good as possible."

...having said that, so far I've fixed bugs by switching to entirely new regexes just as often as I've debugged existing ones, so meh.

what the hell, sure, go for it!

one question: how twitter-specific is that regex? if it is, should we maybe keep the existing one in webutil since it's simpler and more predictable, and only use the new one in a-u/twitter?

@kylewm
Copy link
Contributor Author

kylewm commented Feb 2, 2015

I'm sensitive to diminishing returns here, and I think my goal is "just good enough," not "as good as possible."

No I'm with you on that. I think I can modify the current regex to make https?:// optional, and then ignore the obvious cases where twitter doesn't autolink ([domain-name].[2-letter-TLD]) in the length calculation, and that would cover like 98% of cases.

@snarfed
Copy link
Owner

snarfed commented Feb 6, 2015

piling on with another related but different link example that hopefully this will fix (and we should confirm). the content in @dissolve's https://ben.thatmustbe.me/note/2015/2/6/9/ is:

The next IndieWebCamp is fast approaching. If anyone is interested, head over the https://indiewebcamp.com/2015/Cambridge<br><br>(btmb.me s/7D)

we render it to:
The next IndieWebCamp is fast approaching. If anyone is interested, head over the indiewebcamp.com/2015/Cambridge... s/7D)

ie we interpret the <br><br>(btmb.me as part of the URL. we should probably not count < and > in URLs.

@snarfed
Copy link
Owner

snarfed commented Feb 10, 2015

just checking in here. @kylewm, do you think we should go ahead with tantek's regexp and try to simplify it later? or upgrade the current one like you mentioned? or something else?

@kylewm
Copy link
Contributor Author

kylewm commented Feb 10, 2015

Ah thanks for the reminder, this had fallen off my queue :p I think we should

  1. modify the current regex to accept urls without a schema
  2. explicitly skip bare two-letter TLDs with no path when doing the shortening.

I'm happy to do this if you'd like me to, but wouldn't be offended if you want to before I have a chance.

P.S. eventually I think I'd like to write a library for tweet shortening based on https://github.com/kylewm/cassis-autolink-py, and then I will lobby you for its inclusion in Bridgy :)

@snarfed
Copy link
Owner

snarfed commented Feb 10, 2015

@kylewm sgtm, definitely go ahead if you want. thank you!

@kylewm kylewm self-assigned this Feb 11, 2015
@kylewm
Copy link
Contributor Author

kylewm commented Feb 11, 2015

I started beefing up the regex to match urls without a schema. kylewm/webutil@f53386e

Based partially on Cassis, it will match punctuation at the end of a link greedily, and then do some post processing to back off (same for matching links that are already inside an <a>)

Still todo:

  • optionally skip those ccTLD domains like ind.ie and indie.vc
  • use tokenize_links in the tweet shortening algorithm.

@snarfed
Copy link
Owner

snarfed commented Feb 13, 2015

yay!

@kylewm
Copy link
Contributor Author

kylewm commented Feb 13, 2015

The code is a little labyrinthine, but I added enough test cases that I feel fairly confident it's close to correct :D

@snarfed
Copy link
Owner

snarfed commented Feb 13, 2015

agreed, the tests are what make this feel ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants