New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Twitter publish preview shortening a post unnecessarily #343
Comments
hmm, that's not an apples to apples comparison. tantek's tweet includes a PSC at the end, which costs 14 chars. bridgy includes a PSL (ie link), which iirc costs 23 chars. |
Dollars to donuts, the link regex recognizes ind.ie and indie.vc as URLs and thinks they will cost 23 character each. |
could be a good hypothesis |
I think if that's the case, I'd vote to WONTFIX it. Even the Cassis autolinker recognizes them as links http://tantek.com/2015/013/t1/names-ind-ie-indie-vc-not-indieweb |
hmm. i'm actually not sure that's the bug after all. i tried manually, and it thinks it can do it without truncating: hell:~/src/bridgy/activitystreams> python
...
>>> import twitter
>>> twitter.Twitter('', '').preview_create({'objectType': 'note', 'content': 'ind.ie&indie.vc are NOT #indieweb @indiewebcamp indiewebcamp.com/2014-review#Indie_Term_Re-use @iainspad @sashtown @thomatronic'})
CreationResult(content='ind.ie&indie.vc are NOT #indieweb @indiewebcamp indiewebcamp.com/2014-review#Indie_Term_Re-use @iainspad @sashtown @thomatronic', description='<span class="verb">tweet</span>:', abort=False, error_plain=None, error_html=None) |
aha, it's the newlines. they're triggering it somehow. >>> twitter.Twitter('', '').preview_create({'objectType': 'note', 'content': 'Despite names,\nind.ie&indie.vc are NOT #indieweb @indiewebcamp\nindiewebcamp.com/2014-review#Indie_Term_Re-use\n@iainspad @sashtown @thomatronic'})
CreationResult(content=u'Despite names, ind.ie&indie.vc are NOT #indieweb @indiewebcamp indiewebcamp.com/2014-review#Indie_Term_Re-use @iainspad @sashtown\u2026', description='<span class="verb">tweet</span>:', abort=False, error_plain=None, error_html=None) |
nm, it's not the newlines. i left the first line, |
anyway. regardless, @kylewm sgtm, tentatively closing as wontfix. feel free to reopen anytime if you want. |
agreed, thanks for helping me diagnose! |
My guess on this was wrong. Bridgy doesn't recognize the schemaless indiewebcamp.com/2014-review#Indie_Term_Re-use URL as a URL, so it counts as more characters than it should. webutil's extract_links is a little too simplistic, it expects links to have a schema, and expects them to be separated by word boundaries (for instance The trailing punctuation will also mess with a-u's tweet shortening logic because util.extract_links('http://kylewm.com, http://snarfed.org') == ['http://kylewm.com', 'http://snarfed.org'] but twitter._create is looking at the slightly different string tokens (note the trailing comma after kylewm.com) ['http://kylewm.com,', 'http://snarfed.org'] |
I'm gaining an appreciation for @tantek's research in this area that I admittedly did not have before. Cassis's tw_length_check counts the number of characters in this text the same way Twitter does -- it explicitly skips bare ccTLDs. So even though auto_link would linkify them, indie.vc and ind.ie are not counted as 23 characters each against the tweet length limit. I propose we:
This should get us pretty close to counting tweet length the way twitter does. |
thanks for diving onto this so deep! that proposal all makes sense... but hoo boy, that regex is a monster. I'm not exactly jumping at the maintenance burden. I'm sensitive to diminishing returns here, and I think my goal is "just good enough," not "as good as possible." ...having said that, so far I've fixed bugs by switching to entirely new regexes just as often as I've debugged existing ones, so meh. what the hell, sure, go for it! one question: how twitter-specific is that regex? if it is, should we maybe keep the existing one in webutil since it's simpler and more predictable, and only use the new one in a-u/twitter? |
No I'm with you on that. I think I can modify the current regex to make https?:// optional, and then ignore the obvious cases where twitter doesn't autolink ([domain-name].[2-letter-TLD]) in the length calculation, and that would cover like 98% of cases. |
piling on with another related but different link example that hopefully this will fix (and we should confirm). the content in @dissolve's https://ben.thatmustbe.me/note/2015/2/6/9/ is:
we render it to: ie we interpret the |
just checking in here. @kylewm, do you think we should go ahead with tantek's regexp and try to simplify it later? or upgrade the current one like you mentioned? or something else? |
Ah thanks for the reminder, this had fallen off my queue :p I think we should
I'm happy to do this if you'd like me to, but wouldn't be offended if you want to before I have a chance. P.S. eventually I think I'd like to write a library for tweet shortening based on https://github.com/kylewm/cassis-autolink-py, and then I will lobby you for its inclusion in Bridgy :) |
@kylewm sgtm, definitely go ahead if you want. thank you! |
I started beefing up the regex to match urls without a schema. kylewm/webutil@f53386e Based partially on Cassis, it will match punctuation at the end of a link greedily, and then do some post processing to back off (same for matching links that are already inside an <a>) Still todo:
|
yay! |
The code is a little labyrinthine, but I added enough test cases that I feel fairly confident it's close to correct :D |
agreed, the tests are what make this feel ok. |
~ confusing screenshot removed ~
cc: @tantek
The text was updated successfully, but these errors were encountered: