Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "·" MIDDLEDOT (U+00B7) support #4

Open
jmontane opened this issue Dec 11, 2014 · 8 comments
Open

Add "·" MIDDLEDOT (U+00B7) support #4

jmontane opened this issue Dec 11, 2014 · 8 comments

Comments

@jmontane
Copy link

*Note: this issue is copied from old "twitter-text-conformance" repo
twitter-archive/twitter-text-conformance#63

Hi,

MDIDDLEDOT (U+00B7) is very used as inner-word punctuation in Catalan, a mandatory diacritical char in Catalan ortography rules. Currently Twitter doesn't allow to use "·" in several places, so I request to improve its support in Twitter.

I requested it in Twitter support forum, without feedback. So, I request it here. If that's not the place, please, report it to L10N Twitter team.

For instance:

  1. It's unable to make hashtags like #il·lusió
  2. It's unable to set valid URLs like "http://www.l·l.cat" in user's profile
  3. It's unable to create or name list like "al·lucinant"

About 1 and 3
You can do a workaround using a legacy compatible characters ŀ (U+0140) / Ŀ (U+013F). According to Unicode, it's preferred to use their decomposition: l+· and L+·. So, the weird effect is that you can use ĿL in hashtags (#iŀlusió works fine), but not the preferred Unicode encoding L·L (#il·lusió fails).

About 2
MIDDLEDOT (U+00B7) is a valid char (between 2 Ls) in .CAT and .ES TLDs, and its allowed by RFC592

So, please, improve U+00B7 support in Twitter.

Thanks in advance.

Related links

@jmontane
Copy link
Author

I found a new bug related with U+00B7 and Twitter. Please, see this Tweet https://twitter.com/unjoanqualsevol/status/469148413486194688 There are 2 valid and registered URLs

@jmontane
Copy link
Author

Hi,

Current Unicode UAX 31 cites 00B7 and its use in hashtags
http://www.unicode.org/reports/tr31/#Specific_Character_Adjustments

Is there any improvement or roadmap about this issue?

Regards,

@montxovs
Copy link

We need to normalize the middle dot on hashtags in Catalan language.

@jmontane
Copy link
Author

Hi,

Twitter supports hashtags with middle dot (U+00B7), really good news, :)

There are some issues around middle dot support in URLs:
1.- Twitter lists with L·L are created now, but URL uses an hyphen L-L. See https://twitter.com/unjoanqualsevol/lists/l-l
2.- Twitter validates l·l in domain part of URLs, but only if schema (http://) is declared.
3.- Twitter breaks URL links if L·L is path part of URL. Sample: https://twitter.com/BernatDedeu/status/594162637396643842

Expected behaviour in all 3 cases is same currently achieved with accented letters (à,ç,ñ...). I. E. autolinking working fine with L·L

Please, note CMSs, like Wordpress, doesn't escape middle dot, and there are many word in Catalan Wiktionary with L·L. See: http://ca.wiktionary.org/wiki/Categoria:Mots_en_catal%C3%A0_amb_eles_geminades

@jmontane
Copy link
Author

Just to point one more example about autolinking URLs

See following Tweet:
https://twitter.com/XSalaimartin/status/647004755512958977

It has a link to:
http://caffereggio.net/2015/09/24/la-economia-ante-la-independencia-del-col·lectiu-wilson-en-la-vanguardia/

But Twitter autolink breaks on "·" U+00B7 char and split URL:
http://caffereggio.net/2015/09/24/la-economia-ante-la-independencia-del-col

@jmontane
Copy link
Author

jmontane commented Oct 7, 2015

Just a funny effect. Twitter autolinking feature breaks own Twitter URLs. For instance, a link to #L·L hashtag is automagically broken if it's pasted/copied in a Tweet

https://twitter.com/hashtag/L·L

@twuttke
Copy link
Contributor

twuttke commented Oct 7, 2015

Or, properly escaped if you copy it from the address bar of a modern
browser: https://twitter.com/hashtag/L%C2%B7L

I think it's funny how people and messaging products are gradually giving
the middle finger to https://www.ietf.org/rfc/rfc1738.txt At some point,
we'll have to invent a special url termination char because we will be
allowing all other chars to be in urls.

But, in the case of the middle dot, I don't mind adding it. It is just a
matter of what is more expected by users - that it extend the url, or
terminate the url?

Is there a new RFC for what chars are allowed in urls in the age of modern
message parsers? Seems Twitter, gmail, Facebook, etc... should all agree on
these additions. Or web pages should stop exposing unescaped urls.

On Wed, Oct 7, 2015 at 12:08 PM, Joan Montané notifications@github.com
wrote:

Just a funny effect. Twitter autolinking feature breaks own Twitter URLs.
For instance, a link to #L·L hashtag is automagically broken if it's
pasted/copied in a Tweet

https://twitter.com/hashtag/L·L


Reply to this email directly or view it on GitHub
#4 (comment).

@jmontane
Copy link
Author

jmontane commented Oct 7, 2015

Yeah! I know beyond-old-ASCII chars should be escaped but, as you point, several web services (Wordpress, Twitter...) generate URLs with such chars, so links become unusable, :(

MIDDLE DOT (U+00B7) is used as inner-word char for Catalan language. According to Unicode UAX TR29 it's a MidLetter character [1] on word boundary segmentation. So, it's unlikely that it's used as a URL terminator.
[1] http://unicode.org/reports/tr29/#MidLetter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants