Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding unicode values - best practice? #65

Closed
kevinmarks opened this issue Mar 18, 2016 · 7 comments
Closed

encoding unicode values - best practice? #65

kevinmarks opened this issue Mar 18, 2016 · 7 comments

Comments

@kevinmarks
Copy link
Member

If I start with

<div class="h-entry"><span class="p-name">Entity &mdash; emdash</span></div>
<div class="h-entry"><span class="p-name">unicode — emdash</span></div>

I get

{"rels": {}, 
"items": 
  [{"type": ["h-entry"], "properties": {"name": ["Entity \u2014 emdash"]}}, 
  {"type": ["h-entry"], "properties": {"name": ["unicode \u2014 emdash"]}}], 
"rel-urls": {}}

with the emdash as a unicode entity.
If we passed ensure_ascii=False to json.dumps() we'd get

{"rels": {},
"items": 
  [{"type": ["h-entry"], "properties": {"name": ["Entity — emdash"]}}, 
  {"type": ["h-entry"], "properties": {"name": ["unicode — emdash"]}}],  
"rel-urls": {}}

Would that be more normal json? What is good practice here?

@kevinmarks
Copy link
Member Author

If I make it an e- field istead, we still get \u encoding in the html, which seems off:

<div class="h-entry"><span class="e-name">Entity &mdash; emdash</span></div>
<div class="h-entry"><span class="e-name">unicode — emdash</span></div>

becomes

{"rels": {}, 
"items":
   [{"type": ["h-entry"], "properties": 
    {"name": 
      [{"html": "Entity \u2014 emdash", 
      "value": "Entity \u2014 emdash"}]}}, 
  {"type": ["h-entry"], "properties": 
    {"name": 
      [{"html": "unicode \u2014 emdash", 
      "value": "unicode \u2014 emdash"}]}}], 
"rel-urls": {}}

Is having \u escaped text in the HTML field a good idea?

@kylewm
Copy link
Collaborator

kylewm commented Apr 23, 2016

I guess my expectation is that the HTML property would be preserve the original markup, i.e. continue to include the entity &mdash;

I'd vote to not to force the result to ASCII anymore because every system we expect to use mf2py support UTF-8, and we don't want to subject our Russian friends to

"content": [{
  "html": "\n<p>\u0430 \u043f\u0440\u044f\u043c\u043e\u0433\u043e \u0438\u0437 \u041c\u0421\u041a \u0432 \u0442\u043e\u0447\u043a\u0443 \u043d\u0430\u0437\u043d\u0430\u0447\u0435\u043d\u0438\u044f \u043d\u0435 \u0431\u044b\u043b\u043e?</p>\n",
   "value": "\n\u0430 \u043f\u0440\u044f\u043c\u043e\u0433\u043e \u0438\u0437 \u041c\u0421\u041a \u0432 \u0442\u043e\u0447\u043a\u0443 \u043d\u0430\u0437\u043d\u0430\u0447\u0435\u043d\u0438\u044f \u043d\u0435 \u0431\u044b\u043b\u043e?\n"
}]

so I propose

<div class="h-entry"><span class="e-name">Entity &mdash; emdash</span></div>

should be parsed as

{"rels": {}, 
"items":
   [{"type": ["h-entry"], "properties": 
    {"name": 
      [{"html": "Entity &mdash; emdash", 
      "value": "Entity — emdash"}]}}], 
"rel-urls": {}}

@kevinmarks
Copy link
Member Author

kevinmarks commented Apr 25, 2016

Can you clarify that 'Russian' example? They use KOI8-r or utf8, don't they?
The JSON output is in utf8, surely?
You can't be utf8 and preserve source encoding.
Oh, hang on, utf9 was a typo, and I think we're mostly agreeing.
I think removing HTML safe entity encoding (apart from < > etc) is worth doing for the sake fo uniformity.
On Sat, Apr 23, 2016 at 10:56 AM, Kyle Mahan notifications@github.com
wrote:

I guess my expectation is that the HTML property would be preserve the
original markup, i.e. continue to include the entity —

I'd vote to not to force the result to ASCII anymore because every
system we expect to use mf2py support UTF-9, and we don't want to subject
our Russian friends to

"content": [
"html": "\n

\u0430 \u043f\u0440\u044f\u043c\u043e\u0433\u043e \u0438\u0437 \u041c\u0421\u041a \u0432 \u0442\u043e\u0447\u043a\u0443 \u043d\u0430\u0437\u043d\u0430\u0447\u0435\u043d\u0438\u044f \u043d\u0435 \u0431\u044b\u043b\u043e?

\n",
"value": "\n\u0430 \u043f\u0440\u044f\u043c\u043e\u0433\u043e \u0438\u0437 \u041c\u0421\u041a \u0432 \u0442\u043e\u0447\u043a\u0443 \u043d\u0430\u0437\u043d\u0430\u0447\u0435\u043d\u0438\u044f \u043d\u0435 \u0431\u044b\u043b\u043e?\n"}]

so I propose

Entity — emdash

should be parsed as

{"rels": {}, "items":
[{"type": ["h-entry"], "properties":
{"name":
[{"html": "Entity — emdash",
"value": "Entity — emdash"}]}}], "rel-urls": {}}


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#65 (comment)

@kylewm
Copy link
Collaborator

kylewm commented Apr 25, 2016

Heh, yeah UTF-9 was an unfortunate typo. Wish GitHub would wait a tick before sending the email notification...

And yep I'm agreeing with you, except I think we should leave html entities as-is in the "html" output (precisely because there are exceptions like &lt; and &gt;, easier to just treat everything the same)

@kartikprabhu
Copy link
Member

phpmf2 also encodes the &emdash to a \u2014. So at the moment this seems fine.

@snarfed
Copy link
Member

snarfed commented Jan 14, 2022

We're on Python 3 now, and mf2py now returns strs with both unicode characters (eg \u2014) and HTML entities (including &lt; and &gt;) properly decoded. It's been like that for a while, probably since the initial Python 3 port, so I'm inclined to keep it that way. Tentatively closing. If anyone has an argument for why HTML entities shouldn't be decoded, we can definitely reopen!

(Also, afaik JSON technically is an ASCII-only format, which is why both mf2py's to_json() and Python's json.dumps() use \u-encoding. ensure_ascii=False may be nice, but it technically emits invalid JSON.)

@snarfed snarfed closed this as completed Jan 14, 2022
@kevinmarks
Copy link
Member Author

kevinmarks commented Jan 14, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants