Reimplemted JSON (un)escaping #66

kbairak · 2017-02-08T16:29:38Z

No description provided.

coveralls · 2017-02-08T16:31:59Z

Coverage decreased (-2.6%) to 93.633% when pulling 4dea4a8 on json_escaping into bb00d39 on devel.

kbairak · 2017-02-08T16:35:01Z

The escaper becomes the escapee... The following markdown looks better in my editor than it does here. Some characters are unescaped...

Braindump

Python's unicode type:
- Python has the unicode type, which is as close to "pure" strings as we
  can hope
- A few types of information can't be decoded into unicode strings, such as
  binary files (compressed files, images, etc); we don't have to worry about
  those in the case of JSON and most other file formats however
- We don't have to worry about how strings go in and out of postgres, Django
  does all en/decoding behind the scenes; so we work with the assumption that
  we always store unicode strings in the database and always retrieve
  unicode string from the database
Uploaded file's encoding
- We work with the assumption that the files are UTF-8-encoded and that they
  can always be decoded into unicode
- Thus, we don't care about the specific bytes in the uploaded files, just
  the elements of the unicode strings that we are left with on the server
  side
If the uploaded JSON file was hand-crafted, we don't necessarily know what
the escaping rules used by the customer are meant to represent, so all we can
do is best guesses
If the uploaded JSON file was automatically generated by gathering strings
from another source (gettext-style), then there has to be a "pure" version of
the string somewhere in the customer's codebase; we can only do a best guess
as to what that was
- For example, python's json module will by default escape non-ascii
  characters using \u notation, ensuring that the resulting json file is
  ascii-only; if we encounter something like this, we can assume that the
  file was automatically generated, though it's not clear what we should do
  about it
In general, more than one raw (escaped) strings can result in the same rich
(unescaped) string, depending on what un/escaping rules we will follow
- For example,
```
In [2]: json.loads('{"a": "ε"}') == json.loads('{"a": "\\u03b5"}')
Out[2]: True
```
- Thus, we always store the raw version of the strings in the database, to
  ensure no loss of information
Backwards compatibility:
- Changing the un/escape functions should not introduce (I hope) any
  backwards incompatibility since uploading and downloading of files do not
  use these functions
- The '/strings/' API endpoints use the unescape function, so they will
  be affected
The editor is by default on rich mode, in order to let translators not worry
about encoding/decoding techniques.
- This means that we load the raw string from the database, unescape it
  (convert it to rich), present it in the editor, get the string the
  translator submitted as the translation, escape it (convert it to raw) and
  save it in the database
- The compiler will simply replace the section of the source file with
  whatever's in the database
If the raw source string has '\u' notation strings, I think it makes sense to
show the actual unicode symbols to the translators; thus, unescaping should
convert '\u' notation to proper unicode
If the translators submit non-ascii characters as part of a translation (they
will), I think it makes sense to not apply '\u' notation during escaping;
it's reasonable to expect that they will want the downloaded files to be
readable.

Thus, the following scenario is possible:

Source file:               {"a": "h\u03b5llo"}
Raw source string:         u'h\u03b5llo'
Rich source string:        u'hεllo'
Rich translation:          u'καλημέρα'
Raw translation:           u'καλημέρα'
Downloaded language file:  {"a": "καλημέρα"}

This might be confusing because the customer seems to be working with '\u'
notation and we give him a language file with non-ascii characters; they can
use raw mode however :p

Official JSON specification says that a valid string
consists of a sequence of:
- Any UNICODE character except " or \ or control character
- A \ followed by on of: ", , /, b, f, n, r, t, u + 4 hex digits

Proposal

Raw strings are saved as they are in the database (no changes there)
For rich strings, if we encounter a backslash, we use the unescaping methods
to convert it and the following symbol(s) to their rich version:
- '\', '"' and '/' are converted to '', '"' and '/' respectively
- '\b', '\f', '\n', '\r' and '\t' are converted to their relevant control
  characters
- '\u03b5' is converted to the relevant unicode character (ε)
For converting rich translations to raw ones, we use the reverse of the above
strategy, only if otherwise the result would not be a valid JSON string:
- '' ans '"' are converted to '\' and '"' respectively
- the backspace, formfeed, newline, carriage-return and horizontal-tab
  control characters are converted to '\b', '\f', '\n', '\r' and '\t'
  respectively
- '/' is not touched; it is valid JSON
- '\u03b5' is converted to '\u03b5'; it is not reverse-converted into
  unicode, but only the '' is escaped

kbairak · 2017-02-08T17:20:33Z

End result:

coveralls · 2017-02-09T11:18:15Z

Coverage decreased (-0.7%) to 95.502% when pulling 7c61d6a on json_escaping into bb00d39 on devel.

coveralls · 2017-02-09T11:28:56Z

Coverage decreased (-0.7%) to 95.502% when pulling 550f535 on json_escaping into bb00d39 on devel.

kbairak · 2017-02-09T11:31:40Z

BTW, I commented/uncommented the lines that do the work using the json module and the tests behave identically. We should decide which approach to use

diegobz · 2017-02-09T12:11:09Z

@kbairak What do you mean by "We should decide which approach to use"? I see only one proposal here, isn't it?

kbairak · 2017-02-10T11:08:14Z

hey @diegobz. If you check the code changes, there is a commented out line of code that, instead of using our (un)escaping methods, uses python's json module. It seems that the 2 behaviors are identical (ie all tests pass). So the decision is to whether to use our code or the json module's.

kroustou · 2017-02-17T11:34:47Z

@kbairak I support using python's json, because they rely on the rfcs and we don't have to maintain anything 💃

coveralls · 2017-02-17T12:36:11Z

Coverage decreased (-0.7%) to 95.526% when pulling 53ad9a0 on json_escaping into b6e912c on devel.

kbairak force-pushed the json_escaping branch from a80630c to 4dea4a8 Compare February 8, 2017 16:31

kbairak force-pushed the json_escaping branch from 7c61d6a to 550f535 Compare February 9, 2017 11:28

kbairak requested a review from ChrsMark February 9, 2017 11:29

kroustou approved these changes Feb 17, 2017

View reviewed changes

kbairak added 2 commits February 17, 2017 14:35

Reimplemted JSON (un)escaping

232ed2d

Added unittests

53ad9a0

kbairak force-pushed the json_escaping branch from 550f535 to 53ad9a0 Compare February 17, 2017 12:35

ChrsMark approved these changes Feb 17, 2017

View reviewed changes

ChrsMark merged commit 748e1ae into devel Feb 17, 2017

ChrsMark deleted the json_escaping branch February 17, 2017 12:38

diegobz mentioned this pull request Mar 8, 2017

Github-flavored Markdown enhancements #69

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimplemted JSON (un)escaping #66

Reimplemted JSON (un)escaping #66

kbairak commented Feb 8, 2017

coveralls commented Feb 8, 2017

kbairak commented Feb 8, 2017 •

edited

Loading

kbairak commented Feb 8, 2017

coveralls commented Feb 9, 2017

coveralls commented Feb 9, 2017

kbairak commented Feb 9, 2017

diegobz commented Feb 9, 2017

kbairak commented Feb 10, 2017

kroustou commented Feb 17, 2017 •

edited

Loading

coveralls commented Feb 17, 2017

Reimplemted JSON (un)escaping #66

Reimplemted JSON (un)escaping #66

Conversation

kbairak commented Feb 8, 2017

coveralls commented Feb 8, 2017

kbairak commented Feb 8, 2017 • edited Loading

Braindump

Proposal

kbairak commented Feb 8, 2017

coveralls commented Feb 9, 2017

coveralls commented Feb 9, 2017

kbairak commented Feb 9, 2017

diegobz commented Feb 9, 2017

kbairak commented Feb 10, 2017

kroustou commented Feb 17, 2017 • edited Loading

coveralls commented Feb 17, 2017

kbairak commented Feb 8, 2017 •

edited

Loading

kroustou commented Feb 17, 2017 •

edited

Loading