Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplemted JSON (un)escaping #66

Merged
merged 2 commits into from
Feb 17, 2017
Merged

Reimplemted JSON (un)escaping #66

merged 2 commits into from
Feb 17, 2017

Conversation

kbairak
Copy link
Member

@kbairak kbairak commented Feb 8, 2017

No description provided.

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.6%) to 93.633% when pulling 4dea4a8 on json_escaping into bb00d39 on devel.

@kbairak
Copy link
Member Author

kbairak commented Feb 8, 2017

The escaper becomes the escapee... The following markdown looks better in my editor than it does here. Some characters are unescaped...

Braindump

  • Python's unicode type:

    • Python has the unicode type, which is as close to "pure" strings as we
      can hope
    • A few types of information can't be decoded into unicode strings, such as
      binary files (compressed files, images, etc); we don't have to worry about
      those in the case of JSON and most other file formats however
    • We don't have to worry about how strings go in and out of postgres, Django
      does all en/decoding behind the scenes; so we work with the assumption that
      we always store unicode strings in the database and always retrieve
      unicode string from the database
  • Uploaded file's encoding

    • We work with the assumption that the files are UTF-8-encoded and that they
      can always be decoded into unicode
    • Thus, we don't care about the specific bytes in the uploaded files, just
      the elements of the unicode strings that we are left with on the server
      side
  • If the uploaded JSON file was hand-crafted, we don't necessarily know what
    the escaping rules used by the customer are meant to represent, so all we can
    do is best guesses

  • If the uploaded JSON file was automatically generated by gathering strings
    from another source (gettext-style), then there has to be a "pure" version of
    the string somewhere in the customer's codebase; we can only do a best guess
    as to what that was

    • For example, python's json module will by default escape non-ascii
      characters using \u notation, ensuring that the resulting json file is
      ascii-only; if we encounter something like this, we can assume that the
      file was automatically generated, though it's not clear what we should do
      about it
  • In general, more than one raw (escaped) strings can result in the same rich
    (unescaped) string, depending on what un/escaping rules we will follow

    • For example,

      In [2]: json.loads('{"a": "ε"}') == json.loads('{"a": "\\u03b5"}')
      Out[2]: True
    • Thus, we always store the raw version of the strings in the database, to
      ensure no loss of information

  • Backwards compatibility:

    • Changing the un/escape functions should not introduce (I hope) any
      backwards incompatibility since uploading and downloading of files do not
      use these functions
    • The '/strings/' API endpoints use the unescape function, so they will
      be affected
  • The editor is by default on rich mode, in order to let translators not worry
    about encoding/decoding techniques.

    • This means that we load the raw string from the database, unescape it
      (convert it to rich), present it in the editor, get the string the
      translator submitted as the translation, escape it (convert it to raw) and
      save it in the database
    • The compiler will simply replace the section of the source file with
      whatever's in the database
  • If the raw source string has '\u' notation strings, I think it makes sense to
    show the actual unicode symbols to the translators; thus, unescaping should
    convert '\u' notation to proper unicode

  • If the translators submit non-ascii characters as part of a translation (they
    will), I think it makes sense to not apply '\u' notation during escaping;
    it's reasonable to expect that they will want the downloaded files to be
    readable.

  • Thus, the following scenario is possible:

    Source file:               {"a": "h\u03b5llo"}
    Raw source string:         u'h\u03b5llo'
    Rich source string:        u'hεllo'
    Rich translation:          u'καλημέρα'
    Raw translation:           u'καλημέρα'
    Downloaded language file:  {"a": "καλημέρα"}
    

    This might be confusing because the customer seems to be working with '\u'
    notation and we give him a language file with non-ascii characters; they can
    use raw mode however :p

  • Official JSON specification says that a valid string
    consists of a sequence of:

    • Any UNICODE character except " or \ or control character
    • A \ followed by on of: ", , /, b, f, n, r, t, u + 4 hex digits

Proposal

  • Raw strings are saved as they are in the database (no changes there)
  • For rich strings, if we encounter a backslash, we use the unescaping methods
    to convert it and the following symbol(s) to their rich version:
    • '\', '"' and '/' are converted to '', '"' and '/' respectively
    • '\b', '\f', '\n', '\r' and '\t' are converted to their relevant control
      characters
    • '\u03b5' is converted to the relevant unicode character (ε)
  • For converting rich translations to raw ones, we use the reverse of the above
    strategy, only if otherwise the result would not be a valid JSON string:
    • '' ans '"' are converted to '\' and '"' respectively
    • the backspace, formfeed, newline, carriage-return and horizontal-tab
      control characters are converted to '\b', '\f', '\n', '\r' and '\t'
      respectively
    • '/' is not touched; it is valid JSON
    • '\u03b5' is converted to '\u03b5'; it is not reverse-converted into
      unicode, but only the '' is escaped

@kbairak
Copy link
Member Author

kbairak commented Feb 8, 2017

End result:

screen shot 2017-02-08 at 19 19 45
screen shot 2017-02-08 at 19 19 52

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.7%) to 95.502% when pulling 7c61d6a on json_escaping into bb00d39 on devel.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.7%) to 95.502% when pulling 550f535 on json_escaping into bb00d39 on devel.

@kbairak kbairak requested a review from ChrsMark February 9, 2017 11:29
@kbairak
Copy link
Member Author

kbairak commented Feb 9, 2017

BTW, I commented/uncommented the lines that do the work using the json module and the tests behave identically. We should decide which approach to use

@diegobz
Copy link
Contributor

diegobz commented Feb 9, 2017

@kbairak What do you mean by "We should decide which approach to use"? I see only one proposal here, isn't it?

@kbairak
Copy link
Member Author

kbairak commented Feb 10, 2017

hey @diegobz. If you check the code changes, there is a commented out line of code that, instead of using our (un)escaping methods, uses python's json module. It seems that the 2 behaviors are identical (ie all tests pass). So the decision is to whether to use our code or the json module's.

@kroustou
Copy link

kroustou commented Feb 17, 2017

@kbairak I support using python's json, because they rely on the rfcs and we don't have to maintain anything 💃

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.7%) to 95.526% when pulling 53ad9a0 on json_escaping into b6e912c on devel.

@ChrsMark ChrsMark merged commit 748e1ae into devel Feb 17, 2017
@ChrsMark ChrsMark deleted the json_escaping branch February 17, 2017 12:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants