🍄 [[ Builder ]] Further work on encoding/decoding module #1754

peter-b · 2015-01-30T15:14:37Z

This pull request is for continued work on encoding/decoding support for Builder, based on @livecodeali's progress in #1652.

In particular, I have:

Added support for multi-byte replacements when encoding (only ASCII and native encodings affected)
Added support for multi-character replacements when decoding ASCII

Not done yet

Support for multi-character replacements when decoding (but I can't deal with this without resolving the issues listed below).

Unresolved issues

There are still some unresolved issues when decoding UTF-8, UTF-16 and UTF-32.

When decoding UTF-8, we still use UTF8ToUnicode(), which detects and ignores errors with helpful comments like:
```
foundation-legacy.cpp:401: // This is an error
```
Should there be a MCStringCreateWithUTF8CharsAndReplacement() function? My current feeling is that MCStringCreateWithBytesAndReplacement() should be able to detect UTF-8 encoding errors and thus a replacement for UTF8ToUnicode() is needed.
When decoding UTF-16 and UTF-32, what should be done with trailing bytes? Should they be ignored, should they cause a replacement string to be inserted, or should they always cause an error? @livecodeali and I have discussed it, and have had difficulty coming to a solid conclusion.
When decoding UTF-16, what should be done about unpaired surrogates? Given that we can safely round-trip an unpaired surrogate through LiveCode's internal representation, I don't think that it makes sense to consider them to be a encoding error.
What should be done about UTF-16 encoded as UTF-8? @runrevmark brought this up in the previous pull request.

Several of these issues could be solved by allowing flags to be passed to the codec functions, at the cost of making the API even more complicated.

Conflicts: engine/src/modules.cpp libscript/libscript.xcodeproj/project.pbxproj libscript/libstdscript-modules.list toolchain/lc-compile/lc-compile.xcodeproj/project.pbxproj toolchain/lc-compile/src/module-helper.cpp

…when encoding (resp. decoding) [[ LCB StdLib ]] Return undefined when replacement is unspecified and encoding or decoding fails

…yte sequences

Remove both MCStringGetNativeChars() and MCStringGetNativeCharsWithReplacement() from the public libfoundation API.

* Return true on success/false on failure from MCStringGetNativeChars() and MCStringGetNativeCharsWithReplacement(). * Add an in parameter for size of character buffer provided and an out parameter for number of characters used/needed. * Allow multi-byte replacement sequences for characters that can't be represented in the native encoding, and add an explicit in parameter for the replacement sequence length.

Add a new function that allows more efficient conversion to ASCII, without going via the native encoding.

…acement(). Allow zero-byte and multibyte replacement sequences. Note that it is *not* an error for the replacement array to contain non-native (or non-ASCII) values. **NOTE**: This patch means that "ASCII" encoding **really is** ASCII encoding, not the native encoding discarding bytes > 127.

…ent().

…ding. The "native" encoding available to script programs is rarely correct for modern systems, e.g.: * Almost all Linux systems use UTF-8, not ISO-8859-1. * All Macs use UTF-8, not MacRoman. Block all Builder programs from using "native" to specify an encoding. If someone's absolutely determined to get "native encoding exactly as in script", they can use "-native" to bypass the check.

Allow multi-character substitutions for non-ASCII chars, including allowing non-ASCII characters in the replacement.

peter-b · 2015-02-02T15:25:41Z

Discussion with @runrevmark today (offline).

About the unresolved issues: the thing that determines a decoding error is if we encounter something while decoding that we can't represent internally. That means:

For UTF-8, incomplete or too-long sequences get replaced. But we allow short sequences (as in Modified UTF-8) or surrogates (as in UTF-16-encoded-as-UTF-8). We'll have to add a new e.g. MCStringCreateWithUtf8BytesAndReplacement() function.
For UTF-16 & UTF-32, trailing bytes get replaced.
Unpaired surrogates get accepted silently.

A couple of additional things were brought up:

We should convert by grapheme rather than by unichar_t. This may require changes to the way we do ASCII and "native" encoding. Unfortunately this is likely to be a massive amount of work.
We need to clean up the MCString* API and hide as many of the encoding-specific functions as possible. Also, get rid of the external_rep flag wherever feasible.

We need to decide how far we want to go with all of the above before the next release.

runrevali and others added 18 commits January 22, 2015 11:41

[[ LC Builder ]] Implement various things in encoding module

cbad3ae

Conflicts: engine/src/modules.cpp libscript/libscript.xcodeproj/project.pbxproj libscript/libstdscript-modules.list toolchain/lc-compile/lc-compile.xcodeproj/project.pbxproj toolchain/lc-compile/src/module-helper.cpp

[[ LC Builder ]] Fix typo in encode using base64 docs

a7686fe

[[ LCB StdLib ]] Remove syntactic forms of text encodings

ae9c6c5

[[ LCB StdLib ]] Update lc-compile project file

9868b86

[[ LCB StdLib ]] Allow replacement byte (resp. char) to be specified …

6687396

…when encoding (resp. decoding) [[ LCB StdLib ]] Return undefined when replacement is unspecified and encoding or decoding fails

[[ LCB StdLib ]] Deal with invalid data appropriately

86262f4

[[ LBC StdLib ]] Add encoding module to vs build rules

e1cc062

[[ LCB StdLib ]] Allow replacement char for invalid utf16 and utf32 b…

8b90e13

…yte sequences

Merge branch 'develop' into builder-encoding_module

a46c753

libfoundation: Remove MCStringGetNativeChars() from public API.

1efca9b

Remove both MCStringGetNativeChars() and MCStringGetNativeCharsWithReplacement() from the public libfoundation API.

libfoundation: Add MCStringGetAsciiCharsWithReplacement().

bcf86f7

Add a new function that allows more efficient conversion to ASCII, without going via the native encoding.

libfoundation: Update comments for MCStringConvertToBytesWithReplacem…

64b3330

…ent().

libfoundation: Update comments for MCStringEncodeWithReplacement().

685ae54

[[ LCB stdlib ]] Permit multi-byte replacements when encoding.

702fb88

libfoundation: Refactor MCStringCreateWithAsciiCharsAndReplacement().

095d703

Allow multi-character substitutions for non-ASCII chars, including allowing non-ASCII characters in the replacement.

peter-b added this to the 8.0.0-rc-1 milestone Jan 30, 2015

peter-b added the enhancement label Jan 30, 2015

peter-b added the WIP label Sep 1, 2015

peter-b modified the milestone: 8.0.0-rc-1 Sep 28, 2015

peter-b changed the title ~~[[ Builder ]] Further work on encoding/decoding module~~ [DEAD][[ Builder ]] Further work on encoding/decoding module Apr 13, 2017

peter-b changed the title ~~[DEAD][[ Builder ]] Further work on encoding/decoding module~~ 🍄 [[ Builder ]] Further work on encoding/decoding module Apr 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🍄 [[ Builder ]] Further work on encoding/decoding module #1754

🍄 [[ Builder ]] Further work on encoding/decoding module #1754

Uh oh!

peter-b commented Jan 30, 2015

Uh oh!

peter-b commented Feb 2, 2015

Uh oh!

Uh oh!

🍄 [[ Builder ]] Further work on encoding/decoding module #1754

Are you sure you want to change the base?

🍄 [[ Builder ]] Further work on encoding/decoding module #1754

Uh oh!

Conversation

peter-b commented Jan 30, 2015

Not done yet

Unresolved issues

Uh oh!

peter-b commented Feb 2, 2015

Uh oh!

Uh oh!