UTF8 and Unicode #25

ppaulweber · 2019-09-07T01:20:19Z

provides UTF-8 byte sequence parsing and code point abstraction accordingly to the RFC3629 standard reference
provides generic include point called Unicode which provides an alias for UTF8 and defines the Unicode planes, blocks, and ranges
Unicode consists of helper functions to check of a UTF8 is inside one or multiple block ranges
contains unit tests for both abstractions

* added UTF-8 support as its own standard implementation - http://tools.ietf.org/html/rfc3629 * allows to create UTF-8 abstraction of byte sequences and decode as 32bit value * added some basic unit tests

* added code value description * updated some basic unit tests

* updated byte sequence detection * provided new `byteSequenceLengthIndication` functionality

* split single header into header file and compilation unit * updated expansion functionality to support UTF-8

* provided a new internal helper function to extract a UTF-8 slice of a given source file and the source positions

* added support to represent UTF-8 byte sequence as unicode value and string representation

* primer support of common Unicode planes and block ranges * added functionality to test if UTF-8 characters are inside certain block ranges * provided proper unit tests

emmanuel099 · 2019-09-11T12:21:03Z

etc/test/cpp/unicode.cpp

+                EXPECT_EQ( range.plane(), Plane::SUPPLEMENTARY_MULTILINGUAL );
+                break;
+            }
+            case Block::TRANSPORT_AND_MAP_SYMBOLS:


unreachable?

Better use multiple test cases (or parameterized test case) instead of the for loop.
Makes it easier to follow and easier to detect the faulty block in case of an test case error.

@emmanuel099 good catch, I've created an issue, which addresses this comment and will fix this problem with the suggested solution in a future PR

ppaulweber added 14 commits September 4, 2019 11:27

UTF-8 (RFC3629)

529e34c

* added UTF-8 support as its own standard implementation - http://tools.ietf.org/html/rfc3629 * allows to create UTF-8 abstraction of byte sequences and decode as 32bit value * added some basic unit tests

UTF-8 (RFC3629)

0245510

* added code value description * updated some basic unit tests

C++ String: added UTF-8 test case for expansion functionality

fef543b

UTF-8 (RFC3629)

86a2c95

* updated byte sequence detection * provided new `byteSequenceLengthIndication` functionality

C++ String: compile unit and expansion

6b4d90c

* split single header into header file and compilation unit * updated expansion functionality to support UTF-8

SourceLocation: extract correct UTF-8 sequence of source file

bea7360

* provided a new internal helper function to extract a UTF-8 slice of a given source file and the source positions

UTF-8 (RFC3629): unicode point representation

da86449

* added support to represent UTF-8 byte sequence as unicode value and string representation

C++ Unicode: added block, plane, and range abstractions

c143763

* primer support of common Unicode planes and block ranges * added functionality to test if UTF-8 characters are inside certain block ranges * provided proper unit tests

C++ String: fixed includes

5e2e57f

C++ LSP: fixed includes

e2fca8f

C++ Unicode: updated test compile units

d06e069

UTF-8 (RFC3629): updated and format code

8b9380a

C++ SourceLocation: fixed reading of file location with UTF-8 characters

45f6f5e

C++ SourceLocation: updated UTF-8 based slice calculation

970c6cc

ppaulweber requested a review from emmanuel099 September 7, 2019 01:20

ppaulweber self-assigned this Sep 7, 2019

This was referenced Sep 7, 2019

UTF8 support for identifiers, string literals, and comments casm-lang/libcasm-fe#193

Merged

UTF8 character support sealangdotorg/sea#101

Merged

emmanuel099 approved these changes Sep 11, 2019

View reviewed changes

ppaulweber mentioned this pull request Sep 11, 2019

Full UTF-8 Unicode Block Range Support and Unit Test Structure sealangdotorg/sea#102

Open

ppaulweber merged commit c168223 into master Sep 11, 2019

ppaulweber deleted the feature/100_utf8 branch September 11, 2019 12:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF8 and Unicode #25

UTF8 and Unicode #25

ppaulweber commented Sep 7, 2019

emmanuel099 Sep 11, 2019

ppaulweber Sep 11, 2019

UTF8 and Unicode #25

UTF8 and Unicode #25

Conversation

ppaulweber commented Sep 7, 2019

emmanuel099 Sep 11, 2019

Choose a reason for hiding this comment

ppaulweber Sep 11, 2019

Choose a reason for hiding this comment