feat(rt): replace GSON based JSON serde with KMP compatible impl #477

aajtodd · 2021-09-14T15:35:13Z

Issue #

N/A

Description of changes

Removes GSON in favor of a hand rolled JSON encoder and lexer.

This was based on early work from @kiiadi in #42 but updated to handle Unicode, escape sequences, and improve overall correctness/strictness of the parser. The state machine was adapted from work in smithy-rs by @jdisanti which drastically simplified handling various conditions as well as improved the error messages generated.

Implementation notes:

The tokenizer is based on peek() which requires state mutations to be delayed until the token that is peeked is actually consumed. This allowed removing the RawJsonToken type.
The original CharStream abstraction was removed for performance reasons.
- It turns out that suspend is incredibly slow when invoked on every character (not entirely surprising).
- It is also is 3x slower to decode UTF-8 chars on the fly from raw byte sequences than just calling decodeToString() on the appropriate slice of data. I'm not entirely sure why but suspect it's either a JVM intrinsic somewhere and/or has an optimized ASCII loop that is just faster when done in bulk.
- This lead to keeping the lexer/deserializer to work off the raw ByteArray payload which requires us to keep generated deserializers as they are now where they read the entire payload contents into memory (val payload = httpResponse.body.readAll()). This trades memory for CPU. The team discussed and agreed that most service responses will be small enough that this is the right trade off.

The tokenizer was ran through JSON test suite as described in TESTING.md.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Co-authored-by: Aaron Todd <todaaron@amazon.com> Co-authored-by: Kyle Thomson <kylthoms@amazon.com>

ianbotsf

Heroic!

ianbotsf · 2021-09-14T17:03:37Z

runtime/utils/common/test/aws/smithy/kotlin/runtime/util/text/Utf8Test.kt

+        assertEquals(1, byteCountUtf8("$".encodeToByteArray()[0]))
+        assertEquals(2, byteCountUtf8("¢".encodeToByteArray()[0]))
+        assertEquals(3, byteCountUtf8("€".encodeToByteArray()[0]))
+        assertEquals(4, byteCountUtf8("\uD834\uDD22".encodeToByteArray()[0]))


Question: Can this (and other instances) be replaced with the literal Unicode character 𝄢 in source?

I don't believe so because it's a surrogate pair (you can try in the editor to copy and paste and it will encode it for you).

I was able to copy/paste the 𝄢 symbol directly into source code in the editor and it seemed to render fine as a single character. Maybe that behavior is platform/font dependent.

It's a minor question so I can live with it as-is, it merely seemed conspicuous to render three Unicode characters and then fallback to Unicode escape sequences for the last test case.

That is likely the case. I can't paste it directly, it auto formats it to the code points

ianbotsf · 2021-09-14T17:08:36Z

runtime/serde/build.gradle.kts

 }

-subprojects {
+allprojects {


Question: Why was this change necessary?

It might not be now. It was because we added code to serde/common that needed it. I'll check if it's still required.

It was added for CharStream tests (since removed). I think I'll keep it since it allows tests to be added to serde common if needed.

ianbotsf · 2021-09-14T17:12:13Z

runtime/serde/serde-json/TESTING.md

+How to run JSONTestSuite against serde-json deserialize
+========================================================
+
+When making changes to the lexer it is a good idea to run the
+changes against the [JSONTestSuite](https://github.com/nst/JSONTestSuite) and manually examine the test results.
+
+### How to setup the JSONTestSuite


Question: Why not just add an integration test that uses JSONTestSuite directly into the code?

JSONTestSuite is a bit interesting/onerous to setup and use. It does not lend itself to this kind of integration and I see it as an operational cost not worth paying.

In this line of thinking, can this code as-is serve as a general purpose JSON parser? If so perhaps it would make sense to break out as a separate dependency at some point in the future such that JSONTestSuite could run against it as the others they have in their repo. Obviously not something for this PR.

Unless we are planning to support it as a general purpose parser I would think it probably isn't in our interest (or anyone else) to do that. Perhaps we fork it and set it up so that it's easier to run...

The "good idea to run the changes against JSONTestSuite" sounds like best intentions to me, is there a way we can mechanize it? When I was playing around with the testing, I just copy/pasted all of the parser scenarios from their input files into a (admittedly massive) test class.

ianbotsf · 2021-09-14T17:13:53Z

runtime/serde/serde-json/TESTING.md

+
+
+// NOTE: set to whatever locally published version you are working on
+val smithyKotlinVersion: String = "0.4.1-kmp-json"


Nit: The versions users have will almost always be something like 0.4.0-alpha or 0.4.0-snapshot. We should make the example something that looks familiar.

This readme is targeted towards developers (i.e. us). Not end users.

ianbotsf · 2021-09-14T17:15:08Z

runtime/serde/serde-json/TESTING.md

+val smithyKotlinVersion: String = "0.4.1-kmp-json"
+dependencies {
+   implementation(kotlin("stdlib"))
+   implementation("org.jetbrains.kotlinx:kotlinx-coroutines-core:1.5.0")


Question: Do we need to include a note about matching the coroutines version used by the rest of the project?

I'm not sure that's important, again this is targeted at developers of smithy-kotlin though so I would expect us to make changes here as necessary.

runtime/serde/serde-json/common/src/aws/smithy/kotlin/runtime/serde/json/JsonLexer.kt

ianbotsf · 2021-09-14T20:17:06Z

runtime/serde/serde-json/common/src/aws/smithy/kotlin/runtime/serde/json/JsonLexer.kt

+                        'r' -> append('\r')
+                        'n' -> append('\n')
+                        't' -> append('\t')
+                        else -> throw DeserializationException("Invalid escape character: `$byte`")


Question: Can we find a way for this to include position information?

runtime/serde/serde-json/common/src/aws/smithy/kotlin/runtime/serde/json/LexerState.kt

...me/serde/serde-json/common/test/aws/smithy/kotlin/runtime/serde/json/JsonStreamWriterTest.kt

ianbotsf · 2021-09-14T20:36:04Z

runtime/utils/common/src/aws/smithy/kotlin/runtime/util/text/Utf8.kt

+/**
+ * Checks to see if a codepoint is in the supplementary plane or not (surrogate pair)
+ */
+@InternalApi
+fun Char.Companion.isSupplementaryCodePoint(codePoint: Int): Boolean = codePoint in SUPPLEMENTARY_PLANE_LOW..MAX_CODEPOINT
+
+/**
+ * Converts the [codePoint] to a char array. If the codepoint is in the supplementary plane then it will
+ * return an array with the high surrogate and low surrogate at indexes 0 and 1. Otherwise it will return a char
+ * array with a single character.
+ */
+@InternalApi
+fun Char.Companion.codePointToChars(codePoint: Int): CharArray = when (codePoint) {
+    in 0 until SUPPLEMENTARY_PLANE_LOW -> charArrayOf(codePoint.toChar())
+    in SUPPLEMENTARY_PLANE_LOW..MAX_CODEPOINT -> {
+        val low = MIN_LOW_SURROGATE.code + ((codePoint - 0x10000) and 0x3FF)
+        val high = MIN_HIGH_SURROGATE.code + (((codePoint - 0x10000) ushr 10) and 0x3FF)
+        charArrayOf(high.toChar(), low.toChar())
+    }
+    else -> throw IllegalArgumentException("invalid codepoint $codePoint")
+}


Style: I don't think I've ever seen (effectively) static extension methods before. Given that there's no state to use/reuse in Char.Companion, I think these might be better as top-level non-extension methods or extensions on Int.

Given that there's no state to use/reuse in Char.Companion

Not sure why this matters. It's more about scoping the methods to their intended use. I could see them as extensions off Int though if you feel strongly.

kggilmer

I spent some time but only got maybe a quarter way through this. I'll continue but will share what I have so far. Very clean, nice work.

runtime/utils/common/src/aws/smithy/kotlin/runtime/util/Stack.kt

runtime/serde/serde-json/common/src/aws/smithy/kotlin/runtime/serde/json/JsonEncoder.kt

kggilmer · 2021-09-15T18:59:09Z

runtime/serde/serde-json/common/src/aws/smithy/kotlin/runtime/serde/json/JsonEncoder.kt

+
+    private val state: ListStack<LexerState> = mutableListOf(LexerState.Initial)
+
+    private var depth: Int = 0


nit

consider using UInt here

kggilmer · 2021-09-15T19:22:59Z

runtime/serde/serde-json/TESTING.md

+How to run JSONTestSuite against serde-json deserialize
+========================================================
+
+When making changes to the lexer it is a good idea to run the
+changes against the [JSONTestSuite](https://github.com/nst/JSONTestSuite) and manually examine the test results.
+
+### How to setup the JSONTestSuite


In this line of thinking, can this code as-is serve as a general purpose JSON parser? If so perhaps it would make sense to break out as a separate dependency at some point in the future such that JSONTestSuite could run against it as the others they have in their repo. Obviously not something for this PR.

kggilmer · 2021-09-15T19:35:37Z

runtime/serde/serde-json/common/src/aws/smithy/kotlin/runtime/serde/json/JsonEncoder.kt

+    override fun writeRawValue(value: String) = encodeValue(value)
+}
+
+internal fun String.escape(): String {


suggestion

This may be more efficient in time and concise if implemented with a map. Here we have two iterations on the input string, once to check to see if anything should be done, and then again to do a conditional transform. This could be collapsed to one iteration if there is a static Map<Char, String> which contains the escape values and then simply calling Map.getOrElse(chr) on each character.

hmm ya I'll look into it

kggilmer · 2021-09-15T19:37:26Z

runtime/serde/serde-json/common/src/aws/smithy/kotlin/runtime/serde/json/JsonLexer.kt

+    /**
+     * The size of the state stack
+     */
+    val size: Int


nit

consider UInt

runtime/serde/serde-json/common/src/aws/smithy/kotlin/runtime/serde/json/JsonLexer.kt

kggilmer

Impressive work! I don't have any ship-blocking concerns. Some nits and questions.

kggilmer · 2021-09-17T20:09:00Z

runtime/serde/serde-json/common/src/aws/smithy/kotlin/runtime/serde/json/JsonLexer.kt

+     * Advance the cursor until next non-whitespace character is encountered
+     * @param peek Flag indicating if the next non-whitespace character should be consumed or peeked
+     */
+    private fun nextNonWhitespace(peek: Boolean = false): Char? {


suggestion

I don't find any place where this function is called where peek is false, maybe unneeded.

runtime/serde/serde-json/common/src/aws/smithy/kotlin/runtime/serde/json/JsonLexer.kt

runtime/serde/serde-json/common/src/aws/smithy/kotlin/runtime/serde/json/JsonStreamReader.kt

kiiadi · 2021-09-24T15:28:13Z

runtime/serde/serde-json/TESTING.md

+How to run JSONTestSuite against serde-json deserialize
+========================================================
+
+When making changes to the lexer it is a good idea to run the
+changes against the [JSONTestSuite](https://github.com/nst/JSONTestSuite) and manually examine the test results.
+
+### How to setup the JSONTestSuite


The "good idea to run the changes against JSONTestSuite" sounds like best intentions to me, is there a way we can mechanize it? When I was playing around with the testing, I just copy/pasted all of the parser scenarios from their input files into a (admittedly massive) test class.

runtime/serde/serde-json/common/src/aws/smithy/kotlin/runtime/serde/json/JsonLexer.kt

aajtodd and others added 30 commits July 14, 2021 15:11

Add reatUtf8Char and import CharStream

5efac86

Co-authored-by: Aaron Todd <todaaron@amazon.com> Co-authored-by: Kyle Thomson <kylthoms@amazon.com>

Import KMP compatible stream reader from smithy-kotlin#42

6ea21d9

Co-authored-by: Aaron Todd <todaaron@amazon.com> Co-authored-by: Kyle Thomson <kylthoms@amazon.com>

fix invalid number parsing

e9c599c

wip comma handling

289b089

note to self

4da6a6d

Merge remote-tracking branch 'origin/main' into kmp-json

adb40c6

rename to readLiteral

cd1f752

refactor: remove utf8 char handling from SdkByteReadChannel

dbc4a10

fix: handle surrogates in CharStream

5201698

fix: handle escapes

ab208a7

fix reading escaped unicode and control chars

656a757

add failing tests

7131dd1

exception hygiene

74a13f4

add unescaped control character handling

336074c

add instructions for testing against JSONTestSuite

4b88f50

Merge remote-tracking branch 'origin/main' into kmp-json

2838994

Merge remote-tracking branch 'origin/main' into kmp-json

c111e64

Merge remote-tracking branch 'origin/main' into kmp-json

ee50ae0

update expected exception

8ba6147

use more meaningful states for handling errors

e27566c

remove RawJsonToken

377037b

fix lexer to support peek operations

7d76fcf

cleanup state management

bffb4a4

Merge remote-tracking branch 'origin/main' into kmp-json

c2fdf90

cleanup

9270970

Merge branch 'kmp-json-refactor' into kmp-json

b67e0ba

update testing readme

2519012

replace gson with hand rolled encoder

bb966a6

share same state definition

fb0cbff

reset sdk version

f780452

aajtodd added 9 commits September 10, 2021 10:23

Merge remote-tracking branch 'origin/main' into kmp-json

7892ddc

fix multibyte unicode order

22e2ad9

wip microbenchmarking

bf711fe

optimize lexer

fb2d23f

cleanup

0889cf9

remove CharStream and cleanup

93eed44

cleanup error handling and include position when possible

6bb1a3f

cleanup encoder

54d989a

Merge remote-tracking branch 'origin/main' into kmp-json

f45a4a1

aajtodd requested review from ianbotsf, kggilmer and kiiadi September 14, 2021 15:35

aajtodd mentioned this pull request Sep 14, 2021

refactor: replace GSON-based Json reader with pure Kotlin version #42

Closed

ianbotsf reviewed Sep 14, 2021

View reviewed changes

refactor: remove unnecessary type from consts

b41fac3

kggilmer reviewed Sep 15, 2021

View reviewed changes

kggilmer approved these changes Sep 17, 2021

View reviewed changes

aajtodd added 2 commits September 20, 2021 14:13

Merge remote-tracking branch 'origin/main' into kmp-json

be5a6b9

feedback and cleanup

dacca0d

kiiadi approved these changes Sep 24, 2021

View reviewed changes

aajtodd added 3 commits October 6, 2021 09:43

Merge remote-tracking branch 'origin/main' into kmp-json

a2f600c

fix surrounding backticks

2affe95

include better offset info in exceptions; fix backticks

78e8be7

aajtodd merged commit 637e0f7 into main Oct 6, 2021

aajtodd deleted the kmp-json branch October 6, 2021 15:41

aajtodd mentioned this pull request Oct 11, 2021

fix(rt): remove gson from dependencies #496

Merged

2 tasks



		// NOTE: set to whatever locally published version you are working on
		val smithyKotlinVersion: String = "0.4.1-kmp-json"


		private val state: ListStack<LexerState> = mutableListOf(LexerState.Initial)

		private var depth: Int = 0

feat(rt): replace GSON based JSON serde with KMP compatible impl #477

feat(rt): replace GSON based JSON serde with KMP compatible impl #477

Uh oh!

Conversation

aajtodd commented Sep 14, 2021

Issue #

Description of changes

Uh oh!

ianbotsf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ianbotsf Sep 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kggilmer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kggilmer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ianbotsf Sep 14, 2021 •

edited

Loading