Skip to content

Conversation

@aajtodd
Copy link
Contributor

@aajtodd aajtodd commented Sep 14, 2021

Issue #

N/A

Description of changes

Removes GSON in favor of a hand rolled JSON encoder and lexer.

This was based on early work from @kiiadi in #42 but updated to handle Unicode, escape sequences, and improve overall correctness/strictness of the parser. The state machine was adapted from work in smithy-rs by @jdisanti which drastically simplified handling various conditions as well as improved the error messages generated.

Implementation notes:

  • The tokenizer is based on peek() which requires state mutations to be delayed until the token that is peeked is actually consumed. This allowed removing the RawJsonToken type.
  • The original CharStream abstraction was removed for performance reasons.
    • It turns out that suspend is incredibly slow when invoked on every character (not entirely surprising).
    • It is also is 3x slower to decode UTF-8 chars on the fly from raw byte sequences than just calling decodeToString() on the appropriate slice of data. I'm not entirely sure why but suspect it's either a JVM intrinsic somewhere and/or has an optimized ASCII loop that is just faster when done in bulk.
    • This lead to keeping the lexer/deserializer to work off the raw ByteArray payload which requires us to keep generated deserializers as they are now where they read the entire payload contents into memory (val payload = httpResponse.body.readAll()). This trades memory for CPU. The team discussed and agreed that most service responses will be small enough that this is the right trade off.

The tokenizer was ran through JSON test suite as described in TESTING.md.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

aajtodd and others added 30 commits July 14, 2021 15:11
Co-authored-by: Aaron Todd <todaaron@amazon.com>
Co-authored-by: Kyle Thomson <kylthoms@amazon.com>
Co-authored-by: Aaron Todd <todaaron@amazon.com>
Co-authored-by: Kyle Thomson <kylthoms@amazon.com>
Copy link
Contributor

@ianbotsf ianbotsf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heroic!

assertEquals(1, byteCountUtf8("$".encodeToByteArray()[0]))
assertEquals(2, byteCountUtf8("¢".encodeToByteArray()[0]))
assertEquals(3, byteCountUtf8("".encodeToByteArray()[0]))
assertEquals(4, byteCountUtf8("\uD834\uDD22".encodeToByteArray()[0]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Can this (and other instances) be replaced with the literal Unicode character 𝄢 in source?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe so because it's a surrogate pair (you can try in the editor to copy and paste and it will encode it for you).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to copy/paste the 𝄢 symbol directly into source code in the editor and it seemed to render fine as a single character. Maybe that behavior is platform/font dependent.

It's a minor question so I can live with it as-is, it merely seemed conspicuous to render three Unicode characters and then fallback to Unicode escape sequences for the last test case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is likely the case. I can't paste it directly, it auto formats it to the code points

}

subprojects {
allprojects {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why was this change necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not be now. It was because we added code to serde/common that needed it. I'll check if it's still required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was added for CharStream tests (since removed). I think I'll keep it since it allows tests to be added to serde common if needed.

Comment on lines +1 to +7
How to run JSONTestSuite against serde-json deserialize
========================================================

When making changes to the lexer it is a good idea to run the
changes against the [JSONTestSuite](https://github.com/nst/JSONTestSuite) and manually examine the test results.

### How to setup the JSONTestSuite
Copy link
Contributor

@ianbotsf ianbotsf Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why not just add an integration test that uses JSONTestSuite directly into the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSONTestSuite is a bit interesting/onerous to setup and use. It does not lend itself to this kind of integration and I see it as an operational cost not worth paying.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this line of thinking, can this code as-is serve as a general purpose JSON parser? If so perhaps it would make sense to break out as a separate dependency at some point in the future such that JSONTestSuite could run against it as the others they have in their repo. Obviously not something for this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we are planning to support it as a general purpose parser I would think it probably isn't in our interest (or anyone else) to do that. Perhaps we fork it and set it up so that it's easier to run...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "good idea to run the changes against JSONTestSuite" sounds like best intentions to me, is there a way we can mechanize it? When I was playing around with the testing, I just copy/pasted all of the parser scenarios from their input files into a (admittedly massive) test class.



// NOTE: set to whatever locally published version you are working on
val smithyKotlinVersion: String = "0.4.1-kmp-json"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The versions users have will almost always be something like 0.4.0-alpha or 0.4.0-snapshot. We should make the example something that looks familiar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This readme is targeted towards developers (i.e. us). Not end users.

val smithyKotlinVersion: String = "0.4.1-kmp-json"
dependencies {
implementation(kotlin("stdlib"))
implementation("org.jetbrains.kotlinx:kotlinx-coroutines-core:1.5.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Do we need to include a note about matching the coroutines version used by the rest of the project?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that's important, again this is targeted at developers of smithy-kotlin though so I would expect us to make changes here as necessary.

'r' -> append('\r')
'n' -> append('\n')
't' -> append('\t')
else -> throw DeserializationException("Invalid escape character: `$byte`")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Can we find a way for this to include position information?

Comment on lines +33 to +53
/**
* Checks to see if a codepoint is in the supplementary plane or not (surrogate pair)
*/
@InternalApi
fun Char.Companion.isSupplementaryCodePoint(codePoint: Int): Boolean = codePoint in SUPPLEMENTARY_PLANE_LOW..MAX_CODEPOINT

/**
* Converts the [codePoint] to a char array. If the codepoint is in the supplementary plane then it will
* return an array with the high surrogate and low surrogate at indexes 0 and 1. Otherwise it will return a char
* array with a single character.
*/
@InternalApi
fun Char.Companion.codePointToChars(codePoint: Int): CharArray = when (codePoint) {
in 0 until SUPPLEMENTARY_PLANE_LOW -> charArrayOf(codePoint.toChar())
in SUPPLEMENTARY_PLANE_LOW..MAX_CODEPOINT -> {
val low = MIN_LOW_SURROGATE.code + ((codePoint - 0x10000) and 0x3FF)
val high = MIN_HIGH_SURROGATE.code + (((codePoint - 0x10000) ushr 10) and 0x3FF)
charArrayOf(high.toChar(), low.toChar())
}
else -> throw IllegalArgumentException("invalid codepoint $codePoint")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: I don't think I've ever seen (effectively) static extension methods before. Given that there's no state to use/reuse in Char.Companion, I think these might be better as top-level non-extension methods or extensions on Int.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that there's no state to use/reuse in Char.Companion

Not sure why this matters. It's more about scoping the methods to their intended use. I could see them as extensions off Int though if you feel strongly.

Copy link
Contributor

@kggilmer kggilmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time but only got maybe a quarter way through this. I'll continue but will share what I have so far. Very clean, nice work.


private val state: ListStack<LexerState> = mutableListOf(LexerState.Initial)

private var depth: Int = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

consider using UInt here

Comment on lines +1 to +7
How to run JSONTestSuite against serde-json deserialize
========================================================

When making changes to the lexer it is a good idea to run the
changes against the [JSONTestSuite](https://github.com/nst/JSONTestSuite) and manually examine the test results.

### How to setup the JSONTestSuite
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this line of thinking, can this code as-is serve as a general purpose JSON parser? If so perhaps it would make sense to break out as a separate dependency at some point in the future such that JSONTestSuite could run against it as the others they have in their repo. Obviously not something for this PR.

override fun writeRawValue(value: String) = encodeValue(value)
}

internal fun String.escape(): String {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion

This may be more efficient in time and concise if implemented with a map. Here we have two iterations on the input string, once to check to see if anything should be done, and then again to do a conditional transform. This could be collapsed to one iteration if there is a static Map<Char, String> which contains the escape values and then simply calling Map.getOrElse(chr) on each character.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm ya I'll look into it

/**
* The size of the state stack
*/
val size: Int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

consider UInt

Copy link
Contributor

@kggilmer kggilmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work! I don't have any ship-blocking concerns. Some nits and questions.

* Advance the cursor until next non-whitespace character is encountered
* @param peek Flag indicating if the next non-whitespace character should be consumed or peeked
*/
private fun nextNonWhitespace(peek: Boolean = false): Char? {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion

I don't find any place where this function is called where peek is false, maybe unneeded.

Comment on lines +1 to +7
How to run JSONTestSuite against serde-json deserialize
========================================================

When making changes to the lexer it is a good idea to run the
changes against the [JSONTestSuite](https://github.com/nst/JSONTestSuite) and manually examine the test results.

### How to setup the JSONTestSuite
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "good idea to run the changes against JSONTestSuite" sounds like best intentions to me, is there a way we can mechanize it? When I was playing around with the testing, I just copy/pasted all of the parser scenarios from their input files into a (admittedly massive) test class.

@aajtodd aajtodd merged commit 637e0f7 into main Oct 6, 2021
@aajtodd aajtodd deleted the kmp-json branch October 6, 2021 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants