-
Notifications
You must be signed in to change notification settings - Fork 19
feat(parser)!: support custom input encoding #136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is a Draft, I assume you are aware that there are issues with the current implementation. Hopefully my comments are useful nonetheless; if not feel free to ignore them.
If you want to support only UTF-32 but not any other charsets, then a handwritten decoder might be easiest. Maybe the question is in general how common use cases with other charsets are, and if so which charsets are used then.
Otherwise if you want to support arbitrary charsets, then it probably becomes more difficult.
In general I think custom charset support could work by using CharsetDecoder#decode(ByteBuffer,CharBuffer,boolean)
, where the output buffer is sized to only permit the decoder to decode at most one code point. And then measure how many bytes were consumed from the input buffer.
However, this becomes quite convoluted because if it should be completely correct then you would probably have to:
- first try to decode into an output buffer which fits 1
char
- if that failed with
coderResult.isOverflow()
and did not produce any output, try again with a buffer which fits 2char
(a surrogate pair) - maybe have to account for charsets which behave weirdly, e.g.
- report
isOverflow()
but already wrote one char - write 2 chars which are not actually a surrogate pair
- writes 1 char which is a lone surrogate
- write a prefix (similar to a BOM) or suffix
- report
- perform the full decoding operation as described by the
CharsetDecoder
docs
That is, once there is no further input calldecode(..., ..., true)
and also eventuallyflush
. At the very least if tree-sitter or jtreesitter cannot support decoded data at the end but the decoder produced additional data, throw an exception to avoid silent incorrect behavior.
It is also not completely clear to me how tree-sitter DecodeFunction
is supposed to behave:
- Does it provide all bytes, or could the bytes be incomplete (e.g. when using a custom reading function) and in that case what should the decode function return when there are too few bytes to decode a code point?
- Are there guarantees for how long the
DecodeFunction
is called? Until it returns 0? Or is it not even called anymore whenlength
becomes 0 (regardless of whether the decode function returned 0 in the past)?
(string, length, code_point) -> { | ||
if (length == 0) return 0; | ||
var buffer = string.asSlice(0, length).asByteBuffer(); | ||
var decoded = encoding.charset().decode(buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling decode(buffer)
here will decode all bytes, even though just the first code point is needed. This is probably rather inefficient.
Also as mentioned by the documentation, Charset#decode
uses CodingErrorAction.REPLACE
, which might not be desired? (On the other hand, I am not sure how the Java FFM behaves when an upcall throws an exception, maybe it exits the JVM, see Linker#upcallStub
.)
if (length == 0) return 0; | ||
var buffer = string.asSlice(0, length).asByteBuffer(); | ||
var decoded = encoding.charset().decode(buffer); | ||
code_point.set(C_INT, 0, decoded.charAt(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
charAt(0)
only retrieves a single 16-bit Java char
. For supplementary codepoints (>= U+10000) that will only be the high surrogate.
var buffer = string.asSlice(0, length).asByteBuffer(); | ||
var decoded = encoding.charset().decode(buffer); | ||
code_point.set(C_INT, 0, decoded.charAt(0)); | ||
return (int)encoder.maxBytesPerChar(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxBytesPerChar()
only works if the charset consumes a fixed number of bytes. It would not work for a charset like UTF-8 (if that didn't have built-in support by tree-sitter) where a code point uses a variable amount of bytes.
@DisplayName("parse(custom)") | ||
void parseCustom() { | ||
parser.setLanguage(language); | ||
var encoding = InputEncoding.valueOf(StandardCharsets.UTF_32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to https://github.com/tree-sitter/java-tree-sitter/pull/136/files#r2341779397, this test only works because UTF-32 uses a fixed amount of bytes per code point.
And I am also not completely sure if this here works properly since for these surrogate pairs probably only the high surrogate is actually read.
No description provided.