feat(parser)!: support custom input encoding #136

ObserverOfTime · 2025-08-24T09:49:52Z

No description provided.

Marcono1234

Given that this is a Draft, I assume you are aware that there are issues with the current implementation. Hopefully my comments are useful nonetheless; if not feel free to ignore them.

If you want to support only UTF-32 but not any other charsets, then a handwritten decoder might be easiest. Maybe the question is in general how common use cases with other charsets are, and if so which charsets are used then.

Otherwise if you want to support arbitrary charsets, then it probably becomes more difficult.
In general I think custom charset support could work by using CharsetDecoder#decode(ByteBuffer,CharBuffer,boolean), where the output buffer is sized to only permit the decoder to decode at most one code point. And then measure how many bytes were consumed from the input buffer.

However, this becomes quite convoluted because if it should be completely correct then you would probably have to:

first try to decode into an output buffer which fits 1 char
if that failed with coderResult.isOverflow() and did not produce any output, try again with a buffer which fits 2 char (a surrogate pair)
maybe have to account for charsets which behave weirdly, e.g.
- report isOverflow() but already wrote one char
- write 2 chars which are not actually a surrogate pair
- writes 1 char which is a lone surrogate
- write a prefix (similar to a BOM) or suffix
perform the full decoding operation as described by the CharsetDecoder docs
That is, once there is no further input call decode(..., ..., true) and also eventually flush. At the very least if tree-sitter or jtreesitter cannot support decoded data at the end but the decoder produced additional data, throw an exception to avoid silent incorrect behavior.

It is also not completely clear to me how tree-sitter DecodeFunction is supposed to behave:

Does it provide all bytes, or could the bytes be incomplete (e.g. when using a custom reading function) and in that case what should the decode function return when there are too few bytes to decode a code point?
Are there guarantees for how long the DecodeFunction is called? Until it returns 0? Or is it not even called anymore when length becomes 0 (regardless of whether the decode function returned 0 in the past)?

Marcono1234 · 2025-09-11T16:51:13Z

src/main/java/io/github/treesitter/jtreesitter/Parser.java

+                        (string, length, code_point) -> {
+                            if (length == 0) return 0;
+                            var buffer = string.asSlice(0, length).asByteBuffer();
+                            var decoded = encoding.charset().decode(buffer);


Calling decode(buffer) here will decode all bytes, even though just the first code point is needed. This is probably rather inefficient.

Also as mentioned by the documentation, Charset#decode uses CodingErrorAction.REPLACE, which might not be desired? (On the other hand, I am not sure how the Java FFM behaves when an upcall throws an exception, maybe it exits the JVM, see Linker#upcallStub.)

Marcono1234 · 2025-09-11T16:53:45Z

src/main/java/io/github/treesitter/jtreesitter/Parser.java

+                            if (length == 0) return 0;
+                            var buffer = string.asSlice(0, length).asByteBuffer();
+                            var decoded = encoding.charset().decode(buffer);
+                            code_point.set(C_INT, 0, decoded.charAt(0));


charAt(0) only retrieves a single 16-bit Java char. For supplementary codepoints (>= U+10000) that will only be the high surrogate.

Marcono1234 · 2025-09-11T16:55:18Z

src/main/java/io/github/treesitter/jtreesitter/Parser.java

+                            var buffer = string.asSlice(0, length).asByteBuffer();
+                            var decoded = encoding.charset().decode(buffer);
+                            code_point.set(C_INT, 0, decoded.charAt(0));
+                            return (int)encoder.maxBytesPerChar();


maxBytesPerChar() only works if the charset consumes a fixed number of bytes. It would not work for a charset like UTF-8 (if that didn't have built-in support by tree-sitter) where a code point uses a variable amount of bytes.

Marcono1234 · 2025-09-11T16:55:49Z

src/test/java/io/github/treesitter/jtreesitter/ParserTest.java

+    @DisplayName("parse(custom)")
+    void parseCustom() {
+        parser.setLanguage(language);
+        var encoding = InputEncoding.valueOf(StandardCharsets.UTF_32);


Related to https://github.com/tree-sitter/java-tree-sitter/pull/136/files#r2341779397, this test only works because UTF-32 uses a fixed amount of bytes per code point.

And I am also not completely sure if this here works properly since for these surrogate pairs probably only the high surrogate is actually read.

feat(parser)!: support custom input encoding

f862669

Marcono1234 reviewed Sep 11, 2025

View reviewed changes

ObserverOfTime closed this Sep 11, 2025

ObserverOfTime mentioned this pull request Sep 11, 2025

Support custom input encoding #144

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(parser)!: support custom input encoding #136

feat(parser)!: support custom input encoding #136

Uh oh!

ObserverOfTime commented Aug 24, 2025

Uh oh!

Marcono1234 left a comment

Uh oh!

Marcono1234 Sep 11, 2025 •

edited

Loading

Uh oh!

Marcono1234 Sep 11, 2025

Uh oh!

Marcono1234 Sep 11, 2025

Uh oh!

Marcono1234 Sep 11, 2025

Uh oh!

Uh oh!

feat(parser)!: support custom input encoding #136

feat(parser)!: support custom input encoding #136

Uh oh!

Conversation

ObserverOfTime commented Aug 24, 2025

Uh oh!

Marcono1234 left a comment

Choose a reason for hiding this comment

Uh oh!

Marcono1234 Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Marcono1234 Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Marcono1234 Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Marcono1234 Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Marcono1234 Sep 11, 2025 •

edited

Loading