-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for casting row to json with field name #3613
Conversation
5412d33
to
87fd1be
Compare
@losipiuk We haven't yet decided the actual approach. Let me add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Implementation-wise it looks good. Though, I am not sure what the semantics should be. Especially I am not a big fan if mapping ROWs
without column names to JSON objects, with artificial field%d
keys. It seems not an improvement over current semantics, where ROW
is mapped to JSON array.
Maybe we should have mixed logic? Where only ROWs
with names are mapped to JSON objects and ROWs
without names are mapped to JSON arrays.
On the other hand this may be too complex to follow. And imposes a problem for mixed case when only some columns in ROW
have names assigned.
} | ||
MethodHandle methodHandle = METHOD_HANDLE.bindTo(fieldWriters); | ||
MethodHandle methodHandle = legacyRowToJson ? LEGACY_METHOD_HANDLE.bindTo(fieldWriters) : METHOD_HANDLE.bindTo(fieldNames).bindTo(fieldWriters); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would prefer if .. else
instead of elvis.
throw new RuntimeException(e); | ||
} | ||
} | ||
|
||
@UsedByGeneratedCode | ||
public static Slice toJson(List<JsonGeneratorWriter> fieldWriters, ConnectorSession session, Block block) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename toJsonArray
|
||
return new ScalarFunctionImplementation( | ||
false, | ||
ImmutableList.of(valueTypeArgumentProperty(RETURN_NULL_ON_NULL)), | ||
methodHandle); | ||
} | ||
|
||
@UsedByGeneratedCode | ||
public static Slice toJson(List<String> fieldNames, List<JsonGeneratorWriter> fieldWriters, ConnectorSession session, Block block) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to toJsonObject
. I did not spot the difference between first and second method at first read.
From user perspective, an anonymous ROW is just a tuple and has no field names, only ordinals, so indeed it resembles JSON array most closely. |
Assigning @martint for the "syntax". |
How about using parameters to control whether to carry column names? For example |
4964afb
to
aca3db6
Compare
Hello, first time here.
Our case is that AWS glue crawler marks json fields as rows. Data comes from semistructured data so every event has their own nested properties. trino correctly recognizes Maybe the solution here is to expose some iterators that can give the keys, the values, or the key/value pairs from the row. Then the user can have the flexibility to use them in their needs. Maybe this is the fault of glue in the first place, or maybe I don't know what exactly the purpose of the Row is and how it's different from a map. Excuse me in advance for my ignorance on some parts above, please let me know if I'm wrong somewhere or if there is a way around that I'm missing. Thanks for all your work. |
I also wonder about the decision to chose array. A full string representation of the JSON object would be preferable as this can be read by any parser. |
It's because a ROW in SQL is a tuple with named fields, so the order of the fields matters. For instance, In particular:
This is what the SQL spec says about the ROW type:
Having said that, I think we should reconsider how the CAST to JSON works, since that seems to be the more intuitive way people think about it. CASTs don't need to preserve semantics (even if we do today) and can be "lossy". An example of this is how the comparison operations for numbers are not equivalent to the comparison operations for the numbers after they are CAST as VARCHAR. But, there are a couple of open questions to consider:
We may also want to consider emitting a deprecation warning when a query contains a CAST from ROW to JSON and legacy semantics are enabled. |
For the remaining open questions, @dain suggested the following:
According to the JSON RFC (https://tools.ietf.org/html/rfc8259#section-4), it's legal to have duplicate key names, so it seems reasonable to go that route:
In essence, the options boil down to:
|
core/trino-main/src/main/java/io/trino/operator/scalar/RowToJsonCast.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/sql/analyzer/FeaturesConfig.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/type/TestMapOperatorsLegacy.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/type/TestRowOperatorsLegacy.java
Outdated
Show resolved
Hide resolved
2e739ea
to
c573e4f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but mind the CI failures, they are related.
Fixes #3536