Add support for casting row to json with field name #3613

ebyhr · 2020-05-04T15:33:55Z

losipiuk · 2020-06-29T07:32:27Z

@ebyhr Did we actually decide that we want to go with this approach? I was not following the conversation back when it started. But what I see now on slack is inconclusive.

cc: @martint

ebyhr · 2020-06-29T07:39:48Z

@losipiuk We haven't yet decided the actual approach. Let me add syntax-needs-review label.

losipiuk

Thanks.

Implementation-wise it looks good. Though, I am not sure what the semantics should be. Especially I am not a big fan if mapping ROWs without column names to JSON objects, with artificial field%d keys. It seems not an improvement over current semantics, where ROW is mapped to JSON array.

Maybe we should have mixed logic? Where only ROWs with names are mapped to JSON objects and ROWs without names are mapped to JSON arrays.

On the other hand this may be too complex to follow. And imposes a problem for mixed case when only some columns in ROW have names assigned.

losipiuk · 2020-06-29T18:56:57Z

presto-main/src/main/java/io/prestosql/operator/scalar/RowToJsonCast.java

        }
-        MethodHandle methodHandle = METHOD_HANDLE.bindTo(fieldWriters);
+        MethodHandle methodHandle = legacyRowToJson ? LEGACY_METHOD_HANDLE.bindTo(fieldWriters) : METHOD_HANDLE.bindTo(fieldNames).bindTo(fieldWriters);


nit: I would prefer if .. else instead of elvis.

losipiuk · 2020-06-29T19:06:06Z

presto-main/src/main/java/io/prestosql/operator/scalar/RowToJsonCast.java

+            throw new RuntimeException(e);
+        }
+    }
+
    @UsedByGeneratedCode
    public static Slice toJson(List<JsonGeneratorWriter> fieldWriters, ConnectorSession session, Block block)


rename toJsonArray

losipiuk · 2020-06-29T19:06:33Z

presto-main/src/main/java/io/prestosql/operator/scalar/RowToJsonCast.java


        return new ScalarFunctionImplementation(
                false,
                ImmutableList.of(valueTypeArgumentProperty(RETURN_NULL_ON_NULL)),
                methodHandle);
    }

+    @UsedByGeneratedCode
+    public static Slice toJson(List<String> fieldNames, List<JsonGeneratorWriter> fieldWriters, ConnectorSession session, Block block)


rename to toJsonObject. I did not spot the difference between first and second method at first read.

findepi · 2020-06-30T10:24:34Z

Implementation-wise it looks good. Though, I am not sure what the semantics should be. Especially I am not a big fan if mapping ROWs without column names to JSON objects, with artificial field%d keys. It seems not an improvement over current semantics, where ROW is mapped to JSON array.

Maybe we should have mixed logic? Where only ROWs with names are mapped to JSON objects and ROWs without names are mapped to JSON arrays.

From user perspective, an anonymous ROW is just a tuple and has no field names, only ordinals, so indeed it resembles JSON array most closely.

findepi · 2020-06-30T10:25:03Z

Assigning @martint for the "syntax".

ssquan · 2020-07-31T07:25:27Z

How about using parameters to control whether to carry column names? For example json_format(JSON, to_object). It's more flexible than configuration, users can format one column with names and another without names in the same SQL.

serkef · 2021-01-27T16:52:10Z

Hello, first time here.
My 2 cents as a user that I've been struggling for a couple of hours to understand why a row becomes an array and how can I make it a map instead.

A row in my understanding resembles a row from a database table. That means it contains information regarding a specific entity. And all the information it contains, it refers to the same entity.
Doc says a cast(row as json) converts to array because "order is more important than names". It also states though that a row can contain different types. I would argue that actually the types are more important than the order. And the current behavior actually neglects that and creates a json array where the semantics of getting the elements with the correct type are broken, again per the doc.
Both row and json support named fields, so the information of the field names of a row should not be eliminated implicitly. An iterator with the values could do that. But currently there is no way to convert row to map and retain keys (even with a workaround like get two iterators of keys and values and combine them in a new map). And from what I see there is no way of explicitly converting a row to an array.

Our case is that AWS glue crawler marks json fields as rows. Data comes from semistructured data so every event has their own nested properties. trino correctly recognizes Row as field type. Using trino to insert data from hive to postgres (since the latter doesn't support row type) forces us to cast, and there is no way atm to do that. Our only hope is to split results to different tables and use foreign keys.

Maybe the solution here is to expose some iterators that can give the keys, the values, or the key/value pairs from the row. Then the user can have the flexibility to use them in their needs. Maybe this is the fault of glue in the first place, or maybe I don't know what exactly the purpose of the Row is and how it's different from a map.

Excuse me in advance for my ignorance on some parts above, please let me know if I'm wrong somewhere or if there is a way around that I'm missing.

Thanks for all your work.

geotheory · 2021-04-20T22:49:55Z

I also wonder about the decision to chose array. A full string representation of the JSON object would be preferable as this can be read by any parser.

martint · 2021-04-21T18:50:15Z

I also wonder about the decision to chose array. A full string representation of the JSON object would be preferable as this can be read by any parser.

It's because a ROW in SQL is a tuple with named fields, so the order of the fields matters. For instance, ROW(a BIGINT, b VARCHAR) is not the same type as ROW(b VARCHAR, a BIGINT).

In particular:

values of ROW(a BIGINT, b VARCHAR) are not comparable with values ofROW(b VARCHAR, a BIGINT). Under a cast-to-map semantics, the comparison of two rows wouldn't match the comparison of their version as JSON maps.
the comparison of values of compatible row types (e.g., ROW(a BIGINT, b BIGINT) and ROW(b BIGINT, a BIGINT)) would not match the result of comparing the corresponding row-as-json-maps values. In the case of ROW type, the comparison is positional, field by field. In the case of a JSON map, it'd be based on matching keys with each other.

This is what the SQL spec says about the ROW type:

A row type is a sequence of (<field name> <data type>) pairs, called fields. It is described by 
a row type descriptor. A row type descriptor consists of the field descriptor of every field of the
row type.

The most specific type of a row of a table is a row type. In this case, each column of the table 
corresponds to the field of the row type that has the same ordinal position as the column.

Row type RT2 is a subtype of data type RT1 if and only if RT1 and RT2 are row types of the same 
degree and, in every n-th pair of corresponding field definitions, FD1n in RT1 and FD2n in RT2, the 
<field name>s are equivalent and the <data type> of FD2n is a subtype of the <data type> of FD1n.

A value of row type RT1 is assignable to a site of row type RT2 if and only if the degree of RT1 is 
the same as the degree of RT2 and every field in RT1 is assignable to the field in the same ordinal 
position in RT2.

A value of row type RT1 is comparable with a value of row type RT2 if and only if the degree of RT1 
is the same as the degree of RT2 and every field in RT1 is comparable with the field in the same 
ordinal position in RT2.

Having said that, I think we should reconsider how the CAST to JSON works, since that seems to be the more intuitive way people think about it. CASTs don't need to preserve semantics (even if we do today) and can be "lossy". An example of this is how the comparison operations for numbers are not equivalent to the comparison operations for the numbers after they are CAST as VARCHAR.

But, there are a couple of open questions to consider:

What happens when fields are missing in anonymous rows? Making up names is problematic, because that's something users will come to rely on even though the choice of how the names are decided is an implementation details. One option would be to disallow casting anonymous rows to JSON.
What happens if there are duplicate fields names in the ROW? This is allowed by the SQL specification: ROW(a BIGINT, a VARCHAR). We may want to disallow that case, too.

We may also want to consider emitting a deprecation warning when a query contains a CAST from ROW to JSON and legacy semantics are enabled.

martint · 2021-04-30T20:24:25Z

For the remaining open questions, @dain suggested the following:

generate duplicate keys if the fields have non-unique names
generate empty keys if the fields are unnamed

According to the JSON RFC (https://tools.ietf.org/html/rfc8259#section-4), it's legal to have duplicate key names, so it seems reasonable to go that route:

When the names within an object are not
unique, the behavior of software that receives such an object is
unpredictable.  Many implementations report the last name/value pair
only.  Other implementations report an error or fail to parse the
object, and some implementations report all of the name/value pairs,
including duplicates.

In essence, the options boil down to:

disallow it on the Trino side
generate "legal" JSON (per the RFC) and let clients handle it. Some will, some won't, depending on how they choose to deserialize the JSON documents, but that's ok. This seems like the way to go for flexibility.

core/trino-main/src/main/java/io/trino/operator/scalar/RowToJsonCast.java

core/trino-main/src/main/java/io/trino/util/JsonUtil.java

core/trino-main/src/main/java/io/trino/sql/analyzer/FeaturesConfig.java

core/trino-main/src/test/java/io/trino/type/TestMapOperatorsLegacy.java

core/trino-main/src/test/java/io/trino/type/TestRowOperatorsLegacy.java

martint

Looks good, but mind the CI failures, they are related.

core/trino-main/src/test/java/io/trino/type/TestArrayOperatorsLegacy.java

core/trino-main/src/test/java/io/trino/type/TestMapOperatorsLegacy.java

ebyhr added the WIP label May 4, 2020

cla-bot bot added the cla-signed label May 4, 2020

ebyhr force-pushed the row-json branch 2 times, most recently from 5412d33 to 87fd1be Compare May 8, 2020 14:22

ebyhr force-pushed the row-json branch from 87fd1be to a49060f Compare June 28, 2020 12:33

ebyhr removed the WIP label Jun 28, 2020

ebyhr requested review from losipiuk and findepi June 28, 2020 13:03

ebyhr added the syntax-needs-review label Jun 29, 2020

losipiuk reviewed Jun 29, 2020

View reviewed changes

findepi assigned martint Jun 30, 2020

losipiuk requested a review from martint July 3, 2020 10:04

ebyhr force-pushed the row-json branch from a49060f to bb3721e Compare July 31, 2020 04:01

ebyhr force-pushed the row-json branch 2 times, most recently from 4964afb to aca3db6 Compare September 30, 2020 15:58

ebyhr force-pushed the row-json branch from aca3db6 to 5752a77 Compare November 11, 2020 12:47

ebyhr force-pushed the row-json branch from 5752a77 to 9fd94ee Compare December 27, 2020 11:00

ebyhr force-pushed the row-json branch from 9fd94ee to 2b4a4e7 Compare January 6, 2021 13:15

martint reviewed May 3, 2021

View reviewed changes

ebyhr force-pushed the row-json branch 2 times, most recently from 2e739ea to c573e4f Compare May 5, 2021 14:45

martint approved these changes May 5, 2021

View reviewed changes

Add support for casting row to json with field name

fc06d0a

ebyhr force-pushed the row-json branch from dfdaaa7 to fc06d0a Compare May 6, 2021 15:07

ebyhr merged commit a86f2d9 into trinodb:master May 7, 2021

ebyhr deleted the row-json branch May 7, 2021 06:04

ebyhr mentioned this pull request May 7, 2021

Release notes for 357 #7815

Closed

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for casting row to json with field name #3613

Add support for casting row to json with field name #3613

ebyhr commented May 4, 2020

losipiuk commented Jun 29, 2020

ebyhr commented Jun 29, 2020

losipiuk left a comment

losipiuk Jun 29, 2020

losipiuk Jun 29, 2020

losipiuk Jun 29, 2020

findepi commented Jun 30, 2020

findepi commented Jun 30, 2020

ssquan commented Jul 31, 2020

serkef commented Jan 27, 2021

geotheory commented Apr 20, 2021

martint commented Apr 21, 2021

martint commented Apr 30, 2021

martint left a comment

Add support for casting row to json with field name #3613

Add support for casting row to json with field name #3613

Conversation

ebyhr commented May 4, 2020

losipiuk commented Jun 29, 2020

ebyhr commented Jun 29, 2020

losipiuk left a comment

Choose a reason for hiding this comment

losipiuk Jun 29, 2020

Choose a reason for hiding this comment

losipiuk Jun 29, 2020

Choose a reason for hiding this comment

losipiuk Jun 29, 2020

Choose a reason for hiding this comment

findepi commented Jun 30, 2020

findepi commented Jun 30, 2020

ssquan commented Jul 31, 2020

serkef commented Jan 27, 2021

geotheory commented Apr 20, 2021

martint commented Apr 21, 2021

martint commented Apr 30, 2021

martint left a comment

Choose a reason for hiding this comment