Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for casting row to json with field name #3613

Merged
merged 1 commit into from
May 7, 2021

Conversation

ebyhr
Copy link
Member

@ebyhr ebyhr commented May 4, 2020

Fixes #3536

@losipiuk
Copy link
Member

@ebyhr Did we actually decide that we want to go with this approach? I was not following the conversation back when it started. But what I see now on slack is inconclusive.

cc: @martint

@ebyhr
Copy link
Member Author

ebyhr commented Jun 29, 2020

@losipiuk We haven't yet decided the actual approach. Let me add syntax-needs-review label.

Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Implementation-wise it looks good. Though, I am not sure what the semantics should be. Especially I am not a big fan if mapping ROWs without column names to JSON objects, with artificial field%d keys. It seems not an improvement over current semantics, where ROW is mapped to JSON array.

Maybe we should have mixed logic? Where only ROWs with names are mapped to JSON objects and ROWs without names are mapped to JSON arrays.

On the other hand this may be too complex to follow. And imposes a problem for mixed case when only some columns in ROW have names assigned.

}
MethodHandle methodHandle = METHOD_HANDLE.bindTo(fieldWriters);
MethodHandle methodHandle = legacyRowToJson ? LEGACY_METHOD_HANDLE.bindTo(fieldWriters) : METHOD_HANDLE.bindTo(fieldNames).bindTo(fieldWriters);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would prefer if .. else instead of elvis.

throw new RuntimeException(e);
}
}

@UsedByGeneratedCode
public static Slice toJson(List<JsonGeneratorWriter> fieldWriters, ConnectorSession session, Block block)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename toJsonArray


return new ScalarFunctionImplementation(
false,
ImmutableList.of(valueTypeArgumentProperty(RETURN_NULL_ON_NULL)),
methodHandle);
}

@UsedByGeneratedCode
public static Slice toJson(List<String> fieldNames, List<JsonGeneratorWriter> fieldWriters, ConnectorSession session, Block block)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to toJsonObject. I did not spot the difference between first and second method at first read.

@findepi
Copy link
Member

findepi commented Jun 30, 2020

Implementation-wise it looks good. Though, I am not sure what the semantics should be. Especially I am not a big fan if mapping ROWs without column names to JSON objects, with artificial field%d keys. It seems not an improvement over current semantics, where ROW is mapped to JSON array.

Maybe we should have mixed logic? Where only ROWs with names are mapped to JSON objects and ROWs without names are mapped to JSON arrays.

From user perspective, an anonymous ROW is just a tuple and has no field names, only ordinals, so indeed it resembles JSON array most closely.

@findepi
Copy link
Member

findepi commented Jun 30, 2020

Assigning @martint for the "syntax".

@ssquan
Copy link
Contributor

ssquan commented Jul 31, 2020

How about using parameters to control whether to carry column names? For example json_format(JSON, to_object). It's more flexible than configuration, users can format one column with names and another without names in the same SQL.

@serkef
Copy link

serkef commented Jan 27, 2021

Hello, first time here.
My 2 cents as a user that I've been struggling for a couple of hours to understand why a row becomes an array and how can I make it a map instead.

  • A row in my understanding resembles a row from a database table. That means it contains information regarding a specific entity. And all the information it contains, it refers to the same entity.
  • Doc says a cast(row as json) converts to array because "order is more important than names". It also states though that a row can contain different types. I would argue that actually the types are more important than the order. And the current behavior actually neglects that and creates a json array where the semantics of getting the elements with the correct type are broken, again per the doc.
  • Both row and json support named fields, so the information of the field names of a row should not be eliminated implicitly. An iterator with the values could do that. But currently there is no way to convert row to map and retain keys (even with a workaround like get two iterators of keys and values and combine them in a new map). And from what I see there is no way of explicitly converting a row to an array.

Our case is that AWS glue crawler marks json fields as rows. Data comes from semistructured data so every event has their own nested properties. trino correctly recognizes Row as field type. Using trino to insert data from hive to postgres (since the latter doesn't support row type) forces us to cast, and there is no way atm to do that. Our only hope is to split results to different tables and use foreign keys.

Maybe the solution here is to expose some iterators that can give the keys, the values, or the key/value pairs from the row. Then the user can have the flexibility to use them in their needs. Maybe this is the fault of glue in the first place, or maybe I don't know what exactly the purpose of the Row is and how it's different from a map.

Excuse me in advance for my ignorance on some parts above, please let me know if I'm wrong somewhere or if there is a way around that I'm missing.

Thanks for all your work.

@geotheory
Copy link

I also wonder about the decision to chose array. A full string representation of the JSON object would be preferable as this can be read by any parser.

@martint
Copy link
Member

martint commented Apr 21, 2021

I also wonder about the decision to chose array. A full string representation of the JSON object would be preferable as this can be read by any parser.

It's because a ROW in SQL is a tuple with named fields, so the order of the fields matters. For instance, ROW(a BIGINT, b VARCHAR) is not the same type as ROW(b VARCHAR, a BIGINT).

In particular:

  • values of ROW(a BIGINT, b VARCHAR) are not comparable with values ofROW(b VARCHAR, a BIGINT). Under a cast-to-map semantics, the comparison of two rows wouldn't match the comparison of their version as JSON maps.
  • the comparison of values of compatible row types (e.g., ROW(a BIGINT, b BIGINT) and ROW(b BIGINT, a BIGINT)) would not match the result of comparing the corresponding row-as-json-maps values. In the case of ROW type, the comparison is positional, field by field. In the case of a JSON map, it'd be based on matching keys with each other.

This is what the SQL spec says about the ROW type:

A row type is a sequence of (<field name> <data type>) pairs, called fields. It is described by 
a row type descriptor. A row type descriptor consists of the field descriptor of every field of the
row type.

The most specific type of a row of a table is a row type. In this case, each column of the table 
corresponds to the field of the row type that has the same ordinal position as the column.

Row type RT2 is a subtype of data type RT1 if and only if RT1 and RT2 are row types of the same 
degree and, in every n-th pair of corresponding field definitions, FD1n in RT1 and FD2n in RT2, the 
<field name>s are equivalent and the <data type> of FD2n is a subtype of the <data type> of FD1n.

A value of row type RT1 is assignable to a site of row type RT2 if and only if the degree of RT1 is 
the same as the degree of RT2 and every field in RT1 is assignable to the field in the same ordinal 
position in RT2.

A value of row type RT1 is comparable with a value of row type RT2 if and only if the degree of RT1 
is the same as the degree of RT2 and every field in RT1 is comparable with the field in the same 
ordinal position in RT2.

Having said that, I think we should reconsider how the CAST to JSON works, since that seems to be the more intuitive way people think about it. CASTs don't need to preserve semantics (even if we do today) and can be "lossy". An example of this is how the comparison operations for numbers are not equivalent to the comparison operations for the numbers after they are CAST as VARCHAR.

But, there are a couple of open questions to consider:

  • What happens when fields are missing in anonymous rows? Making up names is problematic, because that's something users will come to rely on even though the choice of how the names are decided is an implementation details. One option would be to disallow casting anonymous rows to JSON.
  • What happens if there are duplicate fields names in the ROW? This is allowed by the SQL specification: ROW(a BIGINT, a VARCHAR). We may want to disallow that case, too.

We may also want to consider emitting a deprecation warning when a query contains a CAST from ROW to JSON and legacy semantics are enabled.

@martint
Copy link
Member

martint commented Apr 30, 2021

For the remaining open questions, @dain suggested the following:

  • generate duplicate keys if the fields have non-unique names
  • generate empty keys if the fields are unnamed

According to the JSON RFC (https://tools.ietf.org/html/rfc8259#section-4), it's legal to have duplicate key names, so it seems reasonable to go that route:

When the names within an object are not
unique, the behavior of software that receives such an object is
unpredictable.  Many implementations report the last name/value pair
only.  Other implementations report an error or fail to parse the
object, and some implementations report all of the name/value pairs,
including duplicates.

In essence, the options boil down to:

  • disallow it on the Trino side
  • generate "legal" JSON (per the RFC) and let clients handle it. Some will, some won't, depending on how they choose to deserialize the JSON documents, but that's ok. This seems like the way to go for flexibility.

@ebyhr ebyhr force-pushed the row-json branch 2 times, most recently from 2e739ea to c573e4f Compare May 5, 2021 14:45
Copy link
Member

@martint martint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but mind the CI failures, they are related.

@ebyhr ebyhr merged commit a86f2d9 into trinodb:master May 7, 2021
@ebyhr ebyhr deleted the row-json branch May 7, 2021 06:04
@ebyhr ebyhr mentioned this pull request May 7, 2021
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Cast ROW to JSON including column names
7 participants