New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support the SQL TRANSLATE function #4080
Support the SQL TRANSLATE function #4080
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. A few minor comments
presto-main/src/main/java/io/prestosql/operator/scalar/StringFunctions.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/operator/scalar/StringFunctions.java
Outdated
Show resolved
Hide resolved
presto-main/src/test/java/io/prestosql/operator/scalar/TestStringFunctions.java
Outdated
Show resolved
Hide resolved
How does this relate to the standard SQL translate syntax?
|
b671871
to
311de32
Compare
@martint and I discussed this. He made the point that TRANSLATE(source, originals, translations) could in general be implemented using TRANSLATE(source USING character_routine) where the routine itself had knowledge of the originals and translations. The reverse is not true, since character translation routines can make arbitrary transformations. TRANSLATE(source USING character_routine) is definitely part of the SQL specification. However, the only RDB vendor I've found that supports it is Oracle. TRANSLATE(source, originals, translations) is supported by Oracle, SQL Server, Redshift, Azure and probably others. Supporting TRANSLATE(source USING character_routine) requires a bunch of machinery that we don't currently have to be built. @martint registered these concerns:
I would vote to move forward with new function TRANSLATE(source, originals, translations), mostly because AFAICT it's more commonly supported, and it's straightforward. |
The SQL standard
The transliteration name is defined separately via @martint unless you have a strong objection, we should merge this as-is since it's a common function and provides a useful feature that can't be done any other way. |
I was looking at the Oracle and PostgreSQL and found a couple interesting things. Duplicates in
The other is that the lengths don't have to match:
For duplicates, we could be strict and not allow them, or we could follow Oracle / PostgreSQL behavior. The current behavior of using the last one is not a good choice. Similar, we can be strict for different lengths, or we could allow a shorter We might want to use the from/to language for the arguments as that's easy to reason about and matches Oracle and PostgreSQL. |
No, my main concern was whether the semantics were too different or the intent behind them was similar (it is). The situation to avoid would be having two functions that look similar but do something completely different -- in that case, we'd need to come up with a different name. |
e698894
to
9ad02de
Compare
I just pushed updates to this PR that follow the Postgres spec:
One result is that no correspondence is required of the "from" and "to" lengths. |
9ad02de
to
228a255
Compare
The behavior for
Otherwise, everything else looks good. |
228a255
to
2556309
Compare
Doh, I hadn't noticed the deletion case. I just force pushed the commit with code that handles the deletion case properly. I also included documentation examples copied from the unit test, since the translate function isn't trivial. I hope this is ready for final review and merge now, @electrum |
2556309
to
4b4b7fe
Compare
@electrum, I force-pushed the translate branch, with the static int eliminated, replaced by -1. and the translate examples formatted like the JSON examples as you suggested. |
|
||
Here are some examples illustrating the translate function:: | ||
|
||
translate('abcd', '', '') -- 'abcd' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please write these as full select queries that can be copy/pasted by the user:
SELECT translate('abcd', '', ''); -- 'abcd'
So add SELECT
and semicolon.
translate('abcd', 'a', 'z') -- 'zbcd' | ||
translate('abcda', 'a', 'z') -- 'zbcdz' | ||
translate('Palhoça', 'ç','c') -- 'Palhoca' | ||
translate('abcd', 'b', '\uD840\uDC00') -- 'a\uD840\uDC00cd' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This \u
is actually a Java escape and not valid in SQL (it works in the test because it's translated by the Java compiler, so it's effectively a literal character in the SQL)
We can write this using the SQL U&
syntax as documented here: https://prestosql.io/docs/current/language/types.html#string
For the output, we can paste the literal in directly.
SELECT translate('abcd', 'b', U&'\+01F600'); -- a😀cd
This implementation follows the specification used by Postgres; see the documentation section of this commit, which also includes illustrative examples.
4b4b7fe
to
bdc4f06
Compare
@electrum, I added the SELECT and ; in the docs, and used the suggested unicode transliteration. |
Thanks! |
#3313
This implementation follows the specification used by Postgres; see the documentation section of this commit.