Support the SQL TRANSLATE function #4080

djsstarburst · 2020-06-18T15:00:14Z

#3313

This implementation follows the specification used by Postgres; see the documentation section of this commit.

electrum

Overall looks good. A few minor comments

presto-main/src/main/java/io/prestosql/operator/scalar/StringFunctions.java

presto-main/src/test/java/io/prestosql/operator/scalar/TestStringFunctions.java

martint · 2020-06-18T17:45:00Z

How does this relate to the standard SQL translate syntax?

TRANSLATE <left paren> <character value expression>
      USING <transliteration name> <right paren>

djsstarburst · 2020-06-18T19:40:15Z

How does this relate to the standard SQL translate syntax?
TRANSLATE <left paren> <character value expression>
      USING <transliteration name> <right paren>

@martint and I discussed this. He made the point that TRANSLATE(source, originals, translations) could in general be implemented using TRANSLATE(source USING character_routine) where the routine itself had knowledge of the originals and translations. The reverse is not true, since character translation routines can make arbitrary transformations.

TRANSLATE(source USING character_routine) is definitely part of the SQL specification. However, the only RDB vendor I've found that supports it is Oracle.

TRANSLATE(source, originals, translations) is supported by Oracle, SQL Server, Redshift, Azure and probably others.

Supporting TRANSLATE(source USING character_routine) requires a bunch of machinery that we don't currently have to be built.

@martint registered these concerns:

confusion: once we add support for standard TRANSLATE, will that be confusing to users? Similar names, very different semantics
name conflict: in theory, there could be parsing ambiguities. The advantage here is that the standard TRANSLATE has the USING clause in the middle, so that should avoid all ambiguities.

I would vote to move forward with new function TRANSLATE(source, originals, translations), mostly because AFAICT it's more commonly supported, and it's straightforward.

electrum · 2020-06-18T20:05:59Z

The SQL standard TRANSLATE ... USING function is rather different:

<character transliteration> is a function for changing each character of a given string according to some many- to-one or one-to-one mapping between two not necessarily distinct character sets. The mapping, rather than being specified as part of the function, is some external function identified by a <transliteration name>.

The transliteration name is defined separately via CREATE TRANSLATION. Oracle doesn't support this -- they only have two predefined translations. I can't find any systems that actually implement this feature (which makes me wonder who added this to the standard and why).

@martint unless you have a strong objection, we should merge this as-is since it's a common function and provides a useful feature that can't be done any other way.

electrum · 2020-06-18T20:21:55Z

I was looking at the Oracle and PostgreSQL and found a couple interesting things.

Duplicates in from are ignored -- the first one is used. Oracle documents this:

If a character appears multiple times in from_string, then the to_string mapping corresponding to the first occurrence is used.

The other is that the lengths don't have to match:

The argument from_string can contain more characters than to_string. In this case, the extra characters at the end of from_string have no corresponding characters in to_string.

For duplicates, we could be strict and not allow them, or we could follow Oracle / PostgreSQL behavior. The current behavior of using the last one is not a good choice.

Similar, we can be strict for different lengths, or we could allow a shorter to_string to be used for removal. We support removal today with regex replace, so perhaps it's best to start off strict. The SQL Server implementation requires the strings to be the same lengths.

We might want to use the from/to language for the arguments as that's easy to reason about and matches Oracle and PostgreSQL.

martint · 2020-06-18T21:42:06Z

@martint unless you have a strong objection, we should merge this as-is since it's a common function and provides a useful feature that can't be done any other way.

No, my main concern was whether the semantics were too different or the intent behind them was similar (it is). The situation to avoid would be having two functions that look similar but do something completely different -- in that case, we'd need to come up with a different name.

djsstarburst · 2020-06-18T22:53:57Z

I just pushed updates to this PR that follow the Postgres spec:

Changed names originals and translations to from and to.
In the case of duplicates, use the "to" character corresponding to the first instance of a character in "from"
If the "from" character is at an index that would put it off the end of the "to" string, don't translate the source character.

One result is that no correspondence is required of the "from" and "to" lengths.

electrum · 2020-06-22T23:25:19Z

The behavior for from being larger than to doesn't match PostgreSQL. The correct behavior is to drop any characters in from that do not exist in to:

# select translate('abcd', 'a', '');
 translate 
-----------
 bcd
(1 row)

# select translate('abcd', 'ac', 'z');
 translate 
-----------
 zbd
(1 row)

# select translate('abcd', 'aac', 'zq');
 translate 
-----------
 zbd
(1 row)

Otherwise, everything else looks good.

djsstarburst · 2020-06-23T14:31:16Z

Doh, I hadn't noticed the deletion case. I just force pushed the commit with code that handles the deletion case properly. I also included documentation examples copied from the unit test, since the translate function isn't trivial. I hope this is ready for final review and merge now, @electrum

djsstarburst · 2020-06-23T21:05:40Z

@electrum, I force-pushed the translate branch, with the static int eliminated, replaced by -1. and the translate examples formatted like the JSON examples as you suggested.

electrum · 2020-06-23T21:58:44Z

presto-docs/src/main/sphinx/functions/string.rst

+
+   Here are some examples illustrating the translate function::
+
+       translate('abcd', '', '') -- 'abcd'


Please write these as full select queries that can be copy/pasted by the user:

SELECT translate('abcd', '', ''); -- 'abcd'

So add SELECT and semicolon.

electrum · 2020-06-23T22:12:36Z

presto-docs/src/main/sphinx/functions/string.rst

+       translate('abcd', 'a', 'z') -- 'zbcd'
+       translate('abcda', 'a', 'z') -- 'zbcdz'
+       translate('Palhoça', 'ç','c') -- 'Palhoca'
+       translate('abcd', 'b', '\uD840\uDC00') -- 'a\uD840\uDC00cd'


This \u is actually a Java escape and not valid in SQL (it works in the test because it's translated by the Java compiler, so it's effectively a literal character in the SQL)

We can write this using the SQL U& syntax as documented here: https://prestosql.io/docs/current/language/types.html#string

For the output, we can paste the literal in directly.

SELECT translate('abcd', 'b', U&'\+01F600'); -- a😀cd

This implementation follows the specification used by Postgres; see the documentation section of this commit, which also includes illustrative examples.

djsstarburst · 2020-06-23T23:16:58Z

@electrum, I added the SELECT and ; in the docs, and used the suggested unicode transliteration.

electrum · 2020-06-23T23:55:56Z

Thanks!

cla-bot bot added the cla-signed label Jun 18, 2020

electrum reviewed Jun 18, 2020

View reviewed changes

djsstarburst force-pushed the david.stryker/support-sql-translate branch from b671871 to 311de32 Compare June 18, 2020 17:50

djsstarburst force-pushed the david.stryker/support-sql-translate branch 2 times, most recently from e698894 to 9ad02de Compare June 18, 2020 22:48

djsstarburst force-pushed the david.stryker/support-sql-translate branch from 9ad02de to 228a255 Compare June 20, 2020 21:07

djsstarburst force-pushed the david.stryker/support-sql-translate branch from 228a255 to 2556309 Compare June 23, 2020 14:27

djsstarburst force-pushed the david.stryker/support-sql-translate branch from 2556309 to 4b4b7fe Compare June 23, 2020 17:13

electrum approved these changes Jun 23, 2020

View reviewed changes

electrum reviewed Jun 23, 2020

View reviewed changes

Support the SQL TRANSLATE function

bdc4f06

This implementation follows the specification used by Postgres; see the documentation section of this commit, which also includes illustrative examples.

djsstarburst force-pushed the david.stryker/support-sql-translate branch from 4b4b7fe to bdc4f06 Compare June 23, 2020 23:15

electrum merged commit f4170ac into trinodb:master Jun 23, 2020

djsstarburst deleted the david.stryker/support-sql-translate branch June 24, 2020 20:24

electrum mentioned this pull request Jun 25, 2020

Release notes for 337 #4058

Closed

9 tasks

electrum added this to the 337 milestone Jun 25, 2020

ebyhr mentioned this pull request Jun 27, 2020

equivalent function SQL Server's TRANSLATE() to in Presto #3313

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support the SQL TRANSLATE function #4080

Support the SQL TRANSLATE function #4080

djsstarburst commented Jun 18, 2020 •

edited

electrum left a comment

martint commented Jun 18, 2020

djsstarburst commented Jun 18, 2020

electrum commented Jun 18, 2020

electrum commented Jun 18, 2020

martint commented Jun 18, 2020

djsstarburst commented Jun 18, 2020

electrum commented Jun 22, 2020

djsstarburst commented Jun 23, 2020

djsstarburst commented Jun 23, 2020

electrum Jun 23, 2020

electrum Jun 23, 2020

djsstarburst commented Jun 23, 2020

electrum commented Jun 23, 2020


		Here are some examples illustrating the translate function::

		translate('abcd', '', '') -- 'abcd'

Support the SQL TRANSLATE function #4080

Support the SQL TRANSLATE function #4080

Conversation

djsstarburst commented Jun 18, 2020 • edited

electrum left a comment

Choose a reason for hiding this comment

martint commented Jun 18, 2020

djsstarburst commented Jun 18, 2020

electrum commented Jun 18, 2020

electrum commented Jun 18, 2020

martint commented Jun 18, 2020

djsstarburst commented Jun 18, 2020

electrum commented Jun 22, 2020

djsstarburst commented Jun 23, 2020

djsstarburst commented Jun 23, 2020

electrum Jun 23, 2020

Choose a reason for hiding this comment

electrum Jun 23, 2020

Choose a reason for hiding this comment

djsstarburst commented Jun 23, 2020

electrum commented Jun 23, 2020

djsstarburst commented Jun 18, 2020 •

edited