Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support the SQL TRANSLATE function #4080

Merged

Conversation

djsstarburst
Copy link
Member

@djsstarburst djsstarburst commented Jun 18, 2020

#3313

This implementation follows the specification used by Postgres; see the documentation section of this commit.

@cla-bot cla-bot bot added the cla-signed label Jun 18, 2020
Copy link
Member

@electrum electrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. A few minor comments

@martint
Copy link
Member

martint commented Jun 18, 2020

How does this relate to the standard SQL translate syntax?

TRANSLATE <left paren> <character value expression>
      USING <transliteration name> <right paren>

@djsstarburst djsstarburst force-pushed the david.stryker/support-sql-translate branch from b671871 to 311de32 Compare June 18, 2020 17:50
@djsstarburst
Copy link
Member Author

How does this relate to the standard SQL translate syntax?

TRANSLATE <left paren> <character value expression>
      USING <transliteration name> <right paren>

@martint and I discussed this. He made the point that TRANSLATE(source, originals, translations) could in general be implemented using TRANSLATE(source USING character_routine) where the routine itself had knowledge of the originals and translations. The reverse is not true, since character translation routines can make arbitrary transformations.

TRANSLATE(source USING character_routine) is definitely part of the SQL specification. However, the only RDB vendor I've found that supports it is Oracle.

TRANSLATE(source, originals, translations) is supported by Oracle, SQL Server, Redshift, Azure and probably others.

Supporting TRANSLATE(source USING character_routine) requires a bunch of machinery that we don't currently have to be built.

@martint registered these concerns:

  • confusion: once we add support for standard TRANSLATE, will that be confusing to users? Similar names, very different semantics
  • name conflict: in theory, there could be parsing ambiguities. The advantage here is that the standard TRANSLATE has the USING clause in the middle, so that should avoid all ambiguities.

I would vote to move forward with new function TRANSLATE(source, originals, translations), mostly because AFAICT it's more commonly supported, and it's straightforward.

@electrum
Copy link
Member

The SQL standard TRANSLATE ... USING function is rather different:

<character transliteration> is a function for changing each character of a given string according to some many- to-one or one-to-one mapping between two not necessarily distinct character sets. The mapping, rather than being specified as part of the function, is some external function identified by a <transliteration name>.

The transliteration name is defined separately via CREATE TRANSLATION. Oracle doesn't support this -- they only have two predefined translations. I can't find any systems that actually implement this feature (which makes me wonder who added this to the standard and why).

@martint unless you have a strong objection, we should merge this as-is since it's a common function and provides a useful feature that can't be done any other way.

@electrum
Copy link
Member

I was looking at the Oracle and PostgreSQL and found a couple interesting things.

Duplicates in from are ignored -- the first one is used. Oracle documents this:

If a character appears multiple times in from_string, then the to_string mapping corresponding to the first occurrence is used.

The other is that the lengths don't have to match:

The argument from_string can contain more characters than to_string. In this case, the extra characters at the end of from_string have no corresponding characters in to_string.

For duplicates, we could be strict and not allow them, or we could follow Oracle / PostgreSQL behavior. The current behavior of using the last one is not a good choice.

Similar, we can be strict for different lengths, or we could allow a shorter to_string to be used for removal. We support removal today with regex replace, so perhaps it's best to start off strict. The SQL Server implementation requires the strings to be the same lengths.

We might want to use the from/to language for the arguments as that's easy to reason about and matches Oracle and PostgreSQL.

@martint
Copy link
Member

martint commented Jun 18, 2020

@martint unless you have a strong objection, we should merge this as-is since it's a common function and provides a useful feature that can't be done any other way.

No, my main concern was whether the semantics were too different or the intent behind them was similar (it is). The situation to avoid would be having two functions that look similar but do something completely different -- in that case, we'd need to come up with a different name.

@djsstarburst djsstarburst force-pushed the david.stryker/support-sql-translate branch 2 times, most recently from e698894 to 9ad02de Compare June 18, 2020 22:48
@djsstarburst
Copy link
Member Author

I just pushed updates to this PR that follow the Postgres spec:

  • Changed names originals and translations to from and to.
  • In the case of duplicates, use the "to" character corresponding to the first instance of a character in "from"
  • If the "from" character is at an index that would put it off the end of the "to" string, don't translate the source character.

One result is that no correspondence is required of the "from" and "to" lengths.

@djsstarburst djsstarburst force-pushed the david.stryker/support-sql-translate branch from 9ad02de to 228a255 Compare June 20, 2020 21:07
@electrum
Copy link
Member

The behavior for from being larger than to doesn't match PostgreSQL. The correct behavior is to drop any characters in from that do not exist in to:

# select translate('abcd', 'a', '');
 translate 
-----------
 bcd
(1 row)

# select translate('abcd', 'ac', 'z');
 translate 
-----------
 zbd
(1 row)

# select translate('abcd', 'aac', 'zq');
 translate 
-----------
 zbd
(1 row)

Otherwise, everything else looks good.

@djsstarburst djsstarburst force-pushed the david.stryker/support-sql-translate branch from 228a255 to 2556309 Compare June 23, 2020 14:27
@djsstarburst
Copy link
Member Author

Doh, I hadn't noticed the deletion case. I just force pushed the commit with code that handles the deletion case properly. I also included documentation examples copied from the unit test, since the translate function isn't trivial. I hope this is ready for final review and merge now, @electrum

@djsstarburst djsstarburst force-pushed the david.stryker/support-sql-translate branch from 2556309 to 4b4b7fe Compare June 23, 2020 17:13
@djsstarburst
Copy link
Member Author

@electrum, I force-pushed the translate branch, with the static int eliminated, replaced by -1. and the translate examples formatted like the JSON examples as you suggested.


Here are some examples illustrating the translate function::

translate('abcd', '', '') -- 'abcd'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write these as full select queries that can be copy/pasted by the user:

SELECT translate('abcd', '', ''); -- 'abcd'

So add SELECT and semicolon.

translate('abcd', 'a', 'z') -- 'zbcd'
translate('abcda', 'a', 'z') -- 'zbcdz'
translate('Palhoça', 'ç','c') -- 'Palhoca'
translate('abcd', 'b', '\uD840\uDC00') -- 'a\uD840\uDC00cd'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This \u is actually a Java escape and not valid in SQL (it works in the test because it's translated by the Java compiler, so it's effectively a literal character in the SQL)

We can write this using the SQL U& syntax as documented here: https://prestosql.io/docs/current/language/types.html#string

For the output, we can paste the literal in directly.

SELECT translate('abcd', 'b', U&'\+01F600'); -- a😀cd

This implementation follows the specification used by Postgres;
see the documentation section of this commit, which also includes
illustrative examples.
@djsstarburst djsstarburst force-pushed the david.stryker/support-sql-translate branch from 4b4b7fe to bdc4f06 Compare June 23, 2020 23:15
@djsstarburst
Copy link
Member Author

@electrum, I added the SELECT and ; in the docs, and used the suggested unicode transliteration.

@electrum electrum merged commit f4170ac into trinodb:master Jun 23, 2020
@electrum
Copy link
Member

Thanks!

@djsstarburst djsstarburst deleted the david.stryker/support-sql-translate branch June 24, 2020 20:24
@electrum electrum mentioned this pull request Jun 25, 2020
9 tasks
@electrum electrum added this to the 337 milestone Jun 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

3 participants