Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: remake string value functions #4145

Open
romanhabibov opened this issue Apr 12, 2019 · 8 comments
Open

sql: remake string value functions #4145

romanhabibov opened this issue Apr 12, 2019 · 8 comments
Labels
Milestone

Comments

@romanhabibov
Copy link
Contributor

romanhabibov commented Apr 12, 2019

These are functions like TRIM(), QUOTE(), SUBSTR(), HEX(), LOWER(). If we want to follow ANSI SQL, we have to:

  1. disallow using other types for args which supposed to be string. String and binary string must be allowed only.
  2. modify the signatures of the functions according to the standard (if needed).
  3. make each function with two or more args accept only args of the same type. For example, TRIM(string FROM string) - allowed, TRIM(binary_string FROM string) - disallowed.

From discussion #3879

@kyukhin kyukhin added this to the 2.3.0 milestone May 6, 2019
@pgulutzan
Copy link
Contributor

Also SELECT TYPEOF(TRIM(both X'00' FROM X'')); should return 'varbinary'. Currently it returns 'string'.

@romanhabibov romanhabibov self-assigned this Sep 25, 2019
@kyukhin kyukhin modified the milestones: 2.3.1, wishlist Sep 25, 2019
@pgulutzan
Copy link
Contributor

SELECT TRIM(1 FROM '123'); currently does nothing, and will be illegal after this fix.
In other words, this will be another case where we do not automatically cast integer to string.

@kyukhin kyukhin removed the ansi sql label Mar 22, 2021
@kyukhin kyukhin modified the milestones: wishlist, 2.10.1 Aug 19, 2021
@igormunkin igormunkin added the 3sp label Aug 20, 2021
@kyukhin kyukhin added the feature A new functionality label Aug 23, 2021
@ImeevMA
Copy link
Collaborator

ImeevMA commented Aug 30, 2021

@pgulutzan I plan to rework some SQL built-in functions. What I plan to do:

  1. Rework substr() function. Now there will be substring() function:
<character substring function> ::=
SUBSTRING (<character value expression> FROM <start position> [ FOR <string length> ])
<binary substring function> ::=
SUBSTRING (<binary value expression> FROM <start position> [ FOR <string length> ])

I removed [ USING <char length units> ] in <character substring function> definition for now. I will add it later in all functions, that needs it.

How it will work:

SUBSTRING (<character value expression> FROM <start position>) = SUBSTR(<character value expression>, MAX(1, <start position>))
SUBSTRING (<character value expression> FROM <start position> FOR <string length>) = SUBSTR(<character value expression>, MAX(1, <start position>), <string length>)

If value of <string length> is less than 0, an error is thrown.
The same for <binary value expression>. Function substr() will be removed. What do you think?

@ImeevMA ImeevMA self-assigned this Aug 30, 2021
@pgulutzan
Copy link
Contributor

@ImeevMA:

I believe that when you write MAX(1, <start position>) you mean,
in Tarantool/SQL terms, GREATEST(1, <start position>). That is fine.

I have a simple C program that simulates the SQL-2016 standard rules
for feature E021-06 SUBSTRING function in
Subclause 6.32, "<string value function>":
This is what it tells me will be returned for various situations.
SUBSTRING('abc' FROM 1 FOR 1)        a
SUBSTRING('abc' FROM -1 FOR 1)       zero-length string
SUBSTRING('abc' FROM 1 FOR -1)        data exception -- substring error
SUBSTRING('abc' FROM 3 FOR 5)         c
SUBSTRING('abc' FROM 2)                    bc
SUBSTRING('' FROM -1)                        zero-length string
SUBSTRING('abc' FROM 0 FOR 1000000000)     abc
Do you think it is correct?
It seems to me that one of the answers is different from
what you propose.

I know that SUBSTR() currently accepts DECIMAL and DOUBLE
values for start position.
(The standard would say that DECIMAL with scale 0 is okay
but we don't define that.)
Do you intend to allow non-integer values for SUBSTRING?

It is an undocumented feature that, when start position is negative,
SUBSTR() counts backwards from the end of the string. For example
SELECT SUBSTR('abc',-1,4); is 'c'
SELECT substr('abc',-1,-3); is 'ab'
...
It is not quite like Lua's string.sub
string.sub('abc',-1,4)        is 'a'
string.sub('abc',-1,-3)       is ''
Similar behaviour appears in MySQL and Oracle and SQLite
(not SQL Server).
Do you think that anyone should care if we're compatible with them?

Since you will disallow USING OCTETS, I guess the only way to 
find a byte-substring in a STRING is to cast. For example
SELECT HEX(SUBSTRING(CAST('aГc' AS VARBINARY) FROM 3 FOR 2));
is 9363, and
SELECT TYPEOF(SUBSTRING(CAST('aГc' AS VARBINARY) FROM 3 FOR 2));
is varbinary.
Can you think of a better way?

Why not allow data type UUID as well as STRING and VARBINARY?

In what Tarantool version will you deprecate and/or disallow SUBSTR()?

@ImeevMA
Copy link
Collaborator

ImeevMA commented Aug 31, 2021

@pgulutzan thank you for the answer! My answers below.

I was planning to drop SUBSTR() in the next release. However, with a little thought, I realized that it might not be the best idea to just drop it without proper deprecation. @unera @kyukhin what do you think? We can just leave it, but I suggest making it a smaller version of ANSI's SUBSTRING() function. It will have the same rules as SUBSTRING(), but it will not have its own syntax and will not be USING OCTETS.

About your program, that simulates ANSI - I believe the SQL-2011 standard (6.30 , in General Rules, 3)-f)-ii)-1)) says:

1) Let S1 be the larger of S and 1 (one).

The S here is value of the <start position>. Should we do what ANSI says, or are you proposing to change this part? I do not have text of SQL-2016 standard.

I executed code below in MySQL, PostgreSQL, MS SQL Server:

CREATE TABLE t(i INT PRIMARY KEY, s TEXT);
INSERT INTO t VALUES(1, '1234567890');
SELECT SUBSTRING(s FROM 5 FOR 3) FROM t;
SELECT SUBSTRING(s FROM -5 FOR 3) FROM t;
SELECT SUBSTRING(s FROM 5 FOR -3) FROM t;
SELECT SUBSTRING(s FROM -5 FOR -3) FROM t;

The results are:

MySQL 5.6:

567
678
NULL
NULL

PostgreSQL 9.6:

567
NULL
error
error

MS SQL Server 2017:

567
NULL
error
error

In addition, MySQL and PostgreSQL can work with both SUBSTRING(s from 5 for 3) and SUBSTRING(s, 5, 3). MS SQL Server can only work with SUBSTRING(s, 5, 3)

For us the result will be:

567
123
error
error

What do you think about such difference?

Also, after some thought, I came to the conclusion that I would add USING CHARACTERS and USING OCTETS to the SUBSTRING() function.

@pgulutzan
Copy link
Contributor

Re deprecation:
As you know, I deplore the 2.10-beta changes. But this one, SUBSTR, is relatively small.

Re "1) Let S1 be the larger of S and 1 (one)":
I agree with that,
I hope I didn't something so unclearly that you think I want to ignore the standard,
and the 2016 version has the same sub-clause as the 2011 version.

Re "I executed code below in MySQL, PostgreSQL, MS SQL Server":
The NULLs are surprising.
Is it possible that you used a client that doesn't distinguish NULL from zero-length string?
Can you confirm with "SELECT SUBSTRING(s FROM 5 FOR -3) is null FROM t;"?
As for the particular results:
MySQL of course is counting from the end, I asked what others think of such behaviour.
PostgreSQL doesn't appear to be acting like Oracle, I had expected that it would.
SQL Server's manual
https://docs.microsoft.com/en-us/sql/t-sql/functions/substring-transact-sql?view=sql-server-ver15
is what I prefer to believe.

Re "For us the result will be:"
By the time you reach f) what do you calculate are the values of the
relevant variables? My program is outputting:
./substring 1234567890 -5 3
character_value_expression: 1234567890.
start_position: -5
string_length: 3
LC: 10
S: -5
L: 3
E: -2
S > LC or E < 1. Return zero-length string
I have checked. I have checked again. I think it is right.
Therefore I think you are ignoring f)i).

Re "USING OCTETS":
Sure. Of course the result string might not be valid UTF-8 but we never promise that.

Re my other questions:
I guess you'll be able to answer after you start making the changes.

@ImeevMA
Copy link
Collaborator

ImeevMA commented Sep 2, 2021

@pgulutzan you are right: I wrongly interpreted '' as NULL and I missed f)i).

About UUID as an argument - I am not sure how to implement this. I mean, should it work as SUBSTR(CAST(uuid AS STRING), start, length) or SUBSTR(CAST(uuid AS VARBINARY), start, length)?

About non-integer arguments - if an argument can be implicitly cast to INTEGER, than there is no reason to not accept it.

In a few days I will submit a document in which I describe what I plan to do with the existing SQL built-in functions and what SQL built-in function I plan to add in the next release. There will be quite a few questions that I hope you can help us resolve.

@pgulutzan
Copy link
Contributor

I have published a blog post about the SQL substring function:
http://ocelot.ca/blog/blog/2021/09/02/the-sql-substring-function/
Although it is not specifically about Tarantool, I believe it will clarify some things for any reader of this issue. And I included the program that I was talking about earlier.

ImeevMA added a commit that referenced this issue Sep 10, 2021
This patch causes OP_AggStep and OP_BuiltinFunction to use register P1
for the number of arguments instead of register P5. This makes it easier
to use these opcodes.

Needed for #4145
ImeevMA added a commit that referenced this issue Sep 10, 2021
This patch moves the initialization of sql_context out of the VDBE. This
allows us to remove the opcodes OP_BuiltinFunction0 and OP_AggStep0,
which work in a rather strange way. Moreover, due to the changes these
opcodes make to the VDBEs, it is possible that the estimated size of the
VDBE may change during execution of VDBE, which could lead to various
problems.

Needed for #4145
ImeevMA added a commit that referenced this issue Sep 10, 2021
This patch makes it easier to get a collation by a function.

Needed for #4145
ImeevMA added a commit that referenced this issue Sep 10, 2021
This patch introduces function mem_append(). This function appends the
given string to the end of the STRING or VARBINARY contained in MEM.
In case MEM needs to increase the size of allocated memory, additional
memory is allocated in an attempt to reduce the total number of
allocations. The main use of this function is to add multiple strings to
a single MEM.

Needed for #4145
ImeevMA added a commit that referenced this issue Sep 10, 2021
We don't need this function, since it is easier to call finalizers
directly. This patch also allows us to make further simplifications.

Needed for #4145
ImeevMA added a commit that referenced this issue Sep 10, 2021
This patch makes SUM() accept DOUBLE values by default. Also, after this
patch SUM() will be able to work with DECIMAL values.

Part of #4145
Part of #6355
ImeevMA added a commit that referenced this issue Sep 10, 2021
This patch makes TOTAL() accept DOUBLE values by default. Also, after
this patch TOTAL() will be able to work with DECIMAL values.

Part of #4145
Part of #6355
kyukhin pushed a commit that referenced this issue Nov 11, 2021
This patch makes SUBSTR() work according to ANSI rules for SUBSTRING()
function. Also, SUBSTR() can now work correctly with large INTEGER
values. The SUBSTR() syntax has not changed.

Part of #4145

@TarantoolBot document
Title: SUBSTR() function

SUBSTR() now works according to the ANSI rules for SUBSTRING().

Rules for SUBSTR() with 2 arguments:
1) let the first argument be VALUE, and the second argument be START;
2) VALUE should be STRING or VARBINARY, START should be INTEGER;
3) if any of arguments is NULL, NULL is returned;
4) let POS be MAX(START - 1, 0), END be length of the VALUE;
5) if POS >= END, the result is empty string;
6) if POS < END, the result will be substring of VALUE, starting from
   the position POS to the position END.

Rules for SUBSTR() with 3 arguments:
1) let the first argument be VALUE, the second argument be START, and
   the third argument be LENGTH;
2) VALUE should be STRING or VARBINARY, START and LENGTH should be
   INTEGERs;
3) if any of arguments is NULL, NULL is returned;
4) if LENGTH < 0, an error is thrown;
5) let POS be MAX(START - 1, 0), END be START + LENGTH - 1;
6) if POS >= END, the result is empty string;
7) if POS < END, the result will be substring of VALUE, starting from
   the position POS to the position END.
kyukhin pushed a commit that referenced this issue Nov 11, 2021
kyukhin pushed a commit that referenced this issue Nov 11, 2021
The CHAR() function now uses the ICU macro to get characters.

Part of #4145
kyukhin pushed a commit that referenced this issue Nov 11, 2021
kyukhin pushed a commit that referenced this issue Nov 11, 2021
kyukhin pushed a commit that referenced this issue Nov 11, 2021
kyukhin pushed a commit that referenced this issue Nov 11, 2021
kyukhin pushed a commit that referenced this issue Nov 11, 2021
kyukhin pushed a commit that referenced this issue Nov 11, 2021
This patch refactors RANDOMBLOB() function. Also, RANDOMBLOB(0) now
returns empty string.

part of #4145
kyukhin pushed a commit that referenced this issue Nov 11, 2021
kyukhin pushed a commit that referenced this issue Nov 11, 2021
kyukhin pushed a commit that referenced this issue Nov 11, 2021
kyukhin pushed a commit that referenced this issue Nov 11, 2021
kyukhin pushed a commit that referenced this issue Nov 11, 2021
This patch refactors UUID() function. Also, UUID(NULL) now returns NULL.

Part of #4145
kyukhin pushed a commit that referenced this issue Nov 11, 2021
kyukhin pushed a commit that referenced this issue Nov 11, 2021
kyukhin pushed a commit that referenced this issue Nov 11, 2021
kyukhin pushed a commit that referenced this issue Nov 11, 2021
kyukhin pushed a commit that referenced this issue Nov 11, 2021
Some of the code is no longer used after changes in the SQL built-in
functions. This patch removes part of the unused code.

Needed for #4145
kyukhin pushed a commit that referenced this issue Nov 11, 2021
This patch removes the MEM_Dyn flag, because after changes in the SQL
built-in functions, this flag is no longer used.

Needed for #4145
kyukhin pushed a commit that referenced this issue Nov 11, 2021
This patch removes the MEM_Term flag, because after changes in the SQL
built-in functions, this flag is no longer used.

Needed for #4145
kyukhin pushed a commit that referenced this issue Nov 11, 2021
This patch forces SQL built-in function implementations to accept
'const struct Mem *' instead of just 'struct Mem *'.

Needed for #4145
@kyukhin kyukhin modified the milestones: 2.10.0-rc1, 2.10.1 Dec 30, 2021
@kyukhin kyukhin modified the milestones: 2.10.0, wishlist Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants