Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(expr): SUBSTR of unicode produces error: "byte index _ is not a char boundary" #9065

Closed
jon-chuang opened this issue Apr 9, 2023 · 1 comment · Fixed by #9079
Closed
Assignees
Labels
type/bug Something isn't working
Milestone

Comments

@jon-chuang
Copy link
Contributor

jon-chuang commented Apr 9, 2023

We should not calculate by bytes, but by unicode character:

To reproduce:
Risingwave:

=> select substr('Mér', 1, 2);
SSL SYSCALL error: EOF detected

PSQL:

=> select substr('Mér', 1, 2);
 substr 
--------
 Mé
(1 row)

Similar for ''Mér'::char(3)


resources:

src/expr/src/vector_op/substr.rs:48:23

pub fn substr_start_for(s: &str, start: i32, count: i32, writer: &mut dyn Write) -> Result<()> {

https://stackoverflow.com/questions/4249745/does-postgresql-varchar-count-using-unicode-character-length-or-ascii-character

@github-actions github-actions bot added this to the release-0.19 milestone Apr 9, 2023
@jon-chuang jon-chuang changed the title bug(expr): Substr of unicode: "byte index _ is not a char boundary" bug(expr): SUBSTR of unicode produces error: "byte index _ is not a char boundary" Apr 9, 2023
@jon-chuang jon-chuang added type/bug Something isn't working and removed type/feature labels Apr 9, 2023
@xiangjinwu
Copy link
Contributor

Similar problem exists for other string functions:

  • overlay
  • ascii

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants