Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index() function returns wrong offset for non-ascii chars #1430

Open
atschabu opened this issue Jun 19, 2017 · 3 comments · May be fixed by #3065
Open

index() function returns wrong offset for non-ascii chars #1430

atschabu opened this issue Jun 19, 2017 · 3 comments · May be fixed by #3065
Labels

Comments

@atschabu
Copy link

I'm trying to strip away some text from part of a text. Trying to use something like sub("!.*"; "") doesn't work, as it is giving me a Segmentation fault when text is too long. So I tried to go this route:

$ jq '.msg | .[0:index("!")]'

which works fine with input like:
{"msg": "hello world!"}
but fails when text contains wide characters:
{"msg": "здравствуй мир!"}

$ echo '{"msg": "здравствуй мир!"}' | jq '.msg | index("!")'
27
$ echo '{"msg": "hello world!"}' | jq '.msg | index("!")'
11
$ jq --version
jq-1.5
$ uname -a
Darwin atschabu-C02SF0UTG8WM 15.6.0 Darwin Kernel Version 15.6.0: Tue Apr 11 16:00:51 PDT 2017; root:xnu-3248.60.11.5.3~1/RELEASE_X86_64 x86_64
@pkoppstein
Copy link
Contributor

pkoppstein commented Jun 19, 2017

There is some documentation about this on the "Pitfalls" page (https://github.com/stedolan/jq/wiki/How-to:-Avoid-Pitfalls)

In brief, you can use match/1:

echo '{"msg": "здравствуй мир!"}' | jq '.msg | match("!").offset'
14

This works in jq 1.5 and later.

By the way, could you please give more details about the failure of sub/2. Here is an illustration that it does not always fail when given a long string:

 jq1.5 -n '[range(0;100000) | "a"] | join("") + "!xx" | sub("!.*";"") | length'
100000

@atschabu
Copy link
Author

My bad. I haven't even realized there is a wiki. I took all the information from the manual, which didn't mention anything about index being byte wise. I'll give match a go.

I still haven't figured out when exactly the Segmentation fault is happening, as I couldn't find the input yet which is producing it. But I went by the assumption it is related to issue 922 until I can proof the opposite.

I guess we can close this one, and I'll open a new ticket, in case my segmentation fault issue is not related to 922.

@nicowilliams
Copy link
Contributor

No, this is a bug. We should fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants