New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserving char-offsets #5
Comments
Thanks for reporting this, Shantanu! I checked, and the problem is a combination of (1) method utokenize_string() lacking Solutions: (ii) Medium term (hopefully this week) (iii) Longer-term
Output (stderr/stdout): Changing from 200 to 300 semicolons: Alert: Exceeded general tokenization recursion depth of 150 in line None (300 characters, 1 words). |
Hi @uhermjakob thanks a lot for the quick reply. Setting Closing this issue. (Please feel free to reopen if you want to use this issue to work on the |
Hi @uhermjakob , thanks a lot for making the tokenizer public.
We are using utoken in one of our projects where we have the requirement that each token is associated with the offset in the original text. Currently, we have it working in the following manner:
This works fine and we get the correct output:
However, when we change the text to include repeated punctuations, we run into an error. To reproduce, I am just changing the text from
Hello world!
to;
200 times:The first and last few lines of the call stack are:
Is our current usage to keep track of char-offsets incorrect because of which we are running into this issue? Is there a different way to tokenize and keep track of char-offsets within utoken?
Thanks.
The text was updated successfully, but these errors were encountered: