Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserving char-offsets #5

Closed
spookyQubit opened this issue Mar 15, 2022 · 2 comments
Closed

Preserving char-offsets #5

spookyQubit opened this issue Mar 15, 2022 · 2 comments

Comments

@spookyQubit
Copy link

Hi @uhermjakob , thanks a lot for making the tokenizer public.

We are using utoken in one of our projects where we have the requirement that each token is associated with the offset in the original text. Currently, we have it working in the following manner:

from utoken import utokenize

text = 'Hello world!' 

tokenizer = utokenize.Tokenizer()
chart = utokenize.Chart(s=text, snt_id='id-0')
tokenizer.next_tok(None, text, chart, {}, 'eng', None)
tokens, offsets = [], []
for tok in chart.tokens:
    s, e = tok.span.spans[0].hard_from, tok.span.spans[0].hard_to
    tokens.append(text[s:e])
    offsets.append((s, e))

print(tokens, offsets)

This works fine and we get the correct output:

['Hello', 'world', '!'] [(0, 5), (6, 11), (11, 12)]

However, when we change the text to include repeated punctuations, we run into an error. To reproduce, I am just changing the text from Hello world! to ; 200 times:

from utoken import utokenize

text = ';' * 200  # this text causes the error. 

tokenizer = utokenize.Tokenizer()
chart = utokenize.Chart(s=text, snt_id='id-0')
tokenizer.next_tok(None, text, chart, {}, 'eng', None)
tokens, offsets = [], []
for tok in chart.tokens:
    s, e = tok.span.spans[0].hard_from, tok.span.spans[0].hard_to
    tokens.append(text[s:e])
    offsets.append((s, e))

print(tokens, offsets)

The first and last few lines of the call stack are:

Traceback (most recent call last):
  File "/Users/shantanu/PycharmProjects/isi-better/t-phrase/tests/test_ulf_token.py", line 7, in <module>
    tokenizer.next_tok(None, text, chart, {}, 'eng', None)
  File "/Users/shantanu/anaconda3/envs/auto_tt/lib/python3.8/site-packages/utoken/utokenize.py", line 820, in next_tok
    s = next_tokenization_function(s, chart, ht, lang_code, line_id, offset)
  File "/Users/shantanu/anaconda3/envs/auto_tt/lib/python3.8/site-packages/utoken/utokenize.py", line 962, in normalize_characters
    return self.next_tok(this_function, s, chart, ht, lang_code, line_id, offset)
...
...
File "/Users/shantanu/anaconda3/envs/auto_tt/lib/python3.8/site-packages/utoken/utokenize.py", line 734, in rec_tok
    tokenizations.append(calling_function(pre, chart, ht, lang_code, line_id, offset1))
  File "/Users/shantanu/anaconda3/envs/auto_tt/lib/python3.8/site-packages/utoken/utokenize.py", line 1652, in tokenize_punctuation_according_to_resource_entries
    return self.rec_tok([token], [start_position], s, offset, 'PUNCT-E',
  File "/Users/shantanu/anaconda3/envs/auto_tt/lib/python3.8/site-packages/utoken/utokenize.py", line 734, in rec_tok
    tokenizations.append(calling_function(pre, chart, ht, lang_code, line_id, offset1))
  File "/Users/shantanu/anaconda3/envs/auto_tt/lib/python3.8/site-packages/utoken/utokenize.py", line 1652, in tokenize_punctuation_according_to_resource_entries
    return self.rec_tok([token], [start_position], s, offset, 'PUNCT-E',
  File "/Users/shantanu/anaconda3/envs/auto_tt/lib/python3.8/site-packages/utoken/utokenize.py", line 714, in rec_tok
    n_chars = len(self.current_orig_s)
TypeError: object of type 'NoneType' has no len()

Is our current usage to keep track of char-offsets incorrect because of which we are running into this issue? Is there a different way to tokenize and keep track of char-offsets within utoken?

Thanks.

@uhermjakob
Copy link
Owner

Thanks for reporting this, Shantanu!

I checked, and the problem is a combination of (1) method utokenize_string() lacking
a chart option (as available for the CLI), (2) some of the code uses recursive code
that runs into deep recursion limits for sentences containing e.g. 200+ semicolons,
and (3) your somewhat unorthodox (but clever) use of method next_tok() which is
supposed to be internal only.

Solutions:
(i) Short-term
There's a quick work-around: add the following line after initializing tokenizer:
tokenizer.current_orig_s = text
Please let me know if that does not resolve the problem for you.
(See below for full script.)

(ii) Medium term (hopefully this week)
I will update method utokenize_string() to include a chart option.
This will facilitate calling utoken with offset info from within Python.
It will still have a recursion limit of 250, but at least will not break,
just stop tokenizing. I assume that this problem is very rare. (Please tell
me if not, which would add urgency to a longer-term solution.)

(iii) Longer-term
I will rewrite some of relevant code from recursive to iterative. I already
did some of that transformation for other functions earlier. The recursive
code is a bit more elegant, but does not fare well for really long/weird
sentences. Note: The "logical" recursion in the code is sound; it's just
that Python interpreters at some point assume (around level 300) that a
recursion is infinite even if that's not actually the case and my limit is
there to preempt that.

from utoken import utokenize

text = ';' * 200  # this text causes the error.

tokenizer = utokenize.Tokenizer()
tokenizer.current_orig_s = text  # ADDED LINE
chart = utokenize.Chart(s=text, snt_id='id-0')
tokenizer.next_tok(None, text, chart, {}, 'eng', None)
tokens, offsets = [], []
for tok in chart.tokens:
    s, e = tok.span.spans[0].hard_from, tok.span.spans[0].hard_to
    tokens.append(text[s:e])
    offsets.append((s, e))

print(tokens, offsets)

Output (stderr/stdout):
Alert: Exceeded general tokenization recursion depth of 150 in line None (200 characters, 1 words).
[';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';'] [(0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9), (9, 10), (10, 11), (11, 12), (12, 13), (13, 14), (14, 15), (15, 16), (16, 17), (17, 18), (18, 19), (19, 20), (20, 21), (21, 22), (22, 23), (23, 24), (24, 25), (25, 26), (26, 27), (27, 28), (28, 29), (29, 30), (30, 31), (31, 32), (32, 33), (33, 34), (34, 35), (35, 36), (36, 37), (37, 38), (38, 39), (39, 40), (40, 41), (41, 42), (42, 43), (43, 44), (44, 45), (45, 46), (46, 47), (47, 48), (48, 49), (49, 50), (50, 51), (51, 52), (52, 53), (53, 54), (54, 55), (55, 56), (56, 57), (57, 58), (58, 59), (59, 60), (60, 61), (61, 62), (62, 63), (63, 64), (64, 65), (65, 66), (66, 67), (67, 68), (68, 69), (69, 70), (70, 71), (71, 72), (72, 73), (73, 74), (74, 75), (75, 76), (76, 77), (77, 78), (78, 79), (79, 80), (80, 81), (81, 82), (82, 83), (83, 84), (84, 85), (85, 86), (86, 87), (87, 88), (88, 89), (89, 90), (90, 91), (91, 92), (92, 93), (93, 94), (94, 95), (95, 96), (96, 97), (97, 98), (98, 99), (99, 100), (100, 101), (101, 102), (102, 103), (103, 104), (104, 105), (105, 106), (106, 107), (107, 108), (108, 109), (109, 110), (110, 111), (111, 112), (112, 113), (113, 114), (114, 115), (115, 116), (116, 117), (117, 118), (118, 119), (119, 120), (120, 121), (121, 122), (122, 123), (123, 124), (124, 125), (125, 126), (126, 127), (127, 128), (128, 129), (129, 130), (130, 131), (131, 132), (132, 133), (133, 134), (134, 135), (135, 136), (136, 137), (137, 138), (138, 139), (139, 140), (140, 141), (141, 142), (142, 143), (143, 144), (144, 145), (145, 146), (146, 147), (147, 148), (148, 149), (149, 150), (150, 151), (151, 152), (152, 153), (153, 154), (154, 155), (155, 156), (156, 157), (157, 158), (158, 159), (159, 160), (160, 161), (161, 162), (162, 163), (163, 164), (164, 165), (165, 166), (166, 167), (167, 168), (168, 169), (169, 170), (170, 171), (171, 172), (172, 173), (173, 174), (174, 175), (175, 176), (176, 177), (177, 178), (178, 179), (179, 180), (180, 181), (181, 182), (182, 183), (183, 184), (184, 185), (185, 186), (186, 187), (187, 188), (188, 189), (189, 190), (190, 191), (191, 192), (192, 193), (193, 194), (194, 195), (195, 196), (196, 197), (197, 198), (198, 199), (199, 200)]

Changing from 200 to 300 semicolons:

Alert: Exceeded general tokenization recursion depth of 150 in line None (300 characters, 1 words).
Warning: Exceeded general tokenization recursion depth of 250 in line None. Will skip remaining tokenization steps for this sentence.
[';;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';'] [(0, 50), (50, 51), (51, 52), (52, 53), (53, 54), (54, 55), (55, 56), (56, 57), (57, 58), (58, 59), (59, 60), (60, 61), (61, 62), (62, 63), (63, 64), (64, 65), (65, 66), (66, 67), (67, 68), (68, 69), (69, 70), (70, 71), (71, 72), (72, 73), (73, 74), (74, 75), (75, 76), (76, 77), (77, 78), (78, 79), (79, 80), (80, 81), (81, 82), (82, 83), (83, 84), (84, 85), (85, 86), (86, 87), (87, 88), (88, 89), (89, 90), (90, 91), (91, 92), (92, 93), (93, 94), (94, 95), (95, 96), (96, 97), (97, 98), (98, 99), (99, 100), (100, 101), (101, 102), (102, 103), (103, 104), (104, 105), (105, 106), (106, 107), (107, 108), (108, 109), (109, 110), (110, 111), (111, 112), (112, 113), (113, 114), (114, 115), (115, 116), (116, 117), (117, 118), (118, 119), (119, 120), (120, 121), (121, 122), (122, 123), (123, 124), (124, 125), (125, 126), (126, 127), (127, 128), (128, 129), (129, 130), (130, 131), (131, 132), (132, 133), (133, 134), (134, 135), (135, 136), (136, 137), (137, 138), (138, 139), (139, 140), (140, 141), (141, 142), (142, 143), (143, 144), (144, 145), (145, 146), (146, 147), (147, 148), (148, 149), (149, 150), (150, 151), (151, 152), (152, 153), (153, 154), (154, 155), (155, 156), (156, 157), (157, 158), (158, 159), (159, 160), (160, 161), (161, 162), (162, 163), (163, 164), (164, 165), (165, 166), (166, 167), (167, 168), (168, 169), (169, 170), (170, 171), (171, 172), (172, 173), (173, 174), (174, 175), (175, 176), (176, 177), (177, 178), (178, 179), (179, 180), (180, 181), (181, 182), (182, 183), (183, 184), (184, 185), (185, 186), (186, 187), (187, 188), (188, 189), (189, 190), (190, 191), (191, 192), (192, 193), (193, 194), (194, 195), (195, 196), (196, 197), (197, 198), (198, 199), (199, 200), (200, 201), (201, 202), (202, 203), (203, 204), (204, 205), (205, 206), (206, 207), (207, 208), (208, 209), (209, 210), (210, 211), (211, 212), (212, 213), (213, 214), (214, 215), (215, 216), (216, 217), (217, 218), (218, 219), (219, 220), (220, 221), (221, 222), (222, 223), (223, 224), (224, 225), (225, 226), (226, 227), (227, 228), (228, 229), (229, 230), (230, 231), (231, 232), (232, 233), (233, 234), (234, 235), (235, 236), (236, 237), (237, 238), (238, 239), (239, 240), (240, 241), (241, 242), (242, 243), (243, 244), (244, 245), (245, 246), (246, 247), (247, 248), (248, 249), (249, 250), (250, 251), (251, 252), (252, 253), (253, 254), (254, 255), (255, 256), (256, 257), (257, 258), (258, 259), (259, 260), (260, 261), (261, 262), (262, 263), (263, 264), (264, 265), (265, 266), (266, 267), (267, 268), (268, 269), (269, 270), (270, 271), (271, 272), (272, 273), (273, 274), (274, 275), (275, 276), (276, 277), (277, 278), (278, 279), (279, 280), (280, 281), (281, 282), (282, 283), (283, 284), (284, 285), (285, 286), (286, 287), (287, 288), (288, 289), (289, 290), (290, 291), (291, 292), (292, 293), (293, 294), (294, 295), (295, 296), (296, 297), (297, 298), (298, 299), (299, 300)]

@spookyQubit
Copy link
Author

Hi @uhermjakob thanks a lot for the quick reply. Setting tokenizer.current_orig_s = text does seem to resolve the crash we were seeing earlier. -- Thanks.

Closing this issue. (Please feel free to reopen if you want to use this issue to work on the Medium term/Longer term solutions. )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants