Preserving char-offsets #5

spookyQubit · 2022-03-15T15:10:48Z

Hi @uhermjakob , thanks a lot for making the tokenizer public.

We are using utoken in one of our projects where we have the requirement that each token is associated with the offset in the original text. Currently, we have it working in the following manner:

from utoken import utokenize

text = 'Hello world!' 

tokenizer = utokenize.Tokenizer()
chart = utokenize.Chart(s=text, snt_id='id-0')
tokenizer.next_tok(None, text, chart, {}, 'eng', None)
tokens, offsets = [], []
for tok in chart.tokens:
    s, e = tok.span.spans[0].hard_from, tok.span.spans[0].hard_to
    tokens.append(text[s:e])
    offsets.append((s, e))

print(tokens, offsets)

This works fine and we get the correct output:

['Hello', 'world', '!'] [(0, 5), (6, 11), (11, 12)]

However, when we change the text to include repeated punctuations, we run into an error. To reproduce, I am just changing the text from Hello world! to ; 200 times:

from utoken import utokenize

text = ';' * 200  # this text causes the error. 

tokenizer = utokenize.Tokenizer()
chart = utokenize.Chart(s=text, snt_id='id-0')
tokenizer.next_tok(None, text, chart, {}, 'eng', None)
tokens, offsets = [], []
for tok in chart.tokens:
    s, e = tok.span.spans[0].hard_from, tok.span.spans[0].hard_to
    tokens.append(text[s:e])
    offsets.append((s, e))

print(tokens, offsets)

The first and last few lines of the call stack are:

Traceback (most recent call last):
  File "/Users/shantanu/PycharmProjects/isi-better/t-phrase/tests/test_ulf_token.py", line 7, in <module>
    tokenizer.next_tok(None, text, chart, {}, 'eng', None)
  File "/Users/shantanu/anaconda3/envs/auto_tt/lib/python3.8/site-packages/utoken/utokenize.py", line 820, in next_tok
    s = next_tokenization_function(s, chart, ht, lang_code, line_id, offset)
  File "/Users/shantanu/anaconda3/envs/auto_tt/lib/python3.8/site-packages/utoken/utokenize.py", line 962, in normalize_characters
    return self.next_tok(this_function, s, chart, ht, lang_code, line_id, offset)
...
...
File "/Users/shantanu/anaconda3/envs/auto_tt/lib/python3.8/site-packages/utoken/utokenize.py", line 734, in rec_tok
    tokenizations.append(calling_function(pre, chart, ht, lang_code, line_id, offset1))
  File "/Users/shantanu/anaconda3/envs/auto_tt/lib/python3.8/site-packages/utoken/utokenize.py", line 1652, in tokenize_punctuation_according_to_resource_entries
    return self.rec_tok([token], [start_position], s, offset, 'PUNCT-E',
  File "/Users/shantanu/anaconda3/envs/auto_tt/lib/python3.8/site-packages/utoken/utokenize.py", line 734, in rec_tok
    tokenizations.append(calling_function(pre, chart, ht, lang_code, line_id, offset1))
  File "/Users/shantanu/anaconda3/envs/auto_tt/lib/python3.8/site-packages/utoken/utokenize.py", line 1652, in tokenize_punctuation_according_to_resource_entries
    return self.rec_tok([token], [start_position], s, offset, 'PUNCT-E',
  File "/Users/shantanu/anaconda3/envs/auto_tt/lib/python3.8/site-packages/utoken/utokenize.py", line 714, in rec_tok
    n_chars = len(self.current_orig_s)
TypeError: object of type 'NoneType' has no len()

Is our current usage to keep track of char-offsets incorrect because of which we are running into this issue? Is there a different way to tokenize and keep track of char-offsets within utoken?

Thanks.

The text was updated successfully, but these errors were encountered:

uhermjakob · 2022-03-16T03:18:47Z

Thanks for reporting this, Shantanu!

I checked, and the problem is a combination of (1) method utokenize_string() lacking
a chart option (as available for the CLI), (2) some of the code uses recursive code
that runs into deep recursion limits for sentences containing e.g. 200+ semicolons,
and (3) your somewhat unorthodox (but clever) use of method next_tok() which is
supposed to be internal only.

Solutions:
(i) Short-term
There's a quick work-around: add the following line after initializing tokenizer:
tokenizer.current_orig_s = text
Please let me know if that does not resolve the problem for you.
(See below for full script.)

(ii) Medium term (hopefully this week)
I will update method utokenize_string() to include a chart option.
This will facilitate calling utoken with offset info from within Python.
It will still have a recursion limit of 250, but at least will not break,
just stop tokenizing. I assume that this problem is very rare. (Please tell
me if not, which would add urgency to a longer-term solution.)

(iii) Longer-term
I will rewrite some of relevant code from recursive to iterative. I already
did some of that transformation for other functions earlier. The recursive
code is a bit more elegant, but does not fare well for really long/weird
sentences. Note: The "logical" recursion in the code is sound; it's just
that Python interpreters at some point assume (around level 300) that a
recursion is infinite even if that's not actually the case and my limit is
there to preempt that.

from utoken import utokenize

text = ';' * 200  # this text causes the error.

tokenizer = utokenize.Tokenizer()
tokenizer.current_orig_s = text  # ADDED LINE
chart = utokenize.Chart(s=text, snt_id='id-0')
tokenizer.next_tok(None, text, chart, {}, 'eng', None)
tokens, offsets = [], []
for tok in chart.tokens:
    s, e = tok.span.spans[0].hard_from, tok.span.spans[0].hard_to
    tokens.append(text[s:e])
    offsets.append((s, e))

print(tokens, offsets)

Output (stderr/stdout):
Alert: Exceeded general tokenization recursion depth of 150 in line None (200 characters, 1 words).
[';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';'] [(0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9), (9, 10), (10, 11), (11, 12), (12, 13), (13, 14), (14, 15), (15, 16), (16, 17), (17, 18), (18, 19), (19, 20), (20, 21), (21, 22), (22, 23), (23, 24), (24, 25), (25, 26), (26, 27), (27, 28), (28, 29), (29, 30), (30, 31), (31, 32), (32, 33), (33, 34), (34, 35), (35, 36), (36, 37), (37, 38), (38, 39), (39, 40), (40, 41), (41, 42), (42, 43), (43, 44), (44, 45), (45, 46), (46, 47), (47, 48), (48, 49), (49, 50), (50, 51), (51, 52), (52, 53), (53, 54), (54, 55), (55, 56), (56, 57), (57, 58), (58, 59), (59, 60), (60, 61), (61, 62), (62, 63), (63, 64), (64, 65), (65, 66), (66, 67), (67, 68), (68, 69), (69, 70), (70, 71), (71, 72), (72, 73), (73, 74), (74, 75), (75, 76), (76, 77), (77, 78), (78, 79), (79, 80), (80, 81), (81, 82), (82, 83), (83, 84), (84, 85), (85, 86), (86, 87), (87, 88), (88, 89), (89, 90), (90, 91), (91, 92), (92, 93), (93, 94), (94, 95), (95, 96), (96, 97), (97, 98), (98, 99), (99, 100), (100, 101), (101, 102), (102, 103), (103, 104), (104, 105), (105, 106), (106, 107), (107, 108), (108, 109), (109, 110), (110, 111), (111, 112), (112, 113), (113, 114), (114, 115), (115, 116), (116, 117), (117, 118), (118, 119), (119, 120), (120, 121), (121, 122), (122, 123), (123, 124), (124, 125), (125, 126), (126, 127), (127, 128), (128, 129), (129, 130), (130, 131), (131, 132), (132, 133), (133, 134), (134, 135), (135, 136), (136, 137), (137, 138), (138, 139), (139, 140), (140, 141), (141, 142), (142, 143), (143, 144), (144, 145), (145, 146), (146, 147), (147, 148), (148, 149), (149, 150), (150, 151), (151, 152), (152, 153), (153, 154), (154, 155), (155, 156), (156, 157), (157, 158), (158, 159), (159, 160), (160, 161), (161, 162), (162, 163), (163, 164), (164, 165), (165, 166), (166, 167), (167, 168), (168, 169), (169, 170), (170, 171), (171, 172), (172, 173), (173, 174), (174, 175), (175, 176), (176, 177), (177, 178), (178, 179), (179, 180), (180, 181), (181, 182), (182, 183), (183, 184), (184, 185), (185, 186), (186, 187), (187, 188), (188, 189), (189, 190), (190, 191), (191, 192), (192, 193), (193, 194), (194, 195), (195, 196), (196, 197), (197, 198), (198, 199), (199, 200)]

Changing from 200 to 300 semicolons:

Alert: Exceeded general tokenization recursion depth of 150 in line None (300 characters, 1 words).
Warning: Exceeded general tokenization recursion depth of 250 in line None. Will skip remaining tokenization steps for this sentence.
[';;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';', ';'] [(0, 50), (50, 51), (51, 52), (52, 53), (53, 54), (54, 55), (55, 56), (56, 57), (57, 58), (58, 59), (59, 60), (60, 61), (61, 62), (62, 63), (63, 64), (64, 65), (65, 66), (66, 67), (67, 68), (68, 69), (69, 70), (70, 71), (71, 72), (72, 73), (73, 74), (74, 75), (75, 76), (76, 77), (77, 78), (78, 79), (79, 80), (80, 81), (81, 82), (82, 83), (83, 84), (84, 85), (85, 86), (86, 87), (87, 88), (88, 89), (89, 90), (90, 91), (91, 92), (92, 93), (93, 94), (94, 95), (95, 96), (96, 97), (97, 98), (98, 99), (99, 100), (100, 101), (101, 102), (102, 103), (103, 104), (104, 105), (105, 106), (106, 107), (107, 108), (108, 109), (109, 110), (110, 111), (111, 112), (112, 113), (113, 114), (114, 115), (115, 116), (116, 117), (117, 118), (118, 119), (119, 120), (120, 121), (121, 122), (122, 123), (123, 124), (124, 125), (125, 126), (126, 127), (127, 128), (128, 129), (129, 130), (130, 131), (131, 132), (132, 133), (133, 134), (134, 135), (135, 136), (136, 137), (137, 138), (138, 139), (139, 140), (140, 141), (141, 142), (142, 143), (143, 144), (144, 145), (145, 146), (146, 147), (147, 148), (148, 149), (149, 150), (150, 151), (151, 152), (152, 153), (153, 154), (154, 155), (155, 156), (156, 157), (157, 158), (158, 159), (159, 160), (160, 161), (161, 162), (162, 163), (163, 164), (164, 165), (165, 166), (166, 167), (167, 168), (168, 169), (169, 170), (170, 171), (171, 172), (172, 173), (173, 174), (174, 175), (175, 176), (176, 177), (177, 178), (178, 179), (179, 180), (180, 181), (181, 182), (182, 183), (183, 184), (184, 185), (185, 186), (186, 187), (187, 188), (188, 189), (189, 190), (190, 191), (191, 192), (192, 193), (193, 194), (194, 195), (195, 196), (196, 197), (197, 198), (198, 199), (199, 200), (200, 201), (201, 202), (202, 203), (203, 204), (204, 205), (205, 206), (206, 207), (207, 208), (208, 209), (209, 210), (210, 211), (211, 212), (212, 213), (213, 214), (214, 215), (215, 216), (216, 217), (217, 218), (218, 219), (219, 220), (220, 221), (221, 222), (222, 223), (223, 224), (224, 225), (225, 226), (226, 227), (227, 228), (228, 229), (229, 230), (230, 231), (231, 232), (232, 233), (233, 234), (234, 235), (235, 236), (236, 237), (237, 238), (238, 239), (239, 240), (240, 241), (241, 242), (242, 243), (243, 244), (244, 245), (245, 246), (246, 247), (247, 248), (248, 249), (249, 250), (250, 251), (251, 252), (252, 253), (253, 254), (254, 255), (255, 256), (256, 257), (257, 258), (258, 259), (259, 260), (260, 261), (261, 262), (262, 263), (263, 264), (264, 265), (265, 266), (266, 267), (267, 268), (268, 269), (269, 270), (270, 271), (271, 272), (272, 273), (273, 274), (274, 275), (275, 276), (276, 277), (277, 278), (278, 279), (279, 280), (280, 281), (281, 282), (282, 283), (283, 284), (284, 285), (285, 286), (286, 287), (287, 288), (288, 289), (289, 290), (290, 291), (291, 292), (292, 293), (293, 294), (294, 295), (295, 296), (296, 297), (297, 298), (298, 299), (299, 300)]

spookyQubit · 2022-03-16T21:11:37Z

Hi @uhermjakob thanks a lot for the quick reply. Setting tokenizer.current_orig_s = text does seem to resolve the crash we were seeing earlier. -- Thanks.

Closing this issue. (Please feel free to reopen if you want to use this issue to work on the Medium term/Longer term solutions. )

spookyQubit closed this as completed Mar 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserving char-offsets #5

Preserving char-offsets #5

spookyQubit commented Mar 15, 2022

uhermjakob commented Mar 16, 2022

spookyQubit commented Mar 16, 2022

Preserving char-offsets #5

Preserving char-offsets #5

Comments

spookyQubit commented Mar 15, 2022

uhermjakob commented Mar 16, 2022

spookyQubit commented Mar 16, 2022