Better sentence tokenization needed in the InteractiveText #334

mircealungu · 2024-03-29T15:30:40Z

When the original text misses a space after the end of the sentence the last word of the previous sentence and the first one of the next are considered to be one .

Equally wrong is when two words are not connected because one is an abbreviation

And another imperfect situation:

tfnribeiro · 2024-04-02T07:39:02Z

I can look into this, would you share those text ids so I can take a look regarding those examples?

The second example, I am not exactly sure what would be the expected behaviour? The other two, I just think we need to improve the pattern matching in the algorithm. We could also consider using a tokenizer from the API to do this to ensure more consistency throughout.

tfnribeiro · 2024-04-02T10:37:37Z

I have something like this now:

Essentially, I have added a process when we split the word token based on whitespaces, there is now a second pass that checks for joint words and these special tokens - where we could handle these cases.

tfnribeiro · 2024-04-03T09:24:54Z

I have added a check with abbreviations that first checks if the next word is uppercase to decide if it's the end of the sentence and the results are like this:

this should at least handle cases like ift. as long it doesn't start with a proper noun. I didn't find a list of abbreviations so without using a pipeline to do some more complex parsing this might be the best we can do. If we could check for proper nouns I think we could have something a little more robust.

tfnribeiro · 2024-04-10T10:55:26Z

Added branch https://github.com/zeeguu/web/tree/update-tokenization with suggestions to obtain the result above.

mircealungu assigned mircealungu and tfnribeiro Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better sentence tokenization needed in the InteractiveText #334

Better sentence tokenization needed in the InteractiveText #334

mircealungu commented Mar 29, 2024 •

edited

tfnribeiro commented Apr 2, 2024

tfnribeiro commented Apr 2, 2024

tfnribeiro commented Apr 3, 2024

tfnribeiro commented Apr 10, 2024

Better sentence tokenization needed in the InteractiveText #334

Better sentence tokenization needed in the InteractiveText #334

Comments

mircealungu commented Mar 29, 2024 • edited

tfnribeiro commented Apr 2, 2024

tfnribeiro commented Apr 2, 2024

tfnribeiro commented Apr 3, 2024

tfnribeiro commented Apr 10, 2024

mircealungu commented Mar 29, 2024 •

edited