Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better sentence tokenization needed in the InteractiveText #334

Open
mircealungu opened this issue Mar 29, 2024 · 4 comments
Open

Better sentence tokenization needed in the InteractiveText #334

mircealungu opened this issue Mar 29, 2024 · 4 comments
Assignees

Comments

@mircealungu
Copy link
Member

mircealungu commented Mar 29, 2024

When the original text misses a space after the end of the sentence the last word of the previous sentence and the first one of the next are considered to be one .

image

Equally wrong is when two words are not connected because one is an abbreviation

image

And another imperfect situation:
image

@tfnribeiro
Copy link
Contributor

I can look into this, would you share those text ids so I can take a look regarding those examples?

The second example, I am not exactly sure what would be the expected behaviour? The other two, I just think we need to improve the pattern matching in the algorithm. We could also consider using a tokenizer from the API to do this to ensure more consistency throughout.

@tfnribeiro
Copy link
Contributor

I have something like this now:

image

Essentially, I have added a process when we split the word token based on whitespaces, there is now a second pass that checks for joint words and these special tokens - where we could handle these cases.

@tfnribeiro
Copy link
Contributor

I have added a check with abbreviations that first checks if the next word is uppercase to decide if it's the end of the sentence and the results are like this:

image

this should at least handle cases like ift. as long it doesn't start with a proper noun. I didn't find a list of abbreviations so without using a pipeline to do some more complex parsing this might be the best we can do. If we could check for proper nouns I think we could have something a little more robust.

@tfnribeiro
Copy link
Contributor

Added branch https://github.com/zeeguu/web/tree/update-tokenization with suggestions to obtain the result above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants