Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recognition of russian abbr names #7

Closed
wants to merge 3 commits into from
Closed

Conversation

SergeyKishenin
Copy link
Collaborator

It will be cool if tokenizer will recognize russian abbr names like "В.Ф. Иванов" and make it a single token.

I've checked in irb a regexp that should do that but when i put it to word_tokenizer.rb nothing happens.
I put the following there:

[/([\p{Word}]{1}\.[\p{Word}]{1}\.)\s([\p{Word}]{1,}[^\.\s])/u, '\1\2']

Any ideas?

@zencephalon
Copy link
Owner

@SergeyKishenin is this still a problem? I want to go through the regexes to fix this issue maybe I can fix the Russian abbreviations at the same time.

@SergeyKishenin
Copy link
Collaborator Author

Seems like it's still an issue. I'll create a branch and attach a pull request here with test cases.

Definition like В.Ф.Иванов is tokenized correctly. В.Ф. Иванов and В. Ф. Иванов (with spaces) are not.

@zencephalon
Copy link
Owner

Thanks for the test cases!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants