Recognition of russian abbr names #7

SergeyKishenin · 2014-04-30T05:16:49Z

It will be cool if tokenizer will recognize russian abbr names like "В.Ф. Иванов" and make it a single token.

I've checked in irb a regexp that should do that but when i put it to word_tokenizer.rb nothing happens.
I put the following there:

[/([\p{Word}]{1}\.[\p{Word}]{1}\.)\s([\p{Word}]{1,}[^\.\s])/u, '\1\2']

Any ideas?

zencephalon · 2014-04-30T00:47:24Z

@SergeyKishenin is this still a problem? I want to go through the regexes to fix this issue maybe I can fix the Russian abbreviations at the same time.

…issue.

SergeyKishenin · 2014-04-30T05:15:40Z

Seems like it's still an issue. I'll create a branch and attach a pull request here with test cases.

Definition like В.Ф.Иванов is tokenized correctly. В.Ф. Иванов and В. Ф. Иванов (with spaces) are not.

zencephalon · 2014-05-01T01:10:26Z

Thanks for the test cases!

SergeyKishenin closed this Sep 21, 2011

SergeyKishenin reopened this Sep 21, 2011

zencephalon added 2 commits April 29, 2014 22:10

Ignored test_out file.

13be3ac

Add test case for right parens disappearing. Doesn't appear to be an …

9ade045

…issue.

add test case for russian abbr names

75eec96

zencephalon closed this Sep 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recognition of russian abbr names #7

Recognition of russian abbr names #7

SergeyKishenin commented Apr 30, 2014

zencephalon commented Apr 30, 2014

SergeyKishenin commented Apr 30, 2014

zencephalon commented May 1, 2014

Recognition of russian abbr names #7

Recognition of russian abbr names #7

Conversation

SergeyKishenin commented Apr 30, 2014

zencephalon commented Apr 30, 2014

SergeyKishenin commented Apr 30, 2014

zencephalon commented May 1, 2014