Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed false positive for "食べられる" (potential verb) by JapaneseBrokenExpression #880

Merged
merged 3 commits into from Oct 19, 2020

Conversation

hirokiky
Copy link
Contributor

JapaneseBrokenExpression would cause false positives for potential verb.
It would assert error for "食べられる", "見られる", "寝られる", and so on.

What I did:

  • Added tests for checking this
  • Fixed code (JapaneseBrokenExpression only).
  • Added special case handling

About special case

The tokenizer will parse ”見れる" as one token.
It might depends on dicts of Kuromoji or so.

This issue looks similar takuyaa/kuromoji.js#28

I think we need to add other special cases (if there are).

About baseForm of tokens

It's better to use BaseForm in this logic

q.getSurface().startsWith("られ")

Best way.

q.getBaseForm().equals("られる")

But to get baseForm, we need to change TokenElement and NoelogdJapaneseTokenizer.
It looks other Validators won't use BaseForm of each tokens, and it's only necessary with Japanese.
So in this PullRequest, I avoided to change them.

As ScreenShot

Before

スクリーンショット 2020-10-16 13-37-20

After

スクリーンショット 2020-10-16 13-37-54

Note

I'm not good at Java, so feel free to change my code and syntax as you like.

* Case: られる
* Case: 下一段活用
Due to tokenizer's behavior '見れる' will be one token.
So now it handle this case specially
@coveralls
Copy link

Coverage Status

Coverage increased (+0.003%) to 91.41% when pulling 4ac46d3 on hirokiky:fix-jp-broken-ra into 4ea76e6 on redpen-cc:master.

@norm-ideal
Copy link
Contributor

I am also working on the subject and I agree with you that the best way is to introduce BaseForm information, even though it is required only by Japanese. How about like this?

https://github.com/norm-ideal/redpen/tree/ra-drop

@takahi-i takahi-i self-requested a review October 19, 2020 05:01
@takahi-i
Copy link
Member

LGTM! Thank you very much for the valuable contribution @hirokiky 🙏

@takahi-i takahi-i merged commit 5552ece into redpen-cc:master Oct 19, 2020
@hirokiky hirokiky deleted the fix-jp-broken-ra branch October 20, 2020 00:38
@hirokiky
Copy link
Contributor Author

Thanks for the quick review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants