Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unk.def in mecab-ipadic #37

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Fix unk.def in mecab-ipadic #37

wants to merge 3 commits into from

Conversation

polm
Copy link

@polm polm commented Jul 18, 2017

Github isn't showing the file properly, so to be clear, I changed this line:

SYMBOL,1283,1283,17585,名詞,サ変接続,*,*,*,*,*

to this:

SYMBOL,1283,1283,17585,記号,一般,*,*,*,*,*

The previous setting makes no sense and has confused many people. I guess it was a mistake?

The jumandic unk.def did not seem to have this problem.

If there's anything I should improve, please let me know.

Many thanks for providing Mecab.

Unknown symbols are not nouns. -POLM
@polm
Copy link
Author

polm commented Mar 16, 2019

Hello. This PR has been here for over a year, it would be great to have it addressed one way or another.

I will add that I realized why the current setting is in place. There's a footnote in "Applying Conditional Random Fields to Japanese Morphological Analysis" that explains it:

JUMAN assigns “unknown POS” to the words not seen in
the lexicon. We simply replace the POS of these words with
the default POS, Noun-SAHEN.

While that sounds reasonable, the articles I linked above and the issue that has been linked to this PR since I originally posted it show that this setting causes confusion and I still think it should be changed.

Any feedback at all would be appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant