New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving the handling of numerals of nagisa's word tokenizer #9
Comments
Hi @BLKSerene The over tokenized problem is caused because: Since it is difficult to make modifications to the training data, I recommend using the following post-processing function import nagisa
def concat_numeric_chars(words, postags, num_postag="数詞"):
out_words = []
out_postags = []
substring = []
for word, postag in zip(words, postags):
if (word.isnumeric() is True) or (postag == "補助記号") or (word == "."):
substring.append(word)
else:
if len(substring) > 0:
out_words.append("".join(substring))
out_postags.append(num_postag)
substring = []
out_words.append(word)
out_postags.append(postag)
if len(substring) > 0:
out_words.append("".join(substring))
out_postags.append(num_postag)
return out_words, out_postags
def main():
# Numbers
text = "357"
tokens = nagisa.tagging(text)
words, postags = concat_numeric_chars(tokens.words, tokens.postags)
print(words, postags) #=> ['357'] ['数詞']
# Decimals
text = "1.48"
tokens = nagisa.tagging(text)
words, postags = concat_numeric_chars(tokens.words, tokens.postags)
print(words, postags) #=> ['1.48'] ['数詞']
# Numbers with currency symbols (and other symbols)
text = "$5.5"
tokens = nagisa.tagging(text)
words, postags = concat_numeric_chars(tokens.words, tokens.postags)
print(words, postags) #=> ['$5.5'] ['数詞']
# Phone numbers
text = "133-1111-2222"
tokens = nagisa.tagging(text)
words, postags = concat_numeric_chars(tokens.words, tokens.postags)
print(words, postags) #=> ['133-1111-2222'] ['数詞']
if __name__ == "__main__":
main() |
Thanks, it works. |
Thanks for the feedback. |
thank you, that's work for me! |
I'm using nagisa v0.1.1. There's some problems about the tokenizer's handling of numerals, the numbers and decimals are split as single characters and tagged as "名詞"
357 -> 3_名詞 5_名詞 7_名詞 # Numbers
1.48 -> 1_名詞 ._名詞 4_名詞 8_名詞 # Decimals
$5.5 -> $_補助記号 5_名詞 ._補助記号 5_名詞 # Numbers with currency symbols (and other symbols)
133-1111-2222 -> 1_名詞 3_名詞 3_名詞 -_補助記号 1_名詞 1_名詞 1_名詞 1_名詞 -_補助記号 2_名詞 2_名詞 2_名詞 2_名詞 # Phone numbers
and etc... Is it possible to improve this?
The text was updated successfully, but these errors were encountered: