Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving the handling of numerals of nagisa's word tokenizer #9

Closed
BLKSerene opened this issue Dec 25, 2018 · 4 comments
Closed

Improving the handling of numerals of nagisa's word tokenizer #9

BLKSerene opened this issue Dec 25, 2018 · 4 comments

Comments

@BLKSerene
Copy link

I'm using nagisa v0.1.1. There's some problems about the tokenizer's handling of numerals, the numbers and decimals are split as single characters and tagged as "名詞"
357 -> 3_名詞 5_名詞 7_名詞 # Numbers
1.48 -> 1_名詞 ._名詞 4_名詞 8_名詞 # Decimals
$5.5 -> $_補助記号 5_名詞 ._補助記号 5_名詞 # Numbers with currency symbols (and other symbols)
133-1111-2222 -> 1_名詞 3_名詞 3_名詞 -_補助記号 1_名詞 1_名詞 1_名詞 1_名詞 -_補助記号 2_名詞 2_名詞 2_名詞 2_名詞 # Phone numbers

and etc... Is it possible to improve this?

@taishi-i
Copy link
Owner

Hi @BLKSerene

The over tokenized problem is caused because:
A lot of numerals exist in the training data as a single character with "名詞". The word segmentation and pos-tagging model in nagisa learns such patterns. So, numerals in text are tagged as a single character with "名詞"

Since it is difficult to make modifications to the training data, I recommend using the following post-processing function concat_numeric_chars. This function concatenates continuous numerals and symbols into a single word with "数詞" ("数詞" means numeric in Japanese.) To avoid the over tokenized problem, please try this approach.

import nagisa

                                                           
def concat_numeric_chars(words, postags, num_postag="数詞"):
    out_words = []                                                      
    out_postags = []                                                    
    substring = []                                                      
    for word, postag in zip(words, postags):                            
        if (word.isnumeric() is True) or (postag == "補助記号") or (word == "."):
            substring.append(word)                                      
        else:                                                           
            if len(substring) > 0:                                      
                out_words.append("".join(substring))                    
                out_postags.append(num_postag)                          
                substring = []                                          
            out_words.append(word)                                      
            out_postags.append(postag)                                  
                                                                        
    if len(substring) > 0:                                              
        out_words.append("".join(substring))                            
        out_postags.append(num_postag)                                  
                                                                        
    return out_words, out_postags                                       
                                                                        
                                                                        
def main():                                                             
    # Numbers                                                           
    text = "357"                                                        
    tokens = nagisa.tagging(text)                                       
    words, postags = concat_numeric_chars(tokens.words, tokens.postags) 
    print(words, postags) #=> ['357'] ['数詞']                          
                                                                        
    # Decimals                                                          
    text = "1.48"                                                       
    tokens = nagisa.tagging(text)                                       
    words, postags = concat_numeric_chars(tokens.words, tokens.postags) 
    print(words, postags) #=> ['1.48'] ['数詞']                         
                                                                        
    # Numbers with currency symbols (and other symbols)                 
    text = "$5.5"                                                       
    tokens = nagisa.tagging(text)                                       
    words, postags = concat_numeric_chars(tokens.words, tokens.postags) 
    print(words, postags) #=> ['$5.5'] ['数詞']                         
                                                                        
    # Phone numbers                                                     
    text = "133-1111-2222"                                              
    tokens = nagisa.tagging(text)                                       
    words, postags = concat_numeric_chars(tokens.words, tokens.postags) 
    print(words, postags) #=> ['133-1111-2222'] ['数詞']                
                                                                        
                                                                        
if __name__ == "__main__":                                              
    main()

@BLKSerene
Copy link
Author

Thanks, it works.

@taishi-i
Copy link
Owner

Thanks for the feedback.

@Subrata15
Copy link

thank you, that's work for me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants