Improving the handling of numerals of nagisa's word tokenizer #9

BLKSerene · 2018-12-25T13:30:28Z

I'm using nagisa v0.1.1. There's some problems about the tokenizer's handling of numerals, the numbers and decimals are split as single characters and tagged as "名詞"
357 -> 3_名詞 5_名詞 7_名詞 # Numbers
1.48 -> 1_名詞 ._名詞 4_名詞 8_名詞 # Decimals
$5.5 -> $_補助記号 5_名詞 ._補助記号 5_名詞 # Numbers with currency symbols (and other symbols)
133-1111-2222 -> 1_名詞 3_名詞 3_名詞 -_補助記号 1_名詞 1_名詞 1_名詞 1_名詞 -_補助記号 2_名詞 2_名詞 2_名詞 2_名詞 # Phone numbers

and etc... Is it possible to improve this?

taishi-i · 2018-12-26T15:34:18Z

Hi @BLKSerene

The over tokenized problem is caused because:
A lot of numerals exist in the training data as a single character with "名詞". The word segmentation and pos-tagging model in nagisa learns such patterns. So, numerals in text are tagged as a single character with "名詞"

Since it is difficult to make modifications to the training data, I recommend using the following post-processing function concat_numeric_chars. This function concatenates continuous numerals and symbols into a single word with "数詞" ("数詞" means numeric in Japanese.) To avoid the over tokenized problem, please try this approach.

import nagisa

                                                           
def concat_numeric_chars(words, postags, num_postag="数詞"):
    out_words = []                                                      
    out_postags = []                                                    
    substring = []                                                      
    for word, postag in zip(words, postags):                            
        if (word.isnumeric() is True) or (postag == "補助記号") or (word == "."):
            substring.append(word)                                      
        else:                                                           
            if len(substring) > 0:                                      
                out_words.append("".join(substring))                    
                out_postags.append(num_postag)                          
                substring = []                                          
            out_words.append(word)                                      
            out_postags.append(postag)                                  
                                                                        
    if len(substring) > 0:                                              
        out_words.append("".join(substring))                            
        out_postags.append(num_postag)                                  
                                                                        
    return out_words, out_postags                                       
                                                                        
                                                                        
def main():                                                             
    # Numbers                                                           
    text = "357"                                                        
    tokens = nagisa.tagging(text)                                       
    words, postags = concat_numeric_chars(tokens.words, tokens.postags) 
    print(words, postags) #=> ['357'] ['数詞']                          
                                                                        
    # Decimals                                                          
    text = "1.48"                                                       
    tokens = nagisa.tagging(text)                                       
    words, postags = concat_numeric_chars(tokens.words, tokens.postags) 
    print(words, postags) #=> ['1.48'] ['数詞']                         
                                                                        
    # Numbers with currency symbols (and other symbols)                 
    text = "$5.5"                                                       
    tokens = nagisa.tagging(text)                                       
    words, postags = concat_numeric_chars(tokens.words, tokens.postags) 
    print(words, postags) #=> ['$5.5'] ['数詞']                         
                                                                        
    # Phone numbers                                                     
    text = "133-1111-2222"                                              
    tokens = nagisa.tagging(text)                                       
    words, postags = concat_numeric_chars(tokens.words, tokens.postags) 
    print(words, postags) #=> ['133-1111-2222'] ['数詞']                
                                                                        
                                                                        
if __name__ == "__main__":                                              
    main()

BLKSerene · 2018-12-28T05:03:44Z

Thanks, it works.

taishi-i · 2018-12-28T06:14:33Z

Thanks for the feedback.

Subrata15 · 2020-12-03T10:00:42Z

thank you, that's work for me!

taishi-i closed this as completed Dec 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving the handling of numerals of nagisa's word tokenizer #9

Improving the handling of numerals of nagisa's word tokenizer #9

BLKSerene commented Dec 25, 2018

taishi-i commented Dec 26, 2018

BLKSerene commented Dec 28, 2018

taishi-i commented Dec 28, 2018

Subrata15 commented Dec 3, 2020

Improving the handling of numerals of nagisa's word tokenizer #9

Improving the handling of numerals of nagisa's word tokenizer #9

Comments

BLKSerene commented Dec 25, 2018

taishi-i commented Dec 26, 2018

BLKSerene commented Dec 28, 2018

taishi-i commented Dec 28, 2018

Subrata15 commented Dec 3, 2020