ipadic problem for 四半期連結会計期間末日満期手形 #1

KoichiYasuoka · 2021-10-27T01:01:08Z

Thank you for releasing bert-small-japanese-fin and other Electra models for FinTech. But I've found they tokenize "四半期連結会計期間末日満期手形" in bad way:

>>> from transformers import AutoTokenizer
>>> tokenizer=AutoTokenizer.from_pretrained("izumi-lab/bert-small-japanese-fin")
>>> tokenizer.tokenize("四半期連結会計期間末日満期手形")
['四半期', '連結', '会計', '期間', '末日', '満期', '手形']
>>> tokenizer.tokenize("第3四半期連結会計期間末日満期手形")
['第', '3', '四半期連結会計期間末日満期手形']

This is because of the bug of ipadic on 名詞,数 tokenization for 漢字-strings which begin with 漢数字.

>>> import fugashi,ipadic
>>> parser=fugashi.GenericTagger(ipadic.MECAB_ARGS).parse
>>> print(parser("四半期連結会計期間末日満期手形"))
四半期  名詞,一般,*,*,*,*,四半期,シハンキ,シハンキ
連結    名詞,サ変接続,*,*,*,*,連結,レンケツ,レンケツ
会計    名詞,サ変接続,*,*,*,*,会計,カイケイ,カイケイ
期間    名詞,一般,*,*,*,*,期間,キカン,キカン
末日    名詞,一般,*,*,*,*,末日,マツジツ,マツジツ
満期    名詞,一般,*,*,*,*,満期,マンキ,マンキ
手形    名詞,一般,*,*,*,*,手形,テガタ,テガタ
EOS
>>> print(parser("第3四半期連結会計期間末日満期手形"))
第      接頭詞,数接続,*,*,*,*,第,ダイ,ダイ
3       名詞,数,*,*,*,*,*
四半期連結会計期間末日満期手形  名詞,数,*,*,*,*,*
EOS

I recommend you to use another tokenizer than BertJapaneseTokenizer+ipadic. See detail in my diary written in Japanese.

The text was updated successfully, but these errors were encountered:

retarfi · 2021-10-27T03:05:51Z

Thank you for your comment and for sharing the issue.
I have not noticed this ipadic issue.
Not only tokenization but also vocab.txt (making vocabulary process) would have the problem, namely the vocabulary wrongly has such a long word (it might be tokenized into some words such as '四半期', '連結', '会計', '期間', '末日', '満期', '手形').
Is this problem unique for ipadic ?
If so, one solution would be changing the dictionary ipadic to unidic_lite or unidic and we need to pre-train our model with the dictionary again.

KoichiYasuoka · 2021-10-27T04:50:36Z

Is this problem unique for ipadic?

Maybe. At least unidic_lite does not tokenize them in such a way:

>>> import fugashi,unidic_lite
>>> parser=fugashi.GenericTagger("-d "+unidic_lite.DICDIR).parse
>>> print(parser("四半期連結会計期間末日満期手形"))
四半    シハン  シハン  四半    名詞-普通名詞-一般                      0,2
期      キ      キ      期      名詞-普通名詞-助数詞可能                       1
連結    レンケツ        レンケツ        連結    名詞-普通名詞-サ変可能         0
会計    カイケー        カイケイ        会計    名詞-普通名詞-サ変可能         0
期間    キカン  キカン  期間    名詞-普通名詞-一般                      1,2
末日    マツジツ        マツジツ        末日    名詞-普通名詞-一般             0
満期    マンキ  マンキ  満期    名詞-普通名詞-一般                      0,1
手形    テガタ  テガタ  手形    名詞-普通名詞-一般                      0
EOS
>>> print(parser("第3四半期連結会計期間末日満期手形"))
第      ダイ    ダイ    第      接頭辞
3       3       3       3       名詞-数詞                       0
四半    シハン  シハン  四半    名詞-普通名詞-一般                      0,2
期      キ      キ      期      名詞-普通名詞-助数詞可能                       1
連結    レンケツ        レンケツ        連結    名詞-普通名詞-サ変可能         0
会計    カイケー        カイケイ        会計    名詞-普通名詞-サ変可能         0
期間    キカン  キカン  期間    名詞-普通名詞-一般                      1,2
末日    マツジツ        マツジツ        末日    名詞-普通名詞-一般             0
満期    マンキ  マンキ  満期    名詞-普通名詞-一般                      0,1
手形    テガタ  テガタ  手形    名詞-普通名詞-一般                      0
EOS

However, unidic_lite (or unidic) is based upon 国語研短単位, which is rather shorter unit of words for the purpose. I think that some longer unit, such as 国語研長単位, is suitable for FinTech. Would you try and make your own tokenizer?

retarfi · 2021-10-27T15:54:13Z

As you mentioned, it seems that subword tokenization from long units like 長単位 would be better than using ipadic or unidic(_lite).
I think it would be better to create a tokenizer, but it's difficult with my current resources...

KoichiYasuoka · 2021-10-30T08:30:07Z

Hi @retarfi I've just released Japanese-LUW-Tokenizer. It took me about 20 hours to make the tokenizer from 700MB orig.txt (each UTF-8 sentence in each line) on 1GPU (NVIDIA GeForce RTX 2080):

import unicodedata
from tokenizers import CharBPETokenizer
from transformers import AutoModelForTokenClassification,AutoTokenizer,TokenClassificationPipeline,RemBertTokenizerFast
brt="KoichiYasuoka/bert-base-japanese-luw-upos"
mdl=AutoModelForTokenClassification.from_pretrained(brt)
tkz=AutoTokenizer.from_pretrained(brt)
nlp=TokenClassificationPipeline(model=mdl,tokenizer=tkz,aggregation_strategy="simple",device=0)
with open("orig.txt","r",encoding="utf-8") as f, open("luw.txt","w",encoding="utf-8") as w:
  d=[]
  for r in f:
    if r.strip()!="":
      d.append(r.strip())
    if len(d)>255:
      for s in nlp(d):
        print(" ".join(t["word"] for t in s),file=w)
      d=[]
  if len(d)>0:
    for s in nlp(d):
      print(" ".join(t["word"] for t in s),file=w)

alp=[c for c in tkz.convert_ids_to_tokens([i for i in range(len(tkz))]) if len(c)==1 and unicodedata.name(c).startswith("CJK")]
pst=tkz.backend_tokenizer.post_processor
tkz=CharBPETokenizer(lowercase=False,unk_token="[UNK]",suffix="")
tkz.normalizer.handle_chinese_chars=False
tkz.post_processor=pst
tkz.train(files=["luw.txt"],vocab_size=250300,min_frequency=2,limit_alphabet=20000,initial_alphabet=alp,special_tokens=["[PAD]","[UNK]","[CLS]","[SEP]","[MASK]","<special0>","<special1>","<special2>","<special3>","<special4>","<special5>","<special6>","<special7>","<special8>","<special9>"],suffix="")
tkz.save("tokenizer.json")
tokenizer=RemBertTokenizerFast(tokenizer_file="tokenizer.json",vocab_file="/dev/null",bos_token="[CLS]",cls_token="[CLS]",unk_token="[UNK]",pad_token="[PAD]",mask_token="[MASK]",sep_token="[SEP]",do_lower_case=False,keep_accents=True)
tokenizer.save_pretrained("Japanese-LUW-Tokenizer")

vocab_size=250300 seems too big but acceptable. See detail in my diary written in Japanese.

retarfi · 2021-11-04T08:01:49Z

Thank you for sharing! I will check it in detail.

retarfi added bug Something isn't working enhancement New feature or request labels Aug 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ipadic problem for 四半期連結会計期間末日満期手形 #1

ipadic problem for 四半期連結会計期間末日満期手形 #1

KoichiYasuoka commented Oct 27, 2021

retarfi commented Oct 27, 2021 •

edited

Loading

KoichiYasuoka commented Oct 27, 2021

retarfi commented Oct 27, 2021

KoichiYasuoka commented Oct 30, 2021 •

edited

Loading

retarfi commented Nov 4, 2021

ipadic problem for 四半期連結会計期間末日満期手形 #1

ipadic problem for 四半期連結会計期間末日満期手形 #1

Comments

KoichiYasuoka commented Oct 27, 2021

retarfi commented Oct 27, 2021 • edited Loading

KoichiYasuoka commented Oct 27, 2021

retarfi commented Oct 27, 2021

KoichiYasuoka commented Oct 30, 2021 • edited Loading

retarfi commented Nov 4, 2021

retarfi commented Oct 27, 2021 •

edited

Loading

KoichiYasuoka commented Oct 30, 2021 •

edited

Loading