Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipadic problem for 四半期連結会計期間末日満期手形 #1

Open
KoichiYasuoka opened this issue Oct 27, 2021 · 5 comments
Open
Labels
bug Something isn't working enhancement New feature or request

Comments

@KoichiYasuoka
Copy link

Thank you for releasing bert-small-japanese-fin and other Electra models for FinTech. But I've found they tokenize "四半期連結会計期間末日満期手形" in bad way:

>>> from transformers import AutoTokenizer
>>> tokenizer=AutoTokenizer.from_pretrained("izumi-lab/bert-small-japanese-fin")
>>> tokenizer.tokenize("四半期連結会計期間末日満期手形")
['四半期', '連結', '会計', '期間', '末日', '満期', '手形']
>>> tokenizer.tokenize("第3四半期連結会計期間末日満期手形")
['第', '3', '四半期連結会計期間末日満期手形']

This is because of the bug of ipadic on 名詞,数 tokenization for 漢字-strings which begin with 漢数字.

>>> import fugashi,ipadic
>>> parser=fugashi.GenericTagger(ipadic.MECAB_ARGS).parse
>>> print(parser("四半期連結会計期間末日満期手形"))
四半期  名詞,一般,*,*,*,*,四半期,シハンキ,シハンキ
連結    名詞,サ変接続,*,*,*,*,連結,レンケツ,レンケツ
会計    名詞,サ変接続,*,*,*,*,会計,カイケイ,カイケイ
期間    名詞,一般,*,*,*,*,期間,キカン,キカン
末日    名詞,一般,*,*,*,*,末日,マツジツ,マツジツ
満期    名詞,一般,*,*,*,*,満期,マンキ,マンキ
手形    名詞,一般,*,*,*,*,手形,テガタ,テガタ
EOS
>>> print(parser("第3四半期連結会計期間末日満期手形"))
第      接頭詞,数接続,*,*,*,*,第,ダイ,ダイ
3       名詞,数,*,*,*,*,*
四半期連結会計期間末日満期手形  名詞,数,*,*,*,*,*
EOS

I recommend you to use another tokenizer than BertJapaneseTokenizer+ipadic. See detail in my diary written in Japanese.

@retarfi
Copy link
Owner

retarfi commented Oct 27, 2021

Thank you for your comment and for sharing the issue.
I have not noticed this ipadic issue.
Not only tokenization but also vocab.txt (making vocabulary process) would have the problem, namely the vocabulary wrongly has such a long word (it might be tokenized into some words such as '四半期', '連結', '会計', '期間', '末日', '満期', '手形').
Is this problem unique for ipadic ?
If so, one solution would be changing the dictionary ipadic to unidic_lite or unidic and we need to pre-train our model with the dictionary again.

@KoichiYasuoka
Copy link
Author

Is this problem unique for ipadic?

Maybe. At least unidic_lite does not tokenize them in such a way:

>>> import fugashi,unidic_lite
>>> parser=fugashi.GenericTagger("-d "+unidic_lite.DICDIR).parse
>>> print(parser("四半期連結会計期間末日満期手形"))
四半    シハン  シハン  四半    名詞-普通名詞-一般                      0,2
期      キ      キ      期      名詞-普通名詞-助数詞可能                       1
連結    レンケツ        レンケツ        連結    名詞-普通名詞-サ変可能         0
会計    カイケー        カイケイ        会計    名詞-普通名詞-サ変可能         0
期間    キカン  キカン  期間    名詞-普通名詞-一般                      1,2
末日    マツジツ        マツジツ        末日    名詞-普通名詞-一般             0
満期    マンキ  マンキ  満期    名詞-普通名詞-一般                      0,1
手形    テガタ  テガタ  手形    名詞-普通名詞-一般                      0
EOS
>>> print(parser("第3四半期連結会計期間末日満期手形"))
第      ダイ    ダイ    第      接頭辞
3       3       3       3       名詞-数詞                       0
四半    シハン  シハン  四半    名詞-普通名詞-一般                      0,2
期      キ      キ      期      名詞-普通名詞-助数詞可能                       1
連結    レンケツ        レンケツ        連結    名詞-普通名詞-サ変可能         0
会計    カイケー        カイケイ        会計    名詞-普通名詞-サ変可能         0
期間    キカン  キカン  期間    名詞-普通名詞-一般                      1,2
末日    マツジツ        マツジツ        末日    名詞-普通名詞-一般             0
満期    マンキ  マンキ  満期    名詞-普通名詞-一般                      0,1
手形    テガタ  テガタ  手形    名詞-普通名詞-一般                      0
EOS

However, unidic_lite (or unidic) is based upon 国語研短単位, which is rather shorter unit of words for the purpose. I think that some longer unit, such as 国語研長単位, is suitable for FinTech. Would you try and make your own tokenizer?

@retarfi
Copy link
Owner

retarfi commented Oct 27, 2021

As you mentioned, it seems that subword tokenization from long units like 長単位 would be better than using ipadic or unidic(_lite).
I think it would be better to create a tokenizer, but it's difficult with my current resources...

@KoichiYasuoka
Copy link
Author

KoichiYasuoka commented Oct 30, 2021

Hi @retarfi I've just released Japanese-LUW-Tokenizer. It took me about 20 hours to make the tokenizer from 700MB orig.txt (each UTF-8 sentence in each line) on 1GPU (NVIDIA GeForce RTX 2080):

import unicodedata
from tokenizers import CharBPETokenizer
from transformers import AutoModelForTokenClassification,AutoTokenizer,TokenClassificationPipeline,RemBertTokenizerFast
brt="KoichiYasuoka/bert-base-japanese-luw-upos"
mdl=AutoModelForTokenClassification.from_pretrained(brt)
tkz=AutoTokenizer.from_pretrained(brt)
nlp=TokenClassificationPipeline(model=mdl,tokenizer=tkz,aggregation_strategy="simple",device=0)
with open("orig.txt","r",encoding="utf-8") as f, open("luw.txt","w",encoding="utf-8") as w:
  d=[]
  for r in f:
    if r.strip()!="":
      d.append(r.strip())
    if len(d)>255:
      for s in nlp(d):
        print(" ".join(t["word"] for t in s),file=w)
      d=[]
  if len(d)>0:
    for s in nlp(d):
      print(" ".join(t["word"] for t in s),file=w)

alp=[c for c in tkz.convert_ids_to_tokens([i for i in range(len(tkz))]) if len(c)==1 and unicodedata.name(c).startswith("CJK")]
pst=tkz.backend_tokenizer.post_processor
tkz=CharBPETokenizer(lowercase=False,unk_token="[UNK]",suffix="")
tkz.normalizer.handle_chinese_chars=False
tkz.post_processor=pst
tkz.train(files=["luw.txt"],vocab_size=250300,min_frequency=2,limit_alphabet=20000,initial_alphabet=alp,special_tokens=["[PAD]","[UNK]","[CLS]","[SEP]","[MASK]","<special0>","<special1>","<special2>","<special3>","<special4>","<special5>","<special6>","<special7>","<special8>","<special9>"],suffix="")
tkz.save("tokenizer.json")
tokenizer=RemBertTokenizerFast(tokenizer_file="tokenizer.json",vocab_file="/dev/null",bos_token="[CLS]",cls_token="[CLS]",unk_token="[UNK]",pad_token="[PAD]",mask_token="[MASK]",sep_token="[SEP]",do_lower_case=False,keep_accents=True)
tokenizer.save_pretrained("Japanese-LUW-Tokenizer")

vocab_size=250300 seems too big but acceptable. See detail in my diary written in Japanese.

@retarfi
Copy link
Owner

retarfi commented Nov 4, 2021

Thank you for sharing! I will check it in detail.

@retarfi retarfi added bug Something isn't working enhancement New feature or request labels Aug 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants