Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CJK_models #57

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

CJK_models #57

wants to merge 9 commits into from

Conversation

5uperpalo
Copy link
Collaborator

@5uperpalo 5uperpalo commented Feb 14, 2021

if I understand it correctly the idea is :

  • create balanced dataset of articles with labels
  • add text (draft or article)
  • vectorize the text
  • create models

After fetching a text I added same wikitext2words procedure as was used to create vectors in python-mwtext(which includes CJK tokenization and also traditional to simplified Chinese conversion).
Maybe the tokenization is not necessary(I haven't checked), the learned vectors were created from "cleaned up" text - ie no numbers, just 'anumber' etc. So it makes sense to do the same here... I think :)

@halfak Please respond if it makes sense - I did not test the code yet. It is a "draft"

@5uperpalo 5uperpalo self-assigned this Feb 14, 2021
@5uperpalo
Copy link
Collaborator Author

5uperpalo commented Feb 17, 2021

I came across the following issues:

Japanese tokenizer

  • PROBLEM: sometimes sudachipy has problems with default dictionary, eg. error:
2021-02-18 09:15:32,337 WARNING:drafttopic.utilities.fetch_draft_text -- アンドリュー・トムソン doesn't look like article text: アンドリュー・トムソン(Andrew T
Traceback (most recent call last):
  File "./utility", line 4, in <module>
    drafttopic.main()
  File "/dev/CJK_tokenization_testing/drafttopic/drafttopic/drafttopic.py", line 60, in main
    module.main(sys.argv[2:])
  File "/dev/CJK_tokenization_testing/drafttopic/drafttopic/utilities/fetch_draft_text.py", line 70, in main
    run(observations, session, threads, output, wtpp)
  File "/dev/CJK_tokenization_testing/drafttopic/drafttopic/utilities/fetch_draft_text.py", line 74, in run
    for obs in fetch_draft_texts(observations, session, threads, wtpp):
  File "/dev/CJK_tokenization_testing/drafttopic/drafttopic/utilities/fetch_draft_text.py", line 86, in fetch_draft_texts
    for obs in executor.map(_fetch_draft_text, observations):
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 556, in result_iterator
    yield future.result()
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 398, in result
    return self.__get_result()
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 357, in __get_result
    raise self._exception
  File "/usr/lib/python3.5/concurrent/futures/thread.py", line 55, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/dev/CJK_tokenization_testing/drafttopic/drafttopic/utilities/fetch_draft_text.py", line 108, in _fetch_text
    obs['text'] = ' '.join(wtpp.transform(text))
  File "/home/pavol86/venv/3.5/lib/python3.5/site-packages/mwtext-0.2.1-py3.5.egg/mwtext/content_transformers/wikitext2words.py", line 54, in transform
    return self._extract_words(content)
  File "/home/pavol86/venv/3.5/lib/python3.5/site-packages/mwtext-0.2.1-py3.5.egg/mwtext/content_transformers/wikitext2words.py", line 86, in _extract_words
    extracted_words[i], language)
  File "/home/pavol86/venv/3.5/lib/python3.5/site-packages/deltas/tokenizers/cjk_tokenization.py", line 76, in CJK_tokenization
    mode)]
  File "sudachipy/tokenizer.pyx", line 151, in sudachipy.tokenizer.Tokenizer.tokenize
  File "sudachipy/lattice.pyx", line 116, in sudachipy.lattice.Lattice.get_best_path
AttributeError: EOS is not connected to BOS
  • SOLUTION: install a small dictionary and link it (problems with a default "core dictionary" are recent - since December 2020?) + set threads to 1 "--threads=1 "
install sudachidict_small
sudachipy link -t small

Korean tokenizer

import jpype
def main(argv=None):
    if threads > 1:
        jpype.startJVM()

def build_fetch_text(get_first_revision, wtpp):
    jpype.attachThreadToJVM()
  • then I received the following error:
  File "/home/pavol86/venv/3.5/lib/python3.5/site-packages/deltas/tokenizers/cjk_tokenization.py", line 79, in CJK_tokenization
    seg = get_kor_tokenizer()
  File "/home/pavol86/venv/3.5/lib/python3.5/site-packages/deltas/tokenizers/cjk_tokenization.py", line 29, in get_kor_tokenizer
    KOR_KONLPY_OKT = ko_okt()
  File "/home/pavol86/venv/3.5/lib/python3.5/site-packages/konlpy/tag/_okt.py", line 94, in __init__
    OktInterfaceJavaClass = oktJavaPackage.OktInterface
AttributeError: Java package 'kr.lucypark.okt' is not valid
  • SOLUTION: temporary fix is to set threads to 1
    --threads=1 \

Notes

  • I set "--threads=1" for all CJK preprocessing as I received also some "list out of bounds errors" while using Chinese tokenizer
  • I used Wikitext2Words from python-mwtext as it cleans the text + converts trad->simplified Chinese using hanziconv
    • cleaning of the text is the same as was used for the creation of "learned vectors", ie. references/numbers/etc. are cleaned

@5uperpalo
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant