CJK_models #57

5uperpalo · 2021-02-14T00:59:22Z

if I understand it correctly the idea is :

create balanced dataset of articles with labels
add text (draft or article)
vectorize the text
create models

After fetching a text I added same wikitext2words procedure as was used to create vectors in python-mwtext(which includes CJK tokenization and also traditional to simplified Chinese conversion).
Maybe the tokenization is not necessary(I haven't checked), the learned vectors were created from "cleaned up" text - ie no numbers, just 'anumber' etc. So it makes sense to do the same here... I think :)

@halfak Please respond if it makes sense - I did not test the code yet. It is a "draft"

5uperpalo · 2021-02-17T21:45:25Z

I came across the following issues:

Japanese tokenizer

PROBLEM: sometimes sudachipy has problems with default dictionary, eg. error:

2021-02-18 09:15:32,337 WARNING:drafttopic.utilities.fetch_draft_text -- アンドリュー・トムソン doesn't look like article text: アンドリュー・トムソン(Andrew T
Traceback (most recent call last):
  File "./utility", line 4, in <module>
    drafttopic.main()
  File "/dev/CJK_tokenization_testing/drafttopic/drafttopic/drafttopic.py", line 60, in main
    module.main(sys.argv[2:])
  File "/dev/CJK_tokenization_testing/drafttopic/drafttopic/utilities/fetch_draft_text.py", line 70, in main
    run(observations, session, threads, output, wtpp)
  File "/dev/CJK_tokenization_testing/drafttopic/drafttopic/utilities/fetch_draft_text.py", line 74, in run
    for obs in fetch_draft_texts(observations, session, threads, wtpp):
  File "/dev/CJK_tokenization_testing/drafttopic/drafttopic/utilities/fetch_draft_text.py", line 86, in fetch_draft_texts
    for obs in executor.map(_fetch_draft_text, observations):
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 556, in result_iterator
    yield future.result()
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 398, in result
    return self.__get_result()
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 357, in __get_result
    raise self._exception
  File "/usr/lib/python3.5/concurrent/futures/thread.py", line 55, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/dev/CJK_tokenization_testing/drafttopic/drafttopic/utilities/fetch_draft_text.py", line 108, in _fetch_text
    obs['text'] = ' '.join(wtpp.transform(text))
  File "/home/pavol86/venv/3.5/lib/python3.5/site-packages/mwtext-0.2.1-py3.5.egg/mwtext/content_transformers/wikitext2words.py", line 54, in transform
    return self._extract_words(content)
  File "/home/pavol86/venv/3.5/lib/python3.5/site-packages/mwtext-0.2.1-py3.5.egg/mwtext/content_transformers/wikitext2words.py", line 86, in _extract_words
    extracted_words[i], language)
  File "/home/pavol86/venv/3.5/lib/python3.5/site-packages/deltas/tokenizers/cjk_tokenization.py", line 76, in CJK_tokenization
    mode)]
  File "sudachipy/tokenizer.pyx", line 151, in sudachipy.tokenizer.Tokenizer.tokenize
  File "sudachipy/lattice.pyx", line 116, in sudachipy.lattice.Lattice.get_best_path
AttributeError: EOS is not connected to BOS

SOLUTION: install a small dictionary and link it (problems with a default "core dictionary" are recent - since December 2020?) + set threads to 1 "--threads=1 "

install sudachidict_small
sudachipy link -t small

Korean tokenizer

PROBLEM:KONLPY has an issue with multithreading as it is using JVM and raises an error when jpype wants to start multiple JVMs(not allowed)
- solution is described in the docs(Multithreading with KoNLPy) https://buildmedia.readthedocs.org/media/pdf/konlpy/latest/konlpy.pdf
- I tried to implement it using the following code:

import jpype
def main(argv=None):
    if threads > 1:
        jpype.startJVM()

def build_fetch_text(get_first_revision, wtpp):
    jpype.attachThreadToJVM()

then I received the following error:

  File "/home/pavol86/venv/3.5/lib/python3.5/site-packages/deltas/tokenizers/cjk_tokenization.py", line 79, in CJK_tokenization
    seg = get_kor_tokenizer()
  File "/home/pavol86/venv/3.5/lib/python3.5/site-packages/deltas/tokenizers/cjk_tokenization.py", line 29, in get_kor_tokenizer
    KOR_KONLPY_OKT = ko_okt()
  File "/home/pavol86/venv/3.5/lib/python3.5/site-packages/konlpy/tag/_okt.py", line 94, in __init__
    OktInterfaceJavaClass = oktJavaPackage.OktInterface
AttributeError: Java package 'kr.lucypark.okt' is not valid

SOLUTION: temporary fix is to set threads to 1
--threads=1 \

Notes

I set "--threads=1" for all CJK preprocessing as I received also some "list out of bounds errors" while using Chinese tokenizer
I used Wikitext2Words from python-mwtext as it cleans the text + converts trad->simplified Chinese using hanziconv
- cleaning of the text is the same as was used for the creation of "learned vectors", ie. references/numbers/etc. are cleaned

5uperpalo · 2021-03-02T09:15:54Z

after recent changes in the "feature files", it finally works
@halfak after you merge and upload CJK learned vectors in Add CJK datasets mediawiki-utilities/python-mwtext#24 I will adjust the Makefile and this could be finally merged

CJK_models

9e68530

5uperpalo self-assigned this Feb 14, 2021

update

c42c40c

Pavol86 and others added 7 commits February 18, 2021 09:47

update

6b2516f

new

42adfae

updated code

8e2cf6a

update

8a4d319

japanase model

340552b

cjk models

c199e4b

it is alive

a06db9e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CJK_models #57

CJK_models #57

5uperpalo commented Feb 14, 2021 •

edited

Loading

5uperpalo commented Feb 17, 2021 •

edited

Loading

5uperpalo commented Mar 2, 2021

CJK_models #57

Are you sure you want to change the base?

CJK_models #57

Conversation

5uperpalo commented Feb 14, 2021 • edited Loading

5uperpalo commented Feb 17, 2021 • edited Loading

I came across the following issues:

Japanese tokenizer

Korean tokenizer

Notes

5uperpalo commented Mar 2, 2021

5uperpalo commented Feb 14, 2021 •

edited

Loading

5uperpalo commented Feb 17, 2021 •

edited

Loading