GitHub - yypy22/gsoc_try: only trial purpose for gsoc2023

https://wiki.apertium.org/wiki/Apertium_New_Language_Pair_HOWTO For a pair-making

Final evaluation https://docs.google.com/document/d/1yxQH6hk-_fjIM9U0fx14OcezMwBUcIlM2WC-AM8O5Y8/edit publicly accessible

Encoding one is written above.

Interjection added

Nice amount of sample corpus on horror

https://kikikaikai.fan/feature_article/syarekowa/%e3%82%b3%e3%83%88%e3%83%aa%e3%83%90%e3%82%b3

$ cat article.txt | apertium -d . jpn-disam | cg-conv -a　　　　　　　　　　　　　　　　

I use this https://tool.konisimple.net/text/hinshi_keitaiso to understand a part of speech classification types like noun, verb, suffix, etc...

Eng rlx file has 650 lines and for Jpn rlx I will set the goal to have it.

I will focus on rlx file more from now on. I need to find many samples to see ambiguous words.

With Stay Hungry, Stay Foolish Japanese Script, I got

total words:3299 unknown words:83 coverage: 97.48%

English words are added from https://englishgrammarhere.com/speaking/100-english-sentences-used-in-daily-life/

Probably lexc file can be decreased the words. It does not very properly follow the format MeCab tokenize.

Created a pull request

Their company names too

British royal names are added and Forbes rich people are added as well

tokenize.py on jpn is fixed

malformed input error with tokenize.py occurs

English names are added a bit like Tom in katakana form

Some English words are added and I will add them more

I am adding katakana words from https://benritecho.com/katakanakotoba/

After adding normal often used words from non-notation.txt, i got

total words:2927 unknown words:218 coverage: 92.55%

I added many characters from all-four.txt and it is not too reliable now, so I tried with non-notation.txt.(non-annotation is right name, notation is typo hehe)

total words:2927 unknown words:264 coverage: 90.98%

total words:2016 unknown words:21 coverage: 98.96%

98.96% cover with a new evaluation python file, which just calculates nums of unknown words and divides it with the total num words count.

with all-four.txt, output has these unknown words

65% cover is due to * assigned to　の；に；は；を；へ；だ ...etc and modCov.py recognese words with * as unknown words!

test coverage with modCov.py ->need to get modified in tokenized function

business words added into lexc file as katakana character like アサイン

rlx file improved a bit

Goal: corpus coverage over 90%

A bit hard to convert this into rule

Will keep on improving lexc file and rlx file to use tagger and morph command

With new rlx,

With old one, as you can see とび is ambiguous here

Now

After

Before

Dictionary Conversion to lexc file research https://docs.google.com/document/d/1p2qFp1g9OufeL_Obgg8vpljfgAwKz4D2briQvqepsMw/edit

real_tokenizer.cpp and mecab_tokenizer.py are moved into apertium-jpn as buffer_mecab.cpp and tokenize.py

C++ file with hfst command

tokenizer with hfst command

Slight improvement in tokenize.cpp

tokenizer with mecab in C++ a few mistakes seen but working main function well (tokenize.cpp)

tokenizer with mecab and [] is not processed (tokenizer_mecab.py)

tokenize_mecab.py for testing. probably something still going wrong

From https://qiita.com/taku910/items/fbaeab4684665952d5a9 I tried the output file with mecab format and i got Precision: 0.25633528265107214 Recall: 0.2321270962047661 F-score: 0.24363131079203335 with bochan.txt. The command I used was: mecab -F"%M||||" -E"\n" -b 100000 < bochan.txt > bochan.txt.tok

sentencepiece with 251MB text file from wikipedia japanese data > Precision: 0.357958872810358 Recall: 0.4148278905560459 F-score: 0.384300899427637 probably not best choice for apertium. It exceeds 100MB and i cannot upload it here. The text look like this

sentencepiece with 22MB text file > Precision: 0.2625968992248062 Recall: 0.23918799646954986 F-score: 0.25034642032332566 (with 8000 vocab > Precision: 0.357958872810358 Recall: 0.4148278905560459 F-score: 0.384300899427637)

sentencepiece with about 13MB text file -> Precision: 0.26522593320235754 Recall: 0.2383053839364519 F-score: 0.2510460251046025

I did mecab word segmentation with bochan.txt and used it to train sentencepiece. I got Precision: 0.24645030425963488 Recall: 0.2144748455428067 F-score: 0.2293534686172723 With plain bochan.txt, I got Precision: 0.25074925074925075 Recall: 0.22153574580759047 F-score: 0.23523898781630742 No big difference.

large japanese corpus is from livedoor news https://www.rondhuit.com/download.html#ldcc

For apertium-jpn, there is a python file called modCov.py. It tokenize text file with tokeniser.py and evaluation is calculated with untokenized/total. It got 42.22% with Hiroshima file. On tests file, python3 modCov.py ~/Desktop/gsoc/apertium-jpn/jpn.automorf.bin ~/Desktop/gsoc/jpn-corpus/Hiroshima_Peace_Park_Wikipedia.txt output the accuracy.

For sentence.py, I simply tried word segmentation aspect and used Hiroshima file as an input. It produces model and vocaburary. With the model, i tokenized Hiroshima file and i got 62.11% with some words from lexc file in apertium-jpn. word segmentation doc in python https://pypi.org/project/sentencepiece/ I got 85.24% with bochan.txt input. Larger corpus input will make accuracy better.

Brief summary of Sentencepiece:'SentencePiece' solves problem by using 'subwords'. First, the text is split into words and the frequency of each word is determined. High-frequency words are then treated as a single vocabulary, while low frequency words are split into shorter vocabulary words. The splitting is then repeated until the vocabulary number reaches a pre-specified number. This makes it possible to eliminate unknown words while keeping the vocabulary size small.

For speed comparison, I used bochan.txt, which is fairly large japanese corpus. I got initinal speed with only one japanese word.

Below is the screenshot of sentence piece word segmentation(62%)

Below is with 85% one

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
Hiroshima_Peace_Park_Wikipedia.txt		Hiroshima_Peace_Park_Wikipedia.txt
MeCab.cpp		MeCab.cpp
README.md		README.md
bochan.txt		bochan.txt
buffer_mecab.cpp		buffer_mecab.cpp
csv_d.py		csv_d.py
eval.py		eval.py
f_score.py		f_score.py
full_annotation.txt		full_annotation.txt
mecab_output.txt		mecab_output.txt
mecab_output_bochan.txt		mecab_output_bochan.txt
mecab_score.py		mecab_score.py
mecab_sp.py		mecab_sp.py
mecab_tokenizer.cpp		mecab_tokenizer.cpp
modCov.py		modCov.py
non_notation.txt		non_notation.txt
pseudocode.py		pseudocode.py
real_tokenizer.cpp		real_tokenizer.cpp
remove_num.py		remove_num.py
remove_space.py		remove_space.py
sentence.py		sentence.py
sentencepiece.model		sentencepiece.model
sentencepiece.vocab		sentencepiece.vocab
smaller_large_corpus.txt		smaller_large_corpus.txt
space.js		space.js
stay_hungry_stay_foolish.txt		stay_hungry_stay_foolish.txt
tokenize.cpp		tokenize.cpp
tokenize_mecab.py		tokenize_mecab.py
top.txt		top.txt
wiki.txt		wiki.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Contributors 2

Languages

yypy22/gsoc_try

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages