Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newest Ichiran with newest data seems to be failing 31 tests #45

Closed
vpltd-kgalaj opened this issue Dec 22, 2023 · 11 comments
Closed

Newest Ichiran with newest data seems to be failing 31 tests #45

vpltd-kgalaj opened this issue Dec 22, 2023 · 11 comments

Comments

@vpltd-kgalaj
Copy link

Edict, Kanjidic2, jmdict-data, quicklisp and ichiran pulled from the Net yesterday.

Did full-init.

Had to comment out 2209300 additions in the errata, because the entire entry was deleted in jmdict. Then applied errata again.

macOS 13.6.1 Intel, Postgres and SBCL installed through Brew.

Results:

Unit Test Summary
| 707 assertions total
| 676 passed
| 31 failed
| 2 execution errors
| 0 missing tests

| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("猫" "は" "しっぽ" "を" "ぴんと" "立てて" "歩いた")
| but saw ("猫" "は" "しっぽ" "を" "ぴんと立てて" "歩いた")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("わかりきった") but saw ("わ" "かりきった")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("おとめ" "に" "ふさわしい" "振る舞い") but saw ("お" "とめ" "に" "ふさわしい" "振る舞い")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("折りたたみ" "式" "ついたて") but saw ("折りたたみ" "式" "ついた" "て")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("使い物" "に" "ならん" "だろ") but saw ("使い" "物にならん" "だろ")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("雪" "が" "ない" "ため") but saw ("雪" "が" "な" "いため")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("バラしちゃってる") but saw ("バラ" "しちゃってる")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("何も" "口" "に" "せぬ") but saw ("何も" "口" "にせぬ")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("工夫" "が" "される") but saw ("工夫" "がされる")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("だめ" "だったら") but saw ("だ" "めだったら")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("彼女" "は" "苦しげ" "に" "うめいて" "横たわった")
| but saw ("彼女" "は" "苦しげ" "に" "うめ" "いて" "横たわった")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("共感" "性") but saw ("共感性")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("それ" "ただ" "の" "怪しい" "人" "です" "し")
| but saw ("それた" "だの" "怪しい" "人" "です" "し")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("出したい" "とき" "は") but saw ("出した" "いと" "き" "は")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("旅行" "に" "いきたい") but saw ("旅行" "にい" "きたい")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("しない" "かい") but saw ("し" "ないかい")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("てか" "最近" "ファン" "層" "は" "円盤" "すら" "買わない" "から" "そいつら" "から" "金" "とる"
"ってのは" "無謀")
| but saw ("てか" "最近" "ファン層" "は" "円盤" "すら" "買わない" "から" "そいつら" "から" "金" "とる" "ってのは"
"無謀")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("なんというか" "すみません") but saw ("なんという" "かすみません")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("そう" "したい" "から" "した" "だけ" "だ") but saw ("そうした" "いからした" "だけ" "だ")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("手にとって" "いただき" "やすくなる") but saw ("手にとっていた" "だ" "きやすくなる")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("大事" "に" "なります") but saw ("大" "事になります")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("奴" "が" "まとも" "に" "見られない") but saw ("奴" "が" "まともに" "見られない")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("といった" "ところ" "でしょうか") but saw ("と" "いった" "ところ" "でしょうか")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("言い方" "も" "します") but saw ("言い方" "もします")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("届け" "したら") but saw ("届" "けしたら")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("全く" "と" "いって" "いい") but saw ("全く" "と" "いっていい")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("仲良し" "に" "なったら") but saw ("仲良し" "になったら")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("体" "に" "悪い" "と" "知り" "ながら" "タバコをやめる" "こと" "は" "できない")
| but saw ("体に悪い" "と" "知り" "ながら" "タバコをやめる" "こと" "は" "できない")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("雨" "が" "降りそう" "な" "気がします") but saw ("雨が降りそう" "な" "気がします")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("そういう" "お" "隣" "どうし") but saw ("そういう" "お" "隣どうし")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("みんな" "土足で" "おいで") but saw ("みんな" "土足で" "おい" "で")
|
SEGMENTATION-TEST: 451 assertions passed, 31 failed, and an execution error.

| Execution error:
| Database error 42P01: relation "kanji" does not exist
QUERY: (SELECT r.text, r.type FROM kanji AS k INNER JOIN reading AS r ON (r.kanji_id = k.id) WHERE ((k.text = E'取') and (not (r.type IN (E'ja_na')))))
|
MATCH-READINGS-TEST: 0 assertions passed, 0 failed, and an execution error.

| Execution error:
| Database error 42P01: relation "kanji" does not exist
QUERY: (SELECT r.text, r.type FROM kanji AS k INNER JOIN reading AS r ON (r.kanji_id = k.id) WHERE ((k.text = E'気') and (not (r.type IN (E'ja_na')))))
|
SEGMENTATION-TEST: 451 assertions passed, 31 failed, and an execution error.

#<TEST-RESULTS-DB Total(707) Passed(676) Failed(31) Errors(2)>

@tshatrov
Copy link
Owner

Hi, unfortunately because JMdict data always changes it's impossible to segmentation tests to always pass unless they're modified and the code has been manually calibrated. For that reason only the latest release is guaranteed to actually pass all the tests.

For example

| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("猫" "は" "しっぽ" "を" "ぴんと" "立てて" "歩いた")
| but saw ("猫" "は" "しっぽ" "を" "ぴんと立てて" "歩いた")

This test failure is caused by the word ぴんと立つ being added to JMdict database on 2022-07-19. Since the latest release of Ichiran was in January 2022 the test doesn't use this word for segmentation.

As for this,

Database error 42P01: relation "kanji" does not exist

check that you have downloaded file kanjidic2.xml and specified a path to it in settings. Try manually running the following functions:

(ichiran/mnt:load-kanjidic)
(ichiran/mnt:load-kanji-stats)

@vpltd-kgalaj
Copy link
Author

I understand; so the example answer is actually better than expected one, given the current state of JMDict, and what's expected needs to be adjusted.

As for kanjidic2.xml, I have it and the path is correct.

  • (ichiran/mnt:load-kanjidic)
    500 entries loaded
    1000 entries loaded
    1500 entries loaded
    2000 entries loaded
    2500 entries loaded
    3000 entries loaded
    3500 entries loaded
    4000 entries loaded
    4500 entries loaded
    5000 entries loaded
    5500 entries loaded
    6000 entries loaded
    6500 entries loaded
    7000 entries loaded
    7500 entries loaded
    8000 entries loaded
    8500 entries loaded
    9000 entries loaded
    9500 entries loaded
    10000 entries loaded
    10500 entries loaded
    11000 entries loaded
    11500 entries loaded
    12000 entries loaded
    12500 entries loaded
    13000 entries loaded
    13109 entries total
    NIL
  • (ichiran/mnt:load-kanji-stats)
    100 kanji processed
    200 kanji processed
    300 kanji processed
    400 kanji processed
    500 kanji processed
    600 kanji processed
    700 kanji processed
    800 kanji processed
    900 kanji processed
    1000 kanji processed
    1100 kanji processed
    1200 kanji processed
    1300 kanji processed
    1400 kanji processed
    1500 kanji processed
    1600 kanji processed
    1700 kanji processed
    1800 kanji processed
    1900 kanji processed
    2000 kanji processed
    2100 kanji processed
    2136 kanji total
    NIL

I did that right now, but it should have executed earlier as well as part of full-init, so I have to assume these were already loaded and calculated when I ran tests previously. I can't run tests again at the moment to confirm that it's still there though, as in the meantime I added in some logging to better understand ho it works, and the side-effect seems to be that the tests lock up mid-way. I think it's possible some other change to JMDict or KanjiDic might be causing the earlier error though.

@vpltd-kgalaj
Copy link
Author

I repeated the procedure on a fresh database, and the 'kanji' error didn't show up. So indeed, most likely the kanjidic2 database hadn't been loaded despite full-init having finished execution, and the kanjidic2 path being already provided to it before it started.

A mystery, but apparently no longer reproducible.

It's still failing the same 31 tests, but it's expected. Closing.

@tshatrov
Copy link
Owner

tshatrov commented Dec 25, 2023

I think the first time it failed on add-errata because the word in question was deleted from JMdict (due to my comment in fact...), I'll try to make it work with the latest data in the coming weeks.

debugger invoked on a CL-POSTGRES-ERROR:FOREIGN-KEY-VIOLATION in thread
#<THREAD "main thread" RUNNING {1001870103}>:
  Database error 23503: insert or update on table "kana_text" violates foreign key constraint "kana_text_entry_seq_foreign"
DETAIL: Key (seq)=(2209300) is not present in table "entry".
QUERY: INSERT INTO kana_text (best_kanji, nokanji, conjugate_p, common_tags, common, ord, text, seq)  VALUES (NULL, false, true, E'', NULL, 0, E'たへる', 2209300) RETURNING id

Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.

restarts (invokable by number or by possibly-abbreviated name):
  0: [ABORT] Exit debugger, returning to top level.

(CL-POSTGRES::GET-ERROR #<SB-SYS:FD-STREAM for "socket 127.0.0.1:54562, peer: 127.0.0.1:5432" {1001B65323}>)
   source: (ERROR (CL-POSTGRES-ERROR::GET-ERROR-TYPE CODE) :CODE CODE :MESSAGE
                  (GET-FIELD #\M) :DETAIL (GET-FIELD #\D) :HINT (GET-FIELD #\H)
                  :CONTEXT (GET-FIELD #\W) ...)

@tshatrov tshatrov reopened this Dec 25, 2023
@vpltd-kgalaj
Copy link
Author

I reinitialized the entire database, and indeed, it turned out that there had been lingering side-effects of that crash (notably n-kanji and n-kana in many conjugations were left at 0, which wasn't causing crashing, but was causing trouble with scoring).

After the reinitialisation, it only fails on 13 tests:

Unit Test Summary
| 748 assertions total
| 735 passed
| 13 failed
| 0 execution errors
| 0 missing tests
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("だしといて") but saw ("だし" "といて")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("猫" "は" "しっぽ" "を" "ぴんと" "立てて" "歩いた")
| but saw ("猫" "は" "しっぽ" "を" "ぴんと立てて" "歩いた")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("おとめ" "に" "ふさわしい" "振る舞い") but saw ("お" "とめ" "に" "ふさわしい" "振る舞い")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("バラしちゃってる") but saw ("バラ" "しちゃってる")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("ガス" "が" "ついている") but saw ("ガス" "が" "ついて" "いる")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("工夫" "が" "される") but saw ("工夫" "がされる")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("共感" "性") but saw ("共感性")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("しない" "かい") but saw ("し" "ないかい")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("てか" "最近" "ファン" "層" "は" "円盤" "すら" "買わない" "から" "そいつら" "から" "金" "とる"
"ってのは" "無謀")
| but saw ("てか" "最近" "ファン層" "は" "円盤" "すら" "買わない" "から" "そいつら" "から" "金" "とる" "ってのは"
"無謀")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("奴" "が" "まとも" "に" "見られない") but saw ("奴" "が" "まともに" "見られない")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("体" "に" "悪い" "と" "知り" "ながら" "タバコをやめる" "こと" "は" "できない")
| but saw ("体に悪い" "と" "知り" "ながら" "タバコをやめる" "こと" "は" "できない")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("雨" "が" "降りそう" "な" "気がします") but saw ("雨が降りそう" "な" "気がします")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("そういう" "お" "隣" "どうし") but saw ("そういう" "お" "隣どうし")
|
SEGMENTATION-TEST: 497 assertions passed, 13 failed.
#<TEST-RESULTS-DB Total(748) Passed(735) Failed(13) Errors(0)>

@tshatrov
Copy link
Owner

tshatrov commented Dec 31, 2023

dec23 branch contains code which should pass all tests on recent JMdict dumps (make sure to run (add-errata) after updating). I'll make a new release soon unless there are some terrible issues with it (haven't tested this version much yet).

@vpltd-kgalaj
Copy link
Author

I just got around to doing it, and full-init seems to be failing very early on:

* (ichiran/maintenance:full-init)
Initializing ichiran/dict...

debugger invoked on a CL-POSTGRES-ERROR:UNIQUE-VIOLATION in thread
#<THREAD "main thread" RUNNING {10010C0093}>:
  Database error 23505: duplicate key value violates unique constraint "entry_pkey"
DETAIL: Key (seq)=(1000280) already exists.
QUERY: INSERT INTO entry (primary_nokanji, n_kana, n_kanji, root_p, content, seq)  VALUES (false, 0, 0, true, E'<?xml version="1.0" encoding="UTF-8"?>
<entry>
<ent_seq>1000280</ent_seq>
<k_ele>
<keb>論う</keb>
</k_ele>
<r_ele>
<reb>あげつらう</reb>
</r_ele>
<sense>
<pos>v5u</pos>
<pos>vt</pos>
<misc>uk</misc>
<gloss xml:lang="eng">to discuss</gloss>
</sense>
<sense>
<pos>v5u</pos>
<pos>vt</pos>
<gloss xml:lang="eng">to find fault with</gloss>
<gloss xml:lang="eng">to criticize</gloss>
<gloss xml:lang="eng">to criticise</gloss>
</sense>
</entry>', 1000280)

Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.

restarts (invokable by number or by possibly-abbreviated name):
  0: [ABORT] Exit debugger, returning to top level.

(CL-POSTGRES::GET-ERROR #<SB-SYS:FD-STREAM for "socket 127.0.0.1:55237, peer: 127.0.0.1:5432" {100EF410F3}>)
   source: (ERROR (CL-POSTGRES-ERROR::GET-ERROR-TYPE CODE) :CODE CODE :MESSAGE
                  (GET-FIELD #\M) :DETAIL (GET-FIELD #\D) :HINT (GET-FIELD #\H)
                  :CONTEXT (GET-FIELD #\W) ...)
0] 0
* 

Previous master worked correctly with the same JMDict file from around the middle of December, so I think some code change must have caused this...

To be sure, I downloaded the newest JMdict_e today's one, and tried with it, but that didn't fix anything, same crash.

Very strange, it's supposed to be dropping the tables at the beginning of full-init, and it seems impossible for the xml file to have a duplicated entry...

Maybe I should have tried just add-errata first, but I wanted to be sure it's all reset. Now I also can't try add-errata anymore, since full-init deleted the tables.

@tshatrov
Copy link
Owner

tshatrov commented Jan 6, 2024 via email

@tshatrov
Copy link
Owner

tshatrov commented Jan 6, 2024

Actually nevermind that. This is related to a change I made to load-entry to auto-conjugate words from data/extra.xml

EDIT: just pushed a fix to the branch

@vpltd-kgalaj
Copy link
Author

vpltd-kgalaj commented Jan 7, 2024

Your last fix seems to have fixed that one. full-init now gets as far as the "Loading custom data..." before crashing:

Loading custom data...

debugger invoked on a CXML:WELL-FORMEDNESS-VIOLATION in thread
#<THREAD "main thread" RUNNING {10010E8093}>:
  Document not well-formed: Bad attribute value delimiter #\\, must be either #\" or #\'.
Location:
  Line 44, column 24 in NIL


Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.

restarts (invokable by number or by possibly-abbreviated name):
  0: [ABORT] Exit debugger, returning to top level.

(CXML::%ERROR CXML:WELL-FORMEDNESS-VIOLATION #<RUNES:XSTREAM [main document :MAIN NIL]> "Document not well-formed: Bad attribute value delimiter #\\\\, must be either #\\\" or #\\'.")
   source: (ERROR CLASS :FORMAT-CONTROL "~A" :FORMAT-ARGUMENTS
                  (LIST (GET-OUTPUT-STREAM-STRING S)))
0] 

EDIT: I am going to assume the problem is that "eng" in two last seqs in extra.xml is escaped, unlike "eng" in old content in there, and edit that and restart full-init.

@tshatrov
Copy link
Owner

tshatrov commented Jan 7, 2024

yeah the xml file was corrupted, I fixed and added a test for it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants