Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using spacy for POS detection when creating word wise epub #71

Closed
Vuizur opened this issue Sep 18, 2022 · 25 comments
Closed

Using spacy for POS detection when creating word wise epub #71

Vuizur opened this issue Sep 18, 2022 · 25 comments

Comments

@Vuizur
Copy link
Contributor

Vuizur commented Sep 18, 2022

First of all, thank you for all your hard work on this extension! It is a really impressive and cool feature.

If I understand it correctly, you currently use spacy for named entitiy recognition for X-ray. I think it would also be cool if you could use Spacy's POS detection to get the correct translation of a word for word wise. This is especially useful for languages like Spanish, where verbs and nouns are often written the same. I tested it a bit, and spacy is pretty good at telling them apart, so this could improve the accuracy of the translations quite a bit.

@xxyzz
Copy link
Owner

xxyzz commented Sep 18, 2022

The verb and noun gloss of the same word won't be much different, applying spaCy pipeline will also make the program slower(the current code creates Word Wise file instantly). The benefits of using POS probably won't worth the effort and the loss of speed.

I see you also use files from kaikki.org in your project, maybe you'll be interested in the pull requests I created for wiktextract and wikitextprocessor. These pull requests add support of parsing non-English Wiktionary dump files, that would provide non-English Word Wise glosses which is requested by many users.

Maybe in the future POS will be used but currently parsing non-English Wiktionary has higher priority.

@Vuizur
Copy link
Contributor Author

Vuizur commented Sep 18, 2022

The verb and noun gloss of the same word won't be much different, applying spaCy pipeline will also make the program slower(the current code creates Word Wise file instantly). The benefits of using POS probably won't worth the effort and the loss of speed.

I think for Spanish it would be super beneficial, I looked at a random page on my Tablet and think it could have fixed about 9 mistranslations. For other languages the improvement might be a lot smaller. But I can see that this might take a lot of effort, so of course one has to prioritise.

I see you also use files from kaikki.org in your project, maybe you'll be interested in the pull requests I created for wiktextract and wikitextprocessor. These pull requests add support of parsing non-English Wiktionary dump files, that would provide non-English Word Wise glosses which is requested by many users.

Totally, I was always too afraid to try to edit this code, so it is really great that you enable the path that in the future more Wiktionaries can get parsed. 👍 I wrote a program reading the 40 GB HTML dump of the Russian wiktionary only because I doubted too much that I could get an adapted Wiktextract to run properly 😁.

@xxyzz
Copy link
Owner

xxyzz commented Feb 3, 2023

Hi @Vuizur, I have added this feature to the master branch. Please download the zip file from GitHub Actions to test it: https://github.com/xxyzz/WordDumb/actions/runs/4081858346

The new POS feature can be enabled in the plugin configuration window.

@Vuizur
Copy link
Contributor Author

Vuizur commented Feb 5, 2023

This is amazing, I am really excited about this feature. 👍

It seems like there is still a small bug and some words get skipped in the output file:
grafik
grafik

(Windows 11, Calibre 6.11.0)

@Vuizur
Copy link
Contributor Author

Vuizur commented Feb 5, 2023

I am not exactly sure if this is the reason, but when I have both an epub and a converted mobi file and Worddumb wants me to select a format (in this case I clicked on mobi), I get the following error message:

Starting job: Generating Word Wise and X-Ray for El Dragón Renacido 
Job: "Generating Word Wise and X-Ray for El Dragón Renacido" failed with error: 
Traceback (most recent call last):
  File "calibre\gui2\threaded_jobs.py", line 82, in start_work
  File "calibre_plugins.worddumb.parse_job", line 127, in do_job
  File "calibre_plugins.worddumb.deps", line 157, in download_word_wise_file
TypeError: spacy_model_name() missing 1 required positional argument: 'prefs'
 
Called with args: ((6, 'MOBI', 'C:\\Users\\hanne\\Calibre Library\\Robert Jordan\\El Dragon Renacido (6)\\El Dragon Renacido - Robert Jordan.mobi', <calibre.ebooks.metadata.book.base.Metadata object at 0x000002C076625120>, {'spacy': 'es_core_news_', 'wiki': 'es', 'kaikki': 'Spanish', 'gloss': False, 'has_trf': False}), True, True) {'notifications': <queue.Queue object at 0x000002C076625420>, 'abort': <threading.Event object at 0x000002C076625CF0>, 'log': <calibre.utils.logging.GUILog object at 0x000002C076625DB0>} 

Edit: I think this only occurs with the new feature enabled.

@xxyzz
Copy link
Owner

xxyzz commented Feb 5, 2023

bc32c7a fixes the error.

There are two cases a word don't have Word Wise:

  • The customize lemmas table doesn't have that word or the same POS type
  • spaCy doesn't lemmatize the word

@Vuizur
Copy link
Contributor Author

Vuizur commented Feb 5, 2023

Thanks a lot for fixing the error!

There are two cases a word don't have Word Wise:

  • The customize lemmas table doesn't have that word or the same POS type

  • spaCy doesn't lemmatize the word

I don't mean Word Wise information, the entire word seems to be missing. In the picture it is the third word for example (dejó).

@xxyzz
Copy link
Owner

xxyzz commented Feb 6, 2023

Oh I didn't notice that, 723634a should fix this bug.

@Vuizur
Copy link
Contributor Author

Vuizur commented Feb 6, 2023

Awesome, the sentences are full now. And the new feature fixed two errors only in the first sentence of my Spanish example book 😁, the glosses are super good now.

The lemmatization seems to be a bit broken by the new feature currently, inflected words don't get definitions.

Before:
grafik
After:
grafik

I checked with the medium Spanish model, it seems to get the POS and lemmatization correct (for example for dejó or ojos). All the words that are missing are inflected. For some reason the word pero also gets no gloss (Spacy classifies it as CCONJ, which gets mapped to conj by your code, so I have no idea why it should not work. (There is also a noun with the same string in the kaikki data, but the disambiguation works in the other cases. 🤔)

@xxyzz
Copy link
Owner

xxyzz commented Feb 6, 2023

Could you upload the book in your screenshot?

@Vuizur
Copy link
Contributor Author

Vuizur commented Feb 6, 2023

Novelas y fantasias - Roberto Payro_comparison.zip

This is another (copyright free) Spanish book, but it shows the same behaviour.

@Vuizur
Copy link
Contributor Author

Vuizur commented Feb 6, 2023

I just tested some other languages, I think the spaCy POS algorithm is also good for languages where the flashtext algorithm is currently broken due to special characters (I think German/probably all cyrillic languages.) In these cases it also matches substrings, like when you have the word "really" it would match the word "real".

@xxyzz
Copy link
Owner

xxyzz commented Feb 7, 2023

Ah, I forget to use the lemma form, a483fe5 should fix this bug.

flashtext is pretty much an abandonware but I can't find another alternative library. Maybe I can enable the POS feature by default and get rid of this dependency. But spaCy is slower than flashtext and spaCy's lemmatizer is not that accurate.

@Vuizur
Copy link
Contributor Author

Vuizur commented Feb 7, 2023

Great, it is fixed now. 👍 You are right that for languages like Spanish the POS version finds a bit less definitions than the original flashtext version. Mostly due to spaCy/wiktionary disagreements (sometimes spaCy says something is AUX and wikipedia says it is a verb) and their rule-based lemmatizer that sucks for irregular verbs, for example.

In the original flashtext version you used the words + inflections from the Kaikki data? The big advantage of this approach as I undertand it is that at the end it could also support basically any language, not only the ones supported by spaCy. On the PR I linked there seem to exist workarounds so that non-latin languages are better supported, nobody benchmarked them though.

I also think it might be a good idea to create a general purpose library to perform the cleaning of the kaikki inflections (or maybe contribute it to wiktextract) that we both could contribute to because it is potentially useful to many people (me among them) and requires language-specific knowledge.

@xxyzz
Copy link
Owner

xxyzz commented Feb 7, 2023

Yeah, I was using flashtext and pyahocorasick(for Chinese, Korean and Japanese) before. The Kindle English inflected forms are created from LemmInflect and for EPUB book inflected forms are from kaikki.org's Wiktionary JSON file. This also guarantees all known forms in the book will be found.

I not sure what you mean a "library cleaning kaikki inflections data". The forms data in kaikki.org's JSON file don't need much cleaning, I don't change these data much.

@Vuizur
Copy link
Contributor Author

Vuizur commented Feb 7, 2023

I not sure what you mean a "library cleaning kaikki inflections data". The forms data in kaikki.org's JSON file don't need much cleaning, I don't change these data much.

I think there is quite some stuff that comes together:

  • Removing meta info tags (inflection table info/...)
  • Removing inflections that actually aren't inflections (for example, paired verbs in Russian. Or auxiliary verbs that are also sometimes in the table)
  • Removing stress marks in cyrillic languages (I think this in addition to the flashtext bugs currently causes weird behaviour for cyrillic languages. There exists a simplemma library that was also bitten by this I think.) Latin also has a similar problem
  • Removing articles (relevant for German I think)
  • Removing duplicates. I still don't know if one should keep the tags in the table though
  • (Bonus feature: Adding inflections that only exist as separate glosses (In Spanish these are words like comérselo).) <- But this is quite difficult and would require creating a DB

@xxyzz
Copy link
Owner

xxyzz commented Feb 7, 2023

I just realize what that issue you linked before really means... So flashtext is only useful for English. I think I could just use spaCy's PhraseMatcher as flastext and pyahocorasick for all languages by adding inflected forms to the matcher and match the text attribute instead the lemma attribute.

As for improving kaikki.org's data that would ideally need to change the wiktextract code and edit Wiktionary page IIUC. Unfortunately I don't understand German and Russian so I can't tell if the inflection is correct or not for those languages. Wiktionary editors use bots to make changes like these a lot, I bet they already have bots that can remove duplicated words.

@Vuizur
Copy link
Contributor Author

Vuizur commented Feb 7, 2023

As for improving kaikki.org's data that would ideally need to change the wiktextract code and edit Wiktionary page IIUC. Unfortunately I don't understand German and Russian so I can't tell if the inflection is correct or not for those languages. Wiktionary editors use bots to make changes like these a lot, I bet they already have bots that can remove duplicated words.

True, duplicates is maybe not the best example. These only occur if you throw away the forms' tags. But in general the fixes I refer to are not mistakes in Wiktionary or wiktextract, because for example the stressed inflections for Russian are super userful in general. They are only problems if you try to directly use the Kaikki data for lemmatization or to look up a word from its inflected form as it would appear in a usual text. So it doesn't directly fit in the wiktextract core code.

@xxyzz
Copy link
Owner

xxyzz commented Feb 8, 2023

Just to make sure I understand the problems you posted, I'll use this https://kaikki.org/dictionary/All%20languages%20combined/meaning/%D1%87/%D1%87%D0%B8/%D1%87%D0%B8%D1%82%D0%B0%D1%82%D1%8C.html page as an example:

* Removing meta info tags (inflection table info/...)

"ru-noun-table", "hard-stem", "accent-c", "1a imperfective transitive" should be removed from forms. This should be fixed in wikiextarct code. But won't affect Word Wise because book texts usually don't have these words.

* Removing inflections that actually aren't inflections (for example, paired verbs in Russian. Or auxiliary verbs that are also sometimes in the table)

You mean "бу́ду" and it's forms should be removed? But have this word won't affect Word Wise.

* Removing stress marks in cyrillic languages (I think this in addition to the flashtext bugs currently causes weird behaviour for cyrillic languages. There exists a simplemma library that was also bitten by this I think.) Latin also has a similar problem

Remove stress mark so "чита́ть" becomes "читать", right? But book texts use "чита́ть" not "читать". And this won't be a problem for spaCy's PhraseMatcher. I should add stress mark to kaikkai's lemma form instead.

* Removing articles (relevant for German I think)

https://kaikki.org/dictionary/German/meaning/l/le/lesen.html

In this case, ignore forms that start with "haben"? Again, have this word doesn't matter. The form after it still can be matched.

@Vuizur
Copy link
Contributor Author

Vuizur commented Feb 8, 2023

"ru-noun-table", "hard-stem", "accent-c", "1a imperfective transitive" should be removed from forms. This should be fixed in wikiextarct code. But won't affect Word Wise because book texts usually don't have these words.

True, although it is theoretically conceivable that there exist exceptions, but I can't name any right now.

You mean "бу́ду" and it's forms should be removed? But have this word won't affect Word Wise.

In Russian, there are pair verbs that are very similarly written and differ in their "aspect" (some grammatical concept). These pair verbs are in the forms list. However, this causes for example the words for "try" and "torture" to exist as inflections for each other, which we don't really want. For German this causes "haben" to be an inflection of almost everything (because it is listed as an auxiliary verb in the inflections).

Remove stress mark so "чита́ть" becomes "читать", right? But book texts use "чита́ть" not "читать". And this won't be a problem for spaCy's PhraseMatcher.

In books stress marks are not written, only in books for learners, but they are pretty hard to find. For Russian you have a point though, because each learner in theory should use my program to get them back 😁. For other cyrillic languages that have unpredictable stress those stressed inflections will cause the words not to get found. Or for Latin as well.

I should add stress mark to kaikkai's lemma form instead.

This is actually a good point. I was also working a bit on dictionary creation, and made the conclusion that ideally the form that is displayed to the user should be the inflection tagged as "canonical". It does have the drawback that some of these canonical forms can still be buggy and contain something like " f m". So for smaller languages where the data hasn't been looked over this might lead to smaller mistakes, but in general it is a good idea.

https://kaikki.org/dictionary/German/meaning/l/le/lesen.html
In this case, ignore forms that start with "haben"? Again, have this word doesn't matter. The form after it still can be matched.

I think in my last tests there existed some German nouns where you only had as an inflection "den Zaubererern", but not "Zauberern" (don't nail me on this example, but it was something similar). As far as I understand it this would cause the word not to be found if the article isn't used in the text.

I already wrote some untested code, I will link it here when it is OKish.

@xxyzz
Copy link
Owner

xxyzz commented Feb 8, 2023

I was completely wrong about the stress... I read the Russian example sentence from the English Wiktionary and thought normal Russian text also have those marks. I should read the Stress Wikipedia page more carefully. So for Russian, Belarusian and Ukrainian languages, both with and without stress mark forms should be added to PhraseMatcher. But I guess your "add-stress-to-epub" library can't "un-stress" words.

Russian pair verbs seem complicated and require knowledge of the language, I'll temperately ignore that. And for "haben" or "den" in German forms I can probably get away with it if most words have the ideal infection form without them.

The unstressed forms will be addressed first, this seems more important than other issues.

@Vuizur
Copy link
Contributor Author

Vuizur commented Feb 10, 2023

I wrote a bit not very tested code here (API might change): https://github.com/Vuizur/wiktextract-lemmatization
The theory is that you can put a forms array in there and get a fixed one back, so that it should be easy to integrate in existing code.

@xxyzz
Copy link
Owner

xxyzz commented Feb 11, 2023

Cool! I could use your code in the Proficiency repo for Russian, Belarusian and Ukrainian languages.

@xxyzz
Copy link
Owner

xxyzz commented Feb 12, 2023

I included your code in the v0.5.4dev Proficiency pre-release, it's currently used by the WordDumb code in the master branch. I also use spaCy's PhraseMatcher to find Word Wise even when Use POS type is disabled, this would add a few seconds for loading the PhraseMatcher object but should have better result than flashtext on Russian and German.

@Vuizur
Copy link
Contributor Author

Vuizur commented Feb 13, 2023

Nice, the word detection works now flawlessly for Russian. 👍

@xxyzz xxyzz closed this as completed Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants