New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using spacy for POS detection when creating word wise epub #71
Comments
The verb and noun gloss of the same word won't be much different, applying spaCy pipeline will also make the program slower(the current code creates Word Wise file instantly). The benefits of using POS probably won't worth the effort and the loss of speed. I see you also use files from kaikki.org in your project, maybe you'll be interested in the pull requests I created for wiktextract and wikitextprocessor. These pull requests add support of parsing non-English Wiktionary dump files, that would provide non-English Word Wise glosses which is requested by many users. Maybe in the future POS will be used but currently parsing non-English Wiktionary has higher priority. |
I think for Spanish it would be super beneficial, I looked at a random page on my Tablet and think it could have fixed about 9 mistranslations. For other languages the improvement might be a lot smaller. But I can see that this might take a lot of effort, so of course one has to prioritise.
Totally, I was always too afraid to try to edit this code, so it is really great that you enable the path that in the future more Wiktionaries can get parsed. 👍 I wrote a program reading the 40 GB HTML dump of the Russian wiktionary only because I doubted too much that I could get an adapted Wiktextract to run properly 😁. |
Hi @Vuizur, I have added this feature to the master branch. Please download the zip file from GitHub Actions to test it: https://github.com/xxyzz/WordDumb/actions/runs/4081858346 The new POS feature can be enabled in the plugin configuration window. |
I am not exactly sure if this is the reason, but when I have both an epub and a converted mobi file and Worddumb wants me to select a format (in this case I clicked on mobi), I get the following error message:
Edit: I think this only occurs with the new feature enabled. |
bc32c7a fixes the error. There are two cases a word don't have Word Wise:
|
Thanks a lot for fixing the error!
I don't mean Word Wise information, the entire word seems to be missing. In the picture it is the third word for example (dejó). |
Oh I didn't notice that, 723634a should fix this bug. |
Awesome, the sentences are full now. And the new feature fixed two errors only in the first sentence of my Spanish example book 😁, the glosses are super good now. The lemmatization seems to be a bit broken by the new feature currently, inflected words don't get definitions. I checked with the medium Spanish model, it seems to get the POS and lemmatization correct (for example for dejó or ojos). All the words that are missing are inflected. For some reason the word pero also gets no gloss (Spacy classifies it as CCONJ, which gets mapped to conj by your code, so I have no idea why it should not work. (There is also a noun with the same string in the kaikki data, but the disambiguation works in the other cases. 🤔) |
Could you upload the book in your screenshot? |
Novelas y fantasias - Roberto Payro_comparison.zip This is another (copyright free) Spanish book, but it shows the same behaviour. |
I just tested some other languages, I think the spaCy POS algorithm is also good for languages where the flashtext algorithm is currently broken due to special characters (I think German/probably all cyrillic languages.) In these cases it also matches substrings, like when you have the word "really" it would match the word "real". |
Ah, I forget to use the lemma form, a483fe5 should fix this bug. flashtext is pretty much an abandonware but I can't find another alternative library. Maybe I can enable the POS feature by default and get rid of this dependency. But spaCy is slower than flashtext and spaCy's lemmatizer is not that accurate. |
Great, it is fixed now. 👍 You are right that for languages like Spanish the POS version finds a bit less definitions than the original flashtext version. Mostly due to spaCy/wiktionary disagreements (sometimes spaCy says something is AUX and wikipedia says it is a verb) and their rule-based lemmatizer that sucks for irregular verbs, for example. In the original flashtext version you used the words + inflections from the Kaikki data? The big advantage of this approach as I undertand it is that at the end it could also support basically any language, not only the ones supported by spaCy. On the PR I linked there seem to exist workarounds so that non-latin languages are better supported, nobody benchmarked them though. I also think it might be a good idea to create a general purpose library to perform the cleaning of the kaikki inflections (or maybe contribute it to wiktextract) that we both could contribute to because it is potentially useful to many people (me among them) and requires language-specific knowledge. |
Yeah, I was using flashtext and pyahocorasick(for Chinese, Korean and Japanese) before. The Kindle English inflected forms are created from LemmInflect and for EPUB book inflected forms are from kaikki.org's Wiktionary JSON file. This also guarantees all known forms in the book will be found. I not sure what you mean a "library cleaning kaikki inflections data". The forms data in kaikki.org's JSON file don't need much cleaning, I don't change these data much. |
I think there is quite some stuff that comes together:
|
I just realize what that issue you linked before really means... So flashtext is only useful for English. I think I could just use spaCy's As for improving kaikki.org's data that would ideally need to change the wiktextract code and edit Wiktionary page IIUC. Unfortunately I don't understand German and Russian so I can't tell if the inflection is correct or not for those languages. Wiktionary editors use bots to make changes like these a lot, I bet they already have bots that can remove duplicated words. |
True, duplicates is maybe not the best example. These only occur if you throw away the forms' tags. But in general the fixes I refer to are not mistakes in Wiktionary or wiktextract, because for example the stressed inflections for Russian are super userful in general. They are only problems if you try to directly use the Kaikki data for lemmatization or to look up a word from its inflected form as it would appear in a usual text. So it doesn't directly fit in the wiktextract core code. |
Just to make sure I understand the problems you posted, I'll use this https://kaikki.org/dictionary/All%20languages%20combined/meaning/%D1%87/%D1%87%D0%B8/%D1%87%D0%B8%D1%82%D0%B0%D1%82%D1%8C.html page as an example:
"ru-noun-table", "hard-stem", "accent-c", "1a imperfective transitive" should be removed from forms. This should be fixed in wikiextarct code. But won't affect Word Wise because book texts usually don't have these words.
You mean "бу́ду" and it's forms should be removed? But have this word won't affect Word Wise.
Remove stress mark so "чита́ть" becomes "читать", right? But book texts use "чита́ть" not "читать". And this won't be a problem for spaCy's
https://kaikki.org/dictionary/German/meaning/l/le/lesen.html In this case, ignore forms that start with "haben"? Again, have this word doesn't matter. The form after it still can be matched. |
True, although it is theoretically conceivable that there exist exceptions, but I can't name any right now.
In Russian, there are pair verbs that are very similarly written and differ in their "aspect" (some grammatical concept). These pair verbs are in the forms list. However, this causes for example the words for "try" and "torture" to exist as inflections for each other, which we don't really want. For German this causes "haben" to be an inflection of almost everything (because it is listed as an auxiliary verb in the inflections).
In books stress marks are not written, only in books for learners, but they are pretty hard to find. For Russian you have a point though, because each learner in theory should use my program to get them back 😁. For other cyrillic languages that have unpredictable stress those stressed inflections will cause the words not to get found. Or for Latin as well.
This is actually a good point. I was also working a bit on dictionary creation, and made the conclusion that ideally the form that is displayed to the user should be the inflection tagged as "canonical". It does have the drawback that some of these canonical forms can still be buggy and contain something like " f m". So for smaller languages where the data hasn't been looked over this might lead to smaller mistakes, but in general it is a good idea.
I think in my last tests there existed some German nouns where you only had as an inflection "den Zaubererern", but not "Zauberern" (don't nail me on this example, but it was something similar). As far as I understand it this would cause the word not to be found if the article isn't used in the text. I already wrote some untested code, I will link it here when it is OKish. |
I was completely wrong about the stress... I read the Russian example sentence from the English Wiktionary and thought normal Russian text also have those marks. I should read the Stress Wikipedia page more carefully. So for Russian, Belarusian and Ukrainian languages, both with and without stress mark forms should be added to Russian pair verbs seem complicated and require knowledge of the language, I'll temperately ignore that. And for "haben" or "den" in German forms I can probably get away with it if most words have the ideal infection form without them. The unstressed forms will be addressed first, this seems more important than other issues. |
I wrote a bit not very tested code here (API might change): https://github.com/Vuizur/wiktextract-lemmatization |
Cool! I could use your code in the Proficiency repo for Russian, Belarusian and Ukrainian languages. |
I included your code in the v0.5.4dev Proficiency pre-release, it's currently used by the WordDumb code in the master branch. I also use spaCy's |
Nice, the word detection works now flawlessly for Russian. 👍 |
First of all, thank you for all your hard work on this extension! It is a really impressive and cool feature.
If I understand it correctly, you currently use spacy for named entitiy recognition for X-ray. I think it would also be cool if you could use Spacy's POS detection to get the correct translation of a word for word wise. This is especially useful for languages like Spanish, where verbs and nouns are often written the same. I tested it a bit, and spacy is pretty good at telling them apart, so this could improve the accuracy of the translations quite a bit.
The text was updated successfully, but these errors were encountered: