Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Underline known words "migaku-style" using anki as a database #169

Closed
AxillV opened this issue Jul 11, 2023 · 2 comments
Closed

Underline known words "migaku-style" using anki as a database #169

AxillV opened this issue Jul 11, 2023 · 2 comments
Labels
enhancement New feature or request wontfix This will not be worked on

Comments

@AxillV
Copy link

AxillV commented Jul 11, 2023

Would highlighting words that you already "know" (they exist in the sort field in an anki card) be possible? The program recognizes if a word exists, if you open the dictionary, but I suspect that the hard part would be deciding which letter/mora to "hover over" in order to check. Maybe it could check for all different combinations and decide on the longest (in letters/mora) result, with the user having the ability to correct the selection?

I don't know how useful this would be, though being at the early stages of language learning, being able to recognize i+1 sentences at a glance would be something that could make the process easier.

@ripose-jp
Copy link
Owner

ripose-jp commented Jul 12, 2023

The major problem to solve is subtitle tokenization.

This can be done fast and easy with MeCab. The issue with only relying on MeCab's results is that it only tokenizes based on data in ipadic. This isn't necessarily going to line up with what is actually available in a user's dictionary. For example, jmdict contains a lot of definitions for phrases which MeCab likely won't consider a single token.

The alternative to MeCab would be writing a tokenizer that's aware of the user's dictionaries. A simple algorithm would be for each character in the subtitle, create a token for every possible substring starting from that character then highlight all the matches. This is O(n^2) just in searches done, which is expensive since each search goes out to disk and Anki in order to get a result. If subtitles are on the screen for only a second or two, there's no guarantee that you even get a result back in time unless you're preloading results.

The other question I have is what is the utility of this all? If you search a word, it's likely because you didn't know it or didn't remember it. Knowing you have a card for the term before you even search doesn't really move the needle in my opinion since Memento is not an SRS program.

Sorry for the half-posted comment originally. I accidentally pressed Ctrl+Enter which GitHub takes as "publish my in progress comment".

@AxillV AxillV closed this as completed Jul 12, 2023
@AxillV
Copy link
Author

AxillV commented Jul 12, 2023

I see, thank you for the very thoughtful answer. Sounds like too much work without a whole lot of reward. I'm still at the start of my language learning journey so indeed, the utility might be a lot lower than what I expected.

(Sorry the for (re)opening spam).

@AxillV AxillV reopened this Jul 12, 2023
@AxillV AxillV closed this as completed Jul 12, 2023
@ripose-jp ripose-jp added enhancement New feature or request wontfix This will not be worked on labels Jul 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants