-
-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with latin alphabet languages #8
Comments
Seems ironically short english terms are the hardest to guess:
|
Great question. The answer might however not be what you’d like. It’s due to the high amount of supported languages that smaller passages are often way off. Also, “tennis”, “flume”, “court”, are all words which originate from French! The fact that the other languages seem to work well on short passages: I’m not sure, it may be coincidence. Or not. I’ll investigate 😄 |
Thanks. Anyhow, I think it's a great project! Keep up the good work. Maybe you can draw inspiration from language-detection which seems to use naive bayesian filter. As far as I understand the sourcecode, your's tries to detect from which unicode codepage the characters are from, and codepages should correlate to language (or are shared). Is that roughly correct? Iff, can you algorithm handle Decomposed Unicode characters (NFD) or "only" NFC? |
@Worm. To elaborate on these etymological issues, you'll note that we have here 3 different cases: |
@djui Yeah I’ve seen it, It’s interesting, but it also states, “the more languages, the more difficult”. Which holds true for 49, 168, and 300+ languages. See unicode-7.0.0 for more information on the used scripts. |
@odalet Thanks for more information. Yeah, although not literally French words, my little understanding of the language made me sympathise with franc detecting “tennis court flume” as French 😛 |
Any reason why your lib seems to be named after the barbarian people who gave his name to my country? ;) |
@odalet Hahaha, I wanted a short name, was thinking about “lingua franca”, and came up with “franc”. Which is short, human-like, and awesome. Only disadvantage is that it’s hard to Google: you have to add “language” or my name! Thanks for the kind words. It’s really interesting, and I’m looking forward to see where it’s all heading! |
A term like
yellow flicker beat
suggest german, english (correct) quite far below.Can you explain how this would work?
I would like to use franc in combination with a spell checker, first detecting the language and then looking up correct words with a spell checker using the identified language.
The text was updated successfully, but these errors were encountered: