Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with latin alphabet languages #8

Closed
djui opened this issue Oct 4, 2014 · 8 comments
Closed

Problems with latin alphabet languages #8

djui opened this issue Oct 4, 2014 · 8 comments

Comments

@djui
Copy link

djui commented Oct 4, 2014

A term like yellow flicker beat suggest german, english (correct) quite far below.

Can you explain how this would work?

I would like to use franc in combination with a spell checker, first detecting the language and then looking up correct words with a spell checker using the identified language.

@djui
Copy link
Author

djui commented Oct 4, 2014

Seems ironically short english terms are the hardest to guess:

  • elastinen vain elämää fin(nish), no problem
  • cae el sol airbag cat(alan), no problem
  • tennis court flume quite off

@wooorm
Copy link
Owner

wooorm commented Oct 4, 2014

Great question. The answer might however not be what you’d like. It’s due to the high amount of supported languages that smaller passages are often way off.

Also, “tennis”, “flume”, “court”, are all words which originate from French!

The fact that the other languages seem to work well on short passages: I’m not sure, it may be coincidence. Or not. I’ll investigate 😄

@djui
Copy link
Author

djui commented Oct 4, 2014

Thanks. Anyhow, I think it's a great project! Keep up the good work. Maybe you can draw inspiration from language-detection which seems to use naive bayesian filter. As far as I understand the sourcecode, your's tries to detect from which unicode codepage the characters are from, and codepages should correlate to language (or are shared). Is that roughly correct? Iff, can you algorithm handle Decomposed Unicode characters (NFD) or "only" NFC?

@odalet
Copy link

odalet commented Oct 4, 2014

@Worm. To elaborate on these etymological issues, you'll note that we have here 3 different cases:
flume does notre exist in modern French (I had never heard this word before and it appears it is really very old French).
Tennis comes from 'tenez' (hold) but is used in French with the English meanings (sport and shoes).
Court is also used in French but usually means 'short'. Btw, short means short trousses un French.
I suppose that the fact English is leaking into every other language does not help. And this is especially true with French for it had previously influenced English...

@wooorm
Copy link
Owner

wooorm commented Oct 4, 2014

@djui Yeah I’ve seen it, It’s interesting, but it also states, “the more languages, the more difficult”. Which holds true for 49, 168, and 300+ languages.

See unicode-7.0.0 for more information on the used scripts.

@wooorm
Copy link
Owner

wooorm commented Oct 4, 2014

@odalet Thanks for more information. Yeah, although not literally French words, my little understanding of the language made me sympathise with franc detecting “tennis court flume” as French 😛

@odalet
Copy link

odalet commented Oct 4, 2014

Any reason why your lib seems to be named after the barbarian people who gave his name to my country? ;)
Anyway, very interesting project. Mixing languages and computing; I love this. Keeping an eye on it!

@wooorm
Copy link
Owner

wooorm commented Oct 4, 2014

@odalet Hahaha, I wanted a short name, was thinking about “lingua franca”, and came up with “franc”. Which is short, human-like, and awesome. Only disadvantage is that it’s hard to Google: you have to add “language” or my name!

Thanks for the kind words. It’s really interesting, and I’m looking forward to see where it’s all heading!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants