Problems with latin alphabet languages #8

djui · 2014-10-04T17:53:58Z

A term like yellow flicker beat suggest german, english (correct) quite far below.

Can you explain how this would work?

I would like to use franc in combination with a spell checker, first detecting the language and then looking up correct words with a spell checker using the identified language.

The text was updated successfully, but these errors were encountered:

djui · 2014-10-04T17:58:27Z

Seems ironically short english terms are the hardest to guess:

elastinen vain elämää fin(nish), no problem
cae el sol airbag cat(alan), no problem
tennis court flume quite off

wooorm · 2014-10-04T18:03:16Z

Great question. The answer might however not be what you’d like. It’s due to the high amount of supported languages that smaller passages are often way off.

Also, “tennis”, “flume”, “court”, are all words which originate from French!

The fact that the other languages seem to work well on short passages: I’m not sure, it may be coincidence. Or not. I’ll investigate 😄

djui · 2014-10-04T18:08:11Z

Thanks. Anyhow, I think it's a great project! Keep up the good work. Maybe you can draw inspiration from language-detection which seems to use naive bayesian filter. As far as I understand the sourcecode, your's tries to detect from which unicode codepage the characters are from, and codepages should correlate to language (or are shared). Is that roughly correct? Iff, can you algorithm handle Decomposed Unicode characters (NFD) or "only" NFC?

odalet · 2014-10-04T18:20:05Z

@Worm. To elaborate on these etymological issues, you'll note that we have here 3 different cases:
flume does notre exist in modern French (I had never heard this word before and it appears it is really very old French).
Tennis comes from 'tenez' (hold) but is used in French with the English meanings (sport and shoes).
Court is also used in French but usually means 'short'. Btw, short means short trousses un French.
I suppose that the fact English is leaking into every other language does not help. And this is especially true with French for it had previously influenced English...

wooorm · 2014-10-04T18:20:59Z

@djui Yeah I’ve seen it, It’s interesting, but it also states, “the more languages, the more difficult”. Which holds true for 49, 168, and 300+ languages.

See unicode-7.0.0 for more information on the used scripts.

wooorm · 2014-10-04T18:23:00Z

@odalet Thanks for more information. Yeah, although not literally French words, my little understanding of the language made me sympathise with franc detecting “tennis court flume” as French 😛

odalet · 2014-10-04T18:43:46Z

Any reason why your lib seems to be named after the barbarian people who gave his name to my country? ;)
Anyway, very interesting project. Mixing languages and computing; I love this. Keeping an eye on it!

wooorm · 2014-10-04T18:53:00Z

@odalet Hahaha, I wanted a short name, was thinking about “lingua franca”, and came up with “franc”. Which is short, human-like, and awesome. Only disadvantage is that it’s hard to Google: you have to add “language” or my name!

Thanks for the kind words. It’s really interesting, and I’m looking forward to see where it’s all heading!

wooorm closed this as completed Oct 4, 2014

wooorm mentioned this issue Mar 19, 2015

Inaccurate detection examples #16

Closed

wooorm mentioned this issue Jul 24, 2015

Accuracy #23

Closed

wooorm mentioned this issue Nov 30, 2015

Wrong language detection even for simple texts #26

Closed

wooorm mentioned this issue Dec 25, 2015

Russian is detected incorrectly. #27

Closed

wooorm mentioned this issue Feb 22, 2016

I got NaN when runinng franc.all #29

Closed

wooorm mentioned this issue Oct 18, 2016

Issue in detecting English #38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with latin alphabet languages #8

Problems with latin alphabet languages #8

djui commented Oct 4, 2014

djui commented Oct 4, 2014

wooorm commented Oct 4, 2014

djui commented Oct 4, 2014

odalet commented Oct 4, 2014

wooorm commented Oct 4, 2014

wooorm commented Oct 4, 2014

odalet commented Oct 4, 2014

wooorm commented Oct 4, 2014

Problems with latin alphabet languages #8

Problems with latin alphabet languages #8

Comments

djui commented Oct 4, 2014

djui commented Oct 4, 2014

wooorm commented Oct 4, 2014

djui commented Oct 4, 2014

odalet commented Oct 4, 2014

wooorm commented Oct 4, 2014

wooorm commented Oct 4, 2014

odalet commented Oct 4, 2014

wooorm commented Oct 4, 2014