Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved accuracy for small documents #100

Open
fabiospampinato opened this issue Jun 28, 2021 · 19 comments
Open

Improved accuracy for small documents #100

fabiospampinato opened this issue Jun 28, 2021 · 19 comments

Comments

@fabiospampinato
Copy link

I'd like to play with patching franc, or making some alternative to it, that can detect the language of small documents much more accurately.

First of all is this something that could be interesting to merge into franc itself?

Secondly I'm almost clueless about language classification, could trying the following things make sense?

  1. Storing more than 300 trigrams, maybe 400 or so.
  2. Using quadgrams or bigrams rather than trigrams.
  3. Extracting the trigrams from a longer and more diverse document than the UDHR.

From a shallow reading of this paper on n-grams it sounds to me like ngrams may be fundamentally not well suited for short documents because there just isn't enough data to reconstruct the top 300 or whatever ngrams reliably from that, maybe 🤔.

CLD3 seems to feed unigrams bigrams and trigrams to some neural network and that seems to work much better for smaller texts somehow, I'm not sure how or why, maybe that's the way to go.

Any other ideas that I should try?

@wooorm
Copy link
Owner

wooorm commented Jun 28, 2021

I’ve been meaning to backport some of the nice additions from: https://github.com/greyblake/whatlang-rs#how-is_reliable-calculated.

I believe you’re right that n-grams are less suited for small input. However, I think an n-gram based approach has the nice benefit of supporting many more languages, including some that are in danger of going extinct.
For that we’re tied to a smallish dataset (UDHR — the Bible has more translations but is often copyrighted).

A different idea to investigate is: #83

@fabiospampinato
Copy link
Author

A different idea to investigate is: #83

That sounds potentially like a good hint, although if a letter is not present in the known trigrams/alphabet at all the trigrams from the document containing it will already be ranked pretty poorly, it might be worth to rank them worse than other non-matching trigrams that at least use the right letters though 🤔 I'm not sure how much of a difference that will make in practice though.

I wonder if one could "just" somehow feed all known words of all known languages straight from the dictionaries used for spell checking to some neural network, then one could try to classify all the words in the document and average the output probabilities for each language out 🤔 the problem is I have no clue if that would work at all (I think that way one wouldn't take advantage of the fact that words like "the" are used all the time in English, and it may introduce some bias toward languages with gazillions of words) and I know nothing about machine learning.

It's interesting how CLD3 is able to segmentate a document into sections where different languages are being used, I haven't checked how well that works though. No clue about how that works either.

@wooorm
Copy link
Owner

wooorm commented Jul 2, 2021

I'm not sure how much of a difference that will make in practice though.

I’m probably remembering the details wrong, but I think this is how Google is very fast, on tiny input, at detecting for example that “turkish I” (or so).
Here’s the data btw, from that other thread: #83 (comment).
I’m guessing that, the shorter the input text is, this letter-based-scoring could be weighted more, and the longer it is the trigrams can be weighted more, and weighing the two scores together could perhaps improve detecting languages.

(I think that way one wouldn't take advantage of the fact that words like "the" are used all the time in English, and it may introduce some bias toward languages with gazillions of words)

Yep, indeed. That’s what I’m afraid of too.

I haven't checked how well that works though. No clue about how that works either.

I’d guess that they don’t just do paragraphs, as it’s possible to have one phrase of french words in an otherwise english sentence. Then I’d score each phrase. Then there should be some bias towards the repeatedly occurring languages: if a language is mostly french and english, it’s unlikely that one sentence is scottish instead of english.
That’s how I’d think of it 🤔

@alanalvarado
Copy link

This would be a great addition, Im doing an MBA and I use Obsidian to store my notes in English, Spanish And French, but I only extract the lang from html tags, and most of the time they are incorrect. So I would like to send a small sample of the document, around 140 characters (the summary), and get the correct language.

@porkopek
Copy link

porkopek commented May 14, 2022

I my opinion, the use of translated texts is what makes the accuracy of language detection diminish. You do not need texts to be translations of another original text, because some translations will introduce words that, otherwise, in a text originally composed in that language, they might not use. A certain bias is introduced with the use of translations.
I think the program would be greatly improved in some languages that are frequently confused, such as Portuguese and Spanish, or even often Catalan, which is a language rarely used on the internet, with a selection of texts originally written in the mother tongue.
So

† - Based on the UDHR, the most translated document in the world.

is the root of the problem

@wooorm
Copy link
Owner

wooorm commented May 15, 2022

Yes. But it is also exactly what makes franc, and its support for more languages than anything else, possible. See #100 (comment) and some of the linked discussions :)

@titanism
Copy link

Would love this!

@fabiospampinato
Copy link
Author

fabiospampinato commented Jun 21, 2022

I might give the following approach a shot in my spare time:

  1. Use the UDHRs as the dataset.
  2. Extract various sections out of each document, things like sentences, paragraphs and maybe whole pages too.
  3. Extract unigrams, digrams and trigrams out of each of those sections.
  4. Encode those as the input layer of a small-ish neural network.
  5. Ask the network to detect the language for each section.
  6. Tweak the weights.
  7. Repeat.

The idea being that the result might be more accurate because:

  1. A much shorter text is necessary to faithfully reconstruct unigram and digram frequencies.
  2. Both unigrams, digrams and trigrams would be provided as input to the program, so more information would be available to the network for finding patterns, I guess.
  3. There are going to be a bunch of hidden layers that should be able to detect patterns that a human programmer might not write the logic for in an imperatively-written language detector.

Also the resulting program may be easier to improve, the network could be tweaked, the underlying dataset could be swapped with a bigger one (I couldn't find a reasonably sized one though, and certainly not one that supports as many languages as the UDHR is translated).

I'll report back any findings 👍


Edit: this may be a good dataset: https://paperswithcode.com/dataset/wili-2018

@wooorm
Copy link
Owner

wooorm commented Jun 22, 2022

I'm very excited in your findings! See wooorm/udhr and wooorm/trigrams for inspiration on 1 through 3.

@fabiospampinato
Copy link
Author

fabiospampinato commented Jul 1, 2022

The ML approach may be usable, some findings:

  • brain.js can export the network as a single JS function, that's amazing because that can be made super small and it would work everywhere, and a lot of the weights can probably be trimmed significantly (e.g. the 10th decimal digit shouldn't have much of an effect on the result), the result should be much smaller than the python -> wasm -> base64 -> js approach.
  • I tried to make a little dataset out of the UDHRs, but there might just not be enough data in there to train the network properly (or to test it for that matter), and I think the structure of it may be too rigid, so maybe the network picks up on things like "if the string starts with "evr" it's probably english".
  • Beyond a handful of languages the time it takes to train the network is pretty significant, and for whatever reason brain.js runs faster on the CPU than the GPU for me, which doesn't sound right.
  • There's a lot I still have to explore in this space, different network parameters, different way to process the input, different ways to encode the input, expanding the dataset to perhaps include the more representative Wili2018 dataset etc.

For now I've put the project on hold, but I'll get back to it in the future, it seems promising, with the right dataset.

@fabiospampinato
Copy link
Author

Alright I've spent some more time on this, I got something but it doesn't seem super interesting.

  • The code is here.
  • Potentially it can be trained on the UDHRs, if somebody spends some time converting them into a single CSV with this format [pointless number]\t[language code]\t[sentence].
  • It supports 50 languages, I think most of the ones supported by franc-min, but many lesser known ones that franc-min supports are missing.
  • It weighs about ~290kb min+gzip, which is a bit of a bummer, I was hoping for something more compact.
  • There's probably still some better architecture and missing optimizations that I haven't thought about, if I'll find a much smaller way to do this I'll post an update.

Comparing it against CLD3, franc, franc-all and franc-min I got the following list of accuracies divided by language:

- afr
  - cld3: 0
  - franc: 0.6203170028818443
  - francAll: 0.5770893371757925
  - francMin: 0
  - lande: 0
- ara
  - cld3: 0
  - franc: 0
  - francAll: 0
  - francMin: 0
  - lande: 0.9746333333333334
- aze
  - cld3: 0
  - franc: 0
  - francAll: 0
  - francMin: 0
  - lande: 0.004921618665694495
- bel
  - cld3: 0
  - franc: 0.7479899101371591
  - francAll: 0.7040832413684377
  - francMin: 0.7818855431183982
  - lande: 0.8424247201639603
- ben
  - cld3: 0
  - franc: 0.9493717664449371
  - francAll: 0.9493717664449371
  - francMin: 0.9493717664449371
  - lande: 0.9358832224685883
- bul
  - cld3: 0
  - franc: 0.5330446396050023
  - francAll: 0.5196891820794043
  - francMin: 0.6722651665385082
  - lande: 0.48059411550447206
- cat
  - cld3: 0
  - franc: 0.5487940630797774
  - francAll: 0.3800865800865801
  - francMin: 0
  - lande: 0.2902906617192331
- ces
  - cld3: 0
  - franc: 0.32463333333333333
  - francAll: 0.26616666666666666
  - francMin: 0.44126666666666664
  - lande: 0.732
- ckb
  - cld3: 0
  - franc: 0
  - francAll: 0
  - francMin: 0
  - lande: 0.9240833333333334
- cmn
  - cld3: 0
  - franc: 0.5978666666666667
  - francAll: 0.5978666666666667
  - francMin: 0.5978666666666667
  - lande: 0.9497
- dan
  - cld3: 0
  - franc: 0.4583
  - francAll: 0.4345333333333333
  - francMin: 0
  - lande: 0.7901
- deu
  - cld3: 0
  - franc: 0.8122333333333334
  - francAll: 0.7596
  - francMin: 0.9238666666666666
  - lande: 0.9237666666666666
- ell
  - cld3: 0
  - franc: 0.9859666666666667
  - francAll: 0.9859666666666667
  - francMin: 0.9859666666666667
  - lande: 0.991
- eng
  - cld3: 0
  - franc: 0.5408333333333334
  - francAll: 0.4644666666666667
  - francMin: 0.7935
  - lande: 0.9475333333333333
- est
  - cld3: 0
  - franc: 0
  - francAll: 0
  - francMin: 0
  - lande: 0
- eus
  - cld3: 0
  - franc: 0
  - francAll: 0.6401423257318454
  - francMin: 0
  - lande: 0.5725376031052887
- fin
  - cld3: 0
  - franc: 0.7455666666666667
  - francAll: 0.46116666666666667
  - francMin: 0
  - lande: 0.9574333333333334
- fra
  - cld3: 0
  - franc: 0.8001333333333334
  - francAll: 0.6492
  - francMin: 0.8789
  - lande: 0.9019333333333334
- hau
  - cld3: 0
  - franc: 0.8678492849284929
  - francAll: 0.8256325632563256
  - francMin: 0.9248716538320498
  - lande: 0.9421983865053172
- heb
  - cld3: 0
  - franc: 0.9465333333333333
  - francAll: 0.9465333333333333
  - francMin: 0
  - lande: 0.9901666666666666
- hin
  - cld3: 0
  - franc: 0.5303505843071786
  - francAll: 0.5278130217028381
  - francMin: 0.5303505843071786
  - lande: 0.7825041736227045
- hrv
  - cld3: 0
  - franc: 0.2578211262421789
  - francAll: 0.18862716231137283
  - francMin: 0.32664703717335297
  - lande: 0
- hun
  - cld3: 0
  - franc: 0.7023
  - francAll: 0.6457666666666667
  - francMin: 0.7784
  - lande: 0.9340333333333334
- hye
  - cld3: 0
  - franc: 0.9631589261218891
  - francAll: 0.9631589261218891
  - francMin: 0
  - lande: 0.9102488732118362
- ind
  - cld3: 0
  - franc: 0.3552997221119492
  - francAll: 0.3319344411047468
  - francMin: 0.39426076107298813
  - lande: 0.8465944535813531
- isl
  - cld3: 0
  - franc: 0
  - francAll: 0.7015968063872255
  - francMin: 0
  - lande: 0.8826193766313527
- ita
  - cld3: 0
  - franc: 0.5973333333333334
  - francAll: 0.3994666666666667
  - francMin: 0.7073333333333334
  - lande: 0.883
- jpn
  - cld3: 0
  - franc: 0.9564666666666667
  - francAll: 0.9564666666666667
  - francMin: 0.9564666666666667
  - lande: 0.955
- kat
  - cld3: 0
  - franc: 0.9072665479851109
  - francAll: 0.9072665479851109
  - francMin: 0
  - lande: 0.9705453956950963
- kaz
  - cld3: 0
  - franc: 0.8507681053401609
  - francAll: 0.7539624481833699
  - francMin: 0.9290416971470373
  - lande: 0
- kor
  - cld3: 0
  - franc: 0.8159621948017852
  - francAll: 0.8159621948017852
  - francMin: 0.8159621948017852
  - lande: 0.9372538724074561
- lit
  - cld3: 0
  - franc: 0.5288
  - francAll: 0.42156666666666665
  - francMin: 0
  - lande: 0.9143666666666667
- mar
  - cld3: 0
  - franc: 0.7049666666666666
  - francAll: 0.6956
  - francMin: 0.7049666666666666
  - lande: 0.9022333333333333
- mkd
  - cld3: 0
  - franc: 0.5904666666666667
  - francAll: 0.5827333333333333
  - francMin: 0
  - lande: 0.7509333333333333
- nld
  - cld3: 0
  - franc: 0.5688666666666666
  - francAll: 0.5407333333333333
  - francMin: 0.8422
  - lande: 0.9222
- nob
  - cld3: 0
  - franc: 0.25186988009022915
  - francAll: 0.23881040009497803
  - francMin: 0
  - lande: 0.32286596224623054
- pes
  - cld3: 0
  - franc: 0.45455898771864534
  - francAll: 0.45199106810569406
  - francMin: 0.9215481950130257
  - lande: 0.955563825828061
- pol
  - cld3: 0
  - franc: 0.6814333333333333
  - francAll: 0.6161333333333333
  - francMin: 0.7497333333333334
  - lande: 0.9480666666666666
- por
  - cld3: 0
  - franc: 0.5497333333333333
  - francAll: 0.44143333333333334
  - francMin: 0.7293666666666667
  - lande: 0.8747666666666667
- ron
  - cld3: 0
  - franc: 0.5804217174289498
  - francAll: 0.4924111235611694
  - francMin: 0.6998404128892058
  - lande: 0.8934501375165529
- run
  - cld3: 0
  - franc: 0.33597285067873306
  - francAll: 0.290158371040724
  - francMin: 0.4326923076923077
  - lande: 0.016025641025641024
- rus
  - cld3: 0
  - franc: 0.45686666666666664
  - francAll: 0.4451
  - francMin: 0.48556666666666665
  - lande: 0.8157
- slk
  - cld3: 0
  - franc: 0.2931611117518164
  - francAll: 0.24414715719063546
  - francMin: 0
  - lande: 0.5831507323261447
- spa
  - cld3: 0
  - franc: 0.49323333333333336
  - francAll: 0.26563333333333333
  - francMin: 0.6714333333333333
  - lande: 0.8227
- srp
  - cld3: 0
  - franc: 0.25716666666666665
  - francAll: 0.1847
  - francMin: 0.3091333333333333
  - lande: 0.6424
- swe
  - cld3: 0
  - franc: 0.4616
  - francAll: 0.42083333333333334
  - francMin: 0.6951333333333334
  - lande: 0.8498666666666667
- tgl
  - cld3: 0
  - franc: 0.40868407032498666
  - francAll: 0.3979754928076718
  - francMin: 0.630207778369739
  - lande: 0.9145444858817262
- tur
  - cld3: 0
  - franc: 0.4473
  - francAll: 0.26206666666666667
  - francMin: 0.5488
  - lande: 0.9459333333333333
- ukr
  - cld3: 0
  - franc: 0.5976333333333333
  - francAll: 0.5812666666666667
  - francMin: 0.6293333333333333
  - lande: 0.7911666666666667
- vie
  - cld3: 0
  - franc: 0.7861802255148634
  - francAll: 0.6792470412822663
  - francMin: 0.8431180691454664
  - lande: 0.9648681390364365

Which looking at it it seems to have ignored some smaller languages, as I guess it was more useful for it to instead focus on the languages that have more sentences, that's something that I should try to address somehow 🤔

In summary I made some progress, it would be interesting to try this approach using UDHRs as the dataset, but so far it didn't seem to have turned out particularly well or be particularly promising.

@wooorm
Copy link
Owner

wooorm commented Dec 16, 2022

interesting stuff!

@fabiospampinato
Copy link
Author

Some more findings, in case they could be useful:

  • I've fixed the issue about some languages getting ignored by the model by ensuring that approximately the same number of training samples are provided for each language.
  • I've shaved ~30kb from the min+gzip size by reordering how some bytes are stored.
  • I've improved accuracy a bit by adding more neurons to the input layers (more ngrams that can be looked at) while removing some of them from the hidden layer (fewer features being learned), the tradeoff seemed to be worth it. Accuracy now is almost better than franc's across the board (though not particularly surprising given the bigger file size and the lower number of languages supported).
  • I've added a benchmark script, this approach seems some 30% slower than franc-all, but 20x faster than cld3-asm.
  • I tried getting to the same accuracy just by looking at bigrams, but that didn't work super well, I guess some languages are easier to identify with unigrams or trigrams, presumably looking at quadgrams also would also improve accuracy.
  • If I can somehow store just 1 byte per weight potentially the size of the library could be cut in half, but I'm not quite sure how to do that just yet, maybe accuracy would suffer too much.
  • Getting below 100kb min+gzip seems kind of impossible with this approach, unless the number of supported languages is reduced even further.

Accuracy output:

- afr
  - cld3: 0
  - franc: 0.6203170028818443
  - francAll: 0.5770893371757925
  - francMin: 0
  - lande: 0.797550432276657
- ara
  - cld3: 0
  - franc: 0
  - francAll: 0
  - francMin: 0
  - lande: 0.9831
- aze
  - cld3: 0
  - franc: 0
  - francAll: 0
  - francMin: 0
  - lande: 0.8786000729128691
- bel
  - cld3: 0
  - franc: 0.7479899101371591
  - francAll: 0.7040832413684377
  - francMin: 0.7818855431183982
  - lande: 0.8823900362604445
- ben
  - cld3: 0
  - franc: 0.9493717664449371
  - francAll: 0.9493717664449371
  - francMin: 0.9493717664449371
  - lande: 0.9822616407982262
- bul
  - cld3: 0
  - franc: 0.5330446396050023
  - francAll: 0.5196891820794043
  - francMin: 0.6722651665385082
  - lande: 0.6906390384070582
- cat
  - cld3: 0
  - franc: 0.5487940630797774
  - francAll: 0.3800865800865801
  - francMin: 0
  - lande: 0.7720470006184292
- ces
  - cld3: 0
  - franc: 0.32463333333333333
  - francAll: 0.26616666666666666
  - francMin: 0.44126666666666664
  - lande: 0.7368666666666667
- ckb
  - cld3: 0
  - franc: 0
  - francAll: 0
  - francMin: 0
  - lande: 0.9623333333333334
- cmn
  - cld3: 0
  - franc: 0.5979
  - francAll: 0.5979
  - francMin: 0.5979
  - lande: 0.9081
- dan
  - cld3: 0
  - franc: 0.4583
  - francAll: 0.4345333333333333
  - francMin: 0
  - lande: 0.5512666666666667
- deu
  - cld3: 0
  - franc: 0.8122333333333334
  - francAll: 0.7596
  - francMin: 0.9238666666666666
  - lande: 0.9243333333333333
- ell
  - cld3: 0
  - franc: 0.9859666666666667
  - francAll: 0.9859666666666667
  - francMin: 0.9859666666666667
  - lande: 0.9972666666666666
- eng
  - cld3: 0
  - franc: 0.5408333333333334
  - francAll: 0.4644666666666667
  - francMin: 0.7935
  - lande: 0.9025333333333333
- est
  - cld3: 0
  - franc: 0
  - francAll: 0
  - francMin: 0
  - lande: 0.7495312081435842
- eus
  - cld3: 0
  - franc: 0
  - francAll: 0.6401423257318454
  - francMin: 0
  - lande: 0.8757884522076662
- fin
  - cld3: 0
  - franc: 0.7455666666666667
  - francAll: 0.46116666666666667
  - francMin: 0
  - lande: 0.9094
- fra
  - cld3: 0
  - franc: 0.8001333333333334
  - francAll: 0.6492
  - francMin: 0.8789
  - lande: 0.8543
- hau
  - cld3: 0
  - franc: 0.8678492849284929
  - francAll: 0.8256325632563256
  - francMin: 0.9248716538320498
  - lande: 0.9375687568756875
- heb
  - cld3: 0
  - franc: 0.9465333333333333
  - francAll: 0.9465333333333333
  - francMin: 0
  - lande: 0.9923
- hin
  - cld3: 0
  - franc: 0.5303505843071786
  - francAll: 0.5278130217028381
  - francMin: 0.5303505843071786
  - lande: 0.8548914858096828
- hrv
  - cld3: 0
  - franc: 0.2578211262421789
  - francAll: 0.18862716231137283
  - francMin: 0.32664703717335297
  - lande: 0.7449392712550608
- hun
  - cld3: 0
  - franc: 0.7023
  - francAll: 0.6457666666666667
  - francMin: 0.7784
  - lande: 0.9325
- hye
  - cld3: 0
  - franc: 0.9631589261218891
  - francAll: 0.9631589261218891
  - francMin: 0
  - lande: 0.9900058788947678
- ind
  - cld3: 0
  - franc: 0.3552997221119492
  - francAll: 0.3319344411047468
  - francMin: 0.39426076107298813
  - lande: 0.9073895536777633
- isl
  - cld3: 0
  - franc: 0
  - francAll: 0.7015968063872255
  - francMin: 0
  - lande: 0.9547059726700445
- ita
  - cld3: 0
  - franc: 0.5973333333333334
  - francAll: 0.3994666666666667
  - francMin: 0.7073333333333334
  - lande: 0.9036333333333333
- jpn
  - cld3: 0
  - franc: 0.8058666666666666
  - francAll: 0.8058666666666666
  - francMin: 0.8058666666666666
  - lande: 0.9269
- kat
  - cld3: 0
  - franc: 0.9072665479851109
  - francAll: 0.9072665479851109
  - francMin: 0
  - lande: 0.9925554296811782
- kaz
  - cld3: 0
  - franc: 0.8507681053401609
  - francAll: 0.7539624481833699
  - francMin: 0.9290416971470373
  - lande: 0.8951475249939039
- kor
  - cld3: 0
  - franc: 0.8159621948017852
  - francAll: 0.8159621948017852
  - francMin: 0.8159621948017852
  - lande: 0.9798722324319594
- lit
  - cld3: 0
  - franc: 0.5288
  - francAll: 0.42156666666666665
  - francMin: 0
  - lande: 0.9219333333333334
- mar
  - cld3: 0
  - franc: 0.7049666666666666
  - francAll: 0.6956
  - francMin: 0.7049666666666666
  - lande: 0.9053666666666667
- mkd
  - cld3: 0
  - franc: 0.5904666666666667
  - francAll: 0.5827333333333333
  - francMin: 0
  - lande: 0.5291333333333333
- nld
  - cld3: 0
  - franc: 0.5688666666666666
  - francAll: 0.5407333333333333
  - francMin: 0.8422
  - lande: 0.9019666666666667
- nob
  - cld3: 0
  - franc: 0.25186988009022915
  - francAll: 0.23881040009497803
  - francMin: 0
  - lande: 0.6826546361153983
- pes
  - cld3: 0
  - franc: 0.45455898771864534
  - francAll: 0.45199106810569406
  - francMin: 0.9215481950130257
  - lande: 0.9704503163379233
- pol
  - cld3: 0
  - franc: 0.6814333333333333
  - francAll: 0.6161333333333333
  - francMin: 0.7497333333333334
  - lande: 0.9395333333333333
- por
  - cld3: 0
  - franc: 0.5497333333333333
  - francAll: 0.44143333333333334
  - francMin: 0.7293666666666667
  - lande: 0.8669
- ron
  - cld3: 0
  - franc: 0.5804217174289498
  - francAll: 0.4924111235611694
  - francMin: 0.6998404128892058
  - lande: 0.8985093884757733
- run
  - cld3: 0
  - franc: 0.33597285067873306
  - francAll: 0.290158371040724
  - francMin: 0.4326923076923077
  - lande: 0.8844268476621417
- rus
  - cld3: 0
  - franc: 0.45686666666666664
  - francAll: 0.4451
  - francMin: 0.48556666666666665
  - lande: 0.7847333333333333
- slk
  - cld3: 0
  - franc: 0.2931611117518164
  - francAll: 0.24414715719063546
  - francMin: 0
  - lande: 0.704820666589782
- spa
  - cld3: 0
  - franc: 0.49323333333333336
  - francAll: 0.26563333333333333
  - francMin: 0.6714333333333333
  - lande: 0.7731666666666667
- srp
  - cld3: 0
  - franc: 0.25716666666666665
  - francAll: 0.1847
  - francMin: 0.3091333333333333
  - lande: 0.2798333333333333
- swe
  - cld3: 0
  - franc: 0.4616
  - francAll: 0.42083333333333334
  - francMin: 0.6951333333333334
  - lande: 0.8582333333333333
- tgl
  - cld3: 0
  - franc: 0.40868407032498666
  - francAll: 0.3979754928076718
  - francMin: 0.630207778369739
  - lande: 0.908950452850293
- tur
  - cld3: 0
  - franc: 0.4473
  - francAll: 0.26206666666666667
  - francMin: 0.5488
  - lande: 0.9178666666666667
- ukr
  - cld3: 0
  - franc: 0.5976333333333333
  - francAll: 0.5812666666666667
  - francMin: 0.6293333333333333
  - lande: 0.7790666666666667
- vie
  - cld3: 0
  - franc: 0.7861802255148634
  - francAll: 0.6792470412822663
  - francMin: 0.8431180691454664
  - lande: 0.9706457925636007
- average
  - cld3: 0
  - franc: 0.5591889885479926
  - francAll: 0.511084565861915
  - francMin: 0.522001091558954
  - lande: 0.8498095388771921

Benchmark:

cld3: 264.809ms
franc: 5.346s
francAll: 12.830s
francMin: 2.295s
lande: 15.608s

@fabiospampinato
Copy link
Author

fabiospampinato commented Dec 18, 2022

Getting below 100kb min+gzip seems kind of impossible with this approach, unless the number of supported languages is reduced even further.

lol, famous last words. I thought I was already encoding weights with 2 bytes rather than 4, so I tried to add a way to encode them with just 1 byte hoping that the accuracy wouldn't go down too much. Turns out I was still using 4 bytes, so now all the weights take 25% of the space. As a result the library is just a tiny bit less accurate, but it now weighs 78.7kb, which makes it much more interesting imo.

@fabiospampinato
Copy link
Author

fabiospampinato commented Dec 19, 2022

Last (?) update:

  • Since weights are cheap now I've rewritten the whole thing to use a single neural network rather than 4, that made the model about half the size again, still with similar accuracy. So the program now weights about 45kb total, for context that's smaller than franc-min.
  • A further ~30% of the size could be deleted by bundling just the operations needed to use the model, at the moment I'm shipping the entire library for training it along with it for convenience.
  • It's now ~3x faster than franc-all, some 20% faster than franc, and about ~2x slower than franc-min.
  • The benchmark against cld3 was totally broken and I misread it, basically now it's about 50% slower than cld3.
  • The accuracy comparison against cld3 was also totally broken, now it's fixed. CLD3 seems a bit more accurate, like maybe 5 percentage points more accurate or thereabouts.
  • CLD3 is about 16x bigger when looking at min sizes, 11x bigger when looking at min+gzip sizes, and it supports 2x the number of languages.

Basically I think after some tweaks it seems to have turned out pretty well. If anybody would like to spend the time needed to turn the UDHRs into a dataset, and if @wooorm is interested in that, I'd be happy to train dedicated models that do what franc/franc-all/franc-min can do, I'd guess for roughly similar bundle sizes this approach could deliver much higher accuracies, if there's enough data to learn from in the UDHRs, so it'd be interesting to try it.

Updated comparison:

- afr
  - cld3: 0.8662343900096061
  - franc: 0.6203170028818443
  - francAll: 0.5770893371757925
  - francMin: 0
  - lande: 0.6853986551392891
- ara
  - cld3: 0.9125333333333333
  - franc: 0
  - francAll: 0
  - francMin: 0
  - lande: 0.9524333333333334
- aze
  - cld3: 0.8900838497994896
  - franc: 0
  - francAll: 0
  - francMin: 0
  - lande: 0.8056872037914692
- bel
  - cld3: 0.9181775185243576
  - franc: 0.7479899101371591
  - francAll: 0.7040832413684377
  - francMin: 0.7818855431183982
  - lande: 0.8718272110988491
- ben
  - cld3: 0.9979674796747967
  - franc: 0.9493717664449371
  - francAll: 0.9493717664449371
  - francMin: 0.9493717664449371
  - lande: 0.9665558019216556
- bul
  - cld3: 0.8441458577846129
  - franc: 0.5330446396050023
  - francAll: 0.5196891820794043
  - francMin: 0.6722651665385082
  - lande: 0.06560362620907362
- cat
  - cld3: 0.8184291898577613
  - franc: 0.5487940630797774
  - francAll: 0.3800865800865801
  - francMin: 0
  - lande: 0.9209647495361781
- ces
  - cld3: 0.8108333333333333
  - franc: 0.32463333333333333
  - francAll: 0.26616666666666666
  - francMin: 0.44126666666666664
  - lande: 0.7595
- ckb
  - cld3: 0.002416666666666667
  - franc: 0
  - francAll: 0
  - francMin: 0
  - lande: 0.97375
- cmn
  - cld3: 0.9883
  - franc: 0.5979
  - francAll: 0.5979
  - francMin: 0.5979
  - lande: 0.9603
- dan
  - cld3: 0.7553333333333333
  - franc: 0.4583
  - francAll: 0.4345333333333333
  - francMin: 0
  - lande: 0.18566666666666667
- deu
  - cld3: 0.9318
  - franc: 0.8122333333333334
  - francAll: 0.7596
  - francMin: 0.9238666666666666
  - lande: 0.9323
- ell
  - cld3: 0.9998333333333334
  - franc: 0.9859666666666667
  - francAll: 0.9859666666666667
  - francMin: 0.9859666666666667
  - lande: 0.9826
- eng
  - cld3: 0.9006333333333333
  - franc: 0.5408333333333334
  - francAll: 0.4644666666666667
  - francMin: 0.7935
  - lande: 0.9118666666666667
- est
  - cld3: 0.7921242968122154
  - franc: 0
  - francAll: 0
  - francMin: 0
  - lande: 0.7548888293597643
- eus
  - cld3: 0.8609089438783762
  - franc: 0
  - francAll: 0.6401423257318454
  - francMin: 0
  - lande: 0.9000485201358563
- fin
  - cld3: 0.9183333333333333
  - franc: 0.7455666666666667
  - francAll: 0.46116666666666667
  - francMin: 0
  - lande: 0.9460333333333333
- fra
  - cld3: 0.9060333333333334
  - franc: 0.8001333333333334
  - francAll: 0.6492
  - francMin: 0.8789
  - lande: 0.652
- hau
  - cld3: 0.9443069306930693
  - franc: 0.8678492849284929
  - francAll: 0.8256325632563256
  - francMin: 0.9248716538320498
  - lande: 0.9056197286395307
- heb
  - cld3: 0.9894333333333334
  - franc: 0.9465333333333333
  - francAll: 0.9465333333333333
  - francMin: 0
  - lande: 0.9729
- hin
  - cld3: 0.8836060100166945
  - franc: 0.5303505843071786
  - francAll: 0.5278130217028381
  - francMin: 0.5303505843071786
  - lande: 0.8629048414023373
- hrv
  - cld3: 0.4449760765550239
  - franc: 0.2578211262421789
  - francAll: 0.18862716231137283
  - francMin: 0.32664703717335297
  - lande: 0.6302907618697092
- hun
  - cld3: 0.9297
  - franc: 0.7023
  - francAll: 0.6457666666666667
  - francMin: 0.7784
  - lande: 0.9259666666666667
- hye
  - cld3: 0.9990201842053694
  - franc: 0.9631589261218891
  - francAll: 0.9631589261218891
  - francMin: 0
  - lande: 0.977464236723496
- ind
  - cld3: 0.6384619747065162
  - franc: 0.3552997221119492
  - francAll: 0.3319344411047468
  - francMin: 0.39426076107298813
  - lande: 0.8822094935632053
- isl
  - cld3: 0.9434976201443267
  - franc: 0
  - francAll: 0.7015968063872255
  - francMin: 0
  - lande: 0.9497159527099647
- ita
  - cld3: 0.8625333333333334
  - franc: 0.5973333333333334
  - francAll: 0.3994666666666667
  - francMin: 0.7073333333333334
  - lande: 0.8038333333333333
- jpn
  - cld3: 0.9991666666666666
  - franc: 0.8058666666666666
  - francAll: 0.8058666666666666
  - francMin: 0.8058666666666666
  - lande: 0.9112666666666667
- kat
  - cld3: 0.9982197766628904
  - franc: 0.9072665479851109
  - francAll: 0.9072665479851109
  - francMin: 0
  - lande: 0.9886713060365755
- kaz
  - cld3: 0.9324554986588637
  - franc: 0.8507681053401609
  - francAll: 0.7539624481833699
  - francMin: 0.9290416971470373
  - lande: 0.897342111680078
- kor
  - cld3: 0.9964120066509145
  - franc: 0.8159621948017852
  - francAll: 0.8159621948017852
  - francMin: 0.8159621948017852
  - lande: 0.9469677080598582
- lit
  - cld3: 0.8789333333333333
  - franc: 0.5288
  - francAll: 0.42156666666666665
  - francMin: 0
  - lande: 0.9466333333333333
- mar
  - cld3: 0.9079333333333334
  - franc: 0.7049666666666666
  - francAll: 0.6956
  - francMin: 0.7049666666666666
  - lande: 0.8873333333333333
- mkd
  - cld3: 0.7934
  - franc: 0.5904666666666667
  - francAll: 0.5827333333333333
  - francMin: 0
  - lande: 0.8236666666666667
- nld
  - cld3: 0.9051
  - franc: 0.5688666666666666
  - francAll: 0.5407333333333333
  - francMin: 0.8422
  - lande: 0.8379
- nob
  - cld3: 0.8031580197079425
  - franc: 0.25186988009022915
  - francAll: 0.23881040009497803
  - francMin: 0
  - lande: 0.9076338596699514
- pes
  - cld3: 0.9594715295868999
  - franc: 0.45455898771864534
  - francAll: 0.45199106810569406
  - francMin: 0.9215481950130257
  - lande: 0.9604391514700409
- pol
  - cld3: 0.9402
  - franc: 0.6814333333333333
  - francAll: 0.6161333333333333
  - francMin: 0.7497333333333334
  - lande: 0.9436666666666667
- por
  - cld3: 0.8888
  - franc: 0.5497333333333333
  - francAll: 0.44143333333333334
  - francMin: 0.7293666666666667
  - lande: 0.7611333333333333
- ron
  - cld3: 0.8122644392380565
  - franc: 0.5804217174289498
  - francAll: 0.4924111235611694
  - francMin: 0.6998404128892058
  - lande: 0.8223829411564972
- run
  - cld3: 0
  - franc: 0.33597285067873306
  - francAll: 0.290158371040724
  - francMin: 0.4326923076923077
  - lande: 0.8964932126696833
- rus
  - cld3: 0.8584666666666667
  - franc: 0.45686666666666664
  - francAll: 0.4451
  - francMin: 0.48556666666666665
  - lande: 0.7372666666666666
- slk
  - cld3: 0.7268481144043363
  - franc: 0.2931611117518164
  - francAll: 0.24414715719063546
  - francMin: 0
  - lande: 0.6873486333756199
- spa
  - cld3: 0.7938
  - franc: 0.49323333333333336
  - francAll: 0.26563333333333333
  - francMin: 0.6714333333333333
  - lande: 0.5963333333333334
- srp
  - cld3: 0.24833333333333332
  - franc: 0.25716666666666665
  - francAll: 0.1847
  - francMin: 0.3091333333333333
  - lande: 0.4236333333333333
- swe
  - cld3: 0.8631333333333333
  - franc: 0.4616
  - francAll: 0.42083333333333334
  - francMin: 0.6951333333333334
  - lande: 0.7063333333333334
- tgl
  - cld3: 0
  - franc: 0.40868407032498666
  - francAll: 0.3979754928076718
  - francMin: 0.630207778369739
  - lande: 0.9349493873201918
- tur
  - cld3: 0.8988333333333334
  - franc: 0.4473
  - francAll: 0.26206666666666667
  - francMin: 0.5488
  - lande: 0.9159
- ukr
  - cld3: 0.8943
  - franc: 0.5976333333333333
  - francAll: 0.5812666666666667
  - francMin: 0.6293333333333333
  - lande: 0.6868
- vie
  - cld3: 0.9800111825552139
  - franc: 0.7861802255148634
  - francAll: 0.6792470412822663
  - francMin: 0.8431180691454664
  - lande: 0.975398378529494
- average
  - cld3: 0.8404255020375455
  - franc: 0.5591889885479926
  - francAll: 0.511084565861915
  - francMin: 0.522001091558954
  - lande: 0.8123052208534568

@porkopek
Copy link

@fabiospampinato Wow! Nice job. I'm testing the accuracy of your library vs franc in a project, but at the moment there is no relevant differences

For me, the problem with franc is accuracy. The weight is not a problem in my case.
I'm following your progress with enthusiasm.
Are your texts translations or texts written in the original language?

@fabiospampinato
Copy link
Author

fabiospampinato commented Dec 19, 2022

I'm testing the accuracy of your library vs franc in a project, but at the moment there is no relevant differences

  1. Is there a particular language that you are trying to detect?
  2. Can you post some samples where detection isn't accurate enough?
  3. Did you check how accurate something like cld3 is?
  4. How many texts are you testing this with? It might take a good amount of tries to build statistically significant results, I guess.

The weight is not a problem in my case.

If the weight is not a problem CLD3 should be pretty good. If even that is not good enough depending on your use case maybe you'd have to customize my thing or something, one could make the model bigger, train it on more examples, support fewer languages, look at quadgrams too etc. all that should drive accuracy up.

Are your texts translations or texts written in the original language?

Original language I believe, I'm using this.

@fabiospampinato
Copy link
Author

fabiospampinato commented Dec 19, 2022

@porkopek I've just made it slightly more accurate by taking into account the top 100 quadgrams also. I seem to have hit some kind of ceiling though, I can't get to 90% accuracy by just making the network bigger 🤔 though I haven't tried increasing multiple numbers at the same time.

@wooorm
Copy link
Owner

wooorm commented Dec 19, 2022

The conversation currently is about how different projects, specifically around neural networks, work. This issue is about improving franc, potentially through neural networks, but also through other methods.

Can questions about other specific other tools be discussed elsewhere? Exploratory work is of course interesting, so perhaps discuss it in other places, and link to those here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants