Improving language detection #929

gavishpoddar · 2021-05-24T19:36:50Z

Improving language detection using optional language detection library

Related Issues #567, #575, #612.

Implementing plugging wrappers for optional language detection library, wrapper template, and docs on implementing a custom wrapper.

gavishpoddar · 2021-05-24T19:38:55Z

Testing different libraries using dateparser/tests and #928

gavishpoddar · 2021-05-27T16:19:23Z

Language Detection Tests

fasttext

Requirements - pybind11, numpy
Model - 'lid.176.bin' Size(~130 MB)
Last Commit - 11 Months ago

spacy_langdetect

Requirements - spacy, langdetect
Last Commit - 2 Years ago

langdetect

Requirements - six
Last Commit - 22 days ago

langdetect

              precision    recall  f1-score   support

           0       0.80      0.93      0.86      5224
           1       0.91      0.75      0.82      4776

    accuracy                           0.85     10000
   macro avg       0.86      0.84      0.84     10000
weighted avg       0.85      0.85      0.84     10000

spacy

              precision    recall  f1-score   support

           0       0.80      0.94      0.87      5224
           1       0.92      0.75      0.83      4776

    accuracy                           0.85     10000
   macro avg       0.86      0.84      0.85     10000
weighted avg       0.86      0.85      0.85     10000

fasttext

              precision    recall  f1-score   support

           0       0.92      0.81      0.86      5224
           1       0.81      0.92      0.86      4776

    accuracy                           0.86     10000
   macro avg       0.87      0.86      0.86     10000
weighted avg       0.87      0.86      0.86     10000

dateparser

              precision    recall  f1-score   support

           0       0.59      0.71      0.65      5224
           1       0.60      0.47      0.53      4776

    accuracy                           0.59     10000
   macro avg       0.59      0.59      0.59     10000
weighted avg       0.59      0.59      0.59     10000

Other observations:

Speed : fasttext > langdetect > spacy > dateparser
Memory requirement in fasttext was the highest.
Dependency requirements were highest in spacy_langdetect
Limiting the text to 100 chars increases the speed of language detection for search_dates.

lopuhin · 2021-05-27T18:54:59Z

Thanks @gavishpoddar . On which dataset did you measure the quality?

gavishpoddar · 2021-05-27T19:05:09Z

Hi @lopuhin, I have used this data set .
I am still conducting research on them and will be testing cld3, langid, Polyglot. And use dataset #928.

Should I share the code that is used. Or anything please do suggest

gavishpoddar · 2021-05-29T07:30:37Z

The accuracy of langdetect is ~ 0.24 when the input text is small input data (1-3 word) on test_languages data.

Few language libraries were removed from results or not tested due to speed/maintenance/installation-related issues.

here and here and NLTK

Other Notes

Issues discussed with langdetect are confirmed.

here

lopuhin · 2021-05-31T10:45:42Z

The accuracy of langdetect is ~ 0.24 when the input text is small input data (1-3 word) on test_languages data.

thanks @gavishpoddar , that is test data from the dateparser library, right? Do you have accuracy values for other libraries on that data?

gavishpoddar · 2021-06-01T11:36:41Z

Accuracy scores were drastic changes if we discard the test data which is not supported by the library the below scores are are observed after the data is cleaned.

Accuracy values:

Using test_languages data (Partial) **:

fasttext

Accuracy Score: 0.7823

Data cleaned stats

Mapping Not Found: 204
Mapping Found:233
Total: 437

lang_detect

Accuracy Score: 0.6608

Data cleaned stats

Mapping Not Found: 299
Mapping Found:138
Total: 437

** Data cleaned and removed for those languages which are not supported by the above libraries, cleaning codes shared in below comment.

gavishpoddar · 2021-06-01T11:39:29Z

Code for cleaning supported languages:

import fasttext

fasttext_supported_lang = "af als am an ar arz as ast av az azb ba bar bcl be bg bh bn bo bpy br bs bxr ca cbk ce ceb ckb co cs cv cy da de diq dsb dty dv el eml en eo es et eu fa fi fr frr fy ga gd gl gn gom gu gv he hi hif hr hsb ht hu hy ia id ie ilo io is it ja jbo jv ka kk km kn ko krc ku kv kw ky la lb lez li lmo lo lrc lt lv mai mg mhr min mk ml mn mr mrj ms mt mwl my myv mzn nah nap nds ne new nl nn no oc or os pa pam pfl pl pms pnb ps pt qu rm ro ru rue sa sah sc scn sco sd sh si sk sl so sq sr su sv sw ta te tg th tk tl tr tt tyv ug uk ur uz vec vep vi vls vo wa war wuu xal xmf yi yo yue"
fasttext_supported_lang=fasttext_supported_lang.split(" ")


path_to_pretrained_model = 'lid.176.bin'
fmodel = fasttext.load_model(path_to_pretrained_model)

def param(cc, text, *_):    
    if cc not in fasttext_supported_lang:
        return cc,None
        
    code = fmodel.predict([text])[0][0][0].replace("__label__", "")
    return cc, code
    
test = [
        param('en', "Sep 03 2014", "september 03 2014"),
        param('en', "friday, 03 september 2014", "friday 03 september 2014"),
        ...
        param('zu', "3 mashi 2007 ulwesibili 10:08", "3 march 2007 tuesday 10:08"),
        param('zu', "son 23 umasingana 1996", "sunday 23 january 1996"),
    ]


mapping_not_found=0
mapping_found=0
total_candidates=0

for x, y in test:
    total_candidates+=1
    if y == None:
        mapping_not_found+=1
    else:
        mapping_found+=1
        
        
print(mapping_not_found, mapping_found,total_candidates)

lopuhin · 2021-06-01T11:43:04Z

@gavishpoddar Thanks for posting the code - that's really helpful. And it's useful to know how many of the examples are in languages missing from the library. But this way to evaluate quality does not sound fair to me, especially comparing accuracy on different datasets between libraries. I don't think we should discard examples just because a library does not support them - after all we want them to work, don't we? I'd rather run evaluation on the whole dataset, doing re-mapping between languages if needed.

Also I see that even fasttext has a lot of "Mapping not found" - is that because languages are really missing, or because the names of languages are different?

gavishpoddar · 2021-06-01T12:02:53Z

Hi @lopuhin, Thanks for the review 😀

I will try to remap as many of them as possible and run tests to find the results without discarding the values. I am also working on the remapping part not just for the tests.

For some case, language is missing and in another case, it's because of mapping as fastest uses IS0-639 our data uses language codes by CLDR data. However, I will try to get some more insight into that. Additionally, there are a few languages supported by CLDR and many of our test data with not very commonly written, i.e 'sr-Cyrl', 'uz-Cyrl', 'az-Latn' and are not packaged with the model.

gavishpoddar · 2021-06-03T06:30:58Z

Accuracy Score without removing language filters:

langdetect: 0.2376

fasttext: 0.3867

Other observations:

The test includes many languages which are not very commonly used (over 50% of test data with less than 0.5% (individually) Percentage of world population. data

gavishpoddar · 2021-06-19T11:05:19Z

With PR #932 and fasttext function.

Issue :

Are solved

gavishpoddar · 2021-06-19T20:38:27Z

Language Detection: Speed

For 10,000 calls:

fasttext :  14.983941316604614
langdetect : 131.56952500343323
dateparser : 19.442069053649902

Pl find codes below

gavishpoddar · 2021-06-19T20:38:44Z

from dateparser import parse
from dateparser.custom_lang_detect.fasttext import detect_languages
from dateparser.custom_lang_detect.langdetect import detect_languages as langdetect_detect_languages

import time


start = time.time()
for _ in range(10000):
    parse(
    "Tuesday Jul 22, 2014" , 
    detect_languages_func=detect_languages
)

end = time.time()
print(end - start)


start = time.time()
for _ in range(10000):
    parse(
    "Tuesday Jul 22, 2014" , 
    detect_languages_func=langdetect_detect_languages
)

end = time.time()
print(end - start)

start = time.time()

for _ in range(10000):
    parse(
    "Tuesday Jul 22, 2014"     
    )


end = time.time()
print(end - start)

gavishpoddar · 2021-06-20T10:23:02Z

WIP:

Language Detection: Accuracy Score

parse: 0.08849557522123894
search_dates: 0.1592920353982301
parse with fasttext: 0.09779646017699115
search_dates with fasttext: 0.336283185840708

These scores are not accurate and are just for discussion.
Pl find codes below

gavishpoddar · 2021-06-20T10:23:16Z

from dateparser import parse
from dateparser.search import search_dates
from dateparser.custom_lang_detect.fasttext import detect_languages
from tldextract import extract
import wikipediaapi
import requests
from bs4 import BeautifulSoup
r = requests.get("https://en.wikipedia.org/wiki/Augustus")
soup =   BeautifulSoup(r.content,"lxml")


urls = []
for a in soup.find_all('a', href=True):
    urls.append(a['href'])
    
used = []
data = []

for url in urls:
    tsd, td, tsu = extract(url) 
    if td == "wikipedia" and tsd not in used :
        path = url.split("/")[-1]        
        wiki_wiki = wikipediaapi.Wikipedia(tsd)
        page_py = wiki_wiki.page(path)
        if page_py.exists():
            text = page_py.text
            if len(text) > 400:
                text = page_py.text[:400]
            data.append([tsd, page_py.text[:400]])
    used.append(tsd)

total = 0
date_parser = 0
date_search = 0
ft_parse = 0
ft_search = 0

for x, y in data:
    total += 1
    y = y.replace('\n', ' ').replace('\r', ' ')
    try:
        dates, language_code_search = search_dates(y)
    except:
        dates = []
        language_code_search = None
        
    if len(dates) == 0 :
        continue
    dates_text = dates[0][0]
    language_code_parse= parse(dates_text)
    
    language_code_search_withft = detect_languages(y)
    if language_code_search_withft:
        language_code_search_withft = language_code_search_withft[0]
    else:
        language_code_search_withft =None
    
    language_code_parse_withft = detect_languages(dates_text)
    if language_code_parse_withft:
        language_code_parse_withft = language_code_parse_withft[0]
    else:
        language_code_parse_withft = None
        
    if x == language_code_search:
        date_search += 1
        
    if x == language_code_parse:
        date_parser += 1
        
    if x == language_code_search_withft:
        ft_search += 1
        
        
    if x == language_code_parse_withft:
        ft_parse += 1


date_search = date_search/total
date_parser = date_parser/total
ft_search = ft_search/total
ft_parse = ft_parse/total

print(date_search, date_parser, ft_search, ft_parse)

gavishpoddar · 2021-08-04T19:02:45Z

Thanks maintainers, this issue discussion is concluded.

Thanks for your time.

gavishpoddar closed this as completed Aug 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving language detection #929

Improving language detection #929

gavishpoddar commented May 24, 2021

gavishpoddar commented May 24, 2021

gavishpoddar commented May 27, 2021 •

edited

Loading

lopuhin commented May 27, 2021

gavishpoddar commented May 27, 2021 •

edited

Loading

gavishpoddar commented May 29, 2021

lopuhin commented May 31, 2021

gavishpoddar commented Jun 1, 2021

gavishpoddar commented Jun 1, 2021

lopuhin commented Jun 1, 2021

gavishpoddar commented Jun 1, 2021

gavishpoddar commented Jun 3, 2021

gavishpoddar commented Jun 19, 2021

gavishpoddar commented Jun 19, 2021

gavishpoddar commented Jun 19, 2021

gavishpoddar commented Jun 20, 2021

gavishpoddar commented Jun 20, 2021

gavishpoddar commented Aug 4, 2021

Improving language detection #929

Improving language detection #929

Comments

gavishpoddar commented May 24, 2021

gavishpoddar commented May 24, 2021

gavishpoddar commented May 27, 2021 • edited Loading

Language Detection Tests

fasttext

spacy_langdetect

langdetect

langdetect

spacy

fasttext

dateparser

Other observations:

lopuhin commented May 27, 2021

gavishpoddar commented May 27, 2021 • edited Loading

gavishpoddar commented May 29, 2021

Other Notes

lopuhin commented May 31, 2021

gavishpoddar commented Jun 1, 2021

gavishpoddar commented Jun 1, 2021

lopuhin commented Jun 1, 2021

gavishpoddar commented Jun 1, 2021

gavishpoddar commented Jun 3, 2021

gavishpoddar commented Jun 19, 2021

gavishpoddar commented Jun 19, 2021

gavishpoddar commented Jun 19, 2021

gavishpoddar commented Jun 20, 2021

gavishpoddar commented Jun 20, 2021

gavishpoddar commented Aug 4, 2021

gavishpoddar commented May 27, 2021 •

edited

Loading

gavishpoddar commented May 27, 2021 •

edited

Loading