Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving language detection #929

Closed
gavishpoddar opened this issue May 24, 2021 · 17 comments
Closed

Improving language detection #929

gavishpoddar opened this issue May 24, 2021 · 17 comments

Comments

@gavishpoddar
Copy link
Contributor

Improving language detection using optional language detection library

Related Issues #567, #575, #612.

Implementing plugging wrappers for optional language detection library, wrapper template, and docs on implementing a custom wrapper.

@gavishpoddar
Copy link
Contributor Author

Testing different libraries using dateparser/tests and #928

@gavishpoddar
Copy link
Contributor Author

gavishpoddar commented May 27, 2021

Language Detection Tests

fasttext

Requirements - pybind11, numpy
Model - 'lid.176.bin' Size(~130 MB)
Last Commit - 11 Months ago

spacy_langdetect

Requirements - spacy, langdetect
Last Commit - 2 Years ago

langdetect

Requirements - six
Last Commit - 22 days ago


langdetect

              precision    recall  f1-score   support

           0       0.80      0.93      0.86      5224
           1       0.91      0.75      0.82      4776

    accuracy                           0.85     10000
   macro avg       0.86      0.84      0.84     10000
weighted avg       0.85      0.85      0.84     10000

spacy

              precision    recall  f1-score   support

           0       0.80      0.94      0.87      5224
           1       0.92      0.75      0.83      4776

    accuracy                           0.85     10000
   macro avg       0.86      0.84      0.85     10000
weighted avg       0.86      0.85      0.85     10000

fasttext

              precision    recall  f1-score   support

           0       0.92      0.81      0.86      5224
           1       0.81      0.92      0.86      4776

    accuracy                           0.86     10000
   macro avg       0.87      0.86      0.86     10000
weighted avg       0.87      0.86      0.86     10000

dateparser

              precision    recall  f1-score   support

           0       0.59      0.71      0.65      5224
           1       0.60      0.47      0.53      4776

    accuracy                           0.59     10000
   macro avg       0.59      0.59      0.59     10000
weighted avg       0.59      0.59      0.59     10000

Other observations:

  • Speed : fasttext > langdetect > spacy > dateparser
  • Memory requirement in fasttext was the highest.
  • Dependency requirements were highest in spacy_langdetect
  • Limiting the text to 100 chars increases the speed of language detection for search_dates.

@lopuhin
Copy link
Member

lopuhin commented May 27, 2021

Thanks @gavishpoddar . On which dataset did you measure the quality?

@gavishpoddar
Copy link
Contributor Author

gavishpoddar commented May 27, 2021

Hi @lopuhin, I have used this data set .
I am still conducting research on them and will be testing cld3, langid, Polyglot. And use dataset #928.

Should I share the code that is used. Or anything please do suggest

@gavishpoddar
Copy link
Contributor Author

The accuracy of langdetect is ~ 0.24 when the input text is small input data (1-3 word) on test_languages data.

Few language libraries were removed from results or not tested due to speed/maintenance/installation-related issues.

here and here and NLTK

Other Notes

Issues discussed with langdetect are confirmed.

@lopuhin
Copy link
Member

lopuhin commented May 31, 2021

The accuracy of langdetect is ~ 0.24 when the input text is small input data (1-3 word) on test_languages data.

thanks @gavishpoddar , that is test data from the dateparser library, right? Do you have accuracy values for other libraries on that data?

@gavishpoddar
Copy link
Contributor Author

Accuracy scores were drastic changes if we discard the test data which is not supported by the library the below scores are are observed after the data is cleaned.

Accuracy values:

Using test_languages data (Partial) **:

fasttext

Accuracy Score: 0.7823

Data cleaned stats

Mapping Not Found: 204
Mapping Found:233
Total: 437

lang_detect

Accuracy Score: 0.6608

Data cleaned stats

Mapping Not Found: 299
Mapping Found:138
Total: 437

** Data cleaned and removed for those languages which are not supported by the above libraries, cleaning codes shared in below comment.

@gavishpoddar
Copy link
Contributor Author

Code for cleaning supported languages:

import fasttext

fasttext_supported_lang = "af als am an ar arz as ast av az azb ba bar bcl be bg bh bn bo bpy br bs bxr ca cbk ce ceb ckb co cs cv cy da de diq dsb dty dv el eml en eo es et eu fa fi fr frr fy ga gd gl gn gom gu gv he hi hif hr hsb ht hu hy ia id ie ilo io is it ja jbo jv ka kk km kn ko krc ku kv kw ky la lb lez li lmo lo lrc lt lv mai mg mhr min mk ml mn mr mrj ms mt mwl my myv mzn nah nap nds ne new nl nn no oc or os pa pam pfl pl pms pnb ps pt qu rm ro ru rue sa sah sc scn sco sd sh si sk sl so sq sr su sv sw ta te tg th tk tl tr tt tyv ug uk ur uz vec vep vi vls vo wa war wuu xal xmf yi yo yue"
fasttext_supported_lang=fasttext_supported_lang.split(" ")


path_to_pretrained_model = 'lid.176.bin'
fmodel = fasttext.load_model(path_to_pretrained_model)

def param(cc, text, *_):    
    if cc not in fasttext_supported_lang:
        return cc,None
        
    code = fmodel.predict([text])[0][0][0].replace("__label__", "")
    return cc, code
    
test = [
        param('en', "Sep 03 2014", "september 03 2014"),
        param('en', "friday, 03 september 2014", "friday 03 september 2014"),
        ...
        param('zu', "3 mashi 2007 ulwesibili 10:08", "3 march 2007 tuesday 10:08"),
        param('zu', "son 23 umasingana 1996", "sunday 23 january 1996"),
    ]


mapping_not_found=0
mapping_found=0
total_candidates=0

for x, y in test:
    total_candidates+=1
    if y == None:
        mapping_not_found+=1
    else:
        mapping_found+=1
        
        
print(mapping_not_found, mapping_found,total_candidates)


@lopuhin
Copy link
Member

lopuhin commented Jun 1, 2021

@gavishpoddar Thanks for posting the code - that's really helpful. And it's useful to know how many of the examples are in languages missing from the library. But this way to evaluate quality does not sound fair to me, especially comparing accuracy on different datasets between libraries. I don't think we should discard examples just because a library does not support them - after all we want them to work, don't we? I'd rather run evaluation on the whole dataset, doing re-mapping between languages if needed.

Also I see that even fasttext has a lot of "Mapping not found" - is that because languages are really missing, or because the names of languages are different?

@gavishpoddar
Copy link
Contributor Author

Hi @lopuhin, Thanks for the review 😀

I will try to remap as many of them as possible and run tests to find the results without discarding the values. I am also working on the remapping part not just for the tests.

For some case, language is missing and in another case, it's because of mapping as fastest uses IS0-639 our data uses language codes by CLDR data. However, I will try to get some more insight into that. Additionally, there are a few languages supported by CLDR and many of our test data with not very commonly written, i.e 'sr-Cyrl', 'uz-Cyrl', 'az-Latn' and are not packaged with the model.

@gavishpoddar
Copy link
Contributor Author

Accuracy Score without removing language filters:

langdetect: 0.2376

fasttext: 0.3867

Other observations:

  • The test includes many languages which are not very commonly used (over 50% of test data with less than 0.5% (individually) Percentage of world population. data

@gavishpoddar
Copy link
Contributor Author

With PR #932 and fasttext function.

Issue :

Are solved

@gavishpoddar
Copy link
Contributor Author

Language Detection: Speed

For 10,000 calls:

fasttext :  14.983941316604614
langdetect : 131.56952500343323
dateparser : 19.442069053649902

Pl find codes below

@gavishpoddar
Copy link
Contributor Author

from dateparser import parse
from dateparser.custom_lang_detect.fasttext import detect_languages
from dateparser.custom_lang_detect.langdetect import detect_languages as langdetect_detect_languages

import time


start = time.time()
for _ in range(10000):
    parse(
    "Tuesday Jul 22, 2014" , 
    detect_languages_func=detect_languages
)

end = time.time()
print(end - start)


start = time.time()
for _ in range(10000):
    parse(
    "Tuesday Jul 22, 2014" , 
    detect_languages_func=langdetect_detect_languages
)

end = time.time()
print(end - start)

start = time.time()

for _ in range(10000):
    parse(
    "Tuesday Jul 22, 2014"     
    )


end = time.time()
print(end - start)

@gavishpoddar
Copy link
Contributor Author

WIP:

Language Detection: Accuracy Score

parse: 0.08849557522123894
search_dates: 0.1592920353982301
parse with fasttext: 0.09779646017699115
search_dates with fasttext: 0.336283185840708

These scores are not accurate and are just for discussion.
Pl find codes below

@gavishpoddar
Copy link
Contributor Author

from dateparser import parse
from dateparser.search import search_dates
from dateparser.custom_lang_detect.fasttext import detect_languages
from tldextract import extract
import wikipediaapi
import requests
from bs4 import BeautifulSoup
r = requests.get("https://en.wikipedia.org/wiki/Augustus")
soup =   BeautifulSoup(r.content,"lxml")


urls = []
for a in soup.find_all('a', href=True):
    urls.append(a['href'])
    
used = []
data = []

for url in urls:
    tsd, td, tsu = extract(url) 
    if td == "wikipedia" and tsd not in used :
        path = url.split("/")[-1]        
        wiki_wiki = wikipediaapi.Wikipedia(tsd)
        page_py = wiki_wiki.page(path)
        if page_py.exists():
            text = page_py.text
            if len(text) > 400:
                text = page_py.text[:400]
            data.append([tsd, page_py.text[:400]])
    used.append(tsd)

total = 0
date_parser = 0
date_search = 0
ft_parse = 0
ft_search = 0

for x, y in data:
    total += 1
    y = y.replace('\n', ' ').replace('\r', ' ')
    try:
        dates, language_code_search = search_dates(y)
    except:
        dates = []
        language_code_search = None
        
    if len(dates) == 0 :
        continue
    dates_text = dates[0][0]
    language_code_parse= parse(dates_text)
    
    language_code_search_withft = detect_languages(y)
    if language_code_search_withft:
        language_code_search_withft = language_code_search_withft[0]
    else:
        language_code_search_withft =None
    
    language_code_parse_withft = detect_languages(dates_text)
    if language_code_parse_withft:
        language_code_parse_withft = language_code_parse_withft[0]
    else:
        language_code_parse_withft = None
        
    if x == language_code_search:
        date_search += 1
        
    if x == language_code_parse:
        date_parser += 1
        
    if x == language_code_search_withft:
        ft_search += 1
        
        
    if x == language_code_parse_withft:
        ft_parse += 1


date_search = date_search/total
date_parser = date_parser/total
ft_search = ft_search/total
ft_parse = ft_parse/total

print(date_search, date_parser, ft_search, ft_parse)

@gavishpoddar
Copy link
Contributor Author

Thanks maintainers, this issue discussion is concluded.

Thanks for your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants