The docker image on Docker Hub for version 1.2.7 (latest) has broken auto language detect #217

MatsBjerin · 2022-02-22T07:38:02Z

To replicate

Create this docker-compose file:

version: "3"

services:
  libretranslate:
    image: libretranslate/libretranslate:latest
    container_name: libretranslate
    stdin_open: true
    tty: true
    restart: unless-stopped
    ports:
      - 5000:5000

Start with: "docker-compose up -d"
Then run this:
curl -X POST -H "Content-Type: application/json" -d '{"q": "Ciao!", "source": "auto", "target": "en"}' localhost:5000/translate

The language auto detect will fail and no translation is made.

The text was updated successfully, but these errors were encountered:

cristeigabriel · 2022-02-22T14:20:40Z

I can't try but I recommend that in some 'production' sort of environment you do your posts through an error handle-able module. My guess is that the character is throwing off the language auto detection. Personally I applied the following changes to the LibreTranslate app to 'language.py'. The error should be seen in the host console.

RE_BAD_CHARS = regex.compile(r"\p{Cc}|\p{Cs}")

def remove_bad_chars(text):
    return RE_BAD_CHARS.sub("", text)

def detect_languages(text):
    # detect batch processing
    if isinstance(text, list):
        is_batch = True
    else:
        is_batch = False
        text = [text]

    # get the candidates
    candidates = []
    for _t in text:
        t = remove_bad_chars(_t)

I have borrowed the solution from someone else who was having the same issue long ago, sadly I didn't store a reference to the post, anyhow. Nonetheless, this solution is still working just fine, after long periods of testing.

cristeigabriel · 2022-02-22T17:13:32Z

If above fixes the issue - I'm willing to make a PR (which'd obviously link the solution original poster).

MatsBjerin · 2022-02-23T07:44:50Z

Thank you Cristei,

The issue however affects all text. It does not matter if the input is just "Ciao" or "Hello". Detection will still fail.
The curly braces are not part of the text. They are just Json wrappers as the POST format is Json and should never reach the actual language engine.

Also, we have exactly the same issue when using the suggested python library from the libretranslate page, to perform the call.

The issue appeared in the latest release. However the previous version with tag "main" turns out to also show the same behaviour.

I am guessing this is because of some change in the language models that are downloaded, as I believe the "main" Docker image itself (not just "latest") is unchanged but still has changed its behaviour since just a week ago.

We have worked around the issue by using another language detection engine and thus specify the source and target languages specifically when using libretranslate.

We are of course still hoping the actual problem will be fixed at some point as most users will probably be affected.

Best regards,
Mats

cristeigabriel · 2022-02-23T08:32:24Z

I forgot that I haven't synced my repository with the upstream, and the issue you were facing is very similar to what I've been myself, hence the response. My apologies.

MatsBjerin · 2022-02-23T08:41:55Z

No worries! I appreciate you looking into this anyhow!
Best, /Mats

ZenulAbidin · 2022-03-03T10:44:51Z

I'm sure this is related to this issue.

When running the LibreTranslate docker instance locally, /detect totally fails to detect the language.

French text copied from random website:

curl -X POST "http://localhost:5000/detect" -H  "accept: application/json" -H  "Content-Type: application/x-www-form-urlencoded" -d "q=Voil%C3%A0%20pourquoi%2C%20je%20n'ach%C3%A9terais%20pas%20d'Asics%20d'occasion%20o%C3%B9%20m%C3%AAme%20neuve%20sauf%20pour%20ceux%20qui%20veulent%20jeter%20leur%20argent%20par%20la%20fen%C3%AAtre" -i
HTTP/1.1 200 OK
Access-Control-Allow-Credentials: true
Access-Control-Allow-Headers: Authorization, Content-Type
Access-Control-Allow-Methods: GET, POST
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Authorization
Access-Control-Max-Age: 1728000
Content-Length: 37
Content-Type: application/json
Date: Thu, 03 Mar 2022 10:29:14 GMT
Server: waitress

[{"confidence":0.0,"language":"en"}]

Strangely, the libretranslate.org instance reports the language correctly as "fr".

/translate works properly though.

This applies to both "latest" and "main" tags.

MatsBjerin · 2022-03-03T10:51:57Z

Yes, Ali's experience is identical to ours. /Mats Hämta Outlook för Android<https://aka.ms/AAb9ysg>

…

________________________________ From: Ali Sherief ***@***.***> Sent: Thursday, March 3, 2022 11:45:03 AM To: LibreTranslate/LibreTranslate ***@***.***> Cc: Mats Bjerin ***@***.***>; Author ***@***.***> Subject: Re: [LibreTranslate/LibreTranslate] The docker image on Docker Hub for version 1.2.7 (latest) has broken auto language detect (Issue #217) I'm sure this is related to this issue. When running the LibreTranslate docker instance locally, /detect totally fails to detect the language. French text copied from random website: curl -X POST "http://localhost:5000/detect" -H "accept: application/json" -H "Content-Type: application/x-www-form-urlencoded" -d "q=Voil%C3%A0%20pourquoi%2C%20je%20n'ach%C3%A9terais%20pas%20d'Asics%20d'occasion%20o%C3%B9%20m%C3%AAme%20neuve%20sauf%20pour%20ceux%20qui%20veulent%20jeter%20leur%20argent%20par%20la%20fen%C3%AAtre" -i HTTP/1.1 200 OK Access-Control-Allow-Credentials: true Access-Control-Allow-Headers: Authorization, Content-Type Access-Control-Allow-Methods: GET, POST Access-Control-Allow-Origin: * Access-Control-Expose-Headers: Authorization Access-Control-Max-Age: 1728000 Content-Length: 37 Content-Type: application/json Date: Thu, 03 Mar 2022 10:29:14 GMT Server: waitress [{"confidence":0.0,"language":"en"}] Strangely, the libretranslate.org instance reports the language correctly as "fr". This applies to both "latest" and "main" tags. — Reply to this email directly, view it on GitHub<#217 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ATZHR5FJPKOGECST7MO6UNLU6CJ27ANCNFSM5PARUKQQ>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.Message ID: ***@***.***>

ZenulAbidin · 2022-03-03T17:49:49Z

Also, it looks like the polyglot module used internally inside LibreTranslate is reporting the detected language and confidence correctly, so it's not like there's any ML model inside the container that's not trained or something.

(to reproduce - create a virtualenv and install all of the stuff inside requirements.txt - you may have to explicitly install ctranslate2 to resolve dependency issues)

Python 3.8.3 (default, Jul  2 2020, 16:21:59) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from polyglot.detect import Detector
>>> arabic_text = u"""
... أفاد مصدر امني في قيادة عمليات صلاح الدين في العراق بأن " القوات الامنية تتوقف لليوم
... الثالث على التوالي عن التقدم الى داخل مدينة تكريت بسبب
... انتشار قناصي التنظيم الذي يطلق على نفسه اسم "الدولة الاسلامية" والعبوات الناسفة
... والمنازل المفخخة والانتحاريين، فضلا عن ان القوات الامنية تنتظر وصول تعزيزات اضافية ".
... """
>>> detector = Detector(arabic_text)
>>> print(detector.language)
name: Arabic      code: ar       confidence:  99.0 read bytes:   907
>>>

Arabic text from https://polyglot.readthedocs.io/en/latest/Detection.html

Even the French text I called LibreTranslate with returns the correct output when calling polyglot's Detector directly:

>>> french_text = u"""
... Voilà pourquoi, je n'achéterais pas d'Asics d'occasion où même neuve sauf pour ceux qui veulent jeter leur argent par la fenêtre
... """
>>> french_detector = Detector(french_text)
>>> print(french_detector.language)
name: French      code: fr       confidence:  99.0 read bytes:   862
>>>

This indicates that the Detector is being passed additional parameters that are messing with the Detector reliability.

This hypothesis is backed by these warnings on the console printed by LibreTranslate daemon when calling /detect, they might offer a good file and line number trace:

WARNING:polyglot.detect.base:Detector is not able to detect the language reliably.
WARNING:polyglot.detect.base:Detector is not able to detect the language reliably.
...

Note: This warning does not always appear during /detect call.

ZenulAbidin · 2022-03-04T04:30:06Z

The failures have occurred because __lang_codes list which contains the supported language prefixes is empty.

After inserting a pdb breakpoint inside detect_languages() function:

$ python main.py                  
Updating language models
Found 54 models
Downloading Arabic → English (1.0) ...
Downloading Azerbaijani → English (1.5) ...
Downloading Chinese → English (1.1) ...
... Output trimmed here
Serving on http://localhost:5000
> /home/zenulabidin/Documents/LibreTranslate/app/language.py(16)detect_languages()
-> is_batch = True
(Pdb) l
 11  
 12  
 13     def detect_languages(text):
 14         # detect batch processing
 15         if isinstance(text, list):
 16  ->         is_batch = True
 17         else:
 18             is_batch = False
 19             text = [text]
 20  
 21         # get the candidates
(Pdb) print(__lang_codes)
[]
(Pdb)

Now we just have to figure out why the list is empty.

ZenulAbidin · 2022-03-04T09:17:57Z

Apparently, argostranslate load_installed_languages() was being called at the top level instead of inside a function, so it runs before the rest of the program and of course it returns an empty list. Wrapping it inside a function resolved this issue.

(this is definitely a bug, someone should edit the label on this issue)

pierotofy · 2022-03-04T15:25:53Z

Should be fixed by #219

pierotofy added help wanted possible bug labels Feb 22, 2022

ZenulAbidin mentioned this issue Mar 4, 2022

Fix language detection error #219

Merged

pierotofy added bug Something isn't working and removed possible bug labels Mar 4, 2022

pierotofy closed this as completed Mar 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The docker image on Docker Hub for version 1.2.7 (latest) has broken auto language detect #217

The docker image on Docker Hub for version 1.2.7 (latest) has broken auto language detect #217

MatsBjerin commented Feb 22, 2022

cristeigabriel commented Feb 22, 2022

cristeigabriel commented Feb 22, 2022

MatsBjerin commented Feb 23, 2022 •

edited

cristeigabriel commented Feb 23, 2022

MatsBjerin commented Feb 23, 2022

ZenulAbidin commented Mar 3, 2022 •

edited

MatsBjerin commented Mar 3, 2022 via email

ZenulAbidin commented Mar 3, 2022 •

edited

ZenulAbidin commented Mar 4, 2022 •

edited

ZenulAbidin commented Mar 4, 2022 •

edited

pierotofy commented Mar 4, 2022

The docker image on Docker Hub for version 1.2.7 (latest) has broken auto language detect #217

The docker image on Docker Hub for version 1.2.7 (latest) has broken auto language detect #217

Comments

MatsBjerin commented Feb 22, 2022

cristeigabriel commented Feb 22, 2022

cristeigabriel commented Feb 22, 2022

MatsBjerin commented Feb 23, 2022 • edited

cristeigabriel commented Feb 23, 2022

MatsBjerin commented Feb 23, 2022

ZenulAbidin commented Mar 3, 2022 • edited

MatsBjerin commented Mar 3, 2022 via email

ZenulAbidin commented Mar 3, 2022 • edited

ZenulAbidin commented Mar 4, 2022 • edited

ZenulAbidin commented Mar 4, 2022 • edited

pierotofy commented Mar 4, 2022

MatsBjerin commented Feb 23, 2022 •

edited

ZenulAbidin commented Mar 3, 2022 •

edited

ZenulAbidin commented Mar 3, 2022 •

edited

ZenulAbidin commented Mar 4, 2022 •

edited

ZenulAbidin commented Mar 4, 2022 •

edited