Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The docker image on Docker Hub for version 1.2.7 (latest) has broken auto language detect #217

Closed
MatsBjerin opened this issue Feb 22, 2022 · 11 comments
Labels
bug Something isn't working

Comments

@MatsBjerin
Copy link

To replicate

  1. Create this docker-compose file:
version: "3"

services:
  libretranslate:
    image: libretranslate/libretranslate:latest
    container_name: libretranslate
    stdin_open: true
    tty: true
    restart: unless-stopped
    ports:
      - 5000:5000
  1. Start with: "docker-compose up -d"

  2. Then run this:
    curl -X POST -H "Content-Type: application/json" -d '{"q": "Ciao!", "source": "auto", "target": "en"}' localhost:5000/translate

The language auto detect will fail and no translation is made.

@cristeigabriel
Copy link

I can't try but I recommend that in some 'production' sort of environment you do your posts through an error handle-able module. My guess is that the character is throwing off the language auto detection. Personally I applied the following changes to the LibreTranslate app to 'language.py'. The error should be seen in the host console.

RE_BAD_CHARS = regex.compile(r"\p{Cc}|\p{Cs}")

def remove_bad_chars(text):
    return RE_BAD_CHARS.sub("", text)

def detect_languages(text):
    # detect batch processing
    if isinstance(text, list):
        is_batch = True
    else:
        is_batch = False
        text = [text]

    # get the candidates
    candidates = []
    for _t in text:
        t = remove_bad_chars(_t)

I have borrowed the solution from someone else who was having the same issue long ago, sadly I didn't store a reference to the post, anyhow. Nonetheless, this solution is still working just fine, after long periods of testing.

@cristeigabriel
Copy link

If above fixes the issue - I'm willing to make a PR (which'd obviously link the solution original poster).

@MatsBjerin
Copy link
Author

MatsBjerin commented Feb 23, 2022

Thank you Cristei,

The issue however affects all text. It does not matter if the input is just "Ciao" or "Hello". Detection will still fail.
The curly braces are not part of the text. They are just Json wrappers as the POST format is Json and should never reach the actual language engine.

Also, we have exactly the same issue when using the suggested python library from the libretranslate page, to perform the call.

The issue appeared in the latest release. However the previous version with tag "main" turns out to also show the same behaviour.

I am guessing this is because of some change in the language models that are downloaded, as I believe the "main" Docker image itself (not just "latest") is unchanged but still has changed its behaviour since just a week ago.

We have worked around the issue by using another language detection engine and thus specify the source and target languages specifically when using libretranslate.

We are of course still hoping the actual problem will be fixed at some point as most users will probably be affected.

Best regards,
Mats

@cristeigabriel
Copy link

I forgot that I haven't synced my repository with the upstream, and the issue you were facing is very similar to what I've been myself, hence the response. My apologies.

@MatsBjerin
Copy link
Author

No worries! I appreciate you looking into this anyhow!
Best, /Mats

@ZenulAbidin
Copy link
Contributor

ZenulAbidin commented Mar 3, 2022

I'm sure this is related to this issue.

When running the LibreTranslate docker instance locally, /detect totally fails to detect the language.

French text copied from random website:

curl -X POST "http://localhost:5000/detect" -H  "accept: application/json" -H  "Content-Type: application/x-www-form-urlencoded" -d "q=Voil%C3%A0%20pourquoi%2C%20je%20n'ach%C3%A9terais%20pas%20d'Asics%20d'occasion%20o%C3%B9%20m%C3%AAme%20neuve%20sauf%20pour%20ceux%20qui%20veulent%20jeter%20leur%20argent%20par%20la%20fen%C3%AAtre" -i
HTTP/1.1 200 OK
Access-Control-Allow-Credentials: true
Access-Control-Allow-Headers: Authorization, Content-Type
Access-Control-Allow-Methods: GET, POST
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Authorization
Access-Control-Max-Age: 1728000
Content-Length: 37
Content-Type: application/json
Date: Thu, 03 Mar 2022 10:29:14 GMT
Server: waitress

[{"confidence":0.0,"language":"en"}]

Strangely, the libretranslate.org instance reports the language correctly as "fr".

/translate works properly though.

This applies to both "latest" and "main" tags.

@MatsBjerin
Copy link
Author

MatsBjerin commented Mar 3, 2022 via email

@ZenulAbidin
Copy link
Contributor

ZenulAbidin commented Mar 3, 2022

Also, it looks like the polyglot module used internally inside LibreTranslate is reporting the detected language and confidence correctly, so it's not like there's any ML model inside the container that's not trained or something.

(to reproduce - create a virtualenv and install all of the stuff inside requirements.txt - you may have to explicitly install ctranslate2 to resolve dependency issues)

Python 3.8.3 (default, Jul  2 2020, 16:21:59) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from polyglot.detect import Detector
>>> arabic_text = u"""
... أفاد مصدر امني في قيادة عمليات صلاح الدين في العراق بأن " القوات الامنية تتوقف لليوم
... الثالث على التوالي عن التقدم الى داخل مدينة تكريت بسبب
... انتشار قناصي التنظيم الذي يطلق على نفسه اسم "الدولة الاسلامية" والعبوات الناسفة
... والمنازل المفخخة والانتحاريين، فضلا عن ان القوات الامنية تنتظر وصول تعزيزات اضافية ".
... """
>>> detector = Detector(arabic_text)
>>> print(detector.language)
name: Arabic      code: ar       confidence:  99.0 read bytes:   907
>>> 

Arabic text from https://polyglot.readthedocs.io/en/latest/Detection.html

Even the French text I called LibreTranslate with returns the correct output when calling polyglot's Detector directly:

>>> french_text = u"""
... Voilà pourquoi, je n'achéterais pas d'Asics d'occasion où même neuve sauf pour ceux qui veulent jeter leur argent par la fenêtre
... """
>>> french_detector = Detector(french_text)
>>> print(french_detector.language)
name: French      code: fr       confidence:  99.0 read bytes:   862
>>> 

This indicates that the Detector is being passed additional parameters that are messing with the Detector reliability.

This hypothesis is backed by these warnings on the console printed by LibreTranslate daemon when calling /detect, they might offer a good file and line number trace:

WARNING:polyglot.detect.base:Detector is not able to detect the language reliably.
WARNING:polyglot.detect.base:Detector is not able to detect the language reliably.
...

Note: This warning does not always appear during /detect call.

@ZenulAbidin
Copy link
Contributor

ZenulAbidin commented Mar 4, 2022

The failures have occurred because __lang_codes list which contains the supported language prefixes is empty.

After inserting a pdb breakpoint inside detect_languages() function:

$ python main.py                  
Updating language models
Found 54 models
Downloading Arabic → English (1.0) ...
Downloading Azerbaijani → English (1.5) ...
Downloading Chinese → English (1.1) ...
... Output trimmed here
Serving on http://localhost:5000
> /home/zenulabidin/Documents/LibreTranslate/app/language.py(16)detect_languages()
-> is_batch = True
(Pdb) l
 11  
 12  
 13     def detect_languages(text):
 14         # detect batch processing
 15         if isinstance(text, list):
 16  ->         is_batch = True
 17         else:
 18             is_batch = False
 19             text = [text]
 20  
 21         # get the candidates
(Pdb) print(__lang_codes)
[]
(Pdb) 

Now we just have to figure out why the list is empty.

@ZenulAbidin
Copy link
Contributor

ZenulAbidin commented Mar 4, 2022

Apparently, argostranslate load_installed_languages() was being called at the top level instead of inside a function, so it runs before the rest of the program and of course it returns an empty list. Wrapping it inside a function resolved this issue.

(this is definitely a bug, someone should edit the label on this issue)

@pierotofy pierotofy added bug Something isn't working and removed possible bug labels Mar 4, 2022
@pierotofy
Copy link
Member

Should be fixed by #219

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants