Skip to content

Enforcing a specific language when uploading documents #2016

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ssztemberg opened this issue Mar 3, 2025 · 5 comments
Open

Enforcing a specific language when uploading documents #2016

ssztemberg opened this issue Mar 3, 2025 · 5 comments

Comments

@ssztemberg
Copy link

How to enforce a specific language when uploading a document?

I noticed that when uploading a document, the language always (at least for me) defaults to "eng". In the code, I found the following snippet, which appears to be the only reference to language settings:

docs = [
    (f"Aristotle was a philosopher {random_1}", "English"),
    (f"Aristóteles fue un filósofo {random_2}", "Spanish"),
    (f"アリストテレスは哲学者でした {random_3}", "Japanese"),
]
doc_ids = []
for text, lang in docs:
    doc_id = client.documents.create(raw_text=text,
                                     metadata={
                                         "language": lang
                                     }).results.document_id
    doc_ids.append(doc_id)

# Query in different languages
queries = [
    "Who was Aristotle?",
    "¿Quién fue Aristóteles?",
    "アリストテレスとは誰でしたか?",
]

However, even when adding metadata with "language", the chunk metadata still contains:
"unstructuredLanguages": "eng"

This also affects the results in Entities, where everything seems to be processed in English.

Is there a way to explicitly enforce a different language for documents? I know that Unstructured, for example, may require adding additional dependencies like Tesseract OCR for different languages.

For now, I have manually set "language": "Spanish" and "unstructuredLanguages": "spa" in the metadata, which does update the chunk metadata, but I'm unsure if this is the correct approach. Any guidance would be appreciated!

@NolanTrem
Copy link
Collaborator

Hey @ssztemberg this is a great request.

I'm not sure how Unstructured is determining language—I am going to look at moving towards using their official Docker image, and it may very well fix this.

And then we don't support graph building in other languages, but we do have it on our roadmap and should support this soon!

@ryanrib14
Copy link

Hey @ssztemberg any updates ? do you achieved ingest and query span language ? I have a similar need to work with Portuguese docs?

@ryanrib14
Copy link

@ssztemberg Which provider are you currently using? I have tried the "quickstart" free tier of SciPhi Cloud for the first time, ingesting Portuguese content, and I ended up with my chunks in English. After looking at the code for a few hours, I see that the r2r is using a the VLMPDFParser by default to parse PDF files, and this parser uses an English prompt. Maybe changing the prompt, or just specifying your language in it, could resolve this. When I overrode the PDF parser to "unstructured," my chunks are now in the native language. Tomorrow, I will try to run the entire project locally to modify the prompt and test. Does this make sense, @NolanTrem?

@SuperPauly
Copy link
Contributor

As long as the model supports that language it should work. AFAIK the LLM is used to chunk so if you use a system prompt like this:

from litellm import completion

# Define the system prompt with language instruction
messages = [
    {
        "role": "system",
        "content": "Please respond in Portuguese."  # Change to "Por favor, responda em português." for Portuguese
    },
    {
        "role": "user",
        "content": "Qual é a capital do Brasil?"  # "What is the capital of Brazil?"
    }
]

# Make the completion call
response = completion(model="gpt-3.5-turbo", messages=messages)

# Extract and print the assistant's reply
assistant_reply = response['choices'][0]['message']['content']
print(assistant_reply)

It should work or increase the accuracy.

But you raise a valid point as the model might not be 100% accurate and confuse the LLM when English prompts are used in other parts of the R2R system. For example in this file https://github.com/SciPhi-AI/R2R/blob/main/py/core/agent/rag.py but System Prompt tuning might still be tempremental if using different LLM providers.

@camilarmoraes
Copy link

@ssztemberg Which provider are you currently using? I have tried the "quickstart" free tier of SciPhi Cloud for the first time, ingesting Portuguese content, and I ended up with my chunks in English. After looking at the code for a few hours, I see that the r2r is using a the VLMPDFParser by default to parse PDF files, and this parser uses an English prompt. Maybe changing the prompt, or just specifying your language in it, could resolve this. When I overrode the PDF parser to "unstructured," my chunks are now in the native language. Tomorrow, I will try to run the entire project locally to modify the prompt and test. Does this make sense, @NolanTrem?

@ryanrib14 Hello, I'm experiencing the same problem when extracting entities to build graphs and perform a search with r2r retrieval. Did you get any further progress by making these changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants