Enforcing a specific language when uploading documents #2016

ssztemberg · 2025-03-03T14:32:17Z

How to enforce a specific language when uploading a document?

I noticed that when uploading a document, the language always (at least for me) defaults to "eng". In the code, I found the following snippet, which appears to be the only reference to language settings:

docs = [
    (f"Aristotle was a philosopher {random_1}", "English"),
    (f"Aristóteles fue un filósofo {random_2}", "Spanish"),
    (f"アリストテレスは哲学者でした {random_3}", "Japanese"),
]
doc_ids = []
for text, lang in docs:
    doc_id = client.documents.create(raw_text=text,
                                     metadata={
                                         "language": lang
                                     }).results.document_id
    doc_ids.append(doc_id)

# Query in different languages
queries = [
    "Who was Aristotle?",
    "¿Quién fue Aristóteles?",
    "アリストテレスとは誰でしたか？",
]

However, even when adding metadata with "language", the chunk metadata still contains:
"unstructuredLanguages": "eng"

This also affects the results in Entities, where everything seems to be processed in English.

Is there a way to explicitly enforce a different language for documents? I know that Unstructured, for example, may require adding additional dependencies like Tesseract OCR for different languages.

For now, I have manually set "language": "Spanish" and "unstructuredLanguages": "spa" in the metadata, which does update the chunk metadata, but I'm unsure if this is the correct approach. Any guidance would be appreciated!

The text was updated successfully, but these errors were encountered:

NolanTrem · 2025-03-04T18:22:47Z

Hey @ssztemberg this is a great request.

I'm not sure how Unstructured is determining language—I am going to look at moving towards using their official Docker image, and it may very well fix this.

And then we don't support graph building in other languages, but we do have it on our roadmap and should support this soon!

ryanrib14 · 2025-03-21T21:55:26Z

Hey @ssztemberg any updates ? do you achieved ingest and query span language ? I have a similar need to work with Portuguese docs?

ryanrib14 · 2025-03-23T21:10:08Z

@ssztemberg Which provider are you currently using? I have tried the "quickstart" free tier of SciPhi Cloud for the first time, ingesting Portuguese content, and I ended up with my chunks in English. After looking at the code for a few hours, I see that the r2r is using a the VLMPDFParser by default to parse PDF files, and this parser uses an English prompt. Maybe changing the prompt, or just specifying your language in it, could resolve this. When I overrode the PDF parser to "unstructured," my chunks are now in the native language. Tomorrow, I will try to run the entire project locally to modify the prompt and test. Does this make sense, @NolanTrem?

SuperPauly · 2025-03-31T06:25:05Z

As long as the model supports that language it should work. AFAIK the LLM is used to chunk so if you use a system prompt like this:

from litellm import completion

# Define the system prompt with language instruction
messages = [
    {
        "role": "system",
        "content": "Please respond in Portuguese."  # Change to "Por favor, responda em português." for Portuguese
    },
    {
        "role": "user",
        "content": "Qual é a capital do Brasil?"  # "What is the capital of Brazil?"
    }
]

# Make the completion call
response = completion(model="gpt-3.5-turbo", messages=messages)

# Extract and print the assistant's reply
assistant_reply = response['choices'][0]['message']['content']
print(assistant_reply)

It should work or increase the accuracy.

But you raise a valid point as the model might not be 100% accurate and confuse the LLM when English prompts are used in other parts of the R2R system. For example in this file https://github.com/SciPhi-AI/R2R/blob/main/py/core/agent/rag.py but System Prompt tuning might still be tempremental if using different LLM providers.

camilarmoraes · 2025-04-30T11:59:00Z

@ssztemberg Which provider are you currently using? I have tried the "quickstart" free tier of SciPhi Cloud for the first time, ingesting Portuguese content, and I ended up with my chunks in English. After looking at the code for a few hours, I see that the r2r is using a the VLMPDFParser by default to parse PDF files, and this parser uses an English prompt. Maybe changing the prompt, or just specifying your language in it, could resolve this. When I overrode the PDF parser to "unstructured," my chunks are now in the native language. Tomorrow, I will try to run the entire project locally to modify the prompt and test. Does this make sense, @NolanTrem?

@ryanrib14 Hello, I'm experiencing the same problem when extracting entities to build graphs and perform a search with r2r retrieval. Did you get any further progress by making these changes?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enforcing a specific language when uploading documents #2016

Enforcing a specific language when uploading documents #2016

ssztemberg commented Mar 3, 2025

NolanTrem commented Mar 4, 2025

Uh oh!

ryanrib14 commented Mar 21, 2025

Uh oh!

ryanrib14 commented Mar 23, 2025

Uh oh!

SuperPauly commented Mar 31, 2025

Uh oh!

camilarmoraes commented Apr 30, 2025

Uh oh!

Enforcing a specific language when uploading documents #2016

Enforcing a specific language when uploading documents #2016

Comments

ssztemberg commented Mar 3, 2025

How to enforce a specific language when uploading a document?

NolanTrem commented Mar 4, 2025

Uh oh!

ryanrib14 commented Mar 21, 2025

Uh oh!

ryanrib14 commented Mar 23, 2025

Uh oh!

SuperPauly commented Mar 31, 2025

Uh oh!

camilarmoraes commented Apr 30, 2025

Uh oh!