-
Notifications
You must be signed in to change notification settings - Fork 553
Enforcing a specific language when uploading documents #2016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey @ssztemberg this is a great request. I'm not sure how Unstructured is determining language—I am going to look at moving towards using their official Docker image, and it may very well fix this. And then we don't support graph building in other languages, but we do have it on our roadmap and should support this soon! |
Hey @ssztemberg any updates ? do you achieved ingest and query span language ? I have a similar need to work with Portuguese docs? |
@ssztemberg Which provider are you currently using? I have tried the "quickstart" free tier of SciPhi Cloud for the first time, ingesting Portuguese content, and I ended up with my chunks in English. After looking at the code for a few hours, I see that the r2r is using a the VLMPDFParser by default to parse PDF files, and this parser uses an English prompt. Maybe changing the prompt, or just specifying your language in it, could resolve this. When I overrode the PDF parser to "unstructured," my chunks are now in the native language. Tomorrow, I will try to run the entire project locally to modify the prompt and test. Does this make sense, @NolanTrem? |
As long as the model supports that language it should work. AFAIK the LLM is used to chunk so if you use a system prompt like this: from litellm import completion
# Define the system prompt with language instruction
messages = [
{
"role": "system",
"content": "Please respond in Portuguese." # Change to "Por favor, responda em português." for Portuguese
},
{
"role": "user",
"content": "Qual é a capital do Brasil?" # "What is the capital of Brazil?"
}
]
# Make the completion call
response = completion(model="gpt-3.5-turbo", messages=messages)
# Extract and print the assistant's reply
assistant_reply = response['choices'][0]['message']['content']
print(assistant_reply) It should work or increase the accuracy. But you raise a valid point as the model might not be 100% accurate and confuse the LLM when English prompts are used in other parts of the R2R system. For example in this file https://github.com/SciPhi-AI/R2R/blob/main/py/core/agent/rag.py but System Prompt tuning might still be tempremental if using different LLM providers. |
@ryanrib14 Hello, I'm experiencing the same problem when extracting entities to build graphs and perform a search with r2r retrieval. Did you get any further progress by making these changes? |
How to enforce a specific language when uploading a document?
I noticed that when uploading a document, the language always (at least for me) defaults to
"eng"
. In the code, I found the following snippet, which appears to be the only reference to language settings:However, even when adding metadata with "language", the chunk metadata still contains:
"unstructuredLanguages": "eng"
This also affects the results in Entities, where everything seems to be processed in English.
Is there a way to explicitly enforce a different language for documents? I know that Unstructured, for example, may require adding additional dependencies like Tesseract OCR for different languages.
For now, I have manually set
"language": "Spanish"
and"unstructuredLanguages": "spa"
in the metadata, which does update the chunk metadata, but I'm unsure if this is the correct approach. Any guidance would be appreciated!The text was updated successfully, but these errors were encountered: