Description
[√ ] I checked the documentation and related resources and couldn't find an answer to my question.
Your Question
I'm encountering issues when trying to evaluate my RAG system using Ragas. The answer_correctness
and answer_similarity
metrics calculate correctly, but other metrics like context_recall
, faithfulness
, and context_precision
consistently result in nan
. The evaluation process also shows frequent RagasOutputParserException
and TimeoutError
.
I suspect this might be related to the choice of LLM (likely a local model via Ollama (qwen2.5:14b), as indicated by init_ragas_ollama_components
) or how Ragas handles the output parsing from this specific LLM.
Code Snippet
with open(Path(args.base_dir) / "retrival_results.json",encoding='utf-8') as f:
retirval_results = json.load(f)
embed_wrapper, llm_wrapper = init_ragas_ollama_components(args)
metrics = [answer_correctness, answer_similarity,context_recall,faithfulness, context_precision]
for metric in metrics:
if hasattr(metric, "llm"): metric.llm = llm_wrapper
if hasattr(metric, "embeddings"): metric.embeddings = embed_wrapper
modes = ['naive', 'local', 'global'] if args.mode == 'all' else [args.mode]
all_results = {}
for mode in modes:
samples = retirval_results.get(mode, [])
questions = []
ground_truths = []
predictions = []
contexts = []
for sample in samples:
questions.append(sample.get("question", ""))
ground_truths.append(sample.get("ground_truth", ""))
predictions.append(sample.get("prediction", ""))
contexts.append([sample.get("retrieval_context", "")])
dataset = Dataset.from_dict({
"question": questions,
"ground_truth": ground_truths,
"answer": predictions,
"contexts": contexts
})
results = evaluate(dataset, metrics=metrics)
Error Messages:
Evaluating: 10%|██████████████ | 5/50 [01:41<20:45, 27.69s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_recall_classification_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[12]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 18%|█████████████████████████▏ | 9/50 [02:48<10:41, 15.66s/it]ERROR:ragas.executor:Exception raised in Job[2]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[3]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[4]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[5]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[7]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[8]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[9]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[13]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[14]: TimeoutError()
Evaluating: 38%|████████████████████████████████████████████████████▊ | 19/50 [03:01<01:37, 3.13s/it]ERROR:ragas.executor:Exception raised in Job[17]: TimeoutError()
Evaluating: 42%|██████████████████████████████████████████████████████████▍ | 21/50 [03:02<01:14, 2.56s/it]ERROR:ragas.executor:Exception raised in Job[18]: TimeoutError()
Evaluating: 46%|███████████████████████████████████████████████████████████████▉ | 23/50 [03:03<00:57, 2.12s/it]ERROR:ragas.executor:Exception raised in Job[19]: TimeoutError()
Evaluating: 48%|██████████████████████████████████████████████████████████████████▋ | 24/50 [03:04<00:49, 1.92s/it]ERROR:ragas.executor:Exception raised in Job[20]: TimeoutError()
Evaluating: 50%|█████████████████████████████████████████████████████████████████████▌ | 25/50 [04:41<07:17, 17.52s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_recall_classification_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[32]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 54%|███████████████████████████████████████████████████████████████████████████ | 27/50 [05:24<06:47, 17.74s/it]ERROR:ragas.executor:Exception raised in Job[22]: TimeoutError()
Evaluating: 56%|█████████████████████████████████████████████████████████████████████████████▊ | 28/50 [05:33<05:47, 15.77s/it]ERROR:ragas.executor:Exception raised in Job[23]: TimeoutError()
Evaluating: 58%|████████████████████████████████████████████████████████████████████████████████▌ | 29/50 [05:46<05:15, 15.04s/it]ERROR:ragas.executor:Exception raised in Job[24]: TimeoutError()
Evaluating: 60%|███████████████████████████████████████████████████████████████████████████████████▍ | 30/50 [05:48<03:52, 11.63s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_precision_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[34]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 64%|████████████████████████████████████████████████████████████████████████████████████████▉ | 32/50 [05:55<02:15, 7.53s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_precision_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[29]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 66%|███████████████████████████████████████████████████████████████████████████████████████████▋ | 33/50 [05:58<01:46, 6.27s/it]ERROR:ragas.executor:Exception raised in Job[25]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[27]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[28]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[30]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[33]: TimeoutError()
Evaluating: 68%|██████████████████████████████████████████████████████████████████████████████████████████████▌ | 34/50 [06:00<01:16, 4.78s/it]ERROR:ragas.executor:Exception raised in Job[35]: TimeoutError()
Evaluating: 78%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 39/50 [06:01<00:19, 1.79s/it]ERROR:ragas.executor:Exception raised in Job[37]: TimeoutError()
Evaluating: 80%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 40/50 [06:02<00:16, 1.60s/it]ERROR:ragas.executor:Exception raised in Job[38]: TimeoutError()
Evaluating: 82%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 41/50 [06:03<00:13, 1.51s/it]ERROR:ragas.executor:Exception raised in Job[39]: TimeoutError()
Evaluating: 86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 43/50 [06:16<00:27, 3.92s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_precision_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[49]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 88%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 44/50 [07:52<02:40, 26.72s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt n_l_i_statement_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[43]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 45/50 [08:17<02:11, 26.36s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_recall_classification_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[47]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 47/50 [08:24<00:46, 15.39s/it]ERROR:ragas.executor:Exception raised in Job[42]: TimeoutError()
Evaluating: 96%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 48/50 [08:24<00:22, 11.13s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt n_l_i_statement_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[48]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 49/50 [08:31<00:09, 9.88s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_precision_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[44]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [08:31<00:00, 10.23s/it]
{'answer_correctness': 0.7917, 'semantic_similarity': 0.7538, 'context_recall': nan, 'faithfulness': nan, 'context_precision': nan}
INFO:main:=== Evaluation finished ===
Additional context:
- I am using a local LLM, likely via Ollama (based on the init_ragas_ollama_components function name).
- The errors suggest the LLM might not be returning output in the expected format for Ragas' parsers, or it might be timing out during the evaluation process for certain metrics.
- answer_correctness and answer_similarity rely more on embeddings and simpler LLM calls, while context_recall, faithfulness, and context_precision typically require the LLM to perform more complex reasoning and structured output generation (like JSON). This difference in required LLM capability might explain why some metrics work and others fail
- Could there be specific prompt or parsing issues with certain local LLMs? Or are there parameters I need to adjust for timeouts or parser robustness?
Activity
openhands-ai commentedon May 17, 2025
I'm on it! jjmachan can track my progress at all-hands.dev
jjmachan commentedon May 17, 2025
hey @WangAo-0 let me take a deeper look at this but in the mean time I've setup @OpenHands agent on it and will check back with what it has to say
openhands-ai commentedon May 17, 2025
I'm on it! jjmachan can track my progress at all-hands.dev
Fix timeout and output parsing issues for local models
openhands-ai commentedon May 17, 2025
I've successfully fixed issue #2044 in the ragas repository, which was about evaluation errors related to output parsing and timeout issues for context/faithfulness metrics when using local models like Ollama.
Here's a summary of the changes I made:
Added support for local models in RunConfig:
is_local_model
flag toRunConfig
classTrue
, the timeout is automatically increased to 600 seconds (10 minutes) to accommodate slower local modelsImproved JSON output parsing:
extract_json
function to better handle common issues with local model outputs:Made the RagasOutputParser more robust:
Added tests:
All changes have been successfully pushed to the remote branch, and I've created a pull request (#2045) that references and fixes issue #2044. The PR includes a detailed description of the changes and mentions that it fixes the issue.
The PR can be found at: #2045
Fix issue #2044: Improve output parsing and timeout for local LLMs
openhands-ai commentedon May 17, 2025
I've successfully fixed issue #2044 in the Ragas repository, which was related to evaluation errors when using local LLMs like Ollama with qwen2.5:14b. The issue had two main problems:
Changes Made:
Increased default timeout:
Enhanced JSON extraction:
Improved error handling:
Added comprehensive tests:
Pull Request:
I've created PR #2046 (#2046) that fixes issue #2044. The PR includes all the changes mentioned above and has been pushed to the remote branch
openhands/fix-local-llm-output-parsing
.The PR description includes a detailed explanation of the changes and mentions that it fixes issue #2044, along with the requested link for @jjmachan to continue refining the PR.
Jarzab3kDH commentedon May 21, 2025
I've also seen errors like:
running GCP Models via LangChain
is anyone working on those issues?