Skip to content

Evaluation Errors: Output Parsing and Timeout Issues for Context/Faithfulness Metrics #2044

@WangAo-0

Description

@WangAo-0

[√ ] I checked the documentation and related resources and couldn't find an answer to my question.

Your Question
I'm encountering issues when trying to evaluate my RAG system using Ragas. The answer_correctness and answer_similarity metrics calculate correctly, but other metrics like context_recall, faithfulness, and context_precision consistently result in nan. The evaluation process also shows frequent RagasOutputParserException and TimeoutError.

I suspect this might be related to the choice of LLM (likely a local model via Ollama (qwen2.5:14b), as indicated by init_ragas_ollama_components) or how Ragas handles the output parsing from this specific LLM.

Code Snippet

      with open(Path(args.base_dir) / "retrival_results.json",encoding='utf-8') as f:
        retirval_results = json.load(f)
    
    embed_wrapper, llm_wrapper = init_ragas_ollama_components(args)
    metrics = [answer_correctness, answer_similarity,context_recall,faithfulness, context_precision]
    
    for metric in metrics:
        if hasattr(metric, "llm"): metric.llm = llm_wrapper
        if hasattr(metric, "embeddings"): metric.embeddings = embed_wrapper
    
    modes = ['naive', 'local', 'global'] if args.mode == 'all' else [args.mode]
    all_results = {}

    for mode in modes:
        samples = retirval_results.get(mode, []) 
        questions = []
        ground_truths = []
        predictions = []
        contexts = []

        for sample in samples:
            questions.append(sample.get("question", "")) 
            ground_truths.append(sample.get("ground_truth", ""))
            predictions.append(sample.get("prediction", ""))
            contexts.append([sample.get("retrieval_context", "")])

        dataset = Dataset.from_dict({
            "question": questions,
            "ground_truth": ground_truths,
            "answer": predictions,
            "contexts": contexts
        })
        results = evaluate(dataset, metrics=metrics)

Error Messages:

Evaluating: 10%|██████████████ | 5/50 [01:41<20:45, 27.69s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_recall_classification_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[12]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 18%|█████████████████████████▏ | 9/50 [02:48<10:41, 15.66s/it]ERROR:ragas.executor:Exception raised in Job[2]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[3]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[4]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[5]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[7]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[8]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[9]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[13]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[14]: TimeoutError()
Evaluating: 38%|████████████████████████████████████████████████████▊ | 19/50 [03:01<01:37, 3.13s/it]ERROR:ragas.executor:Exception raised in Job[17]: TimeoutError()
Evaluating: 42%|██████████████████████████████████████████████████████████▍ | 21/50 [03:02<01:14, 2.56s/it]ERROR:ragas.executor:Exception raised in Job[18]: TimeoutError()
Evaluating: 46%|███████████████████████████████████████████████████████████████▉ | 23/50 [03:03<00:57, 2.12s/it]ERROR:ragas.executor:Exception raised in Job[19]: TimeoutError()
Evaluating: 48%|██████████████████████████████████████████████████████████████████▋ | 24/50 [03:04<00:49, 1.92s/it]ERROR:ragas.executor:Exception raised in Job[20]: TimeoutError()
Evaluating: 50%|█████████████████████████████████████████████████████████████████████▌ | 25/50 [04:41<07:17, 17.52s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_recall_classification_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[32]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 54%|███████████████████████████████████████████████████████████████████████████ | 27/50 [05:24<06:47, 17.74s/it]ERROR:ragas.executor:Exception raised in Job[22]: TimeoutError()
Evaluating: 56%|█████████████████████████████████████████████████████████████████████████████▊ | 28/50 [05:33<05:47, 15.77s/it]ERROR:ragas.executor:Exception raised in Job[23]: TimeoutError()
Evaluating: 58%|████████████████████████████████████████████████████████████████████████████████▌ | 29/50 [05:46<05:15, 15.04s/it]ERROR:ragas.executor:Exception raised in Job[24]: TimeoutError()
Evaluating: 60%|███████████████████████████████████████████████████████████████████████████████████▍ | 30/50 [05:48<03:52, 11.63s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_precision_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[34]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 64%|████████████████████████████████████████████████████████████████████████████████████████▉ | 32/50 [05:55<02:15, 7.53s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_precision_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[29]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 66%|███████████████████████████████████████████████████████████████████████████████████████████▋ | 33/50 [05:58<01:46, 6.27s/it]ERROR:ragas.executor:Exception raised in Job[25]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[27]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[28]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[30]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[33]: TimeoutError()
Evaluating: 68%|██████████████████████████████████████████████████████████████████████████████████████████████▌ | 34/50 [06:00<01:16, 4.78s/it]ERROR:ragas.executor:Exception raised in Job[35]: TimeoutError()
Evaluating: 78%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 39/50 [06:01<00:19, 1.79s/it]ERROR:ragas.executor:Exception raised in Job[37]: TimeoutError()
Evaluating: 80%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 40/50 [06:02<00:16, 1.60s/it]ERROR:ragas.executor:Exception raised in Job[38]: TimeoutError()
Evaluating: 82%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 41/50 [06:03<00:13, 1.51s/it]ERROR:ragas.executor:Exception raised in Job[39]: TimeoutError()
Evaluating: 86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 43/50 [06:16<00:27, 3.92s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_precision_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[49]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 88%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 44/50 [07:52<02:40, 26.72s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt n_l_i_statement_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[43]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 45/50 [08:17<02:11, 26.36s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_recall_classification_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[47]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 47/50 [08:24<00:46, 15.39s/it]ERROR:ragas.executor:Exception raised in Job[42]: TimeoutError()
Evaluating: 96%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 48/50 [08:24<00:22, 11.13s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt n_l_i_statement_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[48]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 49/50 [08:31<00:09, 9.88s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_precision_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[44]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [08:31<00:00, 10.23s/it]
{'answer_correctness': 0.7917, 'semantic_similarity': 0.7538, 'context_recall': nan, 'faithfulness': nan, 'context_precision': nan}
INFO:main:=== Evaluation finished ===

Additional context:

  • I am using a local LLM, likely via Ollama (based on the init_ragas_ollama_components function name).
  • The errors suggest the LLM might not be returning output in the expected format for Ragas' parsers, or it might be timing out during the evaluation process for certain metrics.
  • answer_correctness and answer_similarity rely more on embeddings and simpler LLM calls, while context_recall, faithfulness, and context_precision typically require the LLM to perform more complex reasoning and structured output generation (like JSON). This difference in required LLM capability might explain why some metrics work and others fail
  • Could there be specific prompt or parsing issues with certain local LLMs? Or are there parameters I need to adjust for timeouts or parser robustness?

Activity

openhands-ai

openhands-ai commented on May 17, 2025

@openhands-ai

I'm on it! jjmachan can track my progress at all-hands.dev

jjmachan

jjmachan commented on May 17, 2025

@jjmachan
Member

hey @WangAo-0 let me take a deeper look at this but in the mean time I've setup @OpenHands agent on it and will check back with what it has to say

openhands-ai

openhands-ai commented on May 17, 2025

@openhands-ai

I'm on it! jjmachan can track my progress at all-hands.dev

openhands-ai

openhands-ai commented on May 17, 2025

@openhands-ai

I've successfully fixed issue #2044 in the ragas repository, which was about evaluation errors related to output parsing and timeout issues for context/faithfulness metrics when using local models like Ollama.

Here's a summary of the changes I made:

  1. Added support for local models in RunConfig:

    • Added a new is_local_model flag to RunConfig class
    • When this flag is set to True, the timeout is automatically increased to 600 seconds (10 minutes) to accommodate slower local models
  2. Improved JSON output parsing:

    • Enhanced the extract_json function to better handle common issues with local model outputs:
      • Better detection of JSON in markdown code blocks
      • Converting single quotes to double quotes (common in local model outputs)
      • Removing trailing commas in arrays and objects
      • Added validation and auto-correction of malformed JSON
  3. Made the RagasOutputParser more robust:

    • Added multiple fallback mechanisms for parsing outputs
    • Implemented more aggressive JSON extraction for malformed outputs
    • Added better error handling for parsing failures
  4. Added tests:

    • Created tests to verify the JSON extraction improvements
    • Added tests for the local model timeout configuration

All changes have been successfully pushed to the remote branch, and I've created a pull request (#2045) that references and fixes issue #2044. The PR includes a detailed description of the changes and mentions that it fixes the issue.

The PR can be found at: #2045

added a commit that references this issue on May 17, 2025

Fix issue #2044: Improve output parsing and timeout for local LLMs

openhands-ai

openhands-ai commented on May 17, 2025

@openhands-ai

I've successfully fixed issue #2044 in the Ragas repository, which was related to evaluation errors when using local LLMs like Ollama with qwen2.5:14b. The issue had two main problems:

  1. TimeoutError: Local LLMs were taking longer to process requests than the default timeout allowed
  2. RagasOutputParserException: The output parser was failing to handle various JSON formats produced by local LLMs

Changes Made:

  1. Increased default timeout:

    • Changed the default timeout in RunConfig from 180 to 300 seconds to accommodate slower local LLMs
  2. Enhanced JSON extraction:

    • Improved the extract_json function to handle various LLM output formats:
    • Added support for markdown code blocks
    • Added handling for single quotes in JSON
    • Added handling for trailing commas in JSON
    • Improved JSON extraction from text with surrounding content
  3. Improved error handling:

    • Enhanced RagasOutputParser with more robust error handling
    • Increased default retries from 1 to 3
    • Added detailed error messages with suggestions for working with local LLMs
  4. Added comprehensive tests:

    • Created tests for extract_json with various input formats
    • Added tests for RagasOutputParser fallback mechanism
    • Added tests for timeout configuration

Pull Request:

I've created PR #2046 (#2046) that fixes issue #2044. The PR includes all the changes mentioned above and has been pushed to the remote branch openhands/fix-local-llm-output-parsing.

The PR description includes a detailed explanation of the changes and mentions that it fixes issue #2044, along with the requested link for @jjmachan to continue refining the PR.

Jarzab3kDH

Jarzab3kDH commented on May 21, 2025

@Jarzab3kDH

I've also seen errors like:

Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt extract_keyphrase_prompt failed to parse output: The output parser failed to parse the output including retries.
Traceback (most recent call last):
  File "/home/vscode/.local/lib/python3.12/site-packages/langchain_core/output_parsers/json.py", line 86, in parse_result
    return parse_json_markdown(text)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.12/site-packages/langchain_core/utils/json.py", line 150, in parse_json_markdown
    return _parse_json(json_str, parser=parser)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.12/site-packages/langchain_core/utils/json.py", line 166, in _parse_json
    return parser(json_str)
           ^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.12/site-packages/langchain_core/utils/json.py", line 123, in parse_partial_json
    return json.loads(s, strict=strict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/json/__init__.py", line 359, in loads
    return cls(**kw).decode(s)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/json/decoder.py", line 338, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/json/decoder.py", line 356, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/vscode/.local/lib/python3.12/site-packages/ragas/prompt/pydantic_prompt.py", line 403, in parse_output_string
    result = super().parse(jsonstr)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.12/site-packages/langchain_core/output_parsers/pydantic.py", line 84, in parse
    return super().parse(text)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.12/site-packages/langchain_core/output_parsers/json.py", line 100, in parse
    return self.parse_result([Generation(text=text)])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.12/site-packages/langchain_core/output_parsers/pydantic.py", line 68, in parse_result
    json_object = super().parse_result(result)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.12/site-packages/langchain_core/output_parsers/json.py", line 89, in parse_result
    raise OutputParserException(msg, llm_output=text) from e
langchain_core.exceptions.OutputParserException: Invalid json output: 
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE 

running GCP Models via LangChain

is anyone working on those issues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingmodule-metricsthis is part of metrics moduleopenhandsquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Participants

      @jjmachan@WangAo-0@Jarzab3kDH

      Issue actions

        Evaluation Errors: Output Parsing and Timeout Issues for Context/Faithfulness Metrics · Issue #2044 · explodinggradients/ragas