Implement EvalQueryEngineTool #11679

d-mariano · 2024-03-06T03:25:56Z

Description

Notice

I would like input on this PR from the llama-index team. If the team agrees with the need and approach, I will provide unit tests, documentation updates, and Google Colab notebooks as required.

Summary

The reason for this change is to provide a plug-and-play method of using on-the-fly evaluation for tools within an agent. Some areas I would like feedback on:

Handling this on the tool-level makes sense for the use-case
This is a good approach for addressing the use-case, and aligns with the llama-index vision
Anything else you can think of! <3

If the team decides this is a good approach, I will provide unit tests, documentation updates, and Google Colab notebooks as required.

Implements the EvalQueryEngineTool:

Inherits from QueryEngineTool
Uses a given evaluator to evaluate query engine responses
If the response fails evaluation:
- The tool output is manipulated to deter the LLM from using it
- The evaluator feedback is used as a reason for failure

Type of Change

New feature (non-breaking change which adds functionality)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Added new unit/integration tests
Added new notebook (that tests end-to-end)
I stared at the code and made sure it makes sense

Suggested Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added Google Colab support for the newly added notebooks.
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I ran make format; make lint to appease the lint gods

logan-markewich · 2024-03-06T16:56:20Z

@d-mariano Yea I think I like this approach. If this was combined with an agent that had the chance to "retry" tool calls, it would be pretty powerful (and also helps avoid tracebacks when calling a tool)

d-mariano · 2024-03-06T17:02:47Z

@d-mariano Yea I think I like this approach. If this was combined with an agent that had the chance to "retry" tool calls, it would be pretty powerful (and also helps avoid tracebacks when calling a tool)

@logan-markewich great! And totally agree, that's the idea down the road. I would like to test this with the simple ReAct implementation. Then, after that, test it with more complex use-cases like within a query pipeline.

Also curious what implications this has on a QueryPlan, but that doesn't need to be addressed today.

Okay, great. So with this, I can implement some UTs, of course...should I introduce this in the docs with a notebook example as well? <3

logan-markewich · 2024-03-06T19:51:29Z

@d-mariano yea may as well make this PR a little more complete 💪

d-mariano · 2024-03-08T15:06:55Z

@logan-markewich I've made a few updates so far:

I corrected the change to update ToolOutput.content rather than ToolOutput.raw_output:
1. content is injected into agent prompts and the entire point of this change is to inform agents on evaluation failures
2. Preserving raw_output is probably desirable for debug purposes? I.e "This tool failed eval, but this was its original output."
I've added a test suite covering evaluation pass and fail modes

I would like to update the llama-index documentation with a section for EvalQueryEngine. Here I can write up a blurb on the purpose of it and even include a notebook example of it being used with an agent. Are these docs included in this repo? (I haven't looked around, I can also do that at some point.)

I need to get at some other stuff so I'll have this open a bit longer.

logan-markewich · 2024-03-09T21:10:38Z

@d-mariano yea, there's a docs folder in this repo 💪 it's all markdown.

There should be some existing evaluation sections you can maybe hook into? Or maybe the section on tools would be better

review-notebook-app · 2024-03-11T18:50:01Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

docs/examples/tools/eval_query_engine_tool.ipynb

Implement the EvalQueryEngineTool: * Inherits from QueryEngineTool * Uses a given evaluator to evaluate query engine responses * If the response fails evaluation: * The tool output is manipulated to deter the LLM from using it * The evaluator feedback is used as a reason for failure

* Simplify eval_query_engine_tool imports and avoid circular dependencies

* Fix EvalQueryEngineTool notebook by increasing similarity_top_k to 5 * The above resulted in the lyft query engine to return a response that passed evaluation

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Mar 6, 2024

d-mariano force-pushed the eval-query-engine-tool branch from 433e075 to 8f054ad Compare March 6, 2024 14:09

d-mariano force-pushed the eval-query-engine-tool branch from 8f054ad to e3428fb Compare March 8, 2024 15:04

d-mariano force-pushed the eval-query-engine-tool branch from e3428fb to 2ccd6ca Compare March 11, 2024 18:49

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Mar 11, 2024

dmarianobenchsci reviewed Mar 11, 2024

View reviewed changes

docs/examples/tools/eval_query_engine_tool.ipynb Outdated Show resolved Hide resolved

d-mariano force-pushed the eval-query-engine-tool branch 4 times, most recently from 0f82f98 to dd606b6 Compare March 12, 2024 20:13

d-mariano force-pushed the eval-query-engine-tool branch from dd606b6 to 54e8b25 Compare March 12, 2024 20:53

logan-markewich and others added 5 commits March 18, 2024 12:49

fix circular imports and example

f99a927

Simplify eval_query_engine_tool imports

e2f0ea1

* Simplify eval_query_engine_tool imports and avoid circular dependencies

Fix EvalQueryEngineTool notebook

f1f2949

* Fix EvalQueryEngineTool notebook by increasing similarity_top_k to 5 * The above resulted in the lyft query engine to return a response that passed evaluation

merge main

47da90b

update docs

3cd2f37

logan-markewich approved these changes Mar 25, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Mar 25, 2024

logan-markewich enabled auto-merge (squash) March 25, 2024 18:40

logan-markewich merged commit 1052ee2 into run-llama:main Mar 25, 2024
8 checks passed

d-mariano deleted the eval-query-engine-tool branch March 25, 2024 20:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement EvalQueryEngineTool #11679

Implement EvalQueryEngineTool #11679

d-mariano commented Mar 6, 2024 •

edited

logan-markewich commented Mar 6, 2024

d-mariano commented Mar 6, 2024

logan-markewich commented Mar 6, 2024

d-mariano commented Mar 8, 2024 •

edited

logan-markewich commented Mar 9, 2024

review-notebook-app bot commented Mar 11, 2024

Implement EvalQueryEngineTool #11679

Implement EvalQueryEngineTool #11679

Conversation

d-mariano commented Mar 6, 2024 • edited

Description

Notice

Summary

Type of Change

How Has This Been Tested?

Suggested Checklist:

logan-markewich commented Mar 6, 2024

d-mariano commented Mar 6, 2024

logan-markewich commented Mar 6, 2024

d-mariano commented Mar 8, 2024 • edited

logan-markewich commented Mar 9, 2024

review-notebook-app bot commented Mar 11, 2024

d-mariano commented Mar 6, 2024 •

edited

d-mariano commented Mar 8, 2024 •

edited