Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement EvalQueryEngineTool #11679

Merged
merged 6 commits into from Mar 25, 2024

Conversation

d-mariano
Copy link
Contributor

@d-mariano d-mariano commented Mar 6, 2024

Description

Notice

I would like input on this PR from the llama-index team. If the team agrees with the need and approach, I will provide unit tests, documentation updates, and Google Colab notebooks as required.

Summary

The reason for this change is to provide a plug-and-play method of using on-the-fly evaluation for tools within an agent. Some areas I would like feedback on:

  1. Handling this on the tool-level makes sense for the use-case
  2. This is a good approach for addressing the use-case, and aligns with the llama-index vision
  3. Anything else you can think of! <3

If the team decides this is a good approach, I will provide unit tests, documentation updates, and Google Colab notebooks as required.

Implements the EvalQueryEngineTool:

  • Inherits from QueryEngineTool
  • Uses a given evaluator to evaluate query engine responses
  • If the response fails evaluation:
    • The tool output is manipulated to deter the LLM from using it
    • The evaluator feedback is used as a reason for failure

Type of Change

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Added new unit/integration tests
  • Added new notebook (that tests end-to-end)
  • I stared at the code and made sure it makes sense

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran make format; make lint to appease the lint gods

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Mar 6, 2024
@logan-markewich
Copy link
Collaborator

@d-mariano Yea I think I like this approach. If this was combined with an agent that had the chance to "retry" tool calls, it would be pretty powerful (and also helps avoid tracebacks when calling a tool)

@d-mariano
Copy link
Contributor Author

@d-mariano Yea I think I like this approach. If this was combined with an agent that had the chance to "retry" tool calls, it would be pretty powerful (and also helps avoid tracebacks when calling a tool)

@logan-markewich great! And totally agree, that's the idea down the road. I would like to test this with the simple ReAct implementation. Then, after that, test it with more complex use-cases like within a query pipeline.

Also curious what implications this has on a QueryPlan, but that doesn't need to be addressed today.

Okay, great. So with this, I can implement some UTs, of course...should I introduce this in the docs with a notebook example as well? <3

@logan-markewich
Copy link
Collaborator

@d-mariano yea may as well make this PR a little more complete 💪

@d-mariano
Copy link
Contributor Author

d-mariano commented Mar 8, 2024

@logan-markewich I've made a few updates so far:

  1. I corrected the change to update ToolOutput.content rather than ToolOutput.raw_output:
    1. content is injected into agent prompts and the entire point of this change is to inform agents on evaluation failures
    2. Preserving raw_output is probably desirable for debug purposes? I.e "This tool failed eval, but this was its original output."
  2. I've added a test suite covering evaluation pass and fail modes

I would like to update the llama-index documentation with a section for EvalQueryEngine. Here I can write up a blurb on the purpose of it and even include a notebook example of it being used with an agent. Are these docs included in this repo? (I haven't looked around, I can also do that at some point.)

I need to get at some other stuff so I'll have this open a bit longer.

@logan-markewich
Copy link
Collaborator

@d-mariano yea, there's a docs folder in this repo 💪 it's all markdown.

There should be some existing evaluation sections you can maybe hook into? Or maybe the section on tools would be better

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Mar 11, 2024
@d-mariano d-mariano force-pushed the eval-query-engine-tool branch 4 times, most recently from 0f82f98 to dd606b6 Compare March 12, 2024 20:13
Implement the EvalQueryEngineTool:
* Inherits from QueryEngineTool
* Uses a given evaluator to evaluate query engine responses
* If the response fails evaluation:
    * The tool output is manipulated to deter the LLM
      from using it
    * The evaluator feedback is used as a reason for failure
logan-markewich and others added 5 commits March 18, 2024 12:49
* Simplify eval_query_engine_tool imports and avoid circular dependencies
* Fix EvalQueryEngineTool notebook by increasing similarity_top_k to 5
* The above resulted in the lyft query engine to return a response
  that passed evaluation
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Mar 25, 2024
@logan-markewich logan-markewich enabled auto-merge (squash) March 25, 2024 18:40
@logan-markewich logan-markewich merged commit 1052ee2 into run-llama:main Mar 25, 2024
8 checks passed
@d-mariano d-mariano deleted the eval-query-engine-tool branch March 25, 2024 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm This PR has been approved by a maintainer size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants