-
Notifications
You must be signed in to change notification settings - Fork 749
Change contexts to use ids rather than text #990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors context citations to use deterministic IDs instead of text names, adds a raw_answer
field to store unformatted LLM output, and updates answer formatting to substitute IDs with context names and build a reference list.
- Add an auto-populated
Context.id
field and helper to extract citation IDs. - Change
aquery
andformat_answers
to work with IDs, storing bothraw_answer
andformatted_answer
. - Update prompts, settings, and tests to use and validate new citation IDs.
Reviewed Changes
Copilot reviewed 6 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
tests/test_paperqa.py | Rename test param, add assertions for citation IDs in answers |
paperqa/utils.py | Add get_citation_keys , deprecate old get_citenames |
paperqa/types.py | Introduce Context.id with a pre-validator for auto-generation |
paperqa/settings.py | Update example citation to new ID format |
paperqa/prompts.py | Adjust citation examples and prompt text for ID usage |
paperqa/docs.py | Replace context-name references with IDs; implement format_answers |
Comments suppressed due to low confidence (2)
paperqa/utils.py:159
- [nitpick] Instead of a comment, consider marking
get_citenames
as deprecated (for example, emitting a warning or using a@deprecated
decorator) to clearly signal to maintainers that it should no longer be used.
# no longer used, but kept for backwards compatibility
paperqa/types.py:135
- The decorator
@model_validator
is used but not imported from pydantic, which will cause a NameError. Please addfrom pydantic import model_validator
at the top of this file.
@model_validator(mode="before")
"- Example2024Example et al. (2024) \n" | ||
"- Example's work (pages 17–19) \n" # noqa: RUF001 | ||
"- (pages 17–19) \n" # noqa: RUF001 | ||
"- (pqac-d79ef6fa and pqac-0f650d59) \n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you review in the README.md
or docs/
if there's anything we need to update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea i'll give it a once over -- for the end user, they shouldn't notice this change using our standard objects.
Gives each context a deterministically generated ID, which is then used for references.
This replaces the need for complex citation string formatting because the IDs are simple and easy to identify via regexp. The answer which uses the context ids rather than the text names is added into a new
Session
attribute,raw_answer
, this can be used to find the exact context corresponding to each key.answer
andformatted_answer
are left as they were before, but rather than relying on the LLM to populate the citation strings, we deterministically substitute the text name for each citation ID, after de-duping.Before this change, if
gather_evidence
was run twice with different questions, we had no way to map a citation to the correct summary.This PR also fixes a latent bug I found in the tests where auto-generated dockeys weren't able to be overridden with the metadata based dockey if using
aadd
. This only showed in the tests because I needed to regenerate the cassette for a few tests.