Skip to content

Change contexts to use ids rather than text #990

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Conversation

mskarlin
Copy link
Collaborator

@mskarlin mskarlin commented Jul 3, 2025

Gives each context a deterministically generated ID, which is then used for references.

This replaces the need for complex citation string formatting because the IDs are simple and easy to identify via regexp. The answer which uses the context ids rather than the text names is added into a new Session attribute, raw_answer, this can be used to find the exact context corresponding to each key. answer and formatted_answer are left as they were before, but rather than relying on the LLM to populate the citation strings, we deterministically substitute the text name for each citation ID, after de-duping.

Before this change, if gather_evidence was run twice with different questions, we had no way to map a citation to the correct summary.

This PR also fixes a latent bug I found in the tests where auto-generated dockeys weren't able to be overridden with the metadata based dockey if using aadd. This only showed in the tests because I needed to regenerate the cassette for a few tests.

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working enhancement New feature or request labels Jul 3, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors context citations to use deterministic IDs instead of text names, adds a raw_answer field to store unformatted LLM output, and updates answer formatting to substitute IDs with context names and build a reference list.

  • Add an auto-populated Context.id field and helper to extract citation IDs.
  • Change aquery and format_answers to work with IDs, storing both raw_answer and formatted_answer.
  • Update prompts, settings, and tests to use and validate new citation IDs.

Reviewed Changes

Copilot reviewed 6 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_paperqa.py Rename test param, add assertions for citation IDs in answers
paperqa/utils.py Add get_citation_keys, deprecate old get_citenames
paperqa/types.py Introduce Context.id with a pre-validator for auto-generation
paperqa/settings.py Update example citation to new ID format
paperqa/prompts.py Adjust citation examples and prompt text for ID usage
paperqa/docs.py Replace context-name references with IDs; implement format_answers
Comments suppressed due to low confidence (2)

paperqa/utils.py:159

  • [nitpick] Instead of a comment, consider marking get_citenames as deprecated (for example, emitting a warning or using a @deprecated decorator) to clearly signal to maintainers that it should no longer be used.
# no longer used, but kept for backwards compatibility

paperqa/types.py:135

  • The decorator @model_validator is used but not imported from pydantic, which will cause a NameError. Please add from pydantic import model_validator at the top of this file.
    @model_validator(mode="before")

"- Example2024Example et al. (2024) \n"
"- Example's work (pages 17–19) \n" # noqa: RUF001
"- (pages 17–19) \n" # noqa: RUF001
"- (pqac-d79ef6fa and pqac-0f650d59) \n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you review in the README.md or docs/ if there's anything we need to update?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea i'll give it a once over -- for the end user, they shouldn't notice this change using our standard objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants