Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable regex to extract floats in score generation #1223

Merged
merged 4 commits into from
Jun 18, 2024

Conversation

sfc-gh-dhuang
Copy link
Contributor

@sfc-gh-dhuang sfc-gh-dhuang commented Jun 18, 2024

Items to add to release announcement:

  • Heading:
    During benchmarking of various feedback providers, I notice there are models (i.e. a finetuned mixtral-8x7b that tend to give 10.0 instead of 10 in their feedback scoring before normalization. In the current implementation, PATTERN_INTEGER will extract 0 and 10 from 10.0 and eventually pick the lesser value.

Failing example when testing groundedness feedback functions - where the score accompanying COT was interpreted as 0, instead of the expected 10:

0.0,
 {'reasons': 'STATEMENT 0:\nCriteria: I am a man and I love fish,\nSupporting Evidence: The source states "All men love fish", and the statement contains "I am a man and I love fish". The source contains the information that a man loves fish, and the statement contains the information that the speaker is a man and loves fish.\nScore: 10.0\n'}

I'm switching to PATTERN_NUMBER to unblock for now.

Other details that are good to know but need not be announced:

  • This is only a stopgap solution and I might be missing some contexts here as in why PATTERN_INTEGER was used over PATTERN_NUMBER in the previous PR. cc @sfc-gh-pmardziel to add more background if I'm missing sth obvious.

I do believe we should move toward structured and systematic feedback score generation mechanisms with some self-refining prompt iterations (i.e. via DSPy) ASAP for more robust score generation, ideally before integrating w/ the monitoring stack, even at the cost of slightly higher token usage/cost/latency (which can also be alleviated via better prompts and instruction tuning).

@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Jun 18, 2024
@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Jun 18, 2024
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jun 18, 2024
@sfc-gh-jreini sfc-gh-jreini merged commit e481329 into main Jun 18, 2024
9 checks passed

vals = set()
for match in matches:
try:
vals.add(validate_rating(int(match)))
vals.add(
validate_rating(int(float(match)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kinda late to the game, but:

  1. Why are we constrained to ints?
  2. If we are constrained to ints, shouldn't we round it instead of flooring it?
  3. The doc on L54 says "If the string does not match an integer ... raises an error ...", but this won't do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. No particular reason beyond easier intrepretability for the end users AFAIK.
  2. Make sense - I can make a change to this
  3. Same as 1. - I do recall we went back and forth a bit on this and the doc just became outdated b/c of my change. Will update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1244 fix PR

@sfc-gh-jreini sfc-gh-jreini mentioned this pull request Jun 21, 2024
sfc-gh-dhuang added a commit that referenced this pull request Jun 28, 2024
* match floats and integers for score generation

* bb

* updated expected test cases

* minor fix
sfc-gh-dhuang added a commit that referenced this pull request Jul 1, 2024
* match floats and integers for score generation

* bb

* updated expected test cases

* minor fix
sfc-gh-chu pushed a commit that referenced this pull request Sep 25, 2024
* match floats and integers for score generation

* bb

* updated expected test cases

* minor fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm This PR has been approved by a maintainer size:S This PR changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants