-
Notifications
You must be signed in to change notification settings - Fork 717
[grpo]Tool rl: add reward func for ToolRL #4694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
examples/train/grpo/plugin/plugin.py
Outdated
max_possible_reward = self.format_max_possible | ||
min_possible_reward = self.format_min_possible | ||
if str(os.getenv("MAX1STEP30MAX3", 0)) == "1": | ||
print("MAX1STEP30MAX3 is set to 1, so max 1 -> 30 steps -> max 3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove print statements; otherwise, there will be a lot of print messages in the logger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All print statements have been removed.
examples/train/grpo/plugin/plugin.py
Outdated
|
||
# schedule reward | ||
if str(os.getenv("SCHEDULEREWARD", 0)) == "1": | ||
print("SCHEDULEREWARD is set to 1, so schedule reward is used") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
examples/train/grpo/plugin/plugin.py
Outdated
print("\n======= Answer ======= ") | ||
print(solution[0]) | ||
print("\n======= Responses ======= ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
examples/train/grpo/plugin/plugin.py
Outdated
print(solution[0]) | ||
print("\n======= Responses ======= ") | ||
for idx, response in enumerate(responses): | ||
print(f"*** Response {idx+1}***\n{response}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
examples/train/grpo/plugin/plugin.py
Outdated
print("\n======= Reward for <format> =======") | ||
print("Reward function for <format> is called ...") | ||
print(rewards) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
examples/train/grpo/plugin/plugin.py
Outdated
print("Max possible score:", "Exact Match!") | ||
print("Score:", max_possible_reward) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
examples/train/grpo/plugin/plugin.py
Outdated
return max_possible_reward | ||
|
||
if os.getenv("COARSEREWARD", 0) == "1": | ||
print("COARSEREWARD is set to 1, so coarse reward is used") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
examples/train/grpo/plugin/plugin.py
Outdated
gt_params = gt_tool["parameters"] | ||
|
||
if str(os.getenv("INTERMEDIATEREWARD", 0)) == "1": | ||
print("INTERMEDIATEREWARD is set to 1, so local max possible is changed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
examples/train/grpo/plugin/plugin.py
Outdated
print() | ||
print("Max possible score:", local_max_possible) | ||
print("Score:", score) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
examples/train/grpo/plugin/plugin.py
Outdated
print("\n======= Reward for <tool call> =======") | ||
print("Reward function for <tool call> correctness is called ...") | ||
print(rewards) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
plz pass the lint test |
Lint test has been fixed |
PR type
PR information
add reward func for ToolRL(https://arxiv.org/abs/2504.13958).
Experiment results
Paste your experiment result here(if needed).