Skip to content

[grpo]Tool rl: add reward func for ToolRL #4694

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 27, 2025
Merged

Conversation

tpx818
Copy link
Contributor

@tpx818 tpx818 commented Jun 24, 2025

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

add reward func for ToolRL(https://arxiv.org/abs/2504.13958).

Experiment results

Paste your experiment result here(if needed).

max_possible_reward = self.format_max_possible
min_possible_reward = self.format_min_possible
if str(os.getenv("MAX1STEP30MAX3", 0)) == "1":
print("MAX1STEP30MAX3 is set to 1, so max 1 -> 30 steps -> max 3")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove print statements; otherwise, there will be a lot of print messages in the logger.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All print statements have been removed.


# schedule reward
if str(os.getenv("SCHEDULEREWARD", 0)) == "1":
print("SCHEDULEREWARD is set to 1, so schedule reward is used")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Comment on lines 483 to 485
print("\n======= Answer ======= ")
print(solution[0])
print("\n======= Responses ======= ")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

print(solution[0])
print("\n======= Responses ======= ")
for idx, response in enumerate(responses):
print(f"*** Response {idx+1}***\n{response}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Comment on lines 510 to 512
print("\n======= Reward for <format> =======")
print("Reward function for <format> is called ...")
print(rewards)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Comment on lines 586 to 587
print("Max possible score:", "Exact Match!")
print("Score:", max_possible_reward)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

return max_possible_reward

if os.getenv("COARSEREWARD", 0) == "1":
print("COARSEREWARD is set to 1, so coarse reward is used")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

gt_params = gt_tool["parameters"]

if str(os.getenv("INTERMEDIATEREWARD", 0)) == "1":
print("INTERMEDIATEREWARD is set to 1, so local max possible is changed")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Comment on lines 647 to 649
print()
print("Max possible score:", local_max_possible)
print("Score:", score)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Comment on lines 712 to 714
print("\n======= Reward for <tool call> =======")
print("Reward function for <tool call> correctness is called ...")
print(rewards)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@hjh0119
Copy link
Collaborator

hjh0119 commented Jun 27, 2025

plz pass the lint test

@tpx818
Copy link
Contributor Author

tpx818 commented Jun 27, 2025

plz pass the lint test

Lint test has been fixed

@hjh0119 hjh0119 merged commit 696fad6 into modelscope:main Jun 27, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants