[grpo]Tool rl: add reward func for ToolRL #4694

tpx818 · 2025-06-24T12:09:20Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

add reward func for ToolRL(https://arxiv.org/abs/2504.13958).

Experiment results

Paste your experiment result here(if needed).

hjh0119 · 2025-06-24T12:19:10Z

examples/train/grpo/plugin/plugin.py

+        max_possible_reward = self.format_max_possible
+        min_possible_reward = self.format_min_possible
+        if str(os.getenv("MAX1STEP30MAX3", 0)) == "1":
+            print("MAX1STEP30MAX3 is set to 1, so max 1 -> 30 steps -> max 3")


Remove print statements; otherwise, there will be a lot of print messages in the logger.

All print statements have been removed.

hjh0119 · 2025-06-24T12:19:18Z

examples/train/grpo/plugin/plugin.py

+
+        # schedule reward
+        if str(os.getenv("SCHEDULEREWARD", 0)) == "1":
+            print("SCHEDULEREWARD is set to 1, so schedule reward is used")


hjh0119 · 2025-06-24T12:19:48Z

examples/train/grpo/plugin/plugin.py

+        print("\n======= Answer ======= ")
+        print(solution[0])
+        print("\n======= Responses ======= ")


hjh0119 · 2025-06-24T12:19:57Z

examples/train/grpo/plugin/plugin.py

+        print(solution[0])
+        print("\n======= Responses ======= ")
+        for idx, response in enumerate(responses):
+            print(f"*** Response {idx+1}***\n{response}")


hjh0119 · 2025-06-24T12:20:16Z

examples/train/grpo/plugin/plugin.py

+        print("\n======= Reward for <format> =======")
+        print("Reward function for <format> is called ...")
+        print(rewards)


hjh0119 · 2025-06-24T12:20:53Z

examples/train/grpo/plugin/plugin.py

+            print("Max possible score:", "Exact Match!")
+            print("Score:", max_possible_reward)


hjh0119 · 2025-06-24T12:20:59Z

examples/train/grpo/plugin/plugin.py

+            return max_possible_reward
+
+        if os.getenv("COARSEREWARD", 0) == "1":
+            print("COARSEREWARD is set to 1, so coarse reward is used")


hjh0119 · 2025-06-24T12:21:14Z

examples/train/grpo/plugin/plugin.py

+            gt_params = gt_tool["parameters"]
+
+            if str(os.getenv("INTERMEDIATEREWARD", 0)) == "1":
+                print("INTERMEDIATEREWARD is set to 1, so local max possible is changed")


hjh0119 · 2025-06-24T12:21:25Z

examples/train/grpo/plugin/plugin.py

+        print()
+        print("Max possible score:", local_max_possible)
+        print("Score:", score)


hjh0119 · 2025-06-26T13:11:27Z

examples/train/grpo/plugin/plugin.py

+        print("\n======= Reward for <tool call> =======")
+        print("Reward function for <tool call> correctness is called ...")
+        print(rewards)


hjh0119 · 2025-06-27T01:39:16Z

plz pass the lint test

tpx818 · 2025-06-27T03:38:53Z

plz pass the lint test

Lint test has been fixed

tpx818 added 5 commits June 23, 2025 17:28

init rl func

dba806f

fix

3f7d0d8

Merge branch 'main' into tool_rl

d3bb9f5

fix nan

617c7b3

add comment

4648f6e

hjh0119 reviewed Jun 26, 2025

View reviewed changes

delete print

ef2cdd9

fix lint

6e52fff

hjh0119 approved these changes Jun 27, 2025

View reviewed changes

hjh0119 merged commit 696fad6 into modelscope:main Jun 27, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[grpo]Tool rl: add reward func for ToolRL #4694

[grpo]Tool rl: add reward func for ToolRL #4694

Uh oh!

tpx818 commented Jun 24, 2025

Uh oh!

hjh0119 Jun 24, 2025

Uh oh!

tpx818 Jun 27, 2025

Uh oh!

hjh0119 Jun 24, 2025

Uh oh!

hjh0119 Jun 24, 2025

Uh oh!

hjh0119 Jun 24, 2025

Uh oh!

hjh0119 Jun 24, 2025

Uh oh!

hjh0119 Jun 24, 2025

Uh oh!

hjh0119 Jun 24, 2025

Uh oh!

hjh0119 Jun 24, 2025

Uh oh!

hjh0119 Jun 24, 2025

Uh oh!

hjh0119 Jun 26, 2025

Uh oh!

hjh0119 commented Jun 27, 2025

Uh oh!

tpx818 commented Jun 27, 2025

Uh oh!

Uh oh!

Uh oh!

		print("Max possible score:", "Exact Match!")
		print("Score:", max_possible_reward)

[grpo]Tool rl: add reward func for ToolRL #4694

[grpo]Tool rl: add reward func for ToolRL #4694

Uh oh!

Conversation

tpx818 commented Jun 24, 2025

PR type

PR information

Experiment results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hjh0119 commented Jun 27, 2025

Uh oh!

tpx818 commented Jun 27, 2025

Uh oh!

Uh oh!

Uh oh!