Skip to content

[Feature] Support tool-use parse and reasoning content.#503

Merged
jiapingW merged 6 commits intomainfrom
tool-use
Mar 24, 2026
Merged

[Feature] Support tool-use parse and reasoning content.#503
jiapingW merged 6 commits intomainfrom
tool-use

Conversation

@jiapingW
Copy link
Collaborator

@jiapingW jiapingW commented Mar 17, 2026

Motivation

This PR currently supports tool-use and user-provided tool for data parsing. Besides, it fix the training bug from ckpt of DFlash. Users can add tool information as the following format which should be add to each conversation:

[
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g. San Francisco, CA"
                    },
                    "unit": {
                        "type": "string",
                        "description": "The unit of temperature",
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_web",
            "description": "Search the web for information",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The search query"
                    },
                    "num_results": {
                        "type": "integer",
                        "description": "Number of results to return"
                    }
                },
                "required": ["query"]
            }
        }
    }
]

The format of dataset each line is :

{
    "conversation": [
        {
            "role":"user",
            "content": ""
        },
        {
            ...
        }
    ],
    "tools":[
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get the current weather for a location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA"
                        },
                        "unit": {
                            "type": "string",
                            "description": "The unit of temperature",
                            "enum": ["celsius", "fahrenheit"]
                        }
                    },
                    "required": ["location"]
                }
            }
        },
    ]
}

The below are some suggestions:

  • You can check the format correctness in tests/test_data/test_parsers.py to make sure that your training content is ok.
  • To ensure consistency in training and inference formats, we recommend that each data point have only one user message, but can have multiple tool and assistant messages, which is common in current agent applications. The example is listed below:
[
            [
                # First turn: User asks about weather
                {"role": "user", "content": "我想知道今天北京和上海的天气怎么样?"},
                # Assistant thinks and decides to call tools
                {
                    "role": "assistant",
                    "content": "我来帮您查询北京和上海的天气情况。",
                    "reasoning_content": "用户想知道两个城市的天气:北京和上海。我需要分别调用 get_weather 工具两次,一次查询北京,一次查询上海。",
                    "tool_calls": [
                        {
                            "type": "function",
                            "function": {
                                "name": "get_weather",
                                "arguments": {"location": "北京", "date": "today"},
                            },
                        },
                        {
                            "type": "function",
                            "function": {
                                "name": "get_weather",
                                "arguments": {"location": "上海", "date": "today"},
                            },
                        },
                    ],
                },
                # Tool responses
                {
                    "role": "tool",
                    "content": '{"location": "北京", "temperature": 25, "condition": "晴朗", "humidity": "45%"}',
                },
                {
                    "role": "tool",
                    "content": '{"location": "上海", "temperature": 28, "condition": "多云", "humidity": "65%"}',
                },
                # Assistant summarizes with reasoning
                {
                    "role": "assistant",
                    "content": "根据查询结果,北京今天晴朗,25°C;上海多云,28°C。两地都比较适合出行。",
                    "reasoning_content": "我已经获取了两个城市的天气数据。北京天气更好,晴朗且温度适宜;上海稍微热一些且多云。我可以给用户一个简洁的总结。",
                },
                # Second turn: User asks follow-up question
                {"role": "user", "content": "那明天呢?会下雨吗?"},
                # Assistant checks forecast
                {
                    "role": "assistant",
                    "content": "让我查询一下明天的天气预报。",
                    "reasoning_content": "用户想知道明天是否会下雨,我需要查询两个城市的天气预报。",
                    "tool_calls": [
                        {
                            "type": "function",
                            "function": {
                                "name": "get_weather_forecast",
                                "arguments": {"location": "北京", "days": 1},
                            },
                        },
                        {
                            "type": "function",
                            "function": {
                                "name": "get_weather_forecast",
                                "arguments": {"location": "上海", "days": 1},
                            },
                        },
                    ],
                },
                # Tool forecast responses
                {
                    "role": "tool",
                    "content": '{"location": "北京", "tomorrow": {"condition": "小雨", "temperature": 22, "rain_probability": 70}}',
                },
                {
                    "role": "tool",
                    "content": '{"location": "上海", "tomorrow": {"condition": "晴", "temperature": 29, "rain_probability": 10}}',
                },
                # Final assistant response
                {
                    "role": "assistant",
                    "content": "明天北京有小雨,记得带伞;上海晴天,适合外出。",
                    "reasoning_content": "北京明天有70%概率下雨,需要提醒用户带伞;上海天气很好,不需要特别准备。",
                },
            ]
        ]
  • We now recommend uniformly placing the thought process information in the reasoning_content field (in a format consistent with sglang). The regenerate_train_data.py script will also now include the thought process in reasoning_content. An example is shown below:
{
   "id": "dbef7268287a6a53db87040685896cc666a2932e5f44b66c1c18f3230b59ad44",
   "conversations": [
       {
           "role": "user",
           "content": "Here is a piece of text: HyperXآ®, a division of Kingstonآ® Technology Company, Inc., the independent world leader in memory products, today announced a new fierce look to its Predator DDR4 and DDR3 DRAM. The aggressive design offers greater heat dissipation to optimize reliability while the black aluminum heat spreader and black PCB complement the look of the latest PC hardware. HyperX Predator DDR4 DRAM delivers fast frequencies and low latencies with speeds up to 3333MHz matched with low CL15-CL16 latencies to keep enthusiasts and gamers in the lead. HyperX Predator DDR3 DRAM is available in speeds up to 2666MHz with low CL9-CL11 latencies for an overall faster system.\nHyperX Predator DDR4 and DDR3 is 100-percent factory tested at speed and backed by a lifetime warranty, free technical support and legendary reliability. HyperX is the high-performance product division of Kingston Technology encompassing high-speed DDR4 and DDR3 memory, SSDs, USB Flash drives, headsets and mouse pads. Targeted at gamers, overclockers and enthusiasts, HyperX is known throughout the world for quality, performance and innovation. HyperX is committed to eSports as it sponsors over 30 teams globally and is the main sponsor of Intel Extreme Masters. HyperX can be found at many shows including Brasil Game Show, China Joy, DreamHack, gamescom and PAX.\n\nBased on the text material above, generate the response to the following quesion or instruction: How is HyperX committed to eSports?"
       },
       {
           "role": "assistant",
           "content": "Based on the text provided, HyperX is committed to eSports by sponsoring over 30 teams globally and serving as the main sponsor of Intel Extreme Masters.",
           "reasoning_content": "Thinking Process:\n\n1.  **Analyze the Request:**\n    *   Input: A piece of text about HyperX (a division of Kingston Technology) announcing new Predator DDR4 and DDR3 DRAM products.\n    *   Task: Generate a response to the question \"How is HyperX committed to eSports?\" based *only* on the provided text.\n\n2.  **Scan the Text for Keywords:**\n    *   Keywords: \"eSports\", \"committed\", \"sponsor\", \"team\", \"events\".\n    *   Scanning the text...\n    *   Found in the second paragraph: \"HyperX is committed to eSports as it sponsors over 30 teams globally and is the main sponsor of Intel Extreme Masters.\"\n\n3.  **Extract Relevant Information:**\n    *   Action 1: Sponsors over 30 teams globally.\n    *   Action 2: Is the main sponsor of Intel Extreme Masters.\n\n4.  **Draft the Response:**\n    *   *Draft 1:* HyperX is committed to eSports by sponsoring over 30 teams globally and being the main sponsor of Intel Extreme Masters.\n    *   *Draft 2 (More formal):* According to the text, HyperX supports eSports by sponsoring over 30 teams around the world and serving as the main sponsor for Intel Extreme Masters.\n\n5.  **Review and Refine:**\n    *   Does it answer the question directly? Yes.\n    *   Is it based *only* on the text? Yes.\n    *   Is it concise? Yes.\n\n6.  **Final Polish:**\n    *   \"Based on the text, HyperX is committed to eSports by sponsoring over 30 teams globally and is the main sponsor of Intel Extreme Masters.\"\n\n7.  **Final Output Generation** (matching the selected draft).\n    *   \"HyperX is committed to eSports by sponsoring over 30 teams globally and is the main sponsor of Intel Extreme Masters.\" (Simple and direct).\n\n    *   *Self-Correction:* Ensure I don't add external knowledge. The text explicitly says: \"HyperX is committed to eSports as it sponsors over 30 teams globally and is the main sponsor of Intel Extreme Masters.\" I should paraphrase or quote this accurately.\n\n    *   *Final Version:* HyperX is committed to eSports by sponsoring over 30 teams globally and serving as the main sponsor of Intel Extreme Masters.cw\n"
       },
       {
           "role": "user",
           "content": "Can you provide more information on the speeds and latencies of the HyperX Predator DDR4 and DDR3 DRAM?"
       },
       {
           "role": "assistant",
           "content": "Based on the text provided, here are the specifications for speeds and latencies:\n\n*   **HyperX Predator DDR4 DRAM:** Speeds up to **3333MHz** with latencies of **CL15-CL16**.\n*   **HyperX Predator DDR3 DRAM:** Speeds up to **2666MHz** with latencies of **CL9-CL11**.",
           "reasoning_content": "Thinking Process:\n\n1.  **Analyze the Request:** The user is asking for more information on the speeds and latencies of the HyperX Predator DDR4 and DDR3 DRAM based on the provided text.\n\n2.  **Scan the Text for Keywords:** I need to look for \"DDR4\", \"DDR3\", \"speeds\", \"latencies\", \"MHz\", \"CL\".\n\n3.  **Locate Relevant Information:**\n    *   \"HyperX Predator DDR4 DRAM delivers fast frequencies and low latencies with speeds up to 3333MHz matched with low CL15-CL16 latencies...\"\n    *   \"...HyperX Predator DDR3 DRAM is available in speeds up to 2666MHz with low CL9-CL11 latencies...\"\n\n4.  **Synthesize the Answer:**\n    *   **DDR4:** Speeds up to 3333MHz, Latencies CL15-CL16.\n    *   **DDR3:** Speeds up to 2666MHz, Latencies CL9-CL11.\n\n5.  **Draft the Response:** Combine these points into a clear statement.\n\n    *   *Draft:* According to the text, HyperX Predator DDR4 DRAM has speeds up to 3333MHz with CL15-CL16 latencies. The HyperX Predator DDR3 DRAM has speeds up to 2666MHz with CL9-CL11 latencies.\n\n6.  **Refine the Response:** Ensure it directly answers the prompt using only the provided text.\n\n    *   *Refined:* Based on the text, the HyperX Predator DDR4 DRAM offers speeds up to 3333MHz with latencies of CL15-CL16. The HyperX Predator DDR3 DRAM is available in speeds up to 2666MHz with latencies of CL9-CL11.\n\n7.  **Final Check:** Does this match the source text accurately? Yes. Is it concise? Yes.\n\n8.  **Final Output Generation.** (Matches the refined draft).ot\n"
       }
   ],
   "status": "success"
}

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

polish code

polish code

polish code

polish code
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the data parsing capabilities to natively support advanced conversational patterns, specifically focusing on tool-use and explicit reasoning content. By introducing a specialized parser and enhancing chat templates, it ensures that models can be trained more effectively on complex agent-like interactions. The changes also refine the training process by correctly handling learning rate schedules during resumed training and providing mechanisms to ignore specific tokens during loss calculation, leading to more robust and accurate model training.

Highlights

  • Enhanced Tool-Use and Reasoning Support: Introduced comprehensive support for tool-use and reasoning content within conversational data parsing, enabling models to process and generate structured tool calls and explicit thought processes.
  • New ThinkingParser Implementation: Added a dedicated ThinkingParser to handle multi-turn conversations that include reasoning content and tool calls, ensuring accurate loss mask generation for these complex interactions.
  • Configurable Ignore Tokens in Chat Templates: Implemented a new ignore_token field in ChatTemplate to allow specific tokens (e.g., thinking tokens) to be excluded from loss calculation during training, improving training efficiency and focus.
  • Improved Training Resumption Logic: Adjusted the learning rate scheduler initialization in the training script to correctly account for remaining_steps when resuming training from a checkpoint, ensuring proper learning rate progression.
  • Expanded Test Coverage for Complex Conversations: Significantly updated and added new test cases and reference files to validate the parsing and loss masking logic for multi-turn conversations involving both reasoning and tool-use scenarios across various models.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • scripts/regenerate_train_data.py
    • Updated the key for reasoning content from 'thinking' to 'reasoning_content' in regenerated messages.
  • scripts/train_dflash.py
    • Modified learning rate calculation to use 'remaining_steps' when resuming training, ensuring correct schedule continuation.
  • specforge/data/init.py
    • Imported and exposed preprocess_conversations and ChatTemplate in the module's public API.
  • specforge/data/parse.py
    • Imported TOOLS for tool calling support.
    • Added ThinkingParser to the module's public API.
    • Initialized self.tools in the base Parser class.
    • Modified GeneralParser.apply_chat_template to pass tools to the tokenizer.
    • Implemented logic in GeneralParser.parse to zero out the loss mask for specified ignore_tokens.
    • Introduced ThinkingParser with specialized apply_chat_template and parse methods for reasoning and tool calls.
  • specforge/data/preprocessing.py
    • Removed an unnecessary blank line.
  • specforge/data/template.py
    • Imported Optional for type hinting.
    • Added an ignore_token field to the ChatTemplate Pydantic model.
    • Updated the qwen3-instruct template to adjust assistant_header and include ignore_token for thinking patterns.
  • specforge/data/tools.py
    • Added a new file defining a global TOOLS list for tool calling definitions.
  • tests/test_data/test_parsers.py
    • Updated tool_use_messages with more detailed multi-turn Chinese examples including tool calls and responses.
    • Added reasoning_multi_turn_messages for testing conversations with explicit reasoning content.
    • Introduced complete_reasoning_tool_conversation for comprehensive multi-turn tests combining reasoning and tool calls.
    • Extended _run_template_test to support new message labels for different conversation types.
    • Modified test_qwen3_thinking, test_kimi_k2_thinking, test_qwen3_next_thinking, and test_qwen35_instruct to utilize the newly defined complex conversation test data.
  • tests/test_data/test_references/kimi-k2-instruct_reasoning-multi-turn_ref.json
    • Added new reference JSON for kimi-k2-instruct with reasoning multi-turn messages.
  • tests/test_data/test_references/kimi-k2-thinking_reasoning-multi-turn_ref.json
    • Added new reference JSON for kimi-k2-thinking with reasoning multi-turn messages.
  • tests/test_data/test_references/kimi-k2-thinking_standard_ref.json
    • Updated reference JSON for kimi-k2-thinking standard messages.
  • tests/test_data/test_references/qwen3-instruct_multi-turn-tool-calls-with-reasoning_ref.json
    • Added new reference JSON for qwen3-instruct with multi-turn tool calls and reasoning.
  • tests/test_data/test_references/qwen3-instruct_standard_ref.json
    • Updated reference JSON for qwen3-instruct standard messages.
  • tests/test_data/test_references/qwen3-instruct_tool-use-multi-turn_ref.json
    • Added new reference JSON for qwen3-instruct with tool-use multi-turn messages.
  • tests/test_data/test_references/qwen3-instruct_tool-use_ref.json
    • Updated reference JSON for qwen3-instruct tool-use messages.
  • tests/test_data/test_references/qwen3-next-thinking_multi-turn-tool-calls-with-reasoning_ref.json
    • Added new reference JSON for qwen3-next-thinking with multi-turn tool calls and reasoning.
  • tests/test_data/test_references/qwen3-next-thinking_reasoning-multi-turn_ref.json
    • Added new reference JSON for qwen3-next-thinking with reasoning multi-turn messages.
  • tests/test_data/test_references/qwen3-next-thinking_standard_ref.json
    • Updated reference JSON for qwen3-next-thinking standard messages.
  • tests/test_data/test_references/qwen3-thinking_reasoning-multi-turn_ref.json
    • Added new reference JSON for qwen3-thinking with reasoning multi-turn messages.
  • tests/test_data/test_references/qwen3-thinking_standard_ref.json
    • Updated reference JSON for qwen3-thinking standard messages.
  • tests/test_data/test_references/qwen3.5_multi-turn-tool-calls-with-reasoning_ref.json
    • Added new reference JSON for qwen3.5 with multi-turn tool calls and reasoning.
  • tests/test_data/test_references/qwen3.5_multi-turn-tool-calls_ref.json
    • Added new reference JSON for qwen3.5 with multi-turn tool calls.
  • tests/test_data/test_references/qwen3.5_standard_ref.json
    • Updated reference JSON for qwen3.5 standard messages.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support for 'thinking' or 'reasoning' models and tool calling within chat templates. Key changes include adding a ThinkingParser and enhancing the GeneralParser to handle ignore_tokens for loss masking, along with updating the ChatTemplate model to include ignore_token and tools definitions. The scripts/regenerate_train_data.py file renames a key from 'thinking' to 'reasoning_content' for consistency. In scripts/train_dflash.py, the logic for calculating remaining_steps for the optimizer during training resumption was refactored; however, a critical bug was identified where ckpt_info would be undefined if training is not resumed, causing a NameError. New test data and reference outputs were added to validate the new reasoning and tool-calling functionalities.

Comment on lines +436 to +437
start_epoch = ckpt_info[0]
global_step = ckpt_info[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The variable ckpt_info is only defined within the if args.resume ... block (L357-359). If training is started without the --resume flag, ckpt_info will be undefined, and accessing it here will raise a NameError, causing the script to crash. To fix this, ckpt_info should be initialized with default values (e.g., (0, 0)) before it's potentially assigned inside the conditional block, ensuring it's always available.

fix bug

support dflash train and fix test_sglang_backend

fix lint

fix:train on ckpt

fix:optimizer create when load from ckpt
@jiapingW jiapingW merged commit 2c4893f into main Mar 24, 2026
5 checks passed
@jiapingW jiapingW deleted the tool-use branch March 24, 2026 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant