stanfordnlp · okhat · Mar 24, 2025 · Mar 24, 2025
diff --git a/docs/docs/tutorials/tool_use/index.ipynb b/docs/docs/tutorials/tool_use/index.ipynb
@@ -0,0 +1,337 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Tutorial: Advanced Tool Use\n",
+    "\n",
+    "Let's walk through a quick example of building and prompt-optimizing a DSPy agent for advanced tool use. We'll do this for the challenging task [ToolHop](https://arxiv.org/abs/2501.02506) but with an even stricter evaluation criteria.\n",
+    "\n",
+    "Install the latest DSPy via `pip install -U dspy` and follow along. You will also need to `pip install func_timeout`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<details>\n",
+    "<summary>Recommended: Set up MLflow Tracing to understand what's happening under the hood.</summary>\n",
+    "\n",
+    "### MLflow DSPy Integration\n",
+    "\n",
+    "<a href=\"https://mlflow.org/\">MLflow</a> is an LLMOps tool that natively integrates with DSPy and offer explainability and experiment tracking. In this tutorial, you can use MLflow to visualize prompts and optimization progress as traces to understand the DSPy's behavior better. You can set up MLflow easily by following the four steps below.\n",
+    "\n",
+    "1. Install MLflow\n",
+    "\n",
+    "```bash\n",
+    "%pip install mlflow>=2.20\n",
+    "```\n",
+    "\n",
+    "2. Start MLflow UI in a separate terminal\n",
+    "```bash\n",
+    "mlflow ui --port 5000\n",
+    "```\n",
+    "\n",
+    "3. Connect the notebook to MLflow\n",
+    "```python\n",
+    "import mlflow\n",
+    "\n",
+    "mlflow.set_tracking_uri(\"http://localhost:5000\")\n",
+    "mlflow.set_experiment(\"DSPy\")\n",
+    "```\n",
+    "\n",
+    "4. Enabling tracing.\n",
+    "```python\n",
+    "mlflow.dspy.autolog()\n",
+    "```\n",
+    "\n",
+    "To learn more about the integration, visit [MLflow DSPy Documentation](https://mlflow.org/docs/latest/llms/dspy/index.html) as well.\n",
+    "</details>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this tutorial, we'll demonstrate the new experimental `dspy.SIMBA` prompt optimizer, which tends to be powerful for larger LLMs and harder tasks. Using this, we'll improve our agent from 35% accuracy to 60%."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import dspy\n",
+    "import ujson\n",
+    "import random\n",
+    "\n",
+    "gpt4o = dspy.LM(\"openai/gpt-4o\", temperature=0.7)\n",
+    "dspy.configure(lm=gpt4o)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's now download the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Downloading 'ToolHop.json'...\n"
+     ]
+    }
+   ],
+   "source": [
+    "from dspy.utils import download\n",
+    "\n",
+    "download(\"https://huggingface.co/datasets/bytedance-research/ToolHop/resolve/main/data/ToolHop.json\")\n",
+    "\n",
+    "data = ujson.load(open(\"ToolHop.json\"))\n",
+    "random.Random(0).shuffle(data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Then let's prepare a cleaned set of examples. The ToolHop task is interesting in that the agent gets a _unique set_ of tools (functions) to use separately for each request. Thus, it needs to learn how to use _any_ such tools effectively in practice."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import re\n",
+    "import inspect\n",
+    "\n",
+    "examples = []\n",
+    "fns2code = {}\n",
+    "\n",
+    "def finish(answer: str):\n",
+    "    \"\"\"Conclude the trajectory and return the final answer.\"\"\"\n",
+    "    return answer\n",
+    "\n",
+    "for datapoint in data:\n",
+    "    func_dict = {}\n",
+    "    for func_code in datapoint[\"functions\"]:\n",
+    "        cleaned_code = func_code.rsplit(\"\\n\\n# Example usage\", 1)[0]\n",
+    "        fn_name = re.search(r\"^\\s*def\\s+([a-zA-Z0-9_]+)\\s*\\(\", cleaned_code)\n",
+    "        fn_name = fn_name.group(1) if fn_name else None\n",
+    "\n",
+    "        if not fn_name:\n",
+    "            continue\n",
+    "\n",
+    "        local_vars = {}\n",
+    "        exec(cleaned_code, {}, local_vars)\n",
+    "        fn_obj = local_vars.get(fn_name)\n",
+    "\n",
+    "        if callable(fn_obj):\n",
+    "            func_dict[fn_name] = fn_obj\n",
+    "            assert fn_obj not in fns2code, f\"Duplicate function found: {fn_name}\"\n",
+    "            fns2code[fn_obj] = (fn_name, cleaned_code)\n",
+    "\n",
+    "    func_dict[\"finish\"] = finish\n",
+    "\n",
+    "    example = dspy.Example(question=datapoint[\"question\"], answer=datapoint[\"answer\"], functions=func_dict)\n",
+    "    examples.append(example.with_inputs(\"question\", \"functions\"))\n",
+    "\n",
+    "trainset, devset, testset = examples[:100], examples[100:400], examples[400:]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And let's define some helpers for the task. Here, we will define the `metric`, which will be (much) stricter than in the original paper: we'll expect the prediction to match exactly (after normalization) with the ground truth. We'll also be strict in a second way: we'll only allow the agent to take 5 steps in total, to allow for efficient deployment."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from func_timeout import func_set_timeout\n",
+    "\n",
+    "def wrap_function_with_timeout(fn):\n",
+    "    @func_set_timeout(10)\n",
+    "    def wrapper(*args, **kwargs):\n",
+    "        try:\n",
+    "            return {\"return_value\": fn(*args, **kwargs), \"errors\": None}\n",
+    "        except Exception as e:\n",
+    "            return {\"return_value\": None, \"errors\": str(e)}\n",
+    "\n",
+    "    return wrapper\n",
+    "\n",
+    "def fn_metadata(func):\n",
+    "    signature = inspect.signature(func)\n",
+    "    docstring = inspect.getdoc(func) or \"No docstring.\"\n",
+    "    return dict(function_name=func.__name__, arguments=str(signature), docstring=docstring)\n",
+    "\n",
+    "def metric(example, pred, trace=None):\n",
+    "    gold = str(example.answer).rstrip(\".0\").replace(\",\", \"\").lower()\n",
+    "    pred = str(pred.answer).rstrip(\".0\").replace(\",\", \"\").lower()\n",
+    "    return pred == gold  # stricter than the original paper's metric!\n",
+    "\n",
+    "evaluate = dspy.Evaluate(devset=devset, metric=metric, num_threads=24, display_progress=True, display_table=0, max_errors=999)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, let's define the agent! The core of our agent will be based on a ReAct loop, in which the model sees the trajectory so far and the set of functions available to invoke, and decides the next tool to call.\n",
+    "\n",
+    "To keep the final agent fast, we'll limit its `max_steps` to 5 steps. We'll also run each function call with a timeout."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class Agent(dspy.Module):\n",
+    "    def __init__(self, max_steps=5):\n",
+    "        self.max_steps = max_steps\n",
+    "        instructions = \"For the final answer, produce short (not full sentence) answers in which you format dates as YYYY-MM-DD, names as Firstname Lastname, and numbers without leading 0s.\"\n",
+    "        signature = dspy.Signature('question, trajectory, functions -> next_selected_fn, args: dict[str, Any]', instructions)\n",
+    "        self.react = dspy.ChainOfThought(signature)\n",
+    "\n",
+    "    def forward(self, question, functions):\n",
+    "        tools = {fn_name: fn_metadata(fn) for fn_name, fn in functions.items()}\n",
+    "        trajectory = []\n",
+    "\n",
+    "        for _ in range(self.max_steps):\n",
+    "            pred = self.react(question=question, trajectory=trajectory, functions=tools)\n",
+    "            selected_fn = pred.next_selected_fn.strip('\"').strip(\"'\")\n",
+    "            fn_output = wrap_function_with_timeout(functions[selected_fn])(**pred.args)\n",
+    "            trajectory.append(dict(reasoning=pred.reasoning, selected_fn=selected_fn, args=pred.args, **fn_output))\n",
+    "\n",
+    "            if selected_fn == \"finish\":\n",
+    "                break\n",
+    "\n",
+    "        return dspy.Prediction(answer=fn_output.get(\"return_value\", ''), trajectory=trajectory)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Out of the box, let's assess our `GPT-4o`-powered agent on the development set."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2025/03/23 21:46:10 INFO dspy.evaluate.evaluate: Average Metric: 105.0 / 300 (35.0%)\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "35.0"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "agent = Agent()\n",
+    "evaluate(agent)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, let's optimize the agent using `dspy.SIMBA`, which stands for **Stochastic Introspective Mini-Batch Ascent**. This prompt optimizer accepts arbitrary DSPy programs like our agent here and proceeds in a sequence of mini-batches seeking to make incremental improvements to the prompt instructions or few-shot examples."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "simba = dspy.SIMBA(metric=metric, max_steps=12, max_demos=10)\n",
+    "optimized_agent = simba.compile(agent, trainset=trainset, seed=6793115)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Having completed this optimization, let's now evaluate our agent again. We see a substantial 71% relative gain, jumping to 60% accuracy."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2025/03/23 21:46:21 INFO dspy.evaluate.evaluate: Average Metric: 182.0 / 300 (60.7%)"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "60.67"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "evaluate(optimized_agent)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "jun2024_py310",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.14"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
@@ -53,6 +53,7 @@ nav:
         - Classification Finetuning: tutorials/classification_finetuning/index.ipynb
         - Multi-Hop Search: tutorials/multihop_search/index.ipynb
         - Privacy-Conscious Delegation: tutorials/papillon/index.md
+        - Advanced Tool Use: tutorials/tool_use/index.ipynb
         - Image Generation Prompt iteration: tutorials/image_generation_prompting/index.ipynb
         - Output Refinement: tutorials/output_refinement/best-of-n-and-refine.md
         - Finetuning Agents: tutorials/games/index.ipynb