Refresh AITK Bulk Run Page (#8471)

MuyangAmigo · ntrogh · web-flow · commit 9fb527035003 · 2025-06-17T13:10:31.000+02:00
* Refresh bulk run doc

* Update docs/intelligentapps/bulkrun.md

Co-authored-by: Nick Trogh &lt;1908215+ntrogh@users.noreply.github.com&gt;

* update image

---------

Co-authored-by: Nick Trogh &lt;1908215+ntrogh@users.noreply.github.com&gt;
diff --git a/docs/intelligentapps/bulkrun.md b/docs/intelligentapps/bulkrun.md
@@ -1,37 +1,89 @@
 ---
 ContentId: 1124d141-e893-4780-aba7-b6ca13628bc5
-DateApproved: 12/11/2024
-MetaDescription: Run a set of prompts in an imported dataset, individually or in a full batch towards the selected genAI models and parameters.
+DateApproved: 06/16/2025
+MetaDescription: Run a set of prompts with variables or function calls with an imported or synthetically generated dataset towards the selected models and parameters.
 ---
 # Run multiple prompts in bulk
 
-The bulk run feature in AI Toolkit enables you to run multiple prompts in batch. When you use the playground, you can only run one prompt manually at a time, in the order they're listed.
+> [!NOTE]
+> Bulk run was previously a standalone webview feature in AI Toolkit. It is now fully integrated into **Agent Builder** under the **Evaluation** tab. You can still access it through the AI Toolkit view by selecting **TOOLS** > **Bulk Run**.
 
-Bulk run takes a dataset as input, where each row in the dataset has at least a prompt. Typically, the dataset has multiple rows. Once imported, you can select one or more prompts to run on the selected model. The responses are then displayed in the same dataset view. The results from running the dataset can be exported.
+The bulk run feature in AI Toolkit lets you test agents and prompts against multiple test cases in batch mode. Unlike the playground, which runs one prompt at a time, bulk run automates the process by using a dataset as input and running all prompts sequentially.
+
+After execution, AI responses appear in the dataset view next to your original prompts. You can review, compare, and export the complete dataset with responses for further analysis.
+
+![Screenshot showing AI Toolkit interface with the bulk run feature. The dataset table displays multiple prompts and responses, with queries about weather in Paris France and Shanghai China.](./images/bulkrun/bulkrun.png)
 
 ## Start a bulk run
 
-1. In the AI Toolkit view, select **TOOLS** > **Bulk Run** to open the Bulk Run view
+To start a bulk run in AI Toolkit, follow these steps:
+
+1. In the AI Toolkit view, select **Agent Builder** from the Activity Bar.
+1. Enter your prompt and variables using the `{{your_variable}}` format. Select a model to run the prompt against.
+1. Switch to the **Evaluation** tab in **Agent Builder**.
+
+> [!NOTE]
+> AI Toolkit uses the same LLM models you use for agents to generate datasets, which might incur costs. You can view the meta prompt used to generate datasets in the [AI Toolkit GitHub repository](https://github.com/microsoft/vscode-ai-toolkit/blob/main/doc/data_generator.md).
+
+1. Select **Generate Data** to create a synthetic dataset.
+1. Choose the number of rows to generate and view or modify the data generation logic.
+    ![Screenshot showing Generate Data dialog in AI Toolkit.](./images/bulkrun/generate_data.png)
+1. Select **Generate** to create the dataset.
+
+> [!TIP]
+> You can choose to run only the remaining queries that have not yet been run.
+
+1. Once the dataset is loaded, select **Run** to run a single row or **Run All** to run all rows in the dataset.
+
+## Operate on dataset
+
+![Screenshot showing AI Toolkit interface with dataset operations and a table of evaluation results.](./images/bulkrun/dataset_operation.png)
+
+AI Toolkit provides several operations to manage and analyze your dataset during a bulk run:
+
+- **Generate Data**: Create a synthetic dataset based on a prompt and variables. Specify the number of rows and modify the data generation logic.
+- **Add Row**: Add a new row to the dataset.
+- **Delete Row**: Delete the selected row from the dataset.
+- **Export Dataset**: Export the dataset to a CSV file for further analysis or reporting.
+- **Import Dataset**: Import a dataset from a CSV file to use as input for the bulk run.
+- **Run**: Execute a single row in the dataset against the selected model.
+- **Run All**: Execute all rows in the dataset against the selected model.
+- **Run Remaining**: Execute only the rows that have not yet been run against the selected model.
+- **Manual Evaluation**: Mark responses as Thumb Up or Thumb Down to keep a record of manual evaluations.
+
+## Evaluate bulk run results
+
+AI Toolkit lets you evaluate the results of your bulk run directly in the dataset view.
+
+![Screenshot showing AI Toolkit interface in full screen mode with the Evaluation tab expanded. The dataset table displays multiple columns, including query prompts and AI responses, for detailed analysis.](./images/bulkrun/full_screen.png)
+
+You can expand the **Evaluation** tab to full screen mode for a more detailed view of the results. Full screen mode provides the same functionality as the standard view, but with a larger display area for better visibility and analysis.
 
-1. Select either a sample dataset or import a local [JSONL](https://jsonlines.org/) file with chat prompts
+![Screenshot showing detailed view of evaluation results with a modal dialog displaying a full conversation between user and assistant about weather queries.](./images/bulkrun/view_detail.png)
 
-    The JSONL file needs to have a `query` field to represent a prompt.
+Select **View Details** to see the full response for each query.
 
-1. Once the dataset is loaded, select **Run** or **Rerun** on any prompt to run a single prompt.
+In the detail view, you can:
 
-    Similar to testing a model in the playground, select a model, add context for your prompt, and change inference parameters.
+- Review the full conversation between the user and the assistant.
+- Analyze the AI's responses.
+- Mark responses as good or bad to keep a record of manual evaluations.
+- Navigate to previous or next queries in the dataset.
+- Select **Exit** to return to the dataset overview.
+- View the total number of queries in the dataset and the current query index.
 
-    ![Bulk run prompts](./images/bulkrun/bulkrun_one.png)
+## Manage data columns
 
-1. Select **Run all** to automatically run through all queries.
+![Screenshot showing AI Toolkit interface with dataset management options and column management controls.](./images/bulkrun/manage_columns.png)
 
-    The model responses are shown in the **response** column.
+With data column management, you can customize the dataset view to focus on the most relevant information for your bulk run analysis.
 
-    ![Run all](./images/bulkrun/runall.png)
+You can:
 
-    > [!TIP]
-    > There is an option to only run the remaining queries that have not yet been run.
+- **Add Columns**: Add columns to the left or right of the current column.
+- **Edit Column Name**: Change the name of any column in the dataset.
+- **Add Ground Truth Column**: Add a column for ground truth values to compare with AI responses.
 
-1. Select the **Export** button to export the results to a JSONL format
+## Next steps
 
-1. Select **Import** to import another dataset in JSONL format for the bulk run
+- [Run an evaluation](/docs/intelligentapps/evaluation.md) with the popular evaluators
diff --git a/docs/intelligentapps/images/bulkrun/bulkrun.png b/docs/intelligentapps/images/bulkrun/bulkrun.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:a77a27e41277d8fc796178ea572d6aa742729d97aafb525ed37190c573adb9ed
+size 189192
diff --git a/docs/intelligentapps/images/bulkrun/dataset_operation.png b/docs/intelligentapps/images/bulkrun/dataset_operation.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:f6eae926a65e2cc69f7451bb077d7c49d17dcc3953c7813e15a6e8a1c1f618c9
+size 66277
diff --git a/docs/intelligentapps/images/bulkrun/full_screen.png b/docs/intelligentapps/images/bulkrun/full_screen.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:c44bebaf684b2deeb44576ec90fedcf0aa2ce50ffc67353381626584b34aed68
+size 149497
diff --git a/docs/intelligentapps/images/bulkrun/generate_data.png b/docs/intelligentapps/images/bulkrun/generate_data.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:ffdedaa53e227dea0bea36a2bafbc5a21bfceb55a6295411d935e4ad015e1445
+size 71161
diff --git a/docs/intelligentapps/images/bulkrun/manage_columns.png b/docs/intelligentapps/images/bulkrun/manage_columns.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:1e171e93df82820a3b82d96fbaf3a576a7e32125e1b488903d6e7699a23e8df2
+size 28564
diff --git a/docs/intelligentapps/images/bulkrun/view_detail.png b/docs/intelligentapps/images/bulkrun/view_detail.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:ffa76492580d65e776d56a05ed2204c3420367e9eaa51540d1037cf6a10ac88b
+size 302181