Skip to content

Commit 9fb5270

Browse files
MuyangAmigontrogh
andauthored
Refresh AITK Bulk Run Page (#8471)
* Refresh bulk run doc * Update docs/intelligentapps/bulkrun.md Co-authored-by: Nick Trogh <1908215+ntrogh@users.noreply.github.com> * update image --------- Co-authored-by: Nick Trogh <1908215+ntrogh@users.noreply.github.com>
1 parent de9f5c5 commit 9fb5270

File tree

7 files changed

+87
-17
lines changed

7 files changed

+87
-17
lines changed

docs/intelligentapps/bulkrun.md

Lines changed: 69 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,37 +1,89 @@
11
---
22
ContentId: 1124d141-e893-4780-aba7-b6ca13628bc5
3-
DateApproved: 12/11/2024
4-
MetaDescription: Run a set of prompts in an imported dataset, individually or in a full batch towards the selected genAI models and parameters.
3+
DateApproved: 06/16/2025
4+
MetaDescription: Run a set of prompts with variables or function calls with an imported or synthetically generated dataset towards the selected models and parameters.
55
---
66
# Run multiple prompts in bulk
77

8-
The bulk run feature in AI Toolkit enables you to run multiple prompts in batch. When you use the playground, you can only run one prompt manually at a time, in the order they're listed.
8+
> [!NOTE]
9+
> Bulk run was previously a standalone webview feature in AI Toolkit. It is now fully integrated into **Agent Builder** under the **Evaluation** tab. You can still access it through the AI Toolkit view by selecting **TOOLS** > **Bulk Run**.
910
10-
Bulk run takes a dataset as input, where each row in the dataset has at least a prompt. Typically, the dataset has multiple rows. Once imported, you can select one or more prompts to run on the selected model. The responses are then displayed in the same dataset view. The results from running the dataset can be exported.
11+
The bulk run feature in AI Toolkit lets you test agents and prompts against multiple test cases in batch mode. Unlike the playground, which runs one prompt at a time, bulk run automates the process by using a dataset as input and running all prompts sequentially.
12+
13+
After execution, AI responses appear in the dataset view next to your original prompts. You can review, compare, and export the complete dataset with responses for further analysis.
14+
15+
![Screenshot showing AI Toolkit interface with the bulk run feature. The dataset table displays multiple prompts and responses, with queries about weather in Paris France and Shanghai China.](./images/bulkrun/bulkrun.png)
1116

1217
## Start a bulk run
1318

14-
1. In the AI Toolkit view, select **TOOLS** > **Bulk Run** to open the Bulk Run view
19+
To start a bulk run in AI Toolkit, follow these steps:
20+
21+
1. In the AI Toolkit view, select **Agent Builder** from the Activity Bar.
22+
1. Enter your prompt and variables using the `{{your_variable}}` format. Select a model to run the prompt against.
23+
1. Switch to the **Evaluation** tab in **Agent Builder**.
24+
25+
> [!NOTE]
26+
> AI Toolkit uses the same LLM models you use for agents to generate datasets, which might incur costs. You can view the meta prompt used to generate datasets in the [AI Toolkit GitHub repository](https://github.com/microsoft/vscode-ai-toolkit/blob/main/doc/data_generator.md).
27+
28+
1. Select **Generate Data** to create a synthetic dataset.
29+
1. Choose the number of rows to generate and view or modify the data generation logic.
30+
![Screenshot showing Generate Data dialog in AI Toolkit.](./images/bulkrun/generate_data.png)
31+
1. Select **Generate** to create the dataset.
32+
33+
> [!TIP]
34+
> You can choose to run only the remaining queries that have not yet been run.
35+
36+
1. Once the dataset is loaded, select **Run** to run a single row or **Run All** to run all rows in the dataset.
37+
38+
## Operate on dataset
39+
40+
![Screenshot showing AI Toolkit interface with dataset operations and a table of evaluation results.](./images/bulkrun/dataset_operation.png)
41+
42+
AI Toolkit provides several operations to manage and analyze your dataset during a bulk run:
43+
44+
- **Generate Data**: Create a synthetic dataset based on a prompt and variables. Specify the number of rows and modify the data generation logic.
45+
- **Add Row**: Add a new row to the dataset.
46+
- **Delete Row**: Delete the selected row from the dataset.
47+
- **Export Dataset**: Export the dataset to a CSV file for further analysis or reporting.
48+
- **Import Dataset**: Import a dataset from a CSV file to use as input for the bulk run.
49+
- **Run**: Execute a single row in the dataset against the selected model.
50+
- **Run All**: Execute all rows in the dataset against the selected model.
51+
- **Run Remaining**: Execute only the rows that have not yet been run against the selected model.
52+
- **Manual Evaluation**: Mark responses as Thumb Up or Thumb Down to keep a record of manual evaluations.
53+
54+
## Evaluate bulk run results
55+
56+
AI Toolkit lets you evaluate the results of your bulk run directly in the dataset view.
57+
58+
![Screenshot showing AI Toolkit interface in full screen mode with the Evaluation tab expanded. The dataset table displays multiple columns, including query prompts and AI responses, for detailed analysis.](./images/bulkrun/full_screen.png)
59+
60+
You can expand the **Evaluation** tab to full screen mode for a more detailed view of the results. Full screen mode provides the same functionality as the standard view, but with a larger display area for better visibility and analysis.
1561

16-
1. Select either a sample dataset or import a local [JSONL](https://jsonlines.org/) file with chat prompts
62+
![Screenshot showing detailed view of evaluation results with a modal dialog displaying a full conversation between user and assistant about weather queries.](./images/bulkrun/view_detail.png)
1763

18-
The JSONL file needs to have a `query` field to represent a prompt.
64+
Select **View Details** to see the full response for each query.
1965

20-
1. Once the dataset is loaded, select **Run** or **Rerun** on any prompt to run a single prompt.
66+
In the detail view, you can:
2167

22-
Similar to testing a model in the playground, select a model, add context for your prompt, and change inference parameters.
68+
- Review the full conversation between the user and the assistant.
69+
- Analyze the AI's responses.
70+
- Mark responses as good or bad to keep a record of manual evaluations.
71+
- Navigate to previous or next queries in the dataset.
72+
- Select **Exit** to return to the dataset overview.
73+
- View the total number of queries in the dataset and the current query index.
2374

24-
![Bulk run prompts](./images/bulkrun/bulkrun_one.png)
75+
## Manage data columns
2576

26-
1. Select **Run all** to automatically run through all queries.
77+
![Screenshot showing AI Toolkit interface with dataset management options and column management controls.](./images/bulkrun/manage_columns.png)
2778

28-
The model responses are shown in the **response** column.
79+
With data column management, you can customize the dataset view to focus on the most relevant information for your bulk run analysis.
2980

30-
![Run all](./images/bulkrun/runall.png)
81+
You can:
3182

32-
> [!TIP]
33-
> There is an option to only run the remaining queries that have not yet been run.
83+
- **Add Columns**: Add columns to the left or right of the current column.
84+
- **Edit Column Name**: Change the name of any column in the dataset.
85+
- **Add Ground Truth Column**: Add a column for ground truth values to compare with AI responses.
3486

35-
1. Select the **Export** button to export the results to a JSONL format
87+
## Next steps
3688

37-
1. Select **Import** to import another dataset in JSONL format for the bulk run
89+
- [Run an evaluation](/docs/intelligentapps/evaluation.md) with the popular evaluators
Lines changed: 3 additions & 0 deletions
Loading
Lines changed: 3 additions & 0 deletions
Loading
Lines changed: 3 additions & 0 deletions
Loading
Lines changed: 3 additions & 0 deletions
Loading
Lines changed: 3 additions & 0 deletions
Loading
Lines changed: 3 additions & 0 deletions
Loading

0 commit comments

Comments
 (0)