forked from microsoft/autogen
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added a simple Testbed tool for repeatedly running templated Autogen …
…scenarios with tightly-controlled initial conditions. (microsoft#455) * Initial commit of the autogen testbed environment. * Fixed some typos in the Testbed README.md * Added some stricter termination logic to the two_agent scenario, and swiched the logo task from finding Autogen's logo, to finding Microsoft's (it's easier) * Added documentation to testbed code in preparation for PR * Added a variation of HumanEval to the Testbed. It is also a reasonable example of how to integrate other benchmarks. * Removed ChatCompletion.start_logging and related features. Added an explicit TERMINATE output to HumanEval to save 1 turn in each conversation. * Added metrics utils script for HumanEval * Updated the requirements in the README. * Added documentation for HumanEval csv schemas * Standardized on how the OAI_CONFIG_LIST is handled. * Removed dot-slash from 'includes' path for cross-platform compatibility * Missed a file. * Updated readme to include known-working versions.
- Loading branch information
Showing
10 changed files
with
920 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,151 @@ | ||
# Autogen Testbed Environment | ||
|
||
The Autogen Testbed environment is a tool for repeatedly running a set of pre-defined Autogen scenarios in a setting with tightly-controlled initial conditions. With each run, Autogen will start from a blank slate, working out what code needs to be written, and what libraries or dependencies to install. The results of each run are logged, and can be ingested by analysis or metrics scripts (see the HumanEval example later in this README). By default, all runs are conducted in freshly-initialized docker containers, providing the recommended level of consistency and safety. | ||
|
||
This Testbed sample has been tested in, and is known to work with, Autogen versions 0.1.14 and 0.2.0b1 | ||
|
||
## Setup | ||
|
||
Before you begin, you must configure your API keys for use with the Testbed. As with other Autogen applications, the Testbed will look for the OpenAI keys in a file in the current working directy, or environment variable named, OAI_CONFIG_LIST. This can be overrriden using a command-line parameter described later. | ||
|
||
For some scenarios, additional keys may be required (e.g., keys for the Bing Search API). These can be added to an `ENV` file in the `includes` folder. A sample has been provided in ``includes/ENV.example``. Edit ``includes/ENV`` as needed. | ||
|
||
The Testbed also requires Docker (Desktop or Engine) AND the __python docker__ library. **It will not run in codespaces**, unless you opt for native execution (with is strongly discouraged). To install Docker Desktop see [https://www.docker.com/products/docker-desktop/](https://www.docker.com/products/docker-desktop/). To install the Python library: | ||
|
||
``pip install docker`` | ||
|
||
## Running the Testbed | ||
|
||
To run the Testbed, simply execute | ||
``python run_scenarios.py`` | ||
|
||
The default it to repeat this scenario 10 times. This can be costly. To run each scenario only once, use: | ||
``python run_scenarios.py --repeat 1`` | ||
|
||
|
||
The run_scenarios.py script also allows a number of command-line arguments to control various parameters of execution. Type ``python run_scenarios.py -h`` to explore these options: | ||
|
||
``` | ||
run_scenarios.py will run the specified autogen scenarios for a given number of repetitions and record all logs and trace information. When running in a Docker environment (default), each run will begin from a common, tightly controlled, environment. The resultant logs can then be further processed by other scripts to produce metrics. | ||
positional arguments: | ||
scenario The JSONL scenario file to run. If a directory is specified, | ||
then all JSONL scenarios in the directory are run. (default: | ||
./scenarios) | ||
options: | ||
-h, --help show this help message and exit | ||
-r REPEAT, --repeat REPEAT | ||
The number of repetitions to run for each scenario (default: 10). | ||
-c CONFIG, --config CONFIG | ||
The environment variable name or path to the OAI_CONFIG_LIST (default: OAI_CONFIG_LIST). | ||
--native Run the scenarios natively rather than in docker. | ||
NOTE: This is not advisable, and should be done with great caution. | ||
``` | ||
|
||
## Results | ||
|
||
By default, the Testbed stores results in a folder heirarchy with the following template: | ||
|
||
``./results/[scenario]/[instance_id]/[repetition]`` | ||
|
||
For example, consider the following folders: | ||
|
||
``./results/default_two_agents/two_agent_stocks_gpt4/0`` | ||
``./results/default_two_agents/two_agent_stocks_gpt4/1`` | ||
|
||
... | ||
|
||
``./results/default_two_agents/two_agent_stocks_gpt4/9`` | ||
|
||
This folder holds the results for the ``two_agent_stocks_gpt4`` instance of the ``default_two_agents`` scenario. The ``0`` folder contains the results of the first run. The ``1`` folder contains the results of the second run, and so on. You can think of the _instance_ as mapping to a prompt, or a unique set of parameters, while the _scenario_ defines the template in which those parameters are input. | ||
|
||
Within each folder, you will find the following files: | ||
|
||
- *timestamp.txt*: records the date and time of the run, along with the version of the pyautogen library installed | ||
- *console_log.txt*: all console output produced by Docker when running autogen. Read this like you would a regular console. | ||
- *chat_completions.json*: a log of all OpenAI ChatCompletions, as logged by ``autogen.ChatCompletion.start_logging(compact=False)`` | ||
- *[agent]_messages.json*: for each Agent, a log of their messages dictionaries | ||
- *./coding*: A directory containing all code written by Autogen, and all artifacts produced by that code. | ||
|
||
## Scenario Templating | ||
|
||
All scenarios are stored in JSONL files in the ``./scenarios'' directory. Each line of a scenario file is a JSON object with the following schema: | ||
|
||
``` | ||
{ | ||
"id": string, | ||
"template": filename, | ||
"values" { | ||
"field_name1": string, | ||
"field_name2": string, | ||
... | ||
"field_nameN": string | ||
} | ||
} | ||
``` | ||
|
||
For example: | ||
|
||
``` | ||
{ | ||
"id": "two_agent_stocks_gpt4", | ||
"template": "default_two_agents.py", | ||
"values": { | ||
"\__MODEL\__": "gpt-4", | ||
"\__PROMPT\__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD." | ||
} | ||
} | ||
``` | ||
|
||
Where the ``id`` is the instance id used when saving results, ``template`` points to a python file that contains the scenario logic, and ``values`` contains a set of strings to find and replace when expanding the template. | ||
|
||
An example templated python file is: | ||
|
||
``` | ||
from autogen import AssistantAgent, UserProxyAgent, config_list_from_json | ||
import os | ||
import json | ||
import testbed_utils | ||
testbed_utils.init() | ||
############################## | ||
config_list = config_list_from_json( | ||
"OAI_CONFIG_LIST", filter_dict={"model": ["\__MODEL\__"]}, | ||
) | ||
assistant = AssistantAgent("assistant", llm_config={ | ||
"request_timeout": 180, | ||
"config_list": config_list} | ||
) | ||
user_proxy = UserProxyAgent("user_proxy", | ||
human_input_mode="NEVER", | ||
code_execution_config={ | ||
"work_dir": "coding", | ||
"use_docker": False, | ||
}, | ||
max_consecutive_auto_reply=10) | ||
user_proxy.initiate_chat(assistant, message="\__PROMPT\__") | ||
############################## | ||
testbed_utils.finalize(assistant, user_proxy) | ||
``` | ||
|
||
|
||
## (Example) Running HumanEval | ||
|
||
One sample Testbed scenario type is a variation of the classic [HumanEval](https://github.com/openai/human-eval) benchmark. In this scenario, agents are given access to the unit test results, and are able to continue to debug their code until the problem is solved or they run out of tokens or turns. We can then count how many turns it took to solve the problem (returning -1 if the problem remains unsolved by the end of the conversation, and "" if the run is missing). | ||
|
||
Accessing this scenario-type requires downloading and converting the HumanEval dataset, running the Testbed, collating the results, and finally computing the metrics. The following commands will accomplish this, running each test instance 3 times with GPT-3.5-Turbo-16k: | ||
|
||
``` | ||
python utils/download_humaneval.py | ||
python ./run_scenarios.py --repeat 3 scenarios/human_eval_two_agents_gpt35.jsonl | ||
python utils/collate_human_eval.py ./results/human_eval_two_agents_gpt35 | python utils/metrics_human_eval.py > human_eval_results_gpt35.csv | ||
cat human_eval_results_gpt35.csv | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
export BING_API_KEY= |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
from importlib.metadata import version as lib_version | ||
from datetime import datetime | ||
import os | ||
import autogen | ||
import json | ||
|
||
|
||
def init(): | ||
"""Helper function to initialize logging in a testbed scenario. | ||
Specifically, write timestamp and version information, then | ||
initialize autogen logging. | ||
Args: | ||
None | ||
Returns: | ||
None | ||
""" | ||
|
||
# Print some information about the run | ||
with open("timestamp.txt", "wt") as f: | ||
f.write("Timestamp: " + datetime.now().isoformat() + "\n") | ||
f.write("pyautogen version: " + lib_version("pyautogen") + "\n") | ||
|
||
|
||
def finalize(agents): | ||
"""Helper function to finalize logging in a testbed scenario. | ||
Calling this function will save all the chat completions logged | ||
by Autogen to disk, and will save the messages dictionaries of | ||
all agents passed via the agents argument. | ||
Args: | ||
agents (list): a list of the agents whose messages will be logged to disk. | ||
Returns: | ||
None | ||
""" | ||
|
||
script_dir = os.path.dirname(os.path.realpath(__file__)) | ||
|
||
def messages_to_json(agent): | ||
messages = dict() | ||
for item in agent.chat_messages.items(): | ||
messages[item[0].name] = item[1] | ||
return json.dumps(messages, indent=4) | ||
|
||
for agent in agents: | ||
fname = agent.name + "_messages.json" | ||
with open(os.path.join(script_dir, fname), "wt") as fh: | ||
fh.write(messages_to_json(agent)) |
Oops, something went wrong.