TIMEARENA: Shaping Efficient Multitasking Language Agents in a Time-Aware Simulation

TimeArena

TimeArena is a dynamic and interactive environment that integrates time to enable human-like efficient multitasking, which challenges agents to handle multiple tasks and parallel processing to save time.

It grounds to 30 real-world tasks in cooking, household activities, and laboratory work.

More details are in the paper.

Setup Environment

Create a conda environment and install dependency:

conda create -n timearena python=3.11
conda activate timearena
pip install -r requirements.txt

Running

Evaluating LLMs

For Closed Source Models

If you want to evaluate GPT or Gemini, please enter the corresponding API_KEY.

# if you want to use openai's model
export OPENAI_API_KEY=YOUR_OPENAI_KEY
# if you want to use google's model
export GOOGLE_API_KEY=YOUR_GOOGLE_KEY

Running test.sh lets agents to interact with TimeArena and saves the interaction trajectory.

# household1 refers to the first task of the household activity scenario).
list=("household1"
"household2"
"household3"
"household4"
"household5"
"household6"
"household7"
"household8"
"household9"
"household10")
for item in "${list[@]}"
do
    python LLM_test.py --taskName $item --lm gpt3.5 --total_time 40 --save_path ./trajectory/single --save_name $item
done

Required arguments：

--taskName refers to the tasks you want to evaluate. If you want to combine tasks, please separate the two task names with a comma. Eg, "cooking1,cooking2"
--lm refers to the language models you want to evaluate
--total_time refers to the maximum time for completing a task. For one task, 40 is enough. If there are n task combinations, it is recommended to set it to n*40 (consistent with the setting in the paper)
--save_path refers to the folder where the generated output will be stored
--save_name refers to the file name of the generated output

Optional arguments

--constraint refers to apply resource constraints. (In cooking scenario, we limit the number of pot, fryer and oven to at most one).
--sp refers to self-plan prompting method.

For Open Source Models

For open source models, deploy the model as an API using vLLM

For example:

python -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --port 8001

Then, running test.sh lets agents to interact with TimeArena and saves the interaction trajectory.

# household1 refers to the first task of the household activity scenario).
list=("household1"
"household2"
"household3"
"household4"
"household5"
"household6"
"household7"
"household8"
"household9"
"household10")
for item in "${list[@]}"
do
    python LLM_test.py --taskName $item --lm mistral --total_time 40 --save_path ./trajectory/single --save_name $item --model_name ../hf_model/Mistral-7B-Instruct-v0.2 --ip http://localhost --port 8090
done

Required arguments：

--taskName refers to the tasks you want to evaluate. If you want to combine tasks, please separate the two task names with a comma. Eg, "cooking1,cooking2"
--lm refers to the language models you want to evaluate
--total_time refers to the maximum time for completing a task. For one task, 40 is enough. If there are n task combinations, it is recommended to set it to n*40 (consistent with the setting in the paper)
--save_path refers to the folder where the generated output will be stored
--save_name refers to the file name of the generated output
--model_name refers to model name for deploying vLLM
--ip refers to the ip address of the model
--port refers to the port of the model

Optional arguments

--constraint refers to apply resource constraints. (In cooking scenario, we limit the number of pot, fryer and oven to at most one)
--sp refers to self-plan prompting method.

Calculate Four Metrics

Code for calculate the four metrics in the paper: Average Progress Score (score, AS), Completion Speed (score per minute, CS), Task Completion Rate (%, CR) and Average Completion Time (minutes, CT).

To calculate the metrics:

cd metrics
bash bash_cal_metric.sh

For example:

# The folder path to calculate the metrics
MY_PATH="../trajectory"

python cal_metric.py --path $MY_PATH

Calculate Oracle Performance

To calculate the oracle performance of tasks, you can execute the following code and input the list of tasks in bash_cal_oracle.sh

cd algorithm
bash bash_cal_oracle.sh

For example:

# List contains a total of ten combined tasks. For example, the first combined task is cooking1 and cooking2 (the first and second tasks of the cooking scenario).
list=('cooking1,cooking2'
'cooking2,cooking3'  
'cooking3,cooking4' 
'cooking4,cooking5' 
'cooking5,cooking6' 
'cooking6,cooking7' 
'cooking7,cooking8' 
'cooking8,cooking9' 
'cooking9,cooking10'
'cooking10,cooking1')

python cal_oracle.py --task "${list[@]}"

Contact

If you have any problems, please contact Yikai Zhang.

Citation

If our paper or related resources prove valuable to your research, we kindly ask for citation.

@article{zhang2024timearena,
  title={TimeArena: Shaping Efficient Multitasking Language Agents in a Time-Aware Simulation},
  author={Zhang, Yikai and Yuan, Siyu and Hu, Caiyu and Richardson, Kyle and Xiao, Yanghua and Chen, Jiangjie},
  journal={arXiv preprint arXiv:2402.05733},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TimeArena

TimeArena

algorithm

algorithm

metrics

metrics

resource

resource

.DS_Store

.DS_Store

LLM_test.py

LLM_test.py

README.md

README.md

requirements.txt

requirements.txt

test.sh

test.sh

Repository files navigation

TIMEARENA: Shaping Efficient Multitasking Language Agents in a Time-Aware Simulation

TimeArena

Setup Environment

Running

Evaluating LLMs

For Closed Source Models

For Open Source Models

Calculate Four Metrics

Calculate Oracle Performance

Contact

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
TimeArena		TimeArena
algorithm		algorithm
metrics		metrics
resource		resource
.DS_Store		.DS_Store
LLM_test.py		LLM_test.py
README.md		README.md
requirements.txt		requirements.txt
test.sh		test.sh

ykzhang721/TimeArena

Folders and files

Latest commit

History

Repository files navigation

TIMEARENA: Shaping Efficient Multitasking Language Agents in a Time-Aware Simulation

TimeArena

Setup Environment

Running

Evaluating LLMs

For Closed Source Models

For Open Source Models

Calculate Four Metrics

Calculate Oracle Performance

Contact

Citation

About

Resources

Stars

Watchers

Forks

Languages