Learning Multi-Robot Decentralized Macro-Action-Based Policies via a Centralized Q-Net

In this paper, we first introduce a new macro-action-based decentralized multi-agent double deep recurrent Q-net (MacDec-MADDRQN) which adopts centralized training with decentralized execution by allowing each decentralized Q-net update to use a centralized Q-net for action selection. In order to balance centralized and decentralized exploration, a general version, called Parallel-MacDec-MADDRQN, is also proposed. The code in this repo is to implement these two algorithms.

The decentralized macro-action-based policies learned via MacDec-MADDRQN enable the agents to collaboratively push the big box for higher credits:

A team of robots collaborate to bring the correct tools to a human at the right time by running the decentralized macro-action-based policies learned via Parallel-MacDec-MADDRQN:

Installation

To install the anaconda virtual env with all the dependencies:
```
cd Anaconda_Env/
conda env create -f icra2020.yml
```
To install the python module:
```
cd MacDec-via-Cen
pip install -e .
```

MacDec-MADDRQN

Use either decentralized Q-nets or centralized Q-net as the exploration policy to generate training data; Each decentralized Q-net is then optimized via a novel double-Q update rule by minimizing the loss:

where, the target value for updating each decentralized macro-action Q-net is calculated by using a centralized Q-net for macro-action selection and the corresponding decentralized Q-net for value estimation.

Training in Box Pushing domain and the warehouse tool delivery domain (single run):

Box Pushing (10 x 10)

ma_dec_cen_hddrqn.py --grid_dim 10 10 --env_name=BP_MA --env_terminate_step=100 --trace_len=15 --batch_size=128 --dec_rnn_h_size=32 --cen_rnn_h_size=64 --train_freq=15 --total_epi=15000 --replay_buffer_size=80000 --l_rate=0.001 --discount=0.98 --start_train=2 --l_mode=0 --cen_explore --eps_end=0.1 --dynamic_h --eps_l_d --save_dir=bpma10 --seed=0 --run_id=0

Box Pushing (30 x 30)

ma_dec_cen_hddrqn.py --grid_dim 30 30 --env_name=BP_MA --env_terminate_step=200 --trace_len=45 --batch_size=128 --dec_rnn_h_size=32 --cen_rnn_h_size=64 --train_freq=45 --total_epi=15000 --replay_buffer_size=80000 --eps_l_d_steps=6000 --l_rate=0.001 --discount=0.98 --start_train=2 --l_mode=0 --cen_explore --eps_end=0.1 --dynamic_h --eps_l_d --save_dir=bpma30 --seed=0 --run_id=0

Warehouse Tool Delivery

ma_dec_cen_hddrqn.py --env_name=OSD_S_4 --env_terminate_step=150 --batch_size=16 --dec_rnn_h_size=64 --cen_rnn_h_size=64 --train_freq=30 --total_epi=40000 --replay_buffer_size=1000 --eps_l_d_steps=6000 --l_rate=0.0006 --discount=1.0 --start_train=2 --l_mode=0 --eps_end=0.1 --cen_explore --h_explore --sample_epi --dynamic_h --eps_l_d --save_dir=warehouse --seed=0 --run_id=0

Parallel-MacDec-MADDRQN

This approach differs MacDec-MADDRQN in the way that two parallel environments are involved with agents respectively performing centralized exploration (cen-e-greedy) and decentralized exploration (dec-e-greedy) in each.

The centralized Q-net is first trained purely using the centralized experiences, while each decentralized Q-net is then optimized using the above new double Q-update rule.

Training in the warehouse domain

ma_dec_cen_hddrqn_sep.py --env_name=OSD_S_4 --env_terminate_step=150 --batch_size=16 --dec_rnn_h_size=64 --cen_rnn_h_size=64 --train_freq=30 --total_epi=40000 --replay_buffer_size=1000 --eps_l_d_steps=6000 --l_rate=0.0006 --discount=1.0 --start_train=2 --l_mode=0 --eps_end=0.1 --h_explore --sample_epi --dynamic_h --eps_l_d --save_dir=warehouse_parallel --seed=0 --run_id=0

Dec-HDDRQN with Mac-CERTs and Cen-DDRQN with Mac-JERTs

These two methods are respectively the pure decentralized learning framework and the pure centralized learning framework for macro-action-based domains, proposed in our CoRL2019 paper. Here, we only provide the example commands for runing these two algorithms in the warehouse domain. Regarding the details, please refer to the paper or the MacDeepMARL repo.

Dec-HDDRQN

ma_hddrqn.py --env_name=OSD_S_4  --env_terminate_step=150 --batch_size=16 --mlp_layer_size=32 --rnn_h_size=64 --train_freq=30 --total_epi=40000 --replay_buffer_size=1000 --l_rate=0.0006 --start_train=2 --discount=1.0 --sample_epi --h_explore --dynamic_h --init_h=0.2 --end_h=0.4 --h_stable_at=6000 --eps_l_d --eps_l_d_steps=6000 --eps_end=0.1 --save_dir=warehouse_dec_hddrqn --seed=0 --run_id=0

Cen-DDRQN

ma_cen_condi_ddrqn.py --env_name=OSD_S_4  --env_terminate_step=150 --batch_size=16 --mlp_layer_size=32 --rnn_h_size=64 --train_freq=30 --total_epi=40000 --replay_buffer_size=1000 --l_rate=0.0006 --start_train=2 --discount=1.0 --sample_epi --h_explore --dynamic_h --eps_l_d --eps_l_d_steps=6000 --eps_end=0.1 --save_dir=warehouse_cen_ddrqn --seed=0 --run_id=0

How to Run in a New Domain

Encode the new macro/primitve-action domain as a gym env;
Add "obs_size", "n_action" and "action_spaces" as properties into the env class;
Let the step function return <a, o', r, t, v> instead of <o', r, t>, where
- a is the current macro/primitve actions indice of agents, List[int];
- o' is the new macro/premitive observations, List[ndarry];
- r is the reward, float;
- t is whether terminates or not, bool;
- v is a binary value indicate whether each agent's macro/primitive action terminates or not, List[int]. In primitive-action version, v should be always 1.

Visualization of the Trained Policies in Simulation

Box Pushing (10 x 10)

cd ./test/
python test_bp_ma.py --grid_dim 10 10

Box Pushing (30 x 30)

cd ./test/
python test_bp_ma.py --grid_dim 30 30

Warehouse Tool Delivery

cd ./test/
python test_osd_s_policy_dec.py

Code Structure

./scripts/ma_hddrqn.py the main training loop of Dec-HDDRQN
./scripts/ma_cen_condi_ddrqn.pythe main training loop of Cen-DDRQN
./scripts/ma_dec_cen_hddrqn.py the main training loop of MacDec-MADDRQN
./scripts/ma_dec_cen_hddrqn_sep.py the main training loop of Parallel-MacDec-MADDRQN
./src/rlmamr/method_name the source code for each corresponding method
./src/rlmamr/method_name/team.py the class for a team of agents with feature functions for training
./src/rlmamr/method_name/learning_methods.py core code for the corresponding algorithm
./src/rlmamr/method_name/env_runner.py multi-processing for parallel envs
./src/rlmamr/method_name/model.py the neural network module
./src/rlmamr/method_name/utils/ other useful functions
./src/rlmamr/my_env code for each domain problem

Demo Videos

Please check our YouTube channel for the entire real robots videos.

Paper Citation

If you used this code for your reasearch or found it helpful, please consider citing the following two papers:

@InProceedings{xiao_corl_2019,
    author = "Xiao, Yuchen and Hoffman, Joshua and Amato, Christopher",
    title = "Macro-Action-Based Deep Multi-Agent Reinforcement Learning",
    booktitle = "3rd Annual Conference on Robot Learning",
    year = "2019"
}

@InProceedings{xiao_icra_2020,
    author = "Xiao, Yuchen and Hoffman, Joshua and Xia, Tian and Amato, Christopher",
    title = "Learning Multi-Robot Decentralized Macro-Action-Based Policies via a Centralized Q-Net",
    booktitle = "Proceedings of the International Conference on Robotics and Automation",
    year = "2020"
}

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
Anaconda_Env		Anaconda_Env
scripts		scripts
src/rlmamr		src/rlmamr
test		test
.gitignore		.gitignore
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anaconda_Env

Anaconda_Env

scripts

scripts

src/rlmamr

src/rlmamr

test

test

.gitignore

.gitignore

README.md

README.md

setup.py

setup.py

Repository files navigation

Learning Multi-Robot Decentralized Macro-Action-Based Policies via a Centralized Q-Net

Installation

MacDec-MADDRQN

Parallel-MacDec-MADDRQN

Dec-HDDRQN with Mac-CERTs and Cen-DDRQN with Mac-JERTs

How to Run in a New Domain

Visualization of the Trained Policies in Simulation

Code Structure

Demo Videos

Paper Citation

About

Releases

Packages

Languages

yuchen-x/MacDec-via-Cen

Folders and files

Latest commit

History

Repository files navigation

Learning Multi-Robot Decentralized Macro-Action-Based Policies via a Centralized Q-Net

Installation

MacDec-MADDRQN

Parallel-MacDec-MADDRQN

Dec-HDDRQN with Mac-CERTs and Cen-DDRQN with Mac-JERTs

How to Run in a New Domain

Visualization of the Trained Policies in Simulation

Code Structure

Demo Videos

Paper Citation

About

Resources

Stars

Watchers

Forks

Languages