DIAYN-PyTorch

While intelligent creatures can explore their environments and learn useful skills without supervision, many RL algorithms are heavily on the basis that acquiring skills is only achieved via defining them as explicit reward functions to learn.

Thus, in order to translate the natural behavior of creatures in learning diverse skills to a suitable mathematical formulation, DIAYN (Diversity is All You Need) was proposed for learning useful skills without any domain-specific reward function.

Instead of the real reward of the environment, DIAYN optimizes the following objective:

that z is the skill that the agent is learning and, since we desire learned skills to be as diverse as possible, z is modeled by a Uniform random variable that has the highest standard variation.

The above equation simply implies that the reward of any diverse task is equal to measuring how hard recognizing the skill z is, given the state s that the agent has visited compared to the real distribution over z (which is Uniform distribution in DIAYN paper.)
The bigger r_z(s, a) is, the more ambiguous skill z is thus, the state s should be visited more for task z so, the agent finally acquires this skill.

Concurrently to learn r_z(s, a), any conventional RL method can be utilized to learn a policy and DIAYN uses SAC.

This repository is a PyTorch implementation of Diversity is All You Need and the SAC part of the code is based on this repo.

Results

x-axis in all of the corresponding plots in this section are counted by number episode.

Hopper

number of skills = 20

similar to the environment's goal	Emergent behavior	Emergent behavior

Reward distribution	Reward distribution	Reward distribution

BipedalWalker

number of skills = 50

similar to the environment's goal	Emergent behavior	Emergent behavior

Reward distribution	Reward distribution	Reward distribution

MountainCarContinuous

number of skills = 20

similar to the environment's goal	Emergent behavior	Emergent behavior

Reward distribution	Reward distribution	Reward distribution

Dependencies

gym == 0.17.3
mujoco-py == 2.0.2.13
numpy == 1.19.2
opencv_contrib_python == 4.4.0.44
psutil == 5.5.1
torch == 1.6.0
tqdm == 4.50.0

Installation

pip3 install -r requirements.txt

Usage

How to run

usage: main.py [-h] [--env_name ENV_NAME] [--interval INTERVAL] [--do_train]
               [--train_from_scratch] [--mem_size MEM_SIZE]
               [--n_skills N_SKILLS] [--reward_scale REWARD_SCALE]
               [--seed SEED]

Variable parameters based on the configuration of the machine or user's choice

optional arguments:
  -h, --help            show this help message and exit
  --env_name ENV_NAME   Name of the environment.
  --interval INTERVAL   The interval specifies how often different parameters
                        should be saved and printed, counted by episodes.
  --do_train            The flag determines whether to train the agent or play
                        with it.
  --train_from_scratch  The flag determines whether to train from scratch or
                        continue previous tries.
  --mem_size MEM_SIZE   The memory size.
  --n_skills N_SKILLS   The number of skills to learn.
  --reward_scale REWARD_SCALE   The reward scaling factor introduced in SAC.
  --seed SEED           The randomness' seed for torch, numpy, random & gym[env].

In order to train the agent with default arguments , execute the following command and use --do_train flag, otherwise the agent would be tested (You may change the memory capacity, the environment and number of skills to learn based on your desire.):

python3 main.py --mem_size=1000000 --env_name="Hopper-v3" --interval=100 --do_train --n_skills=20

If you want to keep training your previous run, execute the followoing:

python3 main.py --mem_size=1000000 --env_name="Hopper-v3" --interval=100 --do_train --n_skills=20 --train_from_scratch

An important Note!!!

When I tried to keep training from checkpoints to continue my previous run, I observed some undesirable behavior from the discriminator that its loss rapidly converged towards 0 however, after some epochs it again returned to its correct previous training phase. I suspect since at the beginning of training from checkpoints the replay memory is empty and familiar experiences (according to the policy) gradually get added to it, the trained discriminator from the previous run can easily recognize their true skills until the replay memory gets populated big enough and contains newer and more novel transitions. Thus, I recommend running your whole training monotonically and avoid using checkpoints and successive pausing though, it is been provided.

Environments tested

Hopper-v3
bipedalWalker-v3
MountainCarContinuous-v0
HalfCheetah-v3

Structure

├── Brain
│   ├── agent.py
│   ├── __init__.py
│   ├── model.py
│   └── replay_memory.py
├── Checkpoints
│   ├── BipedalWalker
│   │   └── params.pth
│   ├── Hopper
│   │   └── params.pth
│   └── MountainCar
│       └── params.pth
├── Common
│   ├── config.py
│   ├── __init__.py
│   ├── logger.py
│   └── play.py
├── Gifs
│   ├── BipedalWalker
│   │   ├── skill11.gif
│   │   ├── skill40.gif
│   │   └── skill7.gif
│   ├── Hopper
│   │   ├── skill2.gif
│   │   ├── skill8.gif
│   │   └── skill9.gif
│   └── MountainCar
│       ├── skill3.gif
│       ├── skill7.gif
│       └── skill8.gif
├── LICENSE
├── main.py
├── README.md
├── requirements.txt
└── Results
    ├── BipedalWalker
    │   ├── running_logq.png
    │   ├── skill11.png
    │   ├── skill40.png
    │   └── skill7.png
    ├── equation.png
    ├── Hopper
    │   ├── running_logq.png
    │   ├── skill2.png
    │   ├── skill8.png
    │   └── skill9.png
    ├── MountainCar
    │   ├── running_logq.png
    │   ├── skill3.png
    │   ├── skill7.png
    │   └── skill8.png
    └── r_z.png

Brain dir consists of the neural network structure and the agent decision-making core.
Common consists of minor codes that are common for most RL codes and do auxiliary tasks like logging and... .
main.py is the core module of the code that manages all other parts and makes the agent interact with the environment.

Reference

Diversity is All You Need: Learning Skills without a Reward Function, Eysenbach, 2018

Acknowledgment

Big thanks to:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DIAYN-PyTorch

Results

Hopper

BipedalWalker

MountainCarContinuous

Dependencies

Installation

Usage

How to run

An important Note!!!

Environments tested

Structure

Reference

Acknowledgment

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
Brain		Brain
Checkpoints		Checkpoints
Common		Common
Gifs		Gifs
Results		Results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

License

seongyeop-jeong-poey/DIAYN-PyTorch

Folders and files

Latest commit

History

Repository files navigation

DIAYN-PyTorch

Results

Hopper

BipedalWalker

MountainCarContinuous

Dependencies

Installation

Usage

How to run

An important Note!!!

Environments tested

Structure

Reference

Acknowledgment

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages