<a href="https://colab.research.google.com/github/AI4Finance-Foundation/ElegantRL/blob/master/tutorial_BipedalWalker_v3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **BipedalWalker-v3 Example in ElegantRL**






# **Task Description**

[BipedalWalker-v3](https://gym.openai.com/envs/BipedalWalker-v2/) is a robotic task in OpenAI Gym since it performs one of the most fundamental skills: moving. In this task, our goal is to get a 2D bipedal walker to walk through rough terrain. BipedalWalker is a difficult task in continuous action space, and there are only a few RL implementations can reach the target reward.

# **Part 1: Install ElegantRL**

In [None]:
# install elegantrl library
!pip install git+https://github.com/AI4Finance-LLC/ElegantRL.git

Collecting git+https://github.com/AI4Finance-LLC/ElegantRL.git
  Cloning https://github.com/AI4Finance-LLC/ElegantRL.git to /tmp/pip-req-build-0qrcm_61
  Running command git clone -q https://github.com/AI4Finance-LLC/ElegantRL.git /tmp/pip-req-build-0qrcm_61
Collecting pybullet
  Downloading pybullet-3.2.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (90.8 MB)
[K     |████████████████████████████████| 90.8 MB 254 bytes/s 
Collecting box2d-py
  Downloading box2d_py-2.3.8-cp37-cp37m-manylinux1_x86_64.whl (448 kB)
[K     |████████████████████████████████| 448 kB 47.2 MB/s 
Building wheels for collected packages: elegantrl
  Building wheel for elegantrl (setup.py) ... [?25l[?25hdone
  Created wheel for elegantrl: filename=elegantrl-0.3.3-py3-none-any.whl size=185894 sha256=4be8a158b6dd2d3dff15408ea26953c52a7263bf8d63f9e0b44e1cc582c6cc59
  Stored in directory: /tmp/pip-ephem-wheel-cache-nkyrgypx/wheels/52/9a/b3/08c8a0b5be22a65da0132538c05e7e961b1253c90d6845e0c6
Successfully bui

# **Part 2: Import Packages**


*   **elegantrl**
*   **OpenAI Gym**: a toolkit for developing and comparing reinforcement learning algorithms.



In [None]:
import gymfrom elegantrl.agents import AgentPPOfrom elegantrl.train.config import get_gym_env_args, Argumentsfrom elegantrl.train.run import *

gym.logger.set_level(40) # Block warning

# **Part 3: Get environment information**

In [None]:
get_gym_env_args(gym.make("BipedalWalker-v3"), if_print=False)

{'action_dim': 4,
 'env_name': 'BipedalWalker-v3',
 'env_num': 1,
 'if_discrete': False,
 'max_step': 1600,
 'state_dim': 24,
 'target_return': 300}

# **Part 4: Specify Agent and Environment**

*   **agent**: chooses a agent (DRL algorithm) from a set of agents in the [directory](https://github.com/AI4Finance-Foundation/ElegantRL/tree/master/elegantrl/agents).
*   **env_func**: the function to create an environment, in this case, we use gym.make to create BipedalWalker-v3.
*   **env_args**: the environment information.


In [None]:
env_func = gym.make
env_args = {
    "env_num": 1,
    "env_name": "BipedalWalker-v3",
    "max_step": 1600,
    "state_dim": 24,
    "action_dim": 4,
    "if_discrete": False,
    "target_return": 300,
    "id": "BipedalWalker-v3",
}
args = Arguments(AgentPPO, env_func=env_func, env_args=env_args)

# **Part 4: Specify hyper-parameters**
A list of hyper-parameters is available [here](https://elegantrl.readthedocs.io/en/latest/api/config.html).

In [None]:
args.target_step = args.max_step * 4
args.gamma = 0.98
args.eval_times = 2**4

# **Part 5: Train and Evaluate the Agent**






In [None]:
train_and_evaluate(args)

| Arguments Remove cwd: ./BipedalWalker-v3_PPO_0
################################################################################
ID     Step    maxR |    avgR   stdR   avgS  stdS |    expR   objC   etc.
0  6.98e+03  -91.89 |
0  6.98e+03  -91.89 |  -91.89    0.0    109     2 |   -0.39 676.16   0.06  -0.50
0  9.49e+04  -21.05 |
0  9.49e+04  -21.05 |  -21.05    0.4   1600     0 |   -0.05   6.96   0.02  -0.50
0  1.59e+05  -21.05 |  -38.62    1.8   1600     0 |   -0.03   0.34  -0.01  -0.51
0  2.24e+05  -21.05 |  -34.80    3.4   1600     0 |   -0.02   0.31   0.05  -0.52
0  2.94e+05  133.03 |
0  2.94e+05  133.03 |  133.03    4.3   1600     0 |    0.01   0.59  -0.05  -0.53
0  3.65e+05  133.03 |  -95.17    0.2    121     7 |    0.04   0.75   0.05  -0.55
0  4.55e+05  133.03 | -125.18   13.9    268    68 |    0.07   5.88   0.03  -0.56
0  5.37e+05  133.03 |  -63.86   34.8    416   175 |    0.08   7.43  -0.01  -0.57
0  6.20e+05  152.64 |
0  6.20e+05  152.64 |  152.64  137.1   1152   451 |    0.14 

Understanding the above results::
*   **Step**: the total training steps.
*  **MaxR**: the maximum reward.
*   **avgR**: the average of the rewards.
*   **stdR**: the standard deviation of the rewards.
*   **objA**: the objective function value of Actor Network (Policy Network).
*   **objC**: the objective function value (Q-value)  of Critic Network (Value Network).