Alter the Reward function used in learn.py #6

amijeet · 2020-10-26T14:06:03Z

Hi Jacopo, could you please share with us the reward function you have used in learn.py. Also could you please suggest how can I alter the already used reward function in learn.py ?
Also how much time does it take to train and reach satisfactory results in learn.py ?
Best Wishes.

JacopoPan · 2020-10-27T03:14:34Z

Hello @amijeet,
apologies if I break workflows (especially around learn.py) as I am actively modifying the code.
If you want to get started on single agent RL, look at this commit and, in particular,

These 2 scripts:

singleagent.py—using a few of stable-baselines3 algorithms
test_singleagent.py—to re-run a model trained with the previous script

And these 2 classes:

This is a much simplified take-off and hover scenario with a 2-D obs space (z and velocity in z) and a 1-D action space (the RPM for all motors).

The reward is 1 for z between 0.75 and 0.99 and 0 otherwise.

In this example, running stable-baselines3's PPO finds a solution in just a few minutes.

$ cd gym-pybullet-drones/experiments/learning/
$ python singleagent.py --env takeoff --algo ppo --pol mlp --input rpm

Output:

Eval num_timesteps=10000, episode_reward=26.00 +/- 0.00
Episode length: 242.00 +/- 0.00
New best mean reward!
Eval num_timesteps=20000, episode_reward=29.00 +/- 0.00
Episode length: 242.00 +/- 0.00
New best mean reward!
Eval num_timesteps=30000, episode_reward=58.00 +/- 0.00
Episode length: 242.00 +/- 0.00
New best mean reward!
Eval num_timesteps=40000, episode_reward=173.00 +/- 0.00
Episode length: 242.00 +/- 0.00
New best mean reward!
Stopping training because the mean reward 173.00  is above the threshold 100

Of course, more complicated tasks, using higher dimensional observations and action vectors can require:

More sophisticated reward engineering (see TakeoffAviary.py)
And/or to customize the learning networks architecture (see singleagent.py)

as well as much longer training times. E.g. simply making the input 4-D complicates the problem enough that PPO only collects 1/5 of the reward in 15x the number of iterations:

Eval num_timesteps=680000, episode_reward=31.00 +/- 0.00
Episode length: 86.00 +/- 0.00
New best mean reward!

I don't have all the answers, the purpose of this gym is exactly to try (and let others try) these things.

ArminBaz · 2020-12-06T05:06:30Z

Hey @JacopoPan, forgive me if this is a naive question as I am still relatively new to reinforcement learning and your library. I just ran singleagent.py (from the most recent commit) on takeoff and I noticed that my model seems to be far slower than the one you showed.

It seems that you were able to break the mean reward threshold after 40000 timesteps. While I am stuck in -30 at around 120000. Do you know why this may be happening and do you have any suggestions on how to speed up the training? Thanks!

Here is the output for reference:

Eval num_timesteps=110000, episode_reward=-30.23 +/- 0.00
Episode length: 242.00 +/- 0.00
Eval num_timesteps=115000, episode_reward=-30.18 +/- 0.00
Episode length: 242.00 +/- 0.00
New best mean reward!
Eval num_timesteps=120000, episode_reward=-30.15 +/- 0.00
Episode length: 242.00 +/- 0.00
New best mean reward!
Eval num_timesteps=125000, episode_reward=-30.12 +/- 0.00
Episode length: 242.00 +/- 0.00
New best mean reward!

JacopoPan · 2020-12-06T05:19:40Z

@ArminBaz the reward function in the latest commit is not the same of when I wrote the message above.
Have you tried looking at the performance of the trained agent using script test_singleagent.py?
It should be under folder gym-pybullet-drones/experiments/learning/results

$ python ./test_singleagent.py --exp ./results/save-<env>-<algo>-<obs>-<act>-<time-date>

(-30 over the episode should be ok, as there are negative rewards for any point except the desired hover one)

ArminBaz · 2020-12-06T05:30:25Z

@JacopoPan That makes a lot of sense, thank you for getting back so quickly!

JacopoPan added the question Further information is requested label Oct 28, 2020

JacopoPan closed this as completed Apr 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alter the Reward function used in learn.py #6

Alter the Reward function used in learn.py #6

amijeet commented Oct 26, 2020 •

edited

Loading

JacopoPan commented Oct 27, 2020 •

edited

Loading

ArminBaz commented Dec 6, 2020

JacopoPan commented Dec 6, 2020

ArminBaz commented Dec 6, 2020

Alter the Reward function used in learn.py #6

Alter the Reward function used in learn.py #6

Comments

amijeet commented Oct 26, 2020 • edited Loading

JacopoPan commented Oct 27, 2020 • edited Loading

ArminBaz commented Dec 6, 2020

JacopoPan commented Dec 6, 2020

ArminBaz commented Dec 6, 2020

amijeet commented Oct 26, 2020 •

edited

Loading

JacopoPan commented Oct 27, 2020 •

edited

Loading