Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alter the Reward function used in learn.py #6

Closed
amijeet opened this issue Oct 26, 2020 · 4 comments
Closed

Alter the Reward function used in learn.py #6

amijeet opened this issue Oct 26, 2020 · 4 comments
Labels
question Further information is requested

Comments

@amijeet
Copy link

amijeet commented Oct 26, 2020

Hi Jacopo, could you please share with us the reward function you have used in learn.py. Also could you please suggest how can I alter the already used reward function in learn.py ?
Also how much time does it take to train and reach satisfactory results in learn.py ?
Best Wishes.

@JacopoPan
Copy link
Member

JacopoPan commented Oct 27, 2020

Hello @amijeet,
apologies if I break workflows (especially around learn.py) as I am actively modifying the code.
If you want to get started on single agent RL, look at this commit and, in particular,

These 2 scripts:

And these 2 classes:

This is a much simplified take-off and hover scenario with a 2-D obs space (z and velocity in z) and a 1-D action space (the RPM for all motors).

The reward is 1 for z between 0.75 and 0.99 and 0 otherwise.

In this example, running stable-baselines3's PPO finds a solution in just a few minutes.

$ cd gym-pybullet-drones/experiments/learning/
$ python singleagent.py --env takeoff --algo ppo --pol mlp --input rpm

Output:

Eval num_timesteps=10000, episode_reward=26.00 +/- 0.00
Episode length: 242.00 +/- 0.00
New best mean reward!
Eval num_timesteps=20000, episode_reward=29.00 +/- 0.00
Episode length: 242.00 +/- 0.00
New best mean reward!
Eval num_timesteps=30000, episode_reward=58.00 +/- 0.00
Episode length: 242.00 +/- 0.00
New best mean reward!
Eval num_timesteps=40000, episode_reward=173.00 +/- 0.00
Episode length: 242.00 +/- 0.00
New best mean reward!
Stopping training because the mean reward 173.00  is above the threshold 100

Of course, more complicated tasks, using higher dimensional observations and action vectors can require:

  • More sophisticated reward engineering (see TakeoffAviary.py)
  • And/or to customize the learning networks architecture (see singleagent.py)

as well as much longer training times. E.g. simply making the input 4-D complicates the problem enough that PPO only collects 1/5 of the reward in 15x the number of iterations:

Eval num_timesteps=680000, episode_reward=31.00 +/- 0.00
Episode length: 86.00 +/- 0.00
New best mean reward!

I don't have all the answers, the purpose of this gym is exactly to try (and let others try) these things.

@JacopoPan JacopoPan added the question Further information is requested label Oct 28, 2020
@ArminBaz
Copy link

ArminBaz commented Dec 6, 2020

Hey @JacopoPan, forgive me if this is a naive question as I am still relatively new to reinforcement learning and your library. I just ran singleagent.py (from the most recent commit) on takeoff and I noticed that my model seems to be far slower than the one you showed.

It seems that you were able to break the mean reward threshold after 40000 timesteps. While I am stuck in -30 at around 120000. Do you know why this may be happening and do you have any suggestions on how to speed up the training? Thanks!

Here is the output for reference:

Eval num_timesteps=110000, episode_reward=-30.23 +/- 0.00
Episode length: 242.00 +/- 0.00
Eval num_timesteps=115000, episode_reward=-30.18 +/- 0.00
Episode length: 242.00 +/- 0.00
New best mean reward!
Eval num_timesteps=120000, episode_reward=-30.15 +/- 0.00
Episode length: 242.00 +/- 0.00
New best mean reward!
Eval num_timesteps=125000, episode_reward=-30.12 +/- 0.00
Episode length: 242.00 +/- 0.00
New best mean reward!

@JacopoPan
Copy link
Member

@ArminBaz the reward function in the latest commit is not the same of when I wrote the message above.
Have you tried looking at the performance of the trained agent using script test_singleagent.py?
It should be under folder gym-pybullet-drones/experiments/learning/results

$ python ./test_singleagent.py --exp ./results/save-<env>-<algo>-<obs>-<act>-<time-date>

(-30 over the episode should be ok, as there are negative rewards for any point except the desired hover one)

@ArminBaz
Copy link

ArminBaz commented Dec 6, 2020

@JacopoPan That makes a lot of sense, thank you for getting back so quickly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants