Inverse Reinforcement Learning #57

farbod1277 · 2021-08-16T02:16:18Z

Hi,

Firstly, thanks for putting together such an awesome project!

I've been playing around with the singleagent.py problems recently and was wondering if there is any way to incorporate demonstration learning / inverse reinforcement learning into these tasks. Instead of hard-coding the reward functions in _computeReward, I wanted to try learning a reward function from expert demonstrations. (See: Let's do IRL). I'd appreciate if you could give me some pointers on how to modify the _computeReward and the learning process to include an IRL_reward.

Cheers

The text was updated successfully, but these errors were encountered:

JacopoPan · 2021-08-16T16:54:51Z

Hello @farbod1277,

To modify _computeReward is as simple as re-implementing _computeReward (a function returning a scalar) in any (non-abstract) Aviary class/environment—when you have figured out how to learn the "correct" reward, that is the function to implement. However, to do so for IRL (or imitation learning), I'd imagine you want to find a way to collect trajectories/state(obs)-action pairs for a known, working controller.

For that, I'd start from script fly.py in which the actions for N quadcopters in aviary CtrlAviary (whose reward is not-implemented, see line 168) are picked by PID controller DSLPIDControl:

gym-pybullet-drones/examples/fly.py

Lines 152 to 158 in ed9aaa5

    
           for j in range(ARGS.num_drones): 
        
               action[str(j)], _, _ = ctrl[j].computeControlFromState(control_timestep=CTRL_EVERY_N_STEPS*env.TIMESTEP, 
        
                                                                      state=obs[str(j)]["state"], 
        
                                                                      target_pos=np.hstack([TARGET_POS[wp_counters[j], 0:2], INIT_XYZS[j, 2]]), 
        
                                                                      # target_pos=INIT_XYZS[j, :] + TARGET_POS[wp_counters[j], :], 
        
                                                                      target_rpy=INIT_RPYS[j, :] 
        
                                                                      )

farbod1277 · 2021-08-17T13:15:26Z

Hi @JacopoPan ,

Thank you very much for your response. Yes I was thinking of using an xbox controller or arrow keys to fly the drone manually using the PID controller and save those as my expert trajectories, then use imitation learning. Would that be at all possible with this setup? I would need to write the pipeline to interface the simulator with a controller somehow. Any ideas?

Thanks again!!

JacopoPan · 2021-08-17T17:51:28Z

In fly.py, the PID controllers track positions in x,y,z, written in matrix of shape (waypoints,3), TARGET_POS:

gym-pybullet-drones/examples/fly.py

Lines 68 to 69 in ed9aaa5

    
           for i in range(NUM_WP): 
        
               TARGET_POS[i, :] = R*np.cos((i/NUM_WP)*(2*np.pi)+np.pi/2)+INIT_XYZS[0, 0], R*np.sin((i/NUM_WP)*(2*np.pi)+np.pi/2)-R+INIT_XYZS[0, 1], 0

I'd proceed by steps, (i) first write a separate program that translates your controller inputs into such format, (ii) plug these new trajectories in fly.py, an (iii) try to do in online, writing the desired waypoints from the external input as they are tracked.

farbod1277 · 2021-08-18T11:58:25Z

Thanks again @JacopoPan. How about if I wanted to control the drone directly with attitude and thrust control instead of way-points?

JacopoPan · 2021-08-18T15:35:10Z

fly.py uses computeControlFromState

gym-pybullet-drones/examples/fly.py

Line 153 in ed9aaa5

    
           action[str(j)], _, _ = ctrl[j].computeControlFromState(control_timestep=CTRL_EVERY_N_STEPS*env.TIMESTEP,

which simply wraps around ComputeControl

gym-pybullet-drones/gym_pybullet_drones/control/BaseControl.py

Line 58 in ed9aaa5

def computeControlFromState(self,

gym-pybullet-drones/gym_pybullet_drones/control/BaseControl.py

Lines 87 to 96 in ed9aaa5

    
           return self.computeControl(control_timestep=control_timestep, 
        
                                      cur_pos=state[0:3], 
        
                                      cur_quat=state[3:7], 
        
                                      cur_vel=state[10:13], 
        
                                      cur_ang_vel=state[13:16], 
        
                                      target_pos=target_pos, 
        
                                      target_rpy=target_rpy, 
        
                                      target_vel=target_vel, 
        
                                      target_rpy_rates=target_rpy_rates 
        
                                      )

ComputeControl computes thrust and desired roll/pitch/yaw from the waypoint (using the position control method) and feeds those into the attitude control to obtain the desired motor RPMs

gym-pybullet-drones/gym_pybullet_drones/control/DSLPIDControl.py

Lines 119 to 134 in ed9aaa5

    
           thrust, computed_target_rpy, pos_e = self._dslPIDPositionControl(control_timestep, 
        
                                                                            cur_pos, 
        
                                                                            cur_quat, 
        
                                                                            cur_vel, 
        
                                                                            target_pos, 
        
                                                                            target_rpy, 
        
                                                                            target_vel 
        
                                                                            ) 
        
           rpm = self._dslPIDAttitudeControl(control_timestep, 
        
                                             thrust, 
        
                                             cur_quat, 
        
                                             computed_target_rpy, 
        
                                             target_rpy_rates 
        
                                             ) 
        
           cur_rpy = p.getEulerFromQuaternion(cur_quat) 
        
           return rpm, pos_e, computed_target_rpy[2] - cur_rpy[2]

I suppose you'd want to talk directly to this latter method.

farbod1277 · 2021-09-09T00:51:16Z

@JacopoPan thanks again! Just wondering, if I was to control the drone using positional control using ComputeControl how would I retrieve the following intermediary variables? Thrust in the z direction, and torques in the x,y and z directions. I would wanna retrieve these intermediary variables from ComputeControl and use them with the actionSpace of a DynAviary to control a drone, instead of doing it directly with RPM. This way I can record trajectories where the actionSpace is Thrusts and torques instead of RPMs. I hope that makes sense. Any help is appreciated.

What I'm thinking of doing right now is to retrieve the aforementioned variables from inside of _dslPIDAttitudeControl, the variables being thrust and target_torques. And then feeding these into the .step() method of a DynAviary. Would that be correct?

JacopoPan · 2021-09-10T11:06:51Z

I never tried this but it potentially makes sense to me. DynAviary still converts the target thrust/torques into RPM solving a non-negative least squares problem (but I admit it is probably the least tested of the environments).

farbod1277 · 2021-09-12T10:05:55Z

I tried it and I've got the following issue:


[WARNING] iter 1205 in utils.nnlsRPM(), unfeasible pitch torque 3200.00 outside range [-0.01, 0.01]
[WARNING] iter 1205 in utils.nnlsRPM(), unfeasible squared rotor speeds, using NNLS
Negative sq. rotor speeds:       [-90183354888115.02, 90183773559175.77, 90183773559175.77, -90183354888115.02]                 Normalized: [-0.50, 0.50, 0.50, -0.50]
NNLS:                            [0.00, 60122655263137.41, 60122655263137.43, 0.00]                     Normalized: [0.00, 0.71, 0.71, 0.00]            Residual: 208269536884307.03
[WARNING] iter 1210 in utils.nnlsRPM(), unfeasible thrust 1429.12 outside range [0, 0.60]
[WARNING] iter 1210 in utils.nnlsRPM(), unfeasible roll torque 3200.00 outside range [-0.01, 0.01]
[WARNING] iter 1210 in utils.nnlsRPM(), unfeasible pitch torque 3200.00 outside range [-0.01, 0.01]
[WARNING] iter 1210 in utils.nnlsRPM(), unfeasible yaw torque -3200.00 outside range [-0.01, 0.01]
[WARNING] iter 1210 in utils.nnlsRPM(), unfeasible squared rotor speeds, using NNLS
Negative sq. rotor speeds:       [101886302823643.95, 80742096258340.28, 101886302823643.94, -279992160636241.31]               Normalized: [0.31, 0.25, 0.31, -0.86]
NNLS:                            [101886302823643.94, 0.00, 101886302823643.94, 0.00]                   Normalized: [0.71, 0.00, 0.71, 0.00]            Residual: 457744885724003.88
[WARNING] iter 1215 in utils.nnlsRPM(), unfeasible roll torque -3200.00 outside range [-0.01, 0.01]
[WARNING] iter 1215 in utils.nnlsRPM(), unfeasible pitch torque -3200.00 outside range [-0.01, 0.01]
[WARNING] iter 1215 in utils.nnlsRPM(), unfeasible yaw torque 3200.00 outside range [-0.01, 0.01]
[WARNING] iter 1215 in utils.nnlsRPM(), unfeasible squared rotor speeds, using NNLS
Negative sq. rotor speeds:       [-100755667506297.23, -79611460940993.56, -100755667506297.22, 281122795953588.00]             Normalized: [-0.31, -0.24, -0.31, 0.86]
NNLS:                            [0.00, 0.00, 0.00, 254585642306590.12]                         Normalized: [0.00, 0.00, 0.00, 1.00]            Residual: 313233026428473.94
[WARNING] iter 1220 in utils.nnlsRPM(), unfeasible roll torque -3200.00 outside range [-0.01, 0.01]
[WARNING] iter 1220 in utils.nnlsRPM(), unfeasible pitch torque -3200.00 outside range [-0.01, 0.01]
[WARNING] iter 1220 in utils.nnlsRPM(), unfeasible yaw torque 3200.00 outside range [-0.01, 0.01]
[WARNING] iter 1220 in utils.nnlsRPM(), unfeasible squared rotor speeds, using NNLS
Negative sq. rotor speeds:       [-100755667506297.23, -79611460940993.56, -100755667506297.22, 281122795953588.00]             Normalized: [-0.31, -0.24, -0.31, 0.86]
NNLS:                            [0.00, 0.00, 0.00, 254585642306590.12]                         Normalized: [0.00, 0.00, 0.00, 1.00]            Residual: 313233026428473.94
/home/farbod/.local/lib/python3.8/site-packages/numpy/core/numeric.py:1653: RuntimeWarning: overflow encountered in multiply
  multiply(a1, b2, out=cp0)
/home/farbod/.local/lib/python3.8/site-packages/numpy/core/numeric.py:1654: RuntimeWarning: overflow encountered in multiply
  tmp = array(a2 * b1)
/home/farbod/.local/lib/python3.8/site-packages/numpy/core/numeric.py:1655: RuntimeWarning: invalid value encountered in subtract
  cp0 -= tmp
/home/farbod/.local/lib/python3.8/site-packages/numpy/core/numeric.py:1656: RuntimeWarning: overflow encountered in multiply
  multiply(a2, b0, out=cp1)
/home/farbod/.local/lib/python3.8/site-packages/numpy/core/numeric.py:1657: RuntimeWarning: overflow encountered in multiply
  multiply(a0, b2, out=tmp)
/home/farbod/.local/lib/python3.8/site-packages/numpy/core/numeric.py:1658: RuntimeWarning: invalid value encountered in subtract
  cp1 -= tmp
/home/farbod/.local/lib/python3.8/site-packages/numpy/core/numeric.py:1659: RuntimeWarning: overflow encountered in multiply
  multiply(a0, b1, out=cp2)
/home/farbod/.local/lib/python3.8/site-packages/numpy/core/numeric.py:1660: RuntimeWarning: overflow encountered in multiply
  multiply(a1, b0, out=tmp)
/home/farbod/.local/lib/python3.8/site-packages/numpy/core/numeric.py:1661: RuntimeWarning: invalid value encountered in subtract
  cp2 -= tmp
/home/farbod/git/gym-pybullet-drones/gym_pybullet_drones/control/DSLPIDControlDyn.py:194: UserWarning: Gimbal lock detected. Setting third angle to zero since it is not possible to uniquely determine all angles.
  target_euler = (Rotation.from_matrix(target_rotation)).as_euler('XYZ', degrees=False)
Traceback (most recent call last):
  File "expert_trajectories.py", line 135, in 
    action[str(j)], _, _ = ctrl[j].computeControlFromState(control_timestep=CTRL_EVERY_N_STEPS*env.TIMESTEP,
  File "/home/farbod/git/gym-pybullet-drones/gym_pybullet_drones/control/BaseControl.py", line 87, in computeControlFromState
    return self.computeControl(control_timestep=control_timestep,
  File "/home/farbod/git/gym-pybullet-drones/gym_pybullet_drones/control/DSLPIDControlDyn.py", line 127, in computeControl
    torques = self._dslPIDAttitudeControl(control_timestep,
  File "/home/farbod/git/gym-pybullet-drones/gym_pybullet_drones/control/DSLPIDControlDyn.py", line 233, in _dslPIDAttitudeControl
    target_rotation = (Rotation.from_quat([w, x, y, z])).as_matrix()
  File "rotation.pyx", line 624, in scipy.spatial.transform.rotation.Rotation.from_quat
  File "rotation.pyx", line 528, in scipy.spatial.transform.rotation.Rotation.__init__
ValueError: Found zero norm quaternions in `quat`.

The control works fine when the drone is not required to do any rotations, i.e. going straight up. As soon as it is required to rotate to move in the x and y axis I get the following error in _dslPIDAttitudeControl in my class DSLPIDControlDyn which is the same as DSLPIDControl, the only difference being it returns thrusts and torques as its control output instead of RPMs which are then fed into a DynAviary env instead of a CtrlAviary env. Any ideas on how to fix this would be appreciated! Cheers

farbod1277 · 2021-09-12T12:05:43Z

I fixed it by adding a scaling factor for the thrust and the torques after realising that their scales and clamps are completely different to what DynAviary expects.

JacopoPan · 2021-09-12T14:09:56Z

I see, I think it can make sense. If you run a few tests and eventually want to create a PR for your modified DSLPIDControlDyn (I'd recommend to subclass from DSLPIDControl for as much as possible), contributions are always welcome!

farbod1277 changed the title ~~Inverse Reinforcement Learning~~ [Question] Inverse Reinforcement Learning Aug 16, 2021

JacopoPan added the question Further information is requested label Aug 16, 2021

farbod1277 changed the title ~~[Question] Inverse Reinforcement Learning~~ Inverse Reinforcement Learning Aug 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inverse Reinforcement Learning #57

Inverse Reinforcement Learning #57

farbod1277 commented Aug 16, 2021

JacopoPan commented Aug 16, 2021

farbod1277 commented Aug 17, 2021

JacopoPan commented Aug 17, 2021

farbod1277 commented Aug 18, 2021

JacopoPan commented Aug 18, 2021

farbod1277 commented Sep 9, 2021

JacopoPan commented Sep 10, 2021

farbod1277 commented Sep 12, 2021

farbod1277 commented Sep 12, 2021 •

edited

JacopoPan commented Sep 12, 2021

Inverse Reinforcement Learning #57

Inverse Reinforcement Learning #57

Comments

farbod1277 commented Aug 16, 2021

JacopoPan commented Aug 16, 2021

farbod1277 commented Aug 17, 2021

JacopoPan commented Aug 17, 2021

farbod1277 commented Aug 18, 2021

JacopoPan commented Aug 18, 2021

farbod1277 commented Sep 9, 2021

JacopoPan commented Sep 10, 2021

farbod1277 commented Sep 12, 2021

farbod1277 commented Sep 12, 2021 • edited

JacopoPan commented Sep 12, 2021

farbod1277 commented Sep 12, 2021 •

edited