action-observation pairing in RNN-style training #241

jujun622 · 2020-10-14T06:45:39Z

Hi, I want to pair action and observation in RNN-style training (#19). I have noticed your advice in your documentation that to modify the step function in gym and the history stored in the buffer will be "(a,o), a, (a',o'), r". I tried to do this but still got errors after my multiple modifications.

What I have done: I pre-processed the observation into the shape of (1,84,84) first, and I wrote an observation wrapper in gym to change the observation space as gym.spaces.Tuple(), then I wrote another gym.wrapper to modify the step function. Next, I prepared three networks: two for the action's and the observation's feature extraction respectively (the action and observation will both be 512-d and can do concatenation then), one for Q-function. Until here are all correct since I can get values from the network after putting an observation from the wrapped environment as an input.

However, when I applied to the environment and the network into tianshou policies, I got errors. I think the problem might be that tianshou only can retrieve simple observation instead of combined observations? Do you have some examples of such a situation?

(This problem has struggled me for more than one week :()

Many thanks!

Trinkle23897 · 2020-10-14T08:07:05Z

Could you please provide the code snippets?

The solution is, do not use tuple space. Use a dict obs space like

class Wrapper(...):
    def __init__(self, env):
        self.env = env
        super().__init__(env)
    def step(self, action):
        obs_next, rew, done, info = self.env.step(action)
        return Batch(observation=obs_next, action=action), rew, done, info

(But I think the above is not working since we want to combine obs and action, not obs_next, but this is the idea: use Batch() to put them together)

Examples are in #172, and why tuple space cannot work well is discussed in #147.

I'll add this issue link to the future version of documentation.

jujun622 · 2020-10-15T05:05:03Z

I just tried what you suggested that to use Batch here, and the code is as following:

from tianshou.data import Batch

class ActObsPairing(gym.Wrapper):
    def __init__(self, env):
        gym.Wrapper.__init__(self, env)
    
    def reset(self):
        observation = self.env.reset()
        action = [0]
        return Batch(act=action, obs=observation)
        
    def step(self, action):
        obs, reward, done, info = self.env.step(action)
        return Batch(act=[action], obs=obs), reward, done, info

class ADRQN(nn.Module):

    def __init__(self, ...):
        super().__init__()

    def forward(self, s, state, info):

        #get action's feature
        act = s["act"][0]
        act_onehot = np.zeros(self.action_shape)
        act_onehot[act] = 1
        
        actnet = ActionNet(..)
        act = actnet(act_onehot)

        #get observation's feature
        obs = s['obs']  
        obsnet = ObsNet(...)
        obs = obsnet(obs)

        s = torch.cat((act,obs))

        s (h,c)= self.lstm(s)
       ...
        return s, {"h": h.transpose(0, 1).detach(),
                   "c": c.transpose(0, 1).detach()}

But when I applied the environment into tianshou's policy, I still got the errors as below:

I guess the reason is still that tianshou cannot deal with the combined observations in the policy, or I didn't find the correct way.

Trinkle23897 · 2020-10-15T05:14:06Z

What's s in L31 look like? print it first
And I guess it is because you use [0] for a scalar, maybe?

jujun622 · 2020-10-15T05:51:07Z

Thanks for the tips!

I printed "s" out and I found the shape is (train_num, 1). I have dealt with the index and dimension issues to match tianshou's architecture as following:

        #get action's feature

        act = [s[:, 0][i]['act'] for i in range(len(s[:,0]))]
        act_onehot = np.zeros((len(act), self.action_shape))
        act_onehot[np.arange(len(act)), act] = 1
        
        actnet = ActionNet(self.action_shape)
        act = actnet(act_onehot)

        #get observation's feature

        obs = [s[:, 0][i]['obs'] for i in range(len(s[:,0]))]
        obsnet = ObsNet(...)
        obs = obsnet(obs)
        obs = obs.squeeze()

        s = torch.cat((act,obs),dim=1)
        s = s.view(s.size()[0],-1)
        s = s.unsqueeze(0)

then put "s" into LSTM, and I think this time is ready for the policy, since I got another error (finally! a new error!)

Would you mind helping me at this place? :)

Trinkle23897 · 2020-10-15T07:37:37Z

act = [s[:, 0][i]['act'] for i in range(len(s[:,0]))]

This can be simplified to s.act[:, 0]. Same as the obs: s.obs[:, 0]. But PLEASE make sure obs's shape is what you want (I mean s.obs is correct instead of s.obs[:, 0]).

I guess it is because you wrap another layer of action: remove the [] in Batch(act=[action]) in step and reset, change s.act[:, 0] to s.act, and see what happen.

Do you use DQNPolicy?

jujun622 · 2020-10-15T09:02:16Z

If I do Batch(act=action, obs=obs) in step and reset, the 'act' will disappear in the input of my network. I mean,

If I do Batch(act=action, obs=obs) in step and reset:

The observation I extract from the environment by the following code will be Batch(act: array(4), obs: array([ ]),), and the shape of obs['obs'] is what I want, which is (1,84,84).

env = ActObsPairing(make_env('..'))   #make_env is the wrapper I wrote to preprocess the image into the shape of (1,84,84)
env.reset( )
obs = env.step(env.action_space.sample())[0]

However, when I printed out "s" in the network, "s" is like [[] [] []] (I set num_train=3) and the shape of "s" is (3,1,84,84). It seems "s" loses the information of action.

If I did Batch(act=[action], obs=obs) in step and reset as previously:

The observation I extracted directly from the environment would be Batch(act: array([4]), obs: array([ ]),). "s" in the network would be like [[Batch(act:4, obs: array([]),] [Batch(act:4, obs: array([]),] [Batch(act:4, obs: array([]),]] and the shape of "s" was (3,1). One more thing is that the shape of "obs" here was (84,84) instead of (1,84,84).

I found it weird and probably I did somewhere wrong.

By the way, the type of "s" is 'numpy.ndarray' so I cannot use s.act[:,0].

Yeah, I used DQNPolicy.

Trinkle23897 · 2020-10-15T09:13:21Z

What do you mean by saying "disappear"?

In [5]: s = np.zeros([1, 84, 84])

In [6]: a = 1

In [7]: new_obs = Batch(obs=s, act=a)

In [8]: new_obs
Out[8]: 
Batch(
    obs: array([[[0., 0., 0., ..., 0., 0., 0.],
                 [0., 0., 0., ..., 0., 0., 0.],
                 [0., 0., 0., ..., 0., 0., 0.],
                 ...,
                 [0., 0., 0., ..., 0., 0., 0.],
                 [0., 0., 0., ..., 0., 0., 0.],
                 [0., 0., 0., ..., 0., 0., 0.]]]),
    act: array(1),
)

In [9]: new_obs.obs.shape
Out[9]: (1, 84, 84)

In [10]: new_obs.act.shape
Out[10]: ()
# act is a np.scalar so it has no shape

In [11]: concat_new_obs = Batch([new_obs, new_obs, new_obs])

In [12]: concat_new_obs
Out[12]: 
Batch(
    obs: array([[[[0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.],
                  ...,
                  [0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.]]],
         
         
                [[[0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.],
                  ...,
                  [0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.]]],
         
         
                [[[0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.],
                  ...,
                  [0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.]]]]),
    act: array([1, 1, 1]),
)

In [13]: concat_new_obs.act.shape
Out[13]: (3,)
# here act is a np.ndarray after concat, so this has a shape

In [14]: concat_new_obs.obs.shape
Out[14]: (3, 1, 84, 84)

In [15]: concat_new_obs[0]
Out[15]: 
Batch(
    obs: array([[[0., 0., 0., ..., 0., 0., 0.],
                 [0., 0., 0., ..., 0., 0., 0.],
                 [0., 0., 0., ..., 0., 0., 0.],
                 ...,
                 [0., 0., 0., ..., 0., 0., 0.],
                 [0., 0., 0., ..., 0., 0., 0.],
                 [0., 0., 0., ..., 0., 0., 0.]]]),
    act: 1,
)

The above is the expected output. And could you please give me the code-snippets like that to show any behavior you are not expected?

One more thing: have you already read the document and go through the examples here (before the Advanced Topics): https://tianshou.readthedocs.io/en/master/tutorials/batch.html ?

Trinkle23897 · 2020-10-15T09:32:34Z

[[Batch(act:4, obs: array([]),] [Batch(act:4, obs: array([]),] [Batch(act:4, obs: array([]),]]

This cannot be happened in Batch. Batch will automatically format them into

Batch(
    act: array([4, 4, 4]),
    obs: array([
        obs1,
        obs2,
        obs3,
    ]),
)

jujun622 · 2020-10-15T10:14:19Z

Thanks for the quick reply! And thanks for your examples.

I have read the documentation of "batch" but probably not so carefully, but I think the issues I got are not because of "batch". The only stuff I modified by using "batch" was in the step function, and then I threw the new environment and my defined network into DQNpolicy.

This is the "s" from the network that I got after using Batch(act=action, obs=obs) in the step function, and you can see it only leaves the "obs" and no "act" anymore.

The code I used to get this is only to print (s) in the network. I believe if I do the operations like what you did outside of the policy I can get the excepted output like yours.

This is what I got like [[Batch(act:4, obs: array([]),] [Batch(act:4, obs: array([]),] [Batch(act:4, obs: array([]),]].

Does it clear this time? Do you need other code-snippets?

Trinkle23897 · 2020-10-15T12:42:38Z

Do you need other code-snippets?

Sure, of course. You can send it to my email. I need a runnable code to find out what happens.

And what's your version of tianshou? Is it the newest?

Trinkle23897 · 2020-10-16T09:35:14Z

The issue is marked resolve:

do not use Batch(obs=obs, act=act), use Batch(observation=obs, action=act) instead. This is because DQN has the mask-action feature and reserves Batch(obs, mask, agent_id). Doing the previous approach will result in a conflict (which is the screenshot showing action disappears above).
make sure the first dimension of all output of the nn model is the batch size.

Trinkle23897 added the question Further information is requested label Oct 14, 2020

Trinkle23897 closed this as completed Oct 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

action-observation pairing in RNN-style training #241

action-observation pairing in RNN-style training #241

jujun622 commented Oct 14, 2020 •

edited by Trinkle23897

Loading

Trinkle23897 commented Oct 14, 2020 •

edited

Loading

jujun622 commented Oct 15, 2020 •

edited by Trinkle23897

Loading

Trinkle23897 commented Oct 15, 2020 •

edited

Loading

jujun622 commented Oct 15, 2020 •

edited by Trinkle23897

Loading

Trinkle23897 commented Oct 15, 2020 •

edited

Loading

jujun622 commented Oct 15, 2020 •

edited by Trinkle23897

Loading

Trinkle23897 commented Oct 15, 2020 •

edited

Loading

Trinkle23897 commented Oct 15, 2020 •

edited

Loading

jujun622 commented Oct 15, 2020 •

edited by Trinkle23897

Loading

Trinkle23897 commented Oct 15, 2020 •

edited

Loading

Trinkle23897 commented Oct 16, 2020 •

edited

Loading

action-observation pairing in RNN-style training #241

action-observation pairing in RNN-style training #241

Comments

jujun622 commented Oct 14, 2020 • edited by Trinkle23897 Loading

Trinkle23897 commented Oct 14, 2020 • edited Loading

jujun622 commented Oct 15, 2020 • edited by Trinkle23897 Loading

Trinkle23897 commented Oct 15, 2020 • edited Loading

jujun622 commented Oct 15, 2020 • edited by Trinkle23897 Loading

Trinkle23897 commented Oct 15, 2020 • edited Loading

jujun622 commented Oct 15, 2020 • edited by Trinkle23897 Loading

Trinkle23897 commented Oct 15, 2020 • edited Loading

Trinkle23897 commented Oct 15, 2020 • edited Loading

jujun622 commented Oct 15, 2020 • edited by Trinkle23897 Loading

Trinkle23897 commented Oct 15, 2020 • edited Loading

Trinkle23897 commented Oct 16, 2020 • edited Loading

jujun622 commented Oct 14, 2020 •

edited by Trinkle23897

Loading

Trinkle23897 commented Oct 14, 2020 •

edited

Loading

jujun622 commented Oct 15, 2020 •

edited by Trinkle23897

Loading

Trinkle23897 commented Oct 15, 2020 •

edited

Loading

jujun622 commented Oct 15, 2020 •

edited by Trinkle23897

Loading

Trinkle23897 commented Oct 15, 2020 •

edited

Loading

jujun622 commented Oct 15, 2020 •

edited by Trinkle23897

Loading

Trinkle23897 commented Oct 15, 2020 •

edited

Loading

Trinkle23897 commented Oct 15, 2020 •

edited

Loading

jujun622 commented Oct 15, 2020 •

edited by Trinkle23897

Loading

Trinkle23897 commented Oct 15, 2020 •

edited

Loading

Trinkle23897 commented Oct 16, 2020 •

edited

Loading