Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

action-observation pairing in RNN-style training #241

Closed
jujun622 opened this issue Oct 14, 2020 · 11 comments
Closed

action-observation pairing in RNN-style training #241

jujun622 opened this issue Oct 14, 2020 · 11 comments
Labels
question Further information is requested

Comments

@jujun622
Copy link

jujun622 commented Oct 14, 2020

Hi, I want to pair action and observation in RNN-style training (#19). I have noticed your advice in your documentation that to modify the step function in gym and the history stored in the buffer will be "(a,o), a, (a',o'), r". I tried to do this but still got errors after my multiple modifications.

What I have done: I pre-processed the observation into the shape of (1,84,84) first, and I wrote an observation wrapper in gym to change the observation space as gym.spaces.Tuple(), then I wrote another gym.wrapper to modify the step function. Next, I prepared three networks: two for the action's and the observation's feature extraction respectively (the action and observation will both be 512-d and can do concatenation then), one for Q-function. Until here are all correct since I can get values from the network after putting an observation from the wrapped environment as an input.

However, when I applied to the environment and the network into tianshou policies, I got errors. I think the problem might be that tianshou only can retrieve simple observation instead of combined observations? Do you have some examples of such a situation?

(This problem has struggled me for more than one week :()

Many thanks!

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Oct 14, 2020

Could you please provide the code snippets?

The solution is, do not use tuple space. Use a dict obs space like

class Wrapper(...):
    def __init__(self, env):
        self.env = env
        super().__init__(env)
    def step(self, action):
        obs_next, rew, done, info = self.env.step(action)
        return Batch(observation=obs_next, action=action), rew, done, info

(But I think the above is not working since we want to combine obs and action, not obs_next, but this is the idea: use Batch() to put them together)

Examples are in #172, and why tuple space cannot work well is discussed in #147.

I'll add this issue link to the future version of documentation.

@Trinkle23897 Trinkle23897 added the question Further information is requested label Oct 14, 2020
@jujun622
Copy link
Author

jujun622 commented Oct 15, 2020

I just tried what you suggested that to use Batch here, and the code is as following:

from tianshou.data import Batch

class ActObsPairing(gym.Wrapper):
    def __init__(self, env):
        gym.Wrapper.__init__(self, env)
    
    def reset(self):
        observation = self.env.reset()
        action = [0]
        return Batch(act=action, obs=observation)
        
    def step(self, action):
        obs, reward, done, info = self.env.step(action)
        return Batch(act=[action], obs=obs), reward, done, info
class ADRQN(nn.Module):

    def __init__(self, ...):
        super().__init__()

    def forward(self, s, state, info):

        #get action's feature
        act = s["act"][0]
        act_onehot = np.zeros(self.action_shape)
        act_onehot[act] = 1
        
        actnet = ActionNet(..)
        act = actnet(act_onehot)

        #get observation's feature
        obs = s['obs']  
        obsnet = ObsNet(...)
        obs = obsnet(obs)

        s = torch.cat((act,obs))

        s (h,c)= self.lstm(s)
       ...
        return s, {"h": h.transpose(0, 1).detach(),
                   "c": c.transpose(0, 1).detach()}

But when I applied the environment into tianshou's policy, I still got the errors as below:

image

I guess the reason is still that tianshou cannot deal with the combined observations in the policy, or I didn't find the correct way.

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Oct 15, 2020

What's s in L31 look like? print it first
And I guess it is because you use [0] for a scalar, maybe?

@jujun622
Copy link
Author

jujun622 commented Oct 15, 2020

Thanks for the tips!

I printed "s" out and I found the shape is (train_num, 1). I have dealt with the index and dimension issues to match tianshou's architecture as following:

        #get action's feature

        act = [s[:, 0][i]['act'] for i in range(len(s[:,0]))]
        act_onehot = np.zeros((len(act), self.action_shape))
        act_onehot[np.arange(len(act)), act] = 1
        
        actnet = ActionNet(self.action_shape)
        act = actnet(act_onehot)

        #get observation's feature

        obs = [s[:, 0][i]['obs'] for i in range(len(s[:,0]))]
        obsnet = ObsNet(...)
        obs = obsnet(obs)
        obs = obs.squeeze()

        s = torch.cat((act,obs),dim=1)
        s = s.view(s.size()[0],-1)
        s = s.unsqueeze(0)

then put "s" into LSTM, and I think this time is ready for the policy, since I got another error (finally! a new error!)

image

Would you mind helping me at this place? :)

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Oct 15, 2020

act = [s[:, 0][i]['act'] for i in range(len(s[:,0]))]

This can be simplified to s.act[:, 0]. Same as the obs: s.obs[:, 0]. But PLEASE make sure obs's shape is what you want (I mean s.obs is correct instead of s.obs[:, 0]).

I guess it is because you wrap another layer of action: remove the [] in Batch(act=[action]) in step and reset, change s.act[:, 0] to s.act, and see what happen.

Do you use DQNPolicy?

@jujun622
Copy link
Author

jujun622 commented Oct 15, 2020

If I do Batch(act=action, obs=obs) in step and reset, the 'act' will disappear in the input of my network. I mean,

  • If I do Batch(act=action, obs=obs) in step and reset:

The observation I extract from the environment by the following code will be Batch(act: array(4), obs: array([ ]),), and the shape of obs['obs'] is what I want, which is (1,84,84).

env = ActObsPairing(make_env('..'))   #make_env is the wrapper I wrote to preprocess the image into the shape of (1,84,84)
env.reset( )
obs = env.step(env.action_space.sample())[0]

However, when I printed out "s" in the network, "s" is like [[] [] []] (I set num_train=3) and the shape of "s" is (3,1,84,84). It seems "s" loses the information of action.

  • If I did Batch(act=[action], obs=obs) in step and reset as previously:

The observation I extracted directly from the environment would be Batch(act: array([4]), obs: array([ ]),). "s" in the network would be like [[Batch(act:4, obs: array([]),] [Batch(act:4, obs: array([]),] [Batch(act:4, obs: array([]),]] and the shape of "s" was (3,1). One more thing is that the shape of "obs" here was (84,84) instead of (1,84,84).

I found it weird and probably I did somewhere wrong.

By the way, the type of "s" is 'numpy.ndarray' so I cannot use s.act[:,0].

Yeah, I used DQNPolicy.

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Oct 15, 2020

What do you mean by saying "disappear"?

In [5]: s = np.zeros([1, 84, 84])

In [6]: a = 1

In [7]: new_obs = Batch(obs=s, act=a)

In [8]: new_obs
Out[8]: 
Batch(
    obs: array([[[0., 0., 0., ..., 0., 0., 0.],
                 [0., 0., 0., ..., 0., 0., 0.],
                 [0., 0., 0., ..., 0., 0., 0.],
                 ...,
                 [0., 0., 0., ..., 0., 0., 0.],
                 [0., 0., 0., ..., 0., 0., 0.],
                 [0., 0., 0., ..., 0., 0., 0.]]]),
    act: array(1),
)

In [9]: new_obs.obs.shape
Out[9]: (1, 84, 84)

In [10]: new_obs.act.shape
Out[10]: ()
# act is a np.scalar so it has no shape

In [11]: concat_new_obs = Batch([new_obs, new_obs, new_obs])

In [12]: concat_new_obs
Out[12]: 
Batch(
    obs: array([[[[0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.],
                  ...,
                  [0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.]]],
         
         
                [[[0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.],
                  ...,
                  [0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.]]],
         
         
                [[[0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.],
                  ...,
                  [0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.],
                  [0., 0., 0., ..., 0., 0., 0.]]]]),
    act: array([1, 1, 1]),
)

In [13]: concat_new_obs.act.shape
Out[13]: (3,)
# here act is a np.ndarray after concat, so this has a shape

In [14]: concat_new_obs.obs.shape
Out[14]: (3, 1, 84, 84)

In [15]: concat_new_obs[0]
Out[15]: 
Batch(
    obs: array([[[0., 0., 0., ..., 0., 0., 0.],
                 [0., 0., 0., ..., 0., 0., 0.],
                 [0., 0., 0., ..., 0., 0., 0.],
                 ...,
                 [0., 0., 0., ..., 0., 0., 0.],
                 [0., 0., 0., ..., 0., 0., 0.],
                 [0., 0., 0., ..., 0., 0., 0.]]]),
    act: 1,
)

The above is the expected output. And could you please give me the code-snippets like that to show any behavior you are not expected?

One more thing: have you already read the document and go through the examples here (before the Advanced Topics): https://tianshou.readthedocs.io/en/master/tutorials/batch.html ?
https://tianshou.readthedocs.io/en/master/tutorials/batch.html

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Oct 15, 2020

[[Batch(act:4, obs: array([]),] [Batch(act:4, obs: array([]),] [Batch(act:4, obs: array([]),]]

This cannot be happened in Batch. Batch will automatically format them into

Batch(
    act: array([4, 4, 4]),
    obs: array([
        obs1,
        obs2,
        obs3,
    ]),
)

@jujun622
Copy link
Author

jujun622 commented Oct 15, 2020

Thanks for the quick reply! And thanks for your examples.

I have read the documentation of "batch" but probably not so carefully, but I think the issues I got are not because of "batch". The only stuff I modified by using "batch" was in the step function, and then I threw the new environment and my defined network into DQNpolicy.

  • This is the "s" from the network that I got after using Batch(act=action, obs=obs) in the step function, and you can see it only leaves the "obs" and no "act" anymore.

image

The code I used to get this is only to print (s) in the network. I believe if I do the operations like what you did outside of the policy I can get the excepted output like yours.

  • This is what I got like [[Batch(act:4, obs: array([]),] [Batch(act:4, obs: array([]),] [Batch(act:4, obs: array([]),]].

image

Does it clear this time? Do you need other code-snippets?

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Oct 15, 2020

Do you need other code-snippets?

Sure, of course. You can send it to my email. I need a runnable code to find out what happens.

And what's your version of tianshou? Is it the newest?

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Oct 16, 2020

The issue is marked resolve:

  1. do not use Batch(obs=obs, act=act), use Batch(observation=obs, action=act) instead. This is because DQN has the mask-action feature and reserves Batch(obs, mask, agent_id). Doing the previous approach will result in a conflict (which is the screenshot showing action disappears above).
  2. make sure the first dimension of all output of the nn model is the batch size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants