-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
action-observation pairing in RNN-style training #241
Comments
Could you please provide the code snippets? The solution is, do not use tuple space. Use a dict obs space like class Wrapper(...):
def __init__(self, env):
self.env = env
super().__init__(env)
def step(self, action):
obs_next, rew, done, info = self.env.step(action)
return Batch(observation=obs_next, action=action), rew, done, info (But I think the above is not working since we want to combine obs and action, not obs_next, but this is the idea: use Batch() to put them together) Examples are in #172, and why tuple space cannot work well is discussed in #147. I'll add this issue link to the future version of documentation. |
I just tried what you suggested that to use Batch here, and the code is as following: from tianshou.data import Batch
class ActObsPairing(gym.Wrapper):
def __init__(self, env):
gym.Wrapper.__init__(self, env)
def reset(self):
observation = self.env.reset()
action = [0]
return Batch(act=action, obs=observation)
def step(self, action):
obs, reward, done, info = self.env.step(action)
return Batch(act=[action], obs=obs), reward, done, info class ADRQN(nn.Module):
def __init__(self, ...):
super().__init__()
def forward(self, s, state, info):
#get action's feature
act = s["act"][0]
act_onehot = np.zeros(self.action_shape)
act_onehot[act] = 1
actnet = ActionNet(..)
act = actnet(act_onehot)
#get observation's feature
obs = s['obs']
obsnet = ObsNet(...)
obs = obsnet(obs)
s = torch.cat((act,obs))
s (h,c)= self.lstm(s)
...
return s, {"h": h.transpose(0, 1).detach(),
"c": c.transpose(0, 1).detach()} But when I applied the environment into tianshou's policy, I still got the errors as below: I guess the reason is still that tianshou cannot deal with the combined observations in the policy, or I didn't find the correct way. |
What's |
Thanks for the tips! I printed "s" out and I found the shape is (train_num, 1). I have dealt with the index and dimension issues to match tianshou's architecture as following: #get action's feature
act = [s[:, 0][i]['act'] for i in range(len(s[:,0]))]
act_onehot = np.zeros((len(act), self.action_shape))
act_onehot[np.arange(len(act)), act] = 1
actnet = ActionNet(self.action_shape)
act = actnet(act_onehot)
#get observation's feature
obs = [s[:, 0][i]['obs'] for i in range(len(s[:,0]))]
obsnet = ObsNet(...)
obs = obsnet(obs)
obs = obs.squeeze()
s = torch.cat((act,obs),dim=1)
s = s.view(s.size()[0],-1)
s = s.unsqueeze(0) then put "s" into LSTM, and I think this time is ready for the policy, since I got another error (finally! a new error!) Would you mind helping me at this place? :) |
act = [s[:, 0][i]['act'] for i in range(len(s[:,0]))] This can be simplified to I guess it is because you wrap another layer of action: remove the Do you use DQNPolicy? |
If I do
The observation I extract from the environment by the following code will be
However, when I printed out "s" in the network, "s" is like
The observation I extracted directly from the environment would be I found it weird and probably I did somewhere wrong. By the way, the type of "s" is 'numpy.ndarray' so I cannot use Yeah, I used DQNPolicy. |
What do you mean by saying "disappear"? In [5]: s = np.zeros([1, 84, 84])
In [6]: a = 1
In [7]: new_obs = Batch(obs=s, act=a)
In [8]: new_obs
Out[8]:
Batch(
obs: array([[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]]),
act: array(1),
)
In [9]: new_obs.obs.shape
Out[9]: (1, 84, 84)
In [10]: new_obs.act.shape
Out[10]: ()
# act is a np.scalar so it has no shape
In [11]: concat_new_obs = Batch([new_obs, new_obs, new_obs])
In [12]: concat_new_obs
Out[12]:
Batch(
obs: array([[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]],
[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]],
[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]]]),
act: array([1, 1, 1]),
)
In [13]: concat_new_obs.act.shape
Out[13]: (3,)
# here act is a np.ndarray after concat, so this has a shape
In [14]: concat_new_obs.obs.shape
Out[14]: (3, 1, 84, 84)
In [15]: concat_new_obs[0]
Out[15]:
Batch(
obs: array([[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]]),
act: 1,
) The above is the expected output. And could you please give me the code-snippets like that to show any behavior you are not expected? One more thing: have you already read the document and go through the examples here (before the Advanced Topics): https://tianshou.readthedocs.io/en/master/tutorials/batch.html ? |
This cannot be happened in Batch. Batch will automatically format them into Batch(
act: array([4, 4, 4]),
obs: array([
obs1,
obs2,
obs3,
]),
) |
Thanks for the quick reply! And thanks for your examples. I have read the documentation of "batch" but probably not so carefully, but I think the issues I got are not because of "batch". The only stuff I modified by using "batch" was in the step function, and then I threw the new environment and my defined network into DQNpolicy.
The code I used to get this is only to
Does it clear this time? Do you need other code-snippets? |
Sure, of course. You can send it to my email. I need a runnable code to find out what happens. And what's your version of tianshou? Is it the newest? |
The issue is marked resolve:
|
Hi, I want to pair action and observation in RNN-style training (#19). I have noticed your advice in your documentation that to modify the step function in gym and the history stored in the buffer will be "(a,o), a, (a',o'), r". I tried to do this but still got errors after my multiple modifications.
What I have done: I pre-processed the observation into the shape of (1,84,84) first, and I wrote an observation wrapper in gym to change the observation space as gym.spaces.Tuple(), then I wrote another gym.wrapper to modify the step function. Next, I prepared three networks: two for the action's and the observation's feature extraction respectively (the action and observation will both be 512-d and can do concatenation then), one for Q-function. Until here are all correct since I can get values from the network after putting an observation from the wrapped environment as an input.
However, when I applied to the environment and the network into tianshou policies, I got errors. I think the problem might be that tianshou only can retrieve simple observation instead of combined observations? Do you have some examples of such a situation?
(This problem has struggled me for more than one week :()
Many thanks!
The text was updated successfully, but these errors were encountered: