RNN support #19

miriaford · 2020-04-04T19:57:35Z

I have marked all applicable categories:
- exception-raising bug
- RL algorithm bug
- documentation request (i.e. "X is missing from the documentation.")
- new feature request
I have visited the source website, and in particular read the known issues
I have searched through the issue tracker for duplicates
I have mentioned version numbers, operating system and environment, where applicable:
```
import tianshou, sys
print(tianshou.__version__, sys.version, sys.platform)
```

I see on README that RNN support is on your TODO list. However, the module API seems to support RNN ( forward(obs, state) method). Could you please provide some examples on how to train RNN policy? Thanks!

The text was updated successfully, but these errors were encountered:

Trinkle23897 · 2020-04-04T23:10:18Z

Oh yes, in fact it has already support the rnn policy but I will write a small demonstration script after finishing the documentation. Thanks for pushing me.

miriaford · 2020-04-05T00:21:16Z

Thanks! Do you mind quickly posting a snippet here and explain how to use the RNN policy? I need to run something ASAP, thanks a lot!

fengredrum · 2020-04-05T00:48:35Z

Actually, my experiments show that a RNN policy is nearly always inferior to a MLP policy in robotic control domains.
Besides, RNNs suffer from slow training speed :)

miriaford · 2020-04-05T02:13:21Z

It varies from case to case, but I agree that RNN might not be the best option. I do need it as a baseline though, so will still appreciate it if someone can explain how to run RNN policy with Tianshou. Thanks!

Trinkle23897 · 2020-04-08T13:26:12Z

@miriaford I upload a script in test/discrete/test_drqn.py.
What I have changed in the codebase:

modify the replay buffer to support frame_stack sampling;
fix collector for clearing a state when the corresponding episode is finished;

How to use:

add an argument "stack_num" in training collector's replay buffer code snippet;
change your actor-network to recurrent style code snippet;
use current git master branch instead of pypi whl: pip3 install git+https://github.com/thu-ml/tianshou.git@master.

That's all. If you find anything wrong, please give me feedback.

miriaford · 2020-04-19T20:14:11Z

Cool, thanks a lot!

Does it also work out of the box for PPO and SAC? Do you have any benchmark scripts/results for them?

Trinkle23897 · 2020-04-19T23:28:56Z

@miriaford I haven't tried RNN on PPO and SAC, but you can be the first. Following the changes between test_dqn.py and test_drqn.py and you can construct your first RNN-style PPO and SAC!

miriaford · 2020-04-27T06:40:14Z

Hi, I might have missed something obvious. Could you please educate me on:

In these lines (https://github.com/thu-ml/tianshou/blob/master/test/discrete/net.py#L90-L91), why do you add .detach()? What happens if you don't?
In PPO policy (lines), why doesn't critic support RNN interface like actor?

Thanks for your help!

Trinkle23897 · 2020-04-27T06:45:28Z

It's my fault. Previously the step in collector was not under with torch.no_grad(). I'll remove it.
Good catch! When I implemented the GAE, I did not think about the RNN. I'll upgrade the interface.

miriaford · 2020-04-27T06:54:11Z

This is great! Would love to check out the new interface.

Another quick question:
https://github.com/thu-ml/tianshou/blob/master/test/discrete/net.py#L70-L79
If I understand this correctly, the [bsz, len, dim] will only happen when learning the parameters, but for the rest of cases, [bsz, dim] will be passed in because evaluation runs the RNN one forward step at a time? Then what's the practical impact of longer len vs shorter len for training?

Thanks again!

Trinkle23897 · 2020-04-27T07:18:02Z

Yes, you are right. The longer the length, the smaller the difference of RNN performance between training and testing.

miriaford · 2020-04-27T21:45:07Z

Please kindly let me know when you have finished RNN-critic. I'm a bit bottlenecked on this. Thank you so much!

Trinkle23897 · 2020-04-27T23:58:35Z

Please kindly let me know when you have finished RNN-critic. I'm a bit bottlenecked on this. Thank you so much!

Sure! I currently fix issue #38. Maybe today or tomorrow I’ll resolve your issue.

miriaford · 2020-04-29T02:14:01Z

Thanks for reopening the issue. I'd be happy to test it when you finish, thanks!

miriaford · 2020-04-30T02:23:57Z

Any updates? Sorry for pinging again. Thanks!

Trinkle23897 · 2020-04-30T02:41:46Z

I'm coding it.

Trinkle23897 · 2020-04-30T07:38:46Z

diff --git a/test/continuous/test_ppo.py b/test/continuous/test_ppo.py
index e94a917..c496de2 100644
--- a/test/continuous/test_ppo.py
+++ b/test/continuous/test_ppo.py
@@ -12,7 +12,7 @@ from tianshou.trainer import onpolicy_trainer
 from tianshou.data import Collector, ReplayBuffer
 
 if __name__ == '__main__':
-    from net import ActorProb, Critic
+    from net import RecurrentActorProb, RecurrentCritic
 else:  # pytest
     from test.continuous.net import ActorProb, Critic
 
@@ -71,11 +71,11 @@ def test_ppo(args=get_args()):
     train_envs.seed(args.seed)
     test_envs.seed(args.seed)
     # model
-    actor = ActorProb(
+    actor = RecurrentActorProb(
         args.layer_num, args.state_shape, args.action_shape,
         args.max_action, args.device
     ).to(args.device)
-    critic = Critic(
+    critic = RecurrentCritic(
         args.layer_num, args.state_shape, device=args.device
     ).to(args.device)
     optim = torch.optim.Adam(list(
@@ -95,7 +95,7 @@ def test_ppo(args=get_args()):
         gae_lambda=args.gae_lambda)
     # collector
     train_collector = Collector(
-        policy, train_envs, ReplayBuffer(args.buffer_size))
+        policy, train_envs, ReplayBuffer(args.buffer_size, stack_num=4))
     test_collector = Collector(policy, test_envs)
     # log
     log_path = os.path.join(args.logdir, args.task, 'ppo')

The code has been pushed.

Trinkle23897 · 2020-04-30T08:48:27Z

I'm not sure whether or not to support stacked action for the Q network. This needs to change some lines of code in all policy which need a Q-network (ddpg, td3, sac). Or you can create the Q-network as (stacked-s, single-a) -> Q value (this is in

tianshou/test/continuous/net.py

Lines 119 to 144 in c2a7caf

    
           class RecurrentCritic(nn.Module): 
        
               def __init__(self, layer_num, state_shape, action_shape=0, device='cpu'): 
        
                   super().__init__() 
        
                   self.state_shape = state_shape 
        
                   self.action_shape = action_shape 
        
                   self.device = device 
        
                   self.nn = nn.LSTM(input_size=np.prod(state_shape), hidden_size=128, 
        
                                     num_layers=layer_num, batch_first=True) 
        
                   self.fc2 = nn.Linear(128 + np.prod(action_shape), 1) 
        
               def forward(self, s, a=None): 
        
                   if not isinstance(s, torch.Tensor): 
        
                       s = torch.tensor(s, device=self.device, dtype=torch.float) 
        
                   # s [bsz, len, dim] (training) or [bsz, dim] (evaluation) 
        
                   # In short, the tensor's shape in training phase is longer than which 
        
                   # in evaluation phase. 
        
                   assert len(s.shape) == 3 
        
                   self.nn.flatten_parameters() 
        
                   s, (h, c) = self.nn(s) 
        
                   s = s[:, -1] 
        
                   if a is not None: 
        
                       if not isinstance(a, torch.Tensor): 
        
                           a = torch.tensor(a, device=self.device, dtype=torch.float) 
        
                       s = torch.cat([s, a], dim=1) 
        
                   s = self.fc2(s) 
        
                   return s

).
But for PPO, the current version is enough.

miriaford · 2020-04-30T16:43:58Z

Thanks!! Will take a look and let you know!

Trinkle23897 · 2020-05-14T23:52:22Z

I'm not sure whether or not to support stacked action for the Q network. This needs to change some lines of code in all policy which need a Q-network (ddpg, td3, sac). Or you can create the Q-network as (stacked-s, single-a) -> Q value

Aha, I know how to support it (Q(stacked-s, stacked-a)) without modifying any part of the codebase. Try to add an gym.wrapper which would modify the state as {'original_state': state, 'action': action}. That means in the traditional way we save (s, a, s', r, d) in the replay buffer, but now we save ([s, a], a, [s', a'], r, d) in the buffer.
And in your network side, extract the [s, a] pair from the batch.state.

Trinkle23897 · 2020-07-12T22:53:01Z

I think here is a bug in the implementation of the framestack option.

This issue hass been fixed on dev branch. Yes, keys_partial = set.union(*keys_map) - keys_share is now used instead.

Should we merge the essential commit from dev to master at this point?

Trinkle23897 · 2020-07-13T02:11:19Z

Also, it seems that if sample_avail is set, the batch buffer.sample() returns would not contain all the rewards of an episode, which makes it impossible for policy to learn from it. Any suggestions for fixing this problem?

What's your desired feature? Could you please provide a simple example?

Arthur-Null · 2020-07-13T02:20:35Z

I am trying to reach O(n) complexity in policy.learn(). In the current implementation, when sample_avail is not set, O(n) can't be reached. When sample_avail is set, policy.learn() can only get part of the reward, which makes it impossible to calculate returns.

Arthur-Null · 2020-07-13T03:03:17Z

It is possible to use the original replaybuffer without framestack and sample_avail and leave all the dirty work to policy. I wonder is there are better way?

Trinkle23897 · 2020-07-13T05:11:08Z

Do you mean sample only available index with on-policy method for RNN usage? I haven't figured it out clearly so let's take an example:

(This is a buffer)
Index 0 1 2 3 4 5 6 7 8 9 a b c d e f
Done  0 0 1 0 0 0 1 0 0 0 1 0 1 / / /

If I understand correctly, you want to only extract [0-2], [3-6], [7-a], [b-c] without overlap. What I'm not sure about is the data format that you expect. (and for me "O(n)" is ambiguous)
buffer.sample(0) can do something like this since it will extract all of the data, but only once. I think it is not what you want.

Arthur-Null · 2020-07-13T08:19:26Z

Let me state my question more clearly and thoroughly.
To me, the frame_stack solution has two problems:

As mentioned before, some redundant calculation is included. To calculate the action distribution for sequence [0, 1, 2], we have to do RNN calculation on at least three sequences [0, 0, 0], [0, 0, 1], [0, 1, 2]. (This is what I mean by O(n^2). Only running RNN([0, 1, 2]) once is O(n)).
Sequences like [0, 0, 0] would cause deviation as we always take the last output of the RNN as the final output.

By setting sample_avail to True. The first problem is solved to some extend. However sequences like [b, c, c] may still exist so the second problem is still there. Also, in the current implementation, only reward at the end of an episode will be returned by buf.sample() which makes training impossible.

The data format I am expecting is something like pytorch's PackedSequence, which include the sequences and the lengths of the sequences. It seems a bit complecated to implement this in replay buffer. I am now implementing it in process_fn of my policy.

Trinkle23897 · 2020-07-13T08:28:33Z

So you want a new sample method: it points out the number of episodes you want and will return the sampled episodes, for example, [[7-a], [3-6], [7-a], [b-c]] with episode_size=4, am I right?

Arthur-Null · 2020-07-13T08:49:22Z

Yes

Trinkle23897 · 2020-07-13T09:05:05Z

Okay, I see. I think the simplest way is to implement it in the buffer side. Since we can easily maintain the done flag, we can do something like:

# begin_index stores all the start index of one episode (if available), e.g., [0, 3, 7, b]
episode_index = np.random.choice(begin_index, episode_size)  # assume we get [b, 7, 3, 7]
# do a sorting with key=-episode_length, since we want to mimic torch's PackedSequence
# and after that the episode_index is [7, 7, 3, b]
# should maintain a list of episode_length, in the above case: [3, 4, 4, 2]
data, seq_length = [], []
while not self._meta.done[episode_index][0]:  # since the first episode's length is the longest
    # find out the episodes that haven't terminated
    available_episode_index = episode_index[self._meta.done[episode_index] == 0]
    data.append(self[available_episode_index])  # should set stack_num=0
    seq_length.append(len(available_episode_index))
    episode_index += self._meta.done[episode_index] == 0
data = Batch.cat(data)
# for now, (data, seq_length) is something like a PackedSequence.

I think maintaining begin_index and episode_length is not a big problem. It is very similar to my implementation of sample_avail.

I prefer to add a new argument in the initialization of Collector, to indicate whether to use this mode.

duburcqa · 2020-07-13T09:14:22Z

Should we merge the essential commit from dev to master at this point?

Yes I think it is appropriate.

Trinkle23897 · 2020-07-13T09:15:57Z

Should we merge the essential commit from dev to master at this point?

Yes I think it is appropriate.

But should we release a new version like 0.2.4.rc1? Since we want to keep master stable.

duburcqa · 2020-07-13T09:26:35Z

But should we release a new version like 0.2.4.rc1? Since we want to keep master stable.

It's up to you. It's really a personal choice. For Python-only project, I tend to prefer releasing patches like 0.2.4.rc1 in the case of "critical" bug fixes. But there is no clear answer. I would say the minimum is to push them on Master, but not necessarily to create an actual release.

heartInsert · 2021-01-24T03:41:04Z

Did someone try transfomrer before ? In NLP area , transformer seems perform better than RNN unit .

Trinkle23897 · 2021-01-24T03:43:08Z

But in RL, the bigger the model, the harder it converges to an optimal policy. :(

heartInsert · 2021-01-24T04:00:44Z

Ok , I get it.

doylech · 2021-07-27T23:29:05Z

Is there a reason why in the DQNPolicy.learn method during training you don't pass an initial hidden state to the forward method? Wouldn't this allow it to train from better predictions?

From Recurrent Experience Replay in Distributed Reinforcement Learning by Kapturowski et al. in 2018 (https://openreview.net/pdf?id=r1lyTjAqYX)

...although zero state initialization was often used in previous works (Hausknecht & Stone, 2015; Gruslys et al., 2018), we have found that it leads to misestimated action-values, especially in the early states of replayed sequence

Trinkle23897 · 2021-07-28T01:40:26Z

No, because as you mentioned zero state initialization was often used in previous works, here we mimic this workflow.

cbellinger27 · 2022-01-27T19:10:06Z

Is it possible to use the recurrent network in the discrete soft actor critic?

BFAnas · 2022-03-10T20:54:17Z

No, because as you mentioned zero state initialization was often used in previous works, here we mimic this workflow.

@Trinkle23897 Could you please point me to the exact workflow you based on this RNN implementation? I'm trying to figure out why doesn't it work (I have tested RNN with SAC and with DQN on 5 environments, it only worked with DQN for Cartpole)

xiaoshenxian · 2022-08-27T15:16:35Z

Hello, @Trinkle23897 , I am encountering a problem at line 89-90 in
https://github.com/thu-ml/tianshou/blob/master/tianshou/policy/modelfree/a2c.py#L89
as

v_s.append(self.critic(minibatch.obs))
v_s_.append(self.critic(minibatch.obs_next))

I am wondering why the critie does not support 'states' input as actor does? As it is common for the critic sharing the same RNN input network as the actor using, how can I pass the same state that the actor used to the critic? I noticed there is a state batch stored in the 'policy' key of the minibatch, but this should be the output state if I understand correctly, right? Or I am not supposed to pass any state to the critic during training?

Do I missed anything here? Thanks a lot.

Trinkle23897 · 2022-08-28T17:16:48Z

Or I am not supposed to pass any state to the critic during training?

Currently, yes

xiaoshenxian · 2022-08-29T11:58:02Z

Or I am not supposed to pass any state to the critic during training?

Currently, yes

Thanks. I finally put the input state into the model input dict to solve the problem.

Another thing, I found the torch.nn.utils.clip_grad_norm_ cannot deal with inf gradient which is the case usually happening in RNN. Do you mind to add a torch.nn.utils.clip_grad_value_ before it in all policy implementation like below? Thanks a lot.

if self._grad_value:  # clip large gradient
    nn.utils.clip_grad_value_(
        self._actor_critic.parameters(), clip_value=self._grad_value
    )
if self._grad_norm:  # clip large gradient
    nn.utils.clip_grad_norm_(
        self._actor_critic.parameters(), max_norm=self._grad_norm
    )

experiment:

>>> w.grad
tensor([[0.4000, 0.4000, 0.4000, 0.4000, 0.4000],
        [   inf,    inf,    inf,    inf,    inf],
        [2.3000, 2.3000, 2.3000, 2.3000, 2.3000]])
>>> torch.nn.utils.clip_grad_norm_([w], 0.5) # this cannot deal with inf
tensor(inf)
>>> w.grad
tensor([[0., 0., 0., 0., 0.],
        [nan, nan, nan, nan, nan],
        [0., 0., 0., 0., 0.]])
>>> torch.nn.utils.clip_grad_value_([w], 0.5) # and this cannot deal with nan
>>> w.grad
tensor([[0., 0., 0., 0., 0.],
        [nan, nan, nan, nan, nan],
        [0., 0., 0., 0., 0.]])

========

>>> w.grad
tensor([[0.2000, 0.2000, 0.2000, 0.2000, 0.2000],
        [   inf,    inf,    inf,    inf,    inf],
        [1.8000, 1.8000, 1.8000, 1.8000, 1.8000]])
>>> torch.nn.utils.clip_grad_value_([w], 0.5) # but this can deal with inf
>>> w.grad
tensor([[0.2000, 0.2000, 0.2000, 0.2000, 0.2000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000]])

Trinkle23897 · 2022-08-31T03:31:25Z

Sure! Feel free to submit a PR

…er (thu-ml#19)

Trinkle23897 self-assigned this Apr 5, 2020

Trinkle23897 closed this as completed Apr 10, 2020

Trinkle23897 added this to Usage in Issue/PR Categories Apr 10, 2020

Trinkle23897 reopened this Apr 29, 2020

Trinkle23897 mentioned this issue May 3, 2020

state encoder for PG methods? #45

Closed

8 tasks

Trinkle23897 added enhancement Feature that is not a new algorithm or an algorithm enhancement good first issue Good for newcomers labels May 5, 2020

Trinkle23897 moved this from Usage to Other feature requests in Issue/PR Categories May 11, 2020

Trinkle23897 closed this as completed May 16, 2020

Trinkle23897 pinned this issue May 16, 2020

Trinkle23897 reopened this Jul 13, 2020

Trinkle23897 mentioned this issue Oct 14, 2020

action-observation pairing in RNN-style training #241

Closed

Trinkle23897 closed this as completed Mar 7, 2021

Trinkle23897 unpinned this issue Mar 28, 2021

Trinkle23897 mentioned this issue Aug 11, 2021

Questions about stack_num/RNN in ReplayBuffer #413

Closed

BFAnas pushed a commit to BFAnas/tianshou that referenced this issue May 5, 2024

fix rnn (thu-ml#19), add __repr__, and fix thu-ml#26

d54c2ab

BFAnas pushed a commit to BFAnas/tianshou that referenced this issue May 5, 2024

store RNN hidden states in policy._state and add sample_avail in buff…

0253f76

…er (thu-ml#19)

RNN support #19

RNN support #19

Comments

miriaford commented Apr 4, 2020

Trinkle23897 commented Apr 4, 2020

miriaford commented Apr 5, 2020

fengredrum commented Apr 5, 2020

miriaford commented Apr 5, 2020 • edited by Trinkle23897 Loading

Trinkle23897 commented Apr 8, 2020 • edited Loading

miriaford commented Apr 19, 2020

Trinkle23897 commented Apr 19, 2020 • edited Loading

miriaford commented Apr 27, 2020

Trinkle23897 commented Apr 27, 2020

miriaford commented Apr 27, 2020

Trinkle23897 commented Apr 27, 2020

miriaford commented Apr 27, 2020

Trinkle23897 commented Apr 27, 2020

miriaford commented Apr 29, 2020

miriaford commented Apr 30, 2020

Trinkle23897 commented Apr 30, 2020

Trinkle23897 commented Apr 30, 2020 • edited Loading

Trinkle23897 commented Apr 30, 2020 • edited Loading

miriaford commented Apr 30, 2020

Trinkle23897 commented May 14, 2020 • edited Loading

Trinkle23897 commented Jul 12, 2020

Trinkle23897 commented Jul 13, 2020

Arthur-Null commented Jul 13, 2020

Arthur-Null commented Jul 13, 2020

Trinkle23897 commented Jul 13, 2020 • edited Loading

Arthur-Null commented Jul 13, 2020

Trinkle23897 commented Jul 13, 2020

Arthur-Null commented Jul 13, 2020

Trinkle23897 commented Jul 13, 2020 • edited Loading

duburcqa commented Jul 13, 2020

Trinkle23897 commented Jul 13, 2020 • edited Loading

duburcqa commented Jul 13, 2020 • edited Loading

heartInsert commented Jan 24, 2021

Trinkle23897 commented Jan 24, 2021

heartInsert commented Jan 24, 2021

doylech commented Jul 27, 2021

Trinkle23897 commented Jul 28, 2021

cbellinger27 commented Jan 27, 2022

BFAnas commented Mar 10, 2022 • edited Loading

xiaoshenxian commented Aug 27, 2022 • edited Loading

Trinkle23897 commented Aug 28, 2022

xiaoshenxian commented Aug 29, 2022

Trinkle23897 commented Aug 31, 2022

miriaford commented Apr 5, 2020 •

edited by Trinkle23897

Loading

Trinkle23897 commented Apr 8, 2020 •

edited

Loading

Trinkle23897 commented Apr 19, 2020 •

edited

Loading

Trinkle23897 commented Apr 30, 2020 •

edited

Loading

Trinkle23897 commented Apr 30, 2020 •

edited

Loading

Trinkle23897 commented May 14, 2020 •

edited

Loading

Trinkle23897 commented Jul 13, 2020 •

edited

Loading

Trinkle23897 commented Jul 13, 2020 •

edited

Loading

Trinkle23897 commented Jul 13, 2020 •

edited

Loading

duburcqa commented Jul 13, 2020 •

edited

Loading

BFAnas commented Mar 10, 2022 •

edited

Loading

xiaoshenxian commented Aug 27, 2022 •

edited

Loading