Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RNN support #19

Closed
3 of 8 tasks
miriaford opened this issue Apr 4, 2020 · 55 comments
Closed
3 of 8 tasks

RNN support #19

miriaford opened this issue Apr 4, 2020 · 55 comments
Assignees
Labels
enhancement Feature that is not a new algorithm or an algorithm enhancement good first issue Good for newcomers

Comments

@miriaford
Copy link

  • I have marked all applicable categories:
    • exception-raising bug
    • RL algorithm bug
    • documentation request (i.e. "X is missing from the documentation.")
    • new feature request
  • I have visited the source website, and in particular read the known issues
  • I have searched through the issue tracker for duplicates
  • I have mentioned version numbers, operating system and environment, where applicable:
    import tianshou, sys
    print(tianshou.__version__, sys.version, sys.platform)

I see on README that RNN support is on your TODO list. However, the module API seems to support RNN ( forward(obs, state) method). Could you please provide some examples on how to train RNN policy? Thanks!

@Trinkle23897
Copy link
Collaborator

Oh yes, in fact it has already support the rnn policy but I will write a small demonstration script after finishing the documentation. Thanks for pushing me.

@miriaford
Copy link
Author

Thanks! Do you mind quickly posting a snippet here and explain how to use the RNN policy? I need to run something ASAP, thanks a lot!

@fengredrum
Copy link
Contributor

Actually, my experiments show that a RNN policy is nearly always inferior to a MLP policy in robotic control domains.
Besides, RNNs suffer from slow training speed :)

@miriaford
Copy link
Author

miriaford commented Apr 5, 2020

It varies from case to case, but I agree that RNN might not be the best option. I do need it as a baseline though, so will still appreciate it if someone can explain how to run RNN policy with Tianshou. Thanks!

@Trinkle23897 Trinkle23897 self-assigned this Apr 5, 2020
@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Apr 8, 2020

@miriaford I upload a script in test/discrete/test_drqn.py.
What I have changed in the codebase:

  1. modify the replay buffer to support frame_stack sampling;
  2. fix collector for clearing a state when the corresponding episode is finished;

How to use:

  1. add an argument "stack_num" in training collector's replay buffer code snippet;
  2. change your actor-network to recurrent style code snippet;
  3. use current git master branch instead of pypi whl: pip3 install git+https://github.com/thu-ml/tianshou.git@master.

That's all. If you find anything wrong, please give me feedback.

@miriaford
Copy link
Author

Cool, thanks a lot!

Does it also work out of the box for PPO and SAC? Do you have any benchmark scripts/results for them?

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Apr 19, 2020

@miriaford I haven't tried RNN on PPO and SAC, but you can be the first. Following the changes between test_dqn.py and test_drqn.py and you can construct your first RNN-style PPO and SAC!

@miriaford
Copy link
Author

Hi, I might have missed something obvious. Could you please educate me on:

  1. In these lines (https://github.com/thu-ml/tianshou/blob/master/test/discrete/net.py#L90-L91), why do you add .detach()? What happens if you don't?
  2. In PPO policy (lines), why doesn't critic support RNN interface like actor?

Thanks for your help!

@Trinkle23897
Copy link
Collaborator

  1. It's my fault. Previously the step in collector was not under with torch.no_grad(). I'll remove it.
  2. Good catch! When I implemented the GAE, I did not think about the RNN. I'll upgrade the interface.

@miriaford
Copy link
Author

This is great! Would love to check out the new interface.

Another quick question:
https://github.com/thu-ml/tianshou/blob/master/test/discrete/net.py#L70-L79
If I understand this correctly, the [bsz, len, dim] will only happen when learning the parameters, but for the rest of cases, [bsz, dim] will be passed in because evaluation runs the RNN one forward step at a time? Then what's the practical impact of longer len vs shorter len for training?

Thanks again!

@Trinkle23897
Copy link
Collaborator

Yes, you are right. The longer the length, the smaller the difference of RNN performance between training and testing.

@miriaford
Copy link
Author

Please kindly let me know when you have finished RNN-critic. I'm a bit bottlenecked on this. Thank you so much!

@Trinkle23897
Copy link
Collaborator

Please kindly let me know when you have finished RNN-critic. I'm a bit bottlenecked on this. Thank you so much!

Sure! I currently fix issue #38. Maybe today or tomorrow I’ll resolve your issue.

@Trinkle23897 Trinkle23897 reopened this Apr 29, 2020
@miriaford
Copy link
Author

Thanks for reopening the issue. I'd be happy to test it when you finish, thanks!

@miriaford
Copy link
Author

Any updates? Sorry for pinging again. Thanks!

@Trinkle23897
Copy link
Collaborator

I'm coding it.

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Apr 30, 2020

diff --git a/test/continuous/test_ppo.py b/test/continuous/test_ppo.py
index e94a917..c496de2 100644
--- a/test/continuous/test_ppo.py
+++ b/test/continuous/test_ppo.py
@@ -12,7 +12,7 @@ from tianshou.trainer import onpolicy_trainer
 from tianshou.data import Collector, ReplayBuffer
 
 if __name__ == '__main__':
-    from net import ActorProb, Critic
+    from net import RecurrentActorProb, RecurrentCritic
 else:  # pytest
     from test.continuous.net import ActorProb, Critic
 
@@ -71,11 +71,11 @@ def test_ppo(args=get_args()):
     train_envs.seed(args.seed)
     test_envs.seed(args.seed)
     # model
-    actor = ActorProb(
+    actor = RecurrentActorProb(
         args.layer_num, args.state_shape, args.action_shape,
         args.max_action, args.device
     ).to(args.device)
-    critic = Critic(
+    critic = RecurrentCritic(
         args.layer_num, args.state_shape, device=args.device
     ).to(args.device)
     optim = torch.optim.Adam(list(
@@ -95,7 +95,7 @@ def test_ppo(args=get_args()):
         gae_lambda=args.gae_lambda)
     # collector
     train_collector = Collector(
-        policy, train_envs, ReplayBuffer(args.buffer_size))
+        policy, train_envs, ReplayBuffer(args.buffer_size, stack_num=4))
     test_collector = Collector(policy, test_envs)
     # log
     log_path = os.path.join(args.logdir, args.task, 'ppo')

The code has been pushed.

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Apr 30, 2020

I'm not sure whether or not to support stacked action for the Q network. This needs to change some lines of code in all policy which need a Q-network (ddpg, td3, sac). Or you can create the Q-network as (stacked-s, single-a) -> Q value (this is in

class RecurrentCritic(nn.Module):
def __init__(self, layer_num, state_shape, action_shape=0, device='cpu'):
super().__init__()
self.state_shape = state_shape
self.action_shape = action_shape
self.device = device
self.nn = nn.LSTM(input_size=np.prod(state_shape), hidden_size=128,
num_layers=layer_num, batch_first=True)
self.fc2 = nn.Linear(128 + np.prod(action_shape), 1)
def forward(self, s, a=None):
if not isinstance(s, torch.Tensor):
s = torch.tensor(s, device=self.device, dtype=torch.float)
# s [bsz, len, dim] (training) or [bsz, dim] (evaluation)
# In short, the tensor's shape in training phase is longer than which
# in evaluation phase.
assert len(s.shape) == 3
self.nn.flatten_parameters()
s, (h, c) = self.nn(s)
s = s[:, -1]
if a is not None:
if not isinstance(a, torch.Tensor):
a = torch.tensor(a, device=self.device, dtype=torch.float)
s = torch.cat([s, a], dim=1)
s = self.fc2(s)
return s
).
But for PPO, the current version is enough.

@miriaford
Copy link
Author

Thanks!! Will take a look and let you know!

@Trinkle23897 Trinkle23897 added enhancement Feature that is not a new algorithm or an algorithm enhancement good first issue Good for newcomers labels May 5, 2020
@Trinkle23897 Trinkle23897 moved this from Usage to Other feature requests in Issue/PR Categories May 11, 2020
@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented May 14, 2020

I'm not sure whether or not to support stacked action for the Q network. This needs to change some lines of code in all policy which need a Q-network (ddpg, td3, sac). Or you can create the Q-network as (stacked-s, single-a) -> Q value

Aha, I know how to support it (Q(stacked-s, stacked-a)) without modifying any part of the codebase. Try to add an gym.wrapper which would modify the state as {'original_state': state, 'action': action}. That means in the traditional way we save (s, a, s', r, d) in the replay buffer, but now we save ([s, a], a, [s', a'], r, d) in the buffer.
And in your network side, extract the [s, a] pair from the batch.state.

@Trinkle23897 Trinkle23897 pinned this issue May 16, 2020
@Trinkle23897
Copy link
Collaborator

I think here is a bug in the implementation of the framestack option.

This issue hass been fixed on dev branch. Yes, keys_partial = set.union(*keys_map) - keys_share is now used instead.

Should we merge the essential commit from dev to master at this point?

@Trinkle23897
Copy link
Collaborator

Also, it seems that if sample_avail is set, the batch buffer.sample() returns would not contain all the rewards of an episode, which makes it impossible for policy to learn from it. Any suggestions for fixing this problem?

What's your desired feature? Could you please provide a simple example?

@Trinkle23897 Trinkle23897 reopened this Jul 13, 2020
@Arthur-Null
Copy link

I am trying to reach O(n) complexity in policy.learn(). In the current implementation, when sample_avail is not set, O(n) can't be reached. When sample_avail is set, policy.learn() can only get part of the reward, which makes it impossible to calculate returns.

@Arthur-Null
Copy link

It is possible to use the original replaybuffer without framestack and sample_avail and leave all the dirty work to policy. I wonder is there are better way?

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Jul 13, 2020

Do you mean sample only available index with on-policy method for RNN usage? I haven't figured it out clearly so let's take an example:

(This is a buffer)
Index 0 1 2 3 4 5 6 7 8 9 a b c d e f
Done  0 0 1 0 0 0 1 0 0 0 1 0 1 / / /

If I understand correctly, you want to only extract [0-2], [3-6], [7-a], [b-c] without overlap. What I'm not sure about is the data format that you expect. (and for me "O(n)" is ambiguous)
buffer.sample(0) can do something like this since it will extract all of the data, but only once. I think it is not what you want.

@Arthur-Null
Copy link

Let me state my question more clearly and thoroughly.
To me, the frame_stack solution has two problems:

  1. As mentioned before, some redundant calculation is included. To calculate the action distribution for sequence [0, 1, 2], we have to do RNN calculation on at least three sequences [0, 0, 0], [0, 0, 1], [0, 1, 2]. (This is what I mean by O(n^2). Only running RNN([0, 1, 2]) once is O(n)).
  2. Sequences like [0, 0, 0] would cause deviation as we always take the last output of the RNN as the final output.

By setting sample_avail to True. The first problem is solved to some extend. However sequences like [b, c, c] may still exist so the second problem is still there. Also, in the current implementation, only reward at the end of an episode will be returned by buf.sample() which makes training impossible.

The data format I am expecting is something like pytorch's PackedSequence, which include the sequences and the lengths of the sequences. It seems a bit complecated to implement this in replay buffer. I am now implementing it in process_fn of my policy.

@Trinkle23897
Copy link
Collaborator

So you want a new sample method: it points out the number of episodes you want and will return the sampled episodes, for example, [[7-a], [3-6], [7-a], [b-c]] with episode_size=4, am I right?

@Arthur-Null
Copy link

Yes

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Jul 13, 2020

Okay, I see. I think the simplest way is to implement it in the buffer side. Since we can easily maintain the done flag, we can do something like:

# begin_index stores all the start index of one episode (if available), e.g., [0, 3, 7, b]
episode_index = np.random.choice(begin_index, episode_size)  # assume we get [b, 7, 3, 7]
# do a sorting with key=-episode_length, since we want to mimic torch's PackedSequence
# and after that the episode_index is [7, 7, 3, b]
# should maintain a list of episode_length, in the above case: [3, 4, 4, 2]
data, seq_length = [], []
while not self._meta.done[episode_index][0]:  # since the first episode's length is the longest
    # find out the episodes that haven't terminated
    available_episode_index = episode_index[self._meta.done[episode_index] == 0]
    data.append(self[available_episode_index])  # should set stack_num=0
    seq_length.append(len(available_episode_index))
    episode_index += self._meta.done[episode_index] == 0
data = Batch.cat(data)
# for now, (data, seq_length) is something like a PackedSequence.

I think maintaining begin_index and episode_length is not a big problem. It is very similar to my implementation of sample_avail.

I prefer to add a new argument in the initialization of Collector, to indicate whether to use this mode.

@duburcqa
Copy link
Collaborator

Should we merge the essential commit from dev to master at this point?

Yes I think it is appropriate.

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Jul 13, 2020

Should we merge the essential commit from dev to master at this point?

Yes I think it is appropriate.

But should we release a new version like 0.2.4.rc1? Since we want to keep master stable.

@duburcqa
Copy link
Collaborator

duburcqa commented Jul 13, 2020

But should we release a new version like 0.2.4.rc1? Since we want to keep master stable.

It's up to you. It's really a personal choice. For Python-only project, I tend to prefer releasing patches like 0.2.4.rc1 in the case of "critical" bug fixes. But there is no clear answer. I would say the minimum is to push them on Master, but not necessarily to create an actual release.

@heartInsert
Copy link

Did someone try transfomrer before ? In NLP area , transformer seems perform better than RNN unit .

@Trinkle23897
Copy link
Collaborator

But in RL, the bigger the model, the harder it converges to an optimal policy. :(

@heartInsert
Copy link

Ok , I get it.

@Trinkle23897 Trinkle23897 unpinned this issue Mar 28, 2021
@doylech
Copy link

doylech commented Jul 27, 2021

Is there a reason why in the DQNPolicy.learn method during training you don't pass an initial hidden state to the forward method? Wouldn't this allow it to train from better predictions?

From Recurrent Experience Replay in Distributed Reinforcement Learning by Kapturowski et al. in 2018 (https://openreview.net/pdf?id=r1lyTjAqYX)

...although zero state initialization was often used in previous works (Hausknecht & Stone, 2015; Gruslys et al., 2018), we have found that it leads to misestimated action-values, especially in the early states of replayed sequence

@Trinkle23897
Copy link
Collaborator

No, because as you mentioned zero state initialization was often used in previous works, here we mimic this workflow.

@cbellinger27
Copy link

Is it possible to use the recurrent network in the discrete soft actor critic?

@BFAnas
Copy link
Contributor

BFAnas commented Mar 10, 2022

No, because as you mentioned zero state initialization was often used in previous works, here we mimic this workflow.

@Trinkle23897 Could you please point me to the exact workflow you based on this RNN implementation? I'm trying to figure out why doesn't it work (I have tested RNN with SAC and with DQN on 5 environments, it only worked with DQN for Cartpole)

@xiaoshenxian
Copy link

xiaoshenxian commented Aug 27, 2022

Hello, @Trinkle23897 , I am encountering a problem at line 89-90 in
https://github.com/thu-ml/tianshou/blob/master/tianshou/policy/modelfree/a2c.py#L89
as

v_s.append(self.critic(minibatch.obs))
v_s_.append(self.critic(minibatch.obs_next))

I am wondering why the critie does not support 'states' input as actor does? As it is common for the critic sharing the same RNN input network as the actor using, how can I pass the same state that the actor used to the critic? I noticed there is a state batch stored in the 'policy' key of the minibatch, but this should be the output state if I understand correctly, right? Or I am not supposed to pass any state to the critic during training?

Do I missed anything here? Thanks a lot.

@Trinkle23897
Copy link
Collaborator

Or I am not supposed to pass any state to the critic during training?

Currently, yes

@xiaoshenxian
Copy link

Or I am not supposed to pass any state to the critic during training?

Currently, yes

Thanks. I finally put the input state into the model input dict to solve the problem.

Another thing, I found the torch.nn.utils.clip_grad_norm_ cannot deal with inf gradient which is the case usually happening in RNN. Do you mind to add a torch.nn.utils.clip_grad_value_ before it in all policy implementation like below? Thanks a lot.

if self._grad_value:  # clip large gradient
    nn.utils.clip_grad_value_(
        self._actor_critic.parameters(), clip_value=self._grad_value
    )
if self._grad_norm:  # clip large gradient
    nn.utils.clip_grad_norm_(
        self._actor_critic.parameters(), max_norm=self._grad_norm
    )

experiment:

>>> w.grad
tensor([[0.4000, 0.4000, 0.4000, 0.4000, 0.4000],
        [   inf,    inf,    inf,    inf,    inf],
        [2.3000, 2.3000, 2.3000, 2.3000, 2.3000]])
>>> torch.nn.utils.clip_grad_norm_([w], 0.5) # this cannot deal with inf
tensor(inf)
>>> w.grad
tensor([[0., 0., 0., 0., 0.],
        [nan, nan, nan, nan, nan],
        [0., 0., 0., 0., 0.]])
>>> torch.nn.utils.clip_grad_value_([w], 0.5) # and this cannot deal with nan
>>> w.grad
tensor([[0., 0., 0., 0., 0.],
        [nan, nan, nan, nan, nan],
        [0., 0., 0., 0., 0.]])

========

>>> w.grad
tensor([[0.2000, 0.2000, 0.2000, 0.2000, 0.2000],
        [   inf,    inf,    inf,    inf,    inf],
        [1.8000, 1.8000, 1.8000, 1.8000, 1.8000]])
>>> torch.nn.utils.clip_grad_value_([w], 0.5) # but this can deal with inf
>>> w.grad
tensor([[0.2000, 0.2000, 0.2000, 0.2000, 0.2000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000]])

@Trinkle23897
Copy link
Collaborator

Sure! Feel free to submit a PR

BFAnas pushed a commit to BFAnas/tianshou that referenced this issue May 5, 2024
BFAnas pushed a commit to BFAnas/tianshou that referenced this issue May 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature that is not a new algorithm or an algorithm enhancement good first issue Good for newcomers
Projects
No open projects
Issue/PR Categories
Other feature requests
Development

No branches or pull requests