Naming and typing improvements in Actor/Critic/Policy forwards #1032

arnaujc91 · 2024-01-23T16:52:53Z

Closes #917

Internal Improvements

Better variable names related to model outputs (logits, dist input etc.). Naming and typing improvements in Actor/Critic/Policy forwards #1032
Improved typing for actors and critics, using Tianshou classes like Actor, ActorProb, etc.,
instead of just nn.Module. Naming and typing improvements in Actor/Critic/Policy forwards #1032
Added interfaces for most Actor and Critic classes to enforce the presence of forward methods. Naming and typing improvements in Actor/Critic/Policy forwards #1032
Simplified PGPolicy forward by unifying the dist_fn interface (see associated breaking change). Naming and typing improvements in Actor/Critic/Policy forwards #1032
Use .mode of distribution instead of relying on knowledge of the distribution type. Naming and typing improvements in Actor/Critic/Policy forwards #1032

Breaking Changes

Changed interface of dist_fn in PGPolicy and all subclasses to take a single argument in both
continuous and discrete cases. Naming and typing improvements in Actor/Critic/Policy forwards #1032

arnaujc91 · 2024-01-23T16:56:39Z

@MischaPanch as of now if you agree with the naming there will be other places, as you mentioned, where logits should be replaced. So will continue once we agree the naming is fine for you.

codecov-commenter · 2024-01-23T17:27:15Z

Codecov Report

Attention: Patch coverage is 97.75281% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 88.21%. Comparing base (edae9e4) to head (89d888f).
Report is 4 commits behind head on master.

❗ Current head 89d888f differs from pull request most recent head ef0b0dc. Consider uploading reports for the commit ef0b0dc to get more accurate results

Files	Patch %	Lines
tianshou/policy/imitation/base.py	87.50%	1 Missing ⚠️
tianshou/utils/net/common.py	66.66%	1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1032      +/-   ##
==========================================
+ Coverage   88.15%   88.21%   +0.05%     
==========================================
  Files         100      100              
  Lines        8180     8297     +117     
==========================================
+ Hits         7211     7319     +108     
- Misses        969      978       +9

Flag	Coverage Δ
unittests	`88.21% <97.75%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tianshou/policy/modelfree/pg.py

arnaujc91 · 2024-01-23T18:50:13Z

ModelOutputBatchProtocol might need to be modified too if changes are accepted.

tianshou/policy/modelfree/pg.py

arnaujc91 · 2024-02-01T10:29:25Z

Hi guys, these places i think logits

logits --> action_dist_input

should also be reformated:

ImitationPolicy Class

tianshou/tianshou/policy/imitation/base.py

Lines 74 to 78 in 4756ee8

    
           ) -> ModelOutputBatchProtocol: 
        
               logits, hidden = self.actor(batch.obs, state=state, info=batch.info) 
        
               act = logits.max(dim=1)[1] if self.action_type == "discrete" else logits 
        
               result = Batch(logits=logits, act=act, state=hidden) 
        
               return cast(ModelOutputBatchProtocol, result)

DiscreteSACPolicy Class

tianshou/tianshou/policy/modelfree/discrete_sac.py

Lines 109 to 115 in 4756ee8

    
           logits, hidden = self.actor(batch.obs, state=state, info=batch.info) 
        
           dist = Categorical(logits=logits) 
        
           if self.deterministic_eval and not self.training: 
        
               act = logits.argmax(axis=-1) 
        
           else: 
        
               act = dist.sample() 
        
           return Batch(logits=logits, act=act, state=hidden, dist=dist)

SACPolicy Class

tianshou/tianshou/policy/modelfree/sac.py

Lines 176 to 198 in 4756ee8

    
           logits, hidden = self.actor(batch.obs, state=state, info=batch.info) 
        
           assert isinstance(logits, tuple) 
        
           dist = Independent(Normal(*logits), 1) 
        
           if self.deterministic_eval and not self.training: 
        
               act = logits[0] 
        
           else: 
        
               act = dist.rsample() 
        
           log_prob = dist.log_prob(act).unsqueeze(-1) 
        
           # apply correction for Tanh squashing when computing logprob from Gaussian 
        
           # You can check out the original SAC paper (arXiv 1801.01290): Eq 21. 
        
           # in appendix C to get some understanding of this equation. 
        
           squashed_action = torch.tanh(act) 
        
           log_prob = log_prob - torch.log((1 - squashed_action.pow(2)) + self.__eps).sum( 
        
               -1, 
        
               keepdim=True, 
        
           ) 
        
           result = Batch( 
        
               logits=logits, 
        
               act=squashed_action, 
        
               state=hidden, 
        
               dist=dist, 
        
               log_prob=log_prob, 
        
           )

REDQPolicy Class: Also here loc_scale maybe should be renamed? @MischaPanch

tianshou/tianshou/policy/modelfree/redq.py

Lines 147 to 166 in 4756ee8

    
           def forward(  # type: ignore 
        
               self, 
        
               batch: ObsBatchProtocol, 
        
               state: dict | Batch | np.ndarray | None = None, 
        
               **kwargs: Any, 
        
           ) -> Batch: 
        
               loc_scale, h = self.actor(batch.obs, state=state, info=batch.info) 
        
               loc, scale = loc_scale 
        
               dist = Independent(Normal(loc, scale), 1) 
        
               act = loc if self.deterministic_eval and not self.training else dist.rsample() 
        
               log_prob = dist.log_prob(act).unsqueeze(-1) 
        
               # apply correction for Tanh squashing when computing logprob from Gaussian 
        
               # You can check out the original SAC paper (arXiv 1801.01290): Eq 21. 
        
               # in appendix C to get some understanding of this equation. 
        
               squashed_action = torch.tanh(act) 
        
               log_prob = log_prob - torch.log((1 - squashed_action.pow(2)) + self.__eps).sum( 
        
                   -1, 
        
                   keepdim=True, 
        
               ) 
        
               return Batch(logits=loc_scale, act=squashed_action, state=h, dist=dist, log_prob=log_prob)

Net Class

tianshou/tianshou/utils/net/common.py

Lines 258 to 284 in 4756ee8

    
               def forward( 
        
                   self, 
        
                   obs: np.ndarray | torch.Tensor, 
        
                   state: Any = None, 
        
                   **kwargs: Any, 
        
               ) -> tuple[torch.Tensor, Any]: 
        
                   """Mapping: obs -> flatten (inside MLP)-> logits. 
        
                   :param obs: 
        
                   :param state: unused and returned as is 
        
                   :param kwargs: unused 
        
                   """ 
        
                   logits = self.model(obs) 
        
                   batch_size = logits.shape[0] 
        
                   if self.use_dueling:  # Dueling DQN 
        
                       assert self.Q is not None 
        
                       assert self.V is not None 
        
                       q, v = self.Q(logits), self.V(logits) 
        
                       if self.num_atoms > 1: 
        
                           q = q.view(batch_size, -1, self.num_atoms) 
        
                           v = v.view(batch_size, -1, self.num_atoms) 
        
                       logits = q - q.mean(dim=1, keepdim=True) + v 
        
                   elif self.num_atoms > 1: 
        
                       logits = logits.view(batch_size, -1, self.num_atoms) 
        
                   if self.softmax: 
        
                       logits = torch.softmax(logits, dim=-1) 
        
                   return logits, state

Actor Class

tianshou/tianshou/utils/net/continuous.py

Lines 74 to 85 in 4756ee8

    
           def forward( 
        
               self, 
        
               obs: np.ndarray | torch.Tensor, 
        
               state: Any = None, 
        
               info: dict[str, Any] | None = None, 
        
           ) -> tuple[torch.Tensor, Any]: 
        
               """Mapping: obs -> logits -> action.""" 
        
               if info is None: 
        
                   info = {} 
        
               logits, hidden = self.preprocess(obs, state) 
        
               logits = self.max_action * torch.tanh(self.last(logits)) 
        
               return logits, hidden

Actor Class

tianshou/tianshou/utils/net/discrete.py

Lines 69 to 82 in 4756ee8

    
           def forward( 
        
               self, 
        
               obs: np.ndarray | torch.Tensor, 
        
               state: Any = None, 
        
               info: dict[str, Any] | None = None, 
        
           ) -> tuple[torch.Tensor, Any]: 
        
               r"""Mapping: s -> Q(s, \*).""" 
        
               if info is None: 
        
                   info = {} 
        
               logits, hidden = self.preprocess(obs, state) 
        
               logits = self.last(logits) 
        
               if self.softmax_output: 
        
                   logits = F.softmax(logits, dim=-1) 
        
               return logits, hidden

So basically for all actors and all policies that uses actors. Also seems the Net class which i have to figure out how it is properly related to Policies and Actors.

Any objections? @MischaPanch @opcode81

MischaPanch · 2024-02-01T12:09:46Z

REDQPolicy Class: Also here loc_scale maybe should be renamed? @MischaPanch

tianshou/tianshou/policy/modelfree/redq.py

Lines 147 to 166 in 4756ee8

def forward( # type: ignore

self,

batch: ObsBatchProtocol,

state: dict | Batch | np.ndarray | None = None,

**kwargs: Any,

) -> Batch:

loc_scale, h = self.actor(batch.obs, state=state, info=batch.info)

loc, scale = loc_scale

dist = Independent(Normal(loc, scale), 1)

act = loc if self.deterministic_eval and not self.training else dist.rsample()

log_prob = dist.log_prob(act).unsqueeze(-1)

# apply correction for Tanh squashing when computing logprob from Gaussian

# You can check out the original SAC paper (arXiv 1801.01290): Eq 21.

# in appendix C to get some understanding of this equation.

squashed_action = torch.tanh(act)

log_prob = log_prob - torch.log((1 - squashed_action.pow(2)) + self.__eps).sum(

-1,

keepdim=True,

)

return Batch(logits=loc_scale, act=squashed_action, state=h, dist=dist, log_prob=log_prob)

What's wrong with the loc_scale?

So basically for all actors and all policies that uses actors. Also seems the Net class which i have to figure out how it is properly related to Policies and Actors.

Any objections? @MischaPanch @opcode81

Looks good, go ahead! But you should also rename things inside of some BatchPrototypes, as parrtially discussed above

arnaujc91 · 2024-02-01T12:53:48Z

REDQPolicy Class: Also here loc_scale maybe should be renamed? @MischaPanch

tianshou/tianshou/policy/modelfree/redq.py

Lines 147 to 166 in 4756ee8

def forward( # type: ignore

self,

batch: ObsBatchProtocol,

state: dict | Batch | np.ndarray | None = None,

**kwargs: Any,

) -> Batch:

loc_scale, h = self.actor(batch.obs, state=state, info=batch.info)

loc, scale = loc_scale

dist = Independent(Normal(loc, scale), 1)

act = loc if self.deterministic_eval and not self.training else dist.rsample()

log_prob = dist.log_prob(act).unsqueeze(-1)

# apply correction for Tanh squashing when computing logprob from Gaussian

# You can check out the original SAC paper (arXiv 1801.01290): Eq 21.

# in appendix C to get some understanding of this equation.

squashed_action = torch.tanh(act)

log_prob = log_prob - torch.log((1 - squashed_action.pow(2)) + self.__eps).sum(

-1,

keepdim=True,

)

return Batch(logits=loc_scale, act=squashed_action, state=h, dist=dist, log_prob=log_prob)

What's wrong with the loc_scale?

So basically for all actors and all policies that uses actors. Also seems the Net class which i have to figure out how it is properly related to Policies and Actors.
Any objections? @MischaPanch @opcode81

Looks good, go ahead! But you should also rename things inside of some BatchPrototypes, as parrtially discussed above

Because the logical flow i have seen is usually

self.policy -> dist

so the output of the policy network (torch forward method) is usually a tuple, where the second element is some state and the first element is what you will always input to the Torch distribution as the code demonstrates:

loc_scale, h = self.actor(batch.obs, state=state, info=batch.info) 
loc, scale = loc_scale 
dist = Independent(Normal(loc, scale), 1)

I just wanted to keep consistency. If i call it action_dist_input somewhere because is the input of the Torch distribution i should also do it anywhere else where thats the case too. That was my thought.

MischaPanch · 2024-02-01T13:01:00Z

I just wanted to keep consistency. If i call it action_dist_input somewhere because is the input of the Torch distribution i should also do it anywhere else where thats the case too. That was my thought.

In this case it is not passed to a general action dist, but to a gaussian, where it is used as loc and scale. So here the name loc_scale is the most appropriate one, I think. Names should always reflect the semantics of the variable as precisely as possible

arnaujc91 · 2024-02-01T13:08:12Z

Hmm I bet that in the case of a continuous actor:

tianshou/tianshou/utils/net/continuous.py

Lines 74 to 85 in 4756ee8

    
           def forward( 
        
               self, 
        
               obs: np.ndarray | torch.Tensor, 
        
               state: Any = None, 
        
               info: dict[str, Any] | None = None, 
        
           ) -> tuple[torch.Tensor, Any]: 
        
               """Mapping: obs -> logits -> action.""" 
        
               if info is None: 
        
                   info = {} 
        
               logits, hidden = self.preprocess(obs, state) 
        
               logits = self.max_action * torch.tanh(self.last(logits)) 
        
               return logits, hidden

The output of line 85 that is named logits is also a mean and standard deviation, no? But is still called logits and not loc_scale . Maybe i am misunderstanding something.

MischaPanch · 2024-02-01T13:22:38Z

The output of line 85 that is named logits is also a mean and standard deviation, no? But is still called logits and not loc_scale . Maybe i am misunderstanding something.

here you don't know how the output of forward will be used (into which kind of continuous action_dist it would go), so you shouldn't call it loc_scale. In the code above, on the other hand, the variable is fed into a Gaussian as loc and scale, so the semantics are clear.

If tianshou would only support Gaussians for continuous actors, then the output of the continuous Actor should indeed be loc_scale. For sure it should never be logits - please change that as part of your PR!

The fact that tianshou right now implicitly depends on the action_dist_input being of the loc_scale format in some places is a separate problem and deserves a separate issue. Fixing that is outside of the scope.

In this PR, the goal should be to make the naming at least a bit better and consistent with the current interfaces. If you know that the variable has to be of the form (loc, scale) in some place (like in the example above), then it should be named as such. If you only know that it will be added to a Batch's field called action_dist_input, then the var should be called action_dist_input, b/c in principle tianshou's current interfaces don't allow you to infer a more precise semantics.

That's what I mean when I say the the var names should reflect semantics as precisely as possible

arnaujc91 · 2024-02-01T13:40:05Z

Understood, so wherever there are logits appearing that can be interpreted as action_dist_input i will make the replacement

logits -> action_dist_input

but if already a more accurate name (e.g. loc_scale) is already given, i will leave it untouched. I think we agree here.

tianshou/policy/modelfree/pg.py

…anges. Also changed docstrings.

MischaPanch

Thanks @arnaujc91 , it brings us an important step towards removing wrong names across the source code!

The issue is not fully complete at this point since all kinds of BatchProtocol still carry logits, but it can't be fully completed before introducing an Algorithm abstraction as part of release 2.0.0 anyway. So for the version 1.x,y this is likely as far as we can push it.

I will address the comments myself and merge this

tianshou/policy/imitation/discrete_bcq.py

tianshou/policy/imitation/discrete_cql.py

tianshou/policy/imitation/discrete_crr.py

tianshou/policy/imitation/gail.py

tianshou/policy/modelfree/pg.py

MischaPanch · 2024-04-01T12:56:46Z

tianshou/policy/modelfree/pg.py

+        action_dist_input_BD, hidden_BH = self.actor(batch.obs, state=state, info=batch.info)
+        # in the case that self.action_type == "discrete", the dist should always be Categorical, and D=A
+        # therefore action_dist_input_BD is equivalent to logits_BA
+        if self.action_type == "discrete":


This if-else seems weird and unnecessary, I'll look into it

Required a breaking change to fix, see d0d9d45. Also required a change in the (interiors of the) high-level interfaces

@opcode81 @maxhuettenrauch pls have a look at that commit. Max, I believe you had written a todo on how to improve the typing of the dist-fn - this commit addresses the todo

I ran several low-level and high-level examples, things work. I think I haven't missed anything :)

tianshou/policy/modelfree/sac.py

MischaPanch · 2024-04-01T12:59:05Z

tianshou/policy/modelfree/td3.py

@@ -29,7 +29,7 @@ class TD3Policy(DDPGPolicy[TTD3TrainingStats], Generic[TTD3TrainingStats]):  # t
    """Implementation of TD3, arXiv:1802.09477.

    :param actor: the actor network following the rules in
-        :class:`~tianshou.policy.BasePolicy`. (s -> logits)
+        :class:`~tianshou.policy.BasePolicy`. (s -> actions)


Not sure it's true, the actor might be expected to output dist inputs. Gonna look into it

MischaPanch · 2024-04-01T12:59:21Z

tianshou/utils/net/continuous.py

-        logits, hidden = self.preprocess(obs, state)
-        logits = self.max_action * torch.tanh(self.last(logits))
-        return logits, hidden
+        """Mapping: s -> action_values, hidden_state.


Follow naming convention

tianshou/policy/modelfree/trpo.py

This simplifies the interfaces and the implementations

…l#1032) Closes thu-ml#917 ### Internal Improvements - Better variable names related to model outputs (logits, dist input etc.). thu-ml#1032 - Improved typing for actors and critics, using Tianshou classes like `Actor`, `ActorProb`, etc., instead of just `nn.Module`. thu-ml#1032 - Added interfaces for most `Actor` and `Critic` classes to enforce the presence of `forward` methods. thu-ml#1032 - Simplified `PGPolicy` forward by unifying the `dist_fn` interface (see associated breaking change). thu-ml#1032 - Use `.mode` of distribution instead of relying on knowledge of the distribution type. thu-ml#1032 ### Breaking Changes - Changed interface of `dist_fn` in `PGPolicy` and all subclasses to take a single argument in both continuous and discrete cases. thu-ml#1032 --------- Co-authored-by: Arnau Jimenez <arnau.jimenez@zeiss.com> Co-authored-by: Michael Panchenko <m.panchenko@appliedai.de>

arnaujc91 commented Jan 23, 2024

View reviewed changes

tianshou/policy/modelfree/pg.py Show resolved Hide resolved

MischaPanch reviewed Jan 23, 2024

View reviewed changes

tianshou/policy/modelfree/pg.py Outdated Show resolved Hide resolved

arnaujc91 commented Feb 8, 2024

View reviewed changes

tianshou/policy/modelfree/pg.py Outdated Show resolved Hide resolved

Arnau Jimenez added 2 commits March 17, 2024 16:00

Refactored pg forward method and dist_fn usage.

6799fb3

Refactored forward methods of Policy classes to acommodate for new ch…

51f393a

…anges. Also changed docstrings.

arnaujc91 force-pushed the refactor/logits branch from 26fb7ed to 51f393a Compare March 17, 2024 15:06

Arnau Jimenez added 3 commits March 17, 2024 17:00

linting passed.

925a34a

Fixing pg dist_fn.

4b7a711

Removing logits kw argument in dist_fn inside pg.py

d5fb8d1

arnaujc91 marked this pull request as ready for review March 26, 2024 19:49

arnaujc91 requested review from MischaPanch and opcode81 March 26, 2024 19:49

Merge branch 'master' into refactor/logits

52e6ac7

MischaPanch reviewed Apr 1, 2024

View reviewed changes

tianshou/policy/modelfree/trpo.py Outdated Show resolved Hide resolved

tianshou/policy/modelfree/trpo.py Outdated Show resolved Hide resolved

MischaPanch and others added 5 commits April 1, 2024 15:05

Dosctrings

4b5dc5a

Improved type hints and docstrings for Actor and Critic

89d888f

Minor var renaming [skip ci]

e1ca95a

BREAKING: changed type of dist_fn for continuous cases

d0d9d45

This simplifies the interfaces and the implementations

Spelling [skip ci]

ef0b0dc

Michael Panchenko added 2 commits April 1, 2024 16:51

Fixed dist_fn in test

a12b399

Spelling

0298428

This was referenced Apr 1, 2024

Fix mypy issues in tests and examples #1077

Merged

Rename logits to model_output #917

Closed

MischaPanch changed the title ~~Refactored pg.py logits variable name.~~ Naming and typing improvements in Actor/Critic/Policy forwards Apr 1, 2024

Changelog [skip ci]

08134d6

MischaPanch approved these changes Apr 1, 2024

View reviewed changes

MischaPanch merged commit bf0d632 into thu-ml:master Apr 1, 2024

MischaPanch mentioned this pull request Apr 12, 2024

/test/continuous/test_ppo.py TypeError on torch.distributions #1104

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Naming and typing improvements in Actor/Critic/Policy forwards #1032

Naming and typing improvements in Actor/Critic/Policy forwards #1032

arnaujc91 commented Jan 23, 2024 •

edited by MischaPanch

Loading

arnaujc91 commented Jan 23, 2024

codecov-commenter commented Jan 23, 2024 •

edited

Loading

arnaujc91 commented Jan 23, 2024

arnaujc91 commented Feb 1, 2024 •

edited

Loading

MischaPanch commented Feb 1, 2024

arnaujc91 commented Feb 1, 2024

MischaPanch commented Feb 1, 2024

arnaujc91 commented Feb 1, 2024 •

edited

Loading

MischaPanch commented Feb 1, 2024 •

edited

Loading

arnaujc91 commented Feb 1, 2024 •

edited

Loading

MischaPanch left a comment

MischaPanch Apr 1, 2024

MischaPanch Apr 1, 2024

MischaPanch Apr 1, 2024

MischaPanch Apr 1, 2024

MischaPanch Apr 1, 2024

Naming and typing improvements in Actor/Critic/Policy forwards #1032

Naming and typing improvements in Actor/Critic/Policy forwards #1032

Conversation

arnaujc91 commented Jan 23, 2024 • edited by MischaPanch Loading

Internal Improvements

Breaking Changes

arnaujc91 commented Jan 23, 2024

codecov-commenter commented Jan 23, 2024 • edited Loading

Codecov Report

arnaujc91 commented Jan 23, 2024

arnaujc91 commented Feb 1, 2024 • edited Loading

MischaPanch commented Feb 1, 2024

arnaujc91 commented Feb 1, 2024

MischaPanch commented Feb 1, 2024

arnaujc91 commented Feb 1, 2024 • edited Loading

MischaPanch commented Feb 1, 2024 • edited Loading

arnaujc91 commented Feb 1, 2024 • edited Loading

MischaPanch left a comment

Choose a reason for hiding this comment

MischaPanch Apr 1, 2024

Choose a reason for hiding this comment

MischaPanch Apr 1, 2024

Choose a reason for hiding this comment

MischaPanch Apr 1, 2024

Choose a reason for hiding this comment

MischaPanch Apr 1, 2024

Choose a reason for hiding this comment

MischaPanch Apr 1, 2024

Choose a reason for hiding this comment

arnaujc91 commented Jan 23, 2024 •

edited by MischaPanch

Loading

codecov-commenter commented Jan 23, 2024 •

edited

Loading

arnaujc91 commented Feb 1, 2024 •

edited

Loading

arnaujc91 commented Feb 1, 2024 •

edited

Loading

MischaPanch commented Feb 1, 2024 •

edited

Loading

arnaujc91 commented Feb 1, 2024 •

edited

Loading