-
Notifications
You must be signed in to change notification settings - Fork 418
[Feature] RewardSum transform #751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
839d9c1 to
53e49e0
Compare
vmoens
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review, I was on PTO
Overall LGTM! I like that you code everything in the input-output tensordict. It is not well documented but we should discourage setting attributed to the transform to track call-to-call carry on variables, the tensordict API is made for that.
Also, because we change the output tensordict by adding a new key, it would be nice to add the out-keys to the observation_spec or change the reward_spec to a CompositeSpec with "reward" and "episode_reward" keys.
WDYT?
Beyond this PR:
We should drop the inplace attribute of the transforms. In general inplace is a recipe for bugs and PyTorch developers usually recommend not using it. The perf gain is usually marginal and it significantly narrows the number of applications
|
Not sure if I would consider |
| # reward=reward_spec, | ||
| # episode_reward=episode_reward_spec | ||
| # ) | ||
| return reward_spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to modify the reward_spec, but that generated problems somewhere else in the code.
In any case, is episode_reward part of the reward?
It's probably better to put it in the observation_spec. |
Added the transform_observation_spec method and some code to test it. |
| else: | ||
|
|
||
| # If reward_spec is not a CompositeSpec, the only in_key should be ´reward´ | ||
| assert set(self.in_keys) == { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not use asserts in any other place than tests
Here I think a KeyError would be appropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok!
| tensordict[out_key] = 0.0 | ||
|
|
||
| # Batched environments | ||
| elif "reset_workers" in tensordict.keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that the default in the batched cased where no "reset_workers" is present would be to reset them all no?
otherwise in multi-agent settings for instance, a user could do
env.reset()
an end up with an episode reward that keeps on piling up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right. Now if no "reset_workers" is found in the tensordict for the batched case, all envs are reset
| reward = tensordict.get(in_key) | ||
| if out_key not in tensordict.keys(): | ||
| tensordict.set( | ||
| out_key, torch.zeros(*tensordict.shape, 1, dtype=reward.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add device=reward.device here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
My code considered the possibility that reward spec could be an NdUnboundedContinuousTensorSpec, but I see that this class has been removed. Should I consider that reward spec can only be an UnboundedContinuousTensorSpec? |
|
The Nd*Spec classes were confusing so we removed them in favour of their parent class. |
vmoens
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also add this transform to the doc in the envs.rst file
vmoens
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the tests are passing. The only thing missing is the doc in envs.rst
Codecov Report
@@ Coverage Diff @@
## main #751 +/- ##
=======================================
Coverage 88.72% 88.72%
=======================================
Files 123 123
Lines 21033 21106 +73
=======================================
+ Hits 18661 18726 +65
- Misses 2372 2380 +8
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
|
I added the transform to the list of transforms in envs.rst, but I am not sure if that is enough |
Description
Adds a new Transform class, called RewardSum. Which tracks the cumulative reward of all episodes in progress and adds the information to the tensordict as a new key.
Motivation and Context
It can be informative to be able to access the training episode rewards.
e.g. it can be used like this to track the performance during training
Types of changes
What types of changes does your code introduce? Remove all that do not apply:
Checklist
Go over all the following points, and put an
xin all the boxes that apply.If you are unsure about any of these, don't hesitate to ask. We are here to help!