# Rollout collection walktrhough with TensorDicts

This notebook aims to demonstrate how TensorDicts facilitate communication between actors and environments, irrespective of their required inputs and outputs, while still allowing for a transparent data workflow. We showcase this capability through an example of a data collection loop. Importantly, we want to emphasize that this feature does not compromise the readability of the process. Throughout the demonstration, we will illustrate how information about the data workflow remains transparent and easily accessible to the user, streamlining the entire process.

### Import dependencies

In [114]:
from torch import nn
from tensordict import TensorDict
from tensordict.nn import TensorDictModule
from torchrl.envs.libs.gym import GymEnv
from torchrl.modules import OneHotCategorical, ProbabilisticActor
from torchrl.envs.utils import step_mdp

### Create an environment

In [115]:
env = GymEnv("CartPole-v1", device="cpu")

### Visualise the environment spaces specs

We can obtain comprehensive information about all environment inputs and outputs, including their shapes, data types, and devices, by examining the environment specifications.

In [116]:
print(env.specs)

CompositeSpec(
    output_spec: CompositeSpec(
        _observation_spec: CompositeSpec(
            observation: BoundedTensorSpec(
                shape=torch.Size([4]),
                space=ContinuousBox(
                    minimum=Tensor(shape=torch.Size([4]), device=cpu, dtype=torch.float32, contiguous=True),
                    maximum=Tensor(shape=torch.Size([4]), device=cpu, dtype=torch.float32, contiguous=True)),
                device=cpu,
                dtype=torch.float32,
                domain=continuous), device=cpu, shape=torch.Size([])),
        _reward_spec: CompositeSpec(
            reward: UnboundedContinuousTensorSpec(
                shape=torch.Size([1]),
                space=None,
                device=cpu,
                dtype=torch.float32,
                domain=continuous), device=cpu, shape=torch.Size([])),
        _done_spec: CompositeSpec(
            done: DiscreteTensorSpec(
                shape=torch.Size([1]),
                space=DiscreteBox

### Create the actor and visualising its inputs and outputs

We can retrieve a list of the expected tensor inputs and outputs of the actor using the 'in_keys' and 'out_keys' attributes. In this case, the actor expect to receive a TensorDict object with an 'observation' tensor in it, and will populate the TensorDict with 2 additional tensors, the 'logits' and the 'action'.

In [117]:
actor = ProbabilisticActor(
    module=TensorDictModule(nn.Linear(4, 2), in_keys=["observation"], out_keys=["logits"]),
    in_keys=["logits"],
    out_keys=["action"],
    distribution_class=OneHotCategorical)

print("actor inputs: ", actor.in_keys)
print("actor outputs: ", actor.out_keys)

actor inputs:  ['observation']
actor outputs:  ['logits', 'action']


### Create a target TensorDict to store the rollouts

In this example, we will collect T consecutive steps of the environment. To do so, we create an empty TensorDict with batch size T. We can print it and see that no tensors are contained it in.
<br>

In [118]:
T = 10
out = TensorDict({}, batch_size=[T], device="cpu")
print(out)

TensorDict(
    fields={
    },
    batch_size=torch.Size([10]),
    device=cpu,
    is_shared=False)


### Single step-by-step data collection loop

Following, we will do the first run through the data collection loop step-by-step, showing how the data TensorDict is filled up with a transition data.
<br>
<br>

In [119]:
data = env.reset()
print(data)

TensorDict(
    fields={
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        observation: Tensor(shape=torch.Size([4]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=cpu,
    is_shared=False)


<br>
As expected, the actor adds action an logits to data.
<br>
<br>

In [120]:
data = actor(data)
print(data)

TensorDict(
    fields={
        action: Tensor(shape=torch.Size([2]), device=cpu, dtype=torch.int64, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        logits: Tensor(shape=torch.Size([2]), device=cpu, dtype=torch.float32, is_shared=False),
        observation: Tensor(shape=torch.Size([4]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=cpu,
    is_shared=False)


<br>
Now that we have the action in our TensorDict, we can take a step in the environment. As in any RL environment, our environment will return the next observation, the reward and a done flag in response the selected action.
<br>
<br>

In [121]:
data = env.step(data)
out[i] = data
print(data)

TensorDict(
    fields={
        action: Tensor(shape=torch.Size([2]), device=cpu, dtype=torch.int64, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        logits: Tensor(shape=torch.Size([2]), device=cpu, dtype=torch.float32, is_shared=False),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([4]), device=cpu, dtype=torch.float32, is_shared=False),
                reward: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=cpu,
            is_shared=False),
        observation: Tensor(shape=torch.Size([4]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=cpu,
    is_shared=False)


<br>
Finally, we use the 'step_mdp()' function to update the TensorDict by one step, shifting the 'done' and 'observation' tensors from the 'next' state to the current state, thereby preparing the TensorDict for another iteration through the loop.
<br>
<br>

In [122]:
data = step_mdp(data)
print(data)

TensorDict(
    fields={
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        logits: Tensor(shape=torch.Size([2]), device=cpu, dtype=torch.float32, is_shared=False),
        observation: Tensor(shape=torch.Size([4]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=cpu,
    is_shared=False)


<br>
Because TensorDicts enable the transfer of any tensors between actors and environments, the code below will function effectively for any actor and environment. While the tensors within the TensorDict may vary depending on the specific case, they will always remain accessible to the user as demonstrated in the notebook, facilitating understanding of the workflow.
<br>
<br>

In [123]:
for i in range(1, T):
    data = actor(data)
    data = env.step(data)
    out[i] = data
    data = step_mdp(data)

<br>
The out TensorDict contains now the collected data.
<br>
<br>

In [124]:
print(out)

TensorDict(
    fields={
        action: Tensor(shape=torch.Size([10, 2]), device=cpu, dtype=torch.int64, is_shared=False),
        done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        logits: Tensor(shape=torch.Size([10, 2]), device=cpu, dtype=torch.float32, is_shared=False),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([10, 4]), device=cpu, dtype=torch.float32, is_shared=False),
                reward: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10]),
            device=cpu,
            is_shared=False),
        observation: Tensor(shape=torch.Size([10, 4]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([10]),
    device=cpu,
    is_shared=False)
