# Buffer Module

The replay buffer is a fundamental component in deep reinforcement learning (DRL) implementations. In Tianshou, the Buffer module extends the functionality of the Batch class by providing trajectory tracking capabilities and sampling utilities that go beyond basic data storage.

Tianshou provides several buffer implementations, with `ReplayBuffer` and `VectorReplayBuffer` being the most fundamental. The latter is specifically designed for parallelized environments, which will be covered in the [Vectorized Environment](https://tianshou.readthedocs.io/en/master/02_notebooks/L3_Vectorized__Environment.html) tutorial. This tutorial focuses exclusively on the `ReplayBuffer` implementation.

In [None]:
import pickle

import numpy as np

from tianshou.data import Batch, ReplayBuffer

## Core Functionality

### Circular Queue Storage Mechanism

The buffer stores data in batches using a circular queue mechanism. When the buffer reaches its maximum capacity, newly added data automatically overwrites the oldest entries, ensuring efficient memory utilization while maintaining the most recent experiences.

In [None]:
# Initialize buffer with maximum capacity of 10 transitions
print("========================================")
dummy_buf = ReplayBuffer(size=10)
print(dummy_buf)
print(f"maxsize: {dummy_buf.maxsize}, data length: {len(dummy_buf)}")

# Add 3 transition steps sequentially
print("========================================")
for i in range(3):
    dummy_buf.add(
        Batch(obs=i, act=i, rew=i, terminated=0, truncated=0, done=0, obs_next=i + 1, info={}),
    )
print(dummy_buf)
print(f"maxsize: {dummy_buf.maxsize}, data length: {len(dummy_buf)}")

# Add 10 additional transitions to demonstrate circular queue behavior
# Note: First 3 transitions will be overwritten as capacity is exceeded
print("========================================")
for i in range(3, 13):
    dummy_buf.add(
        Batch(obs=i, act=i, rew=i, terminated=0, truncated=0, done=0, obs_next=i + 1, info={}),
    )
print(dummy_buf)
print(f"maxsize: {dummy_buf.maxsize}, data length: {len(dummy_buf)}")

### Batch-Compatible Operations

Consistent with the `Batch` interface, `ReplayBuffer` supports standard operations including concatenation, splitting, advanced slicing, and indexing.

In [None]:
print(dummy_buf[-1])
print(dummy_buf[-3:])
# Additional Batch methods can be explored as needed

### Persistence and Serialization

The buffer can be serialized to disk while preserving trajectory information. This capability is particularly valuable for offline reinforcement learning applications, where pre-collected experience datasets are utilized for training.

In [None]:
_dummy_buf = pickle.loads(pickle.dumps(dummy_buf))

### Reserved Keys for DRL Integration

To facilitate seamless integration with DRL algorithms, `ReplayBuffer` utilizes nine reserved keys within the `Batch` structure. These keys follow the [Gymnasium](https://gymnasium.farama.org/index.html#) conventions:

*   `obs` - Current observation
*   `act` - Action taken
*   `rew` - Reward received
*   `terminated` - Episode termination flag (goal reached or failure)
*   `truncated` - Episode truncation flag (time limit or external interruption)
*   `done` - Combined termination/truncation indicator
*   `obs_next` - Subsequent observation
*   `info` - Auxiliary information dictionary
*   `policy` - Policy-specific data

**Best Practice**: Use the `info` dictionary for custom metadata rather than adding additional top-level keys. The `done` flag is internally tracked to determine trajectory boundaries, episode lengths, and cumulative rewards.

```python
# Not recommended: Custom top-level keys
buf.add(Batch(......, extra_info=0))

# Recommended: Use info dictionary
buf.add(Batch(......, info={"extra_info": 0}))
```

### Experience Sampling

The primary function of a replay buffer in DRL is to enable experience sampling for training. The buffer provides two methods for this purpose:

1. `ReplayBuffer.sample()` - Direct batch sampling with specified size
2. `ReplayBuffer.split(..., shuffle=True)` - Split buffer into multiple batches with optional shuffling

In [None]:
dummy_buf.sample(batch_size=5)

## Trajectory Management

A distinguishing feature of `ReplayBuffer` compared to `Batch` is its trajectory tracking capability, which maintains episode boundaries and associated metadata.

The following example demonstrates trajectory tracking by simulating three episodes:
1. First episode: 3 steps (completed)
2. Second episode: 5 steps (completed)
3. Third episode: 5 steps (ongoing)

In [None]:
trajectory_buffer = ReplayBuffer(size=10)

# Episode 1: 3 steps, terminates at step 2
print("========================================")
for i in range(3):
    result = trajectory_buffer.add(
        Batch(
            obs=i,
            act=i,
            rew=i,
            terminated=1 if i == 2 else 0,
            truncated=0,
            done=i == 2,
            obs_next=i + 1,
            info={},
        ),
    )
    print(result)
print(trajectory_buffer)
print(f"maxsize: {trajectory_buffer.maxsize}, data length: {len(trajectory_buffer)}")

# Episode 2: 5 steps, terminates at step 7
print("========================================")
for i in range(3, 8):
    result = trajectory_buffer.add(
        Batch(
            obs=i,
            act=i,
            rew=i,
            terminated=1 if i == 7 else 0,
            truncated=0,
            done=i == 7,
            obs_next=i + 1,
            info={},
        ),
    )
    print(result)
print(trajectory_buffer)
print(f"maxsize: {trajectory_buffer.maxsize}, data length: {len(trajectory_buffer)}")

# Episode 3: 5 steps added, episode still ongoing
print("========================================")
for i in range(8, 13):
    result = trajectory_buffer.add(
        Batch(obs=i, act=i, rew=i, terminated=0, truncated=0, done=False, obs_next=i + 1, info={}),
    )
    print(result)
print(trajectory_buffer)
print(f"maxsize: {trajectory_buffer.maxsize}, data length: {len(trajectory_buffer)}")

### Episode Metrics Tracking

The `ReplayBuffer.add()` method returns a tuple containing four values: `(current_index, episode_reward, episode_length, episode_start_index)`. 

**Important**: The `episode_reward` and `episode_length` fields are only populated when an episode completes (i.e., when `done=True`). This automatic computation eliminates the need for manual episode metric tracking during data collection.

### Episode Boundary Navigation

The buffer provides mechanisms to navigate episode boundaries efficiently. Consider the following scenario where we query a mid-episode step:

In [None]:
print(trajectory_buffer)
print("========================================")

data = trajectory_buffer[6]
print(data)

### Determining Episode Start Indices

Step 6 belongs to the second episode (steps 3-7). While this may appear straightforward, determining the episode start index programmatically is non-trivial due to:

1. **Ambiguous done flags**: The preceding `done` flag approach fails when the buffer contains incomplete episodes, as step 3 is surrounded by `done=False` values
2. **Complex buffer structures**: Advanced buffers like `VectorReplayBuffer` do not store data sequentially, making boundary detection more challenging

The buffer provides a unified API to handle these complexities through the `prev()` method, which identifies the previous step within an episode:

In [None]:
# Query previous steps for indices [0, 1, 2, 3, 4, 5, 6]
# Episode boundaries prevent backward traversal past episode starts
print(trajectory_buffer.prev(np.array([0, 1, 2, 3, 4, 5, 6])))

The output confirms that step 3 marks the episode start. The complementary `ReplayBuffer.next()` method enables forward traversal to identify episode terminations, providing a consistent interface across all buffer implementations.

In [None]:
# Query next steps for indices [4, 5, 6, 7, 8, 9]
# Episode boundaries prevent forward traversal past episode ends
print(trajectory_buffer.next(np.array([4, 5, 6, 7, 8, 9])))

### Identifying Incomplete Episodes

The buffer maintains tracking of incomplete episodes through the `unfinished_index()` method, which identifies the most recent step of ongoing episodes (marked with `done=False`):

In [None]:
print(trajectory_buffer.unfinished_index())

### Applications in DRL Algorithms

These trajectory navigation APIs are essential for computing algorithmic quantities such as:
- Generalized Advantage Estimation (GAE)
- N-step returns
- Temporal difference targets

The unified interface ensures modular design and enables algorithm implementations that generalize across different buffer types. For reference implementations, see the [Tianshou policy base class](https://github.com/thu-ml/tianshou/blob/6fc68578127387522424460790cbcb32a2bd43c4/tianshou/policy/base.py#L384).

## Advanced Topics

### Specialized Buffer Implementations

Tianshou provides several specialized buffer variants for advanced use cases:

*   **PrioritizedReplayBuffer**: Implements [prioritized experience replay](https://arxiv.org/abs/1511.05952) for importance-weighted sampling
*   **CachedReplayBuffer**: Maintains a primary buffer with auxiliary cached buffers for improved sample efficiency in specific scenarios
*   **ReplayBufferManager**: Base class for custom buffer implementations requiring management of multiple buffer instances

Consult the API documentation and source code for detailed implementation specifications.

### Recurrent Neural Network Support

The buffer initialization accepts a `stack_num` parameter (default: 1) to enable frame stacking for recurrent neural network (RNN) integration in DRL algorithms. This feature facilitates temporal sequence processing by automatically stacking consecutive observations. Refer to the API documentation for configuration details and usage examples.