# DreamerV1 Code Explanation
The Dreamer paper (“Dream to Control: Learning Behaviors by Latent Imagination”) boils down to three parts:

1) Dynamics learning — learn a world model from real data.

2) Behavior learning — use that model to imagine rollouts and train the actor/critic.

3) Environment interaction — collect fresh data with the current policy and repeat.

In the following sections, we will walk through these phases and explain the components and logic flow for each of these phases, beginning with **Dynamics Learning**.

##1. Dynamics Learning

Initially, Dreamer learns a **world model** that captures how the environment behaves, using real data.  
Instead of learning directly from raw pixels, dreamer first encodes observations, like images, into compact latent states and learns how they change over time.

The world model includes:
- an **Encoder** and **RSSM**, which together is implemented by the Representation and Transition Models,
- a **Reward Model** that predicts rewards from latent states, and
- a **Decoder** that is implemented by an Observation Model, that reconstructs images for a training signal.

All of these different parts and the models used to implement them are explained in detail down below.

Once this model is trained, Dreamer can generate future states entirely within latent space which then allows it to imagine what the outcomes of actions would be without additional real-world information.

Below is the pseudocode overview of Dreamer's world model learning phase. This function shows how the main components (Representation,Transition, Reward, and Observation models) work together.



In [None]:
def dynamics_learning(self, replay_buffer):

    # Sample a batch of real trajectories from replay buffer
    b_obs, b_act, b_rew = replay_buffer.sample(self.config.main.batch_size)

    # Perform the learning step (computes different losses and updates the world model)
    losses = self.dynamic_learning_step(b_obs, b_act, b_rew)

    # Return the computed loss values for logging or display
    return losses

def dynamic_learning_step(self, b_obs, b_act, b_rew):

    emb_obs = self.encoder(b_obs)  # Encoded latent observations

    # Initialize latent states
    deter_state = self.rssm.init_deterministic(b_obs.size(0)).to(self.device)
    post_state  = self.rssm.init_stochastic(b_obs.size(0)).to(self.device)

    # Initialize loss terms
    reconstruction_loss, reward_prediction_loss, consistency_alignment_loss = 0, 0, 0

    for t in range(b_obs.size(1)):
        # Predict next state and correct it with current observation
        deter_state = self.rssm.recurrent(
            post_state,
            b_act[:, t-1, :] if t > 0 else 0,
            deter_state
        )
        prior_dist, _ = self.rssm.transition(deter_state)
        post_dist, post_state = self.rssm.representation(emb_obs[:, t, :], deter_state)

        # Compute individual losses
        reconstruction_loss += -self.decoder(post_state, deter_state).log_prob(b_obs[:, t]).mean()
        reward_prediction_loss += -(self.reward(post_state, deter_state)).mean()
        consistency_alignment_loss += torch.distributions.kl.kl_divergence(post_dist, prior_dist).mean()

    total_loss = reconstruction_loss + reward_prediction_loss + consistency_alignment_loss

    return {
        "total_loss": total_loss,
        "reconstruction_loss": reconstruction_loss,
        "reward_prediction_loss": reward_prediction_loss,
        "consistency_alignment_loss": consistency_alignment_loss
    }


The code above shows how Dreamer trains its world model step by step.  
Each part of the model learns a different skill from real experience:  

- The **Representation Model** turns raw images into smaller “latent” features that are easier to work with.  
- The **Transition Model** learns how those features change when the agent takes an action — basically predicting what comes next.  
- The **Reward Model** learns to guess the reward the agent would get in that situation.  
- The **Observation Model** tries to rebuild the original image, helping the model remember what details matter.  

Each of these parts produces its own loss:  
- **Reconstruction loss** (how well the model rebuilds the image),  
- **Reward loss** (how well it predicts the reward), and  
- **Consistency loss** (how consistent its imagined state is with the real one).  

By combining these, Dreamer gradually learns how the environment works inside a compact, internal world.

Next, we’ll explore each of these models to see what they do.


## Models

The latent world model that Dreamer uses is composed of three components:
- Representation Model
- Transition Model
- Reward Model

An additional observation model is used as a training signal for image reconstruction loss.

The behavior is learned via an action/value model like in actor/critic RL algorithms.

### Representation Model

The representation model encodes actions and observations in order to output a continuous, vector-valued state with Markovian transitions.

Mathematically it is the probability distribution over the real data of a state $s_t$ given previous state/action ($s_{t-1}$, $a_{t-1}$) and current observation $o_t$:

> $p_{\theta}(s_t|s_{t-1},a_{t-1},o_t)$ where $p$ indicates a probability distribution over real experience data


In DreamerV1 it is represented as a CNN encoder and an RSSM (Recurrent State Space Model).

In [None]:
# Encoder is responsible for compressing the raw pixel space into a lower dimensional feature vector
#   This provides the observation part in the representation model
self.encoder = models.ConvEncoder(input_shape=self.obs_size).to(self.device)

# The encoder's output is stored in eb_obs variable which is used by the RSSM later
eb_obs = self.encoder(b_obs)

After the encoder processes an image from the environment, it's output feature vector is used in the RSSM model. The RSSM model effectively acts as a non-linear kalman filter. First, a deterministic prior belief of the world dynamics is used to propagate the past state/action pair into the current time step. Then a correction to this prediction is applied by using the observation taken from the real environment.

It is important to remember that this process operates with real transition data, not the imagined data.

In [None]:
# RSSM combines the feature vector from the encoder with the previous state/action to create the new state s_t
self.rssm = models.RSSM(self.config.main.stochastic_size,
                                self.config.main.embedded_obs_size,
                                self.config.main.deterministic_size,
                                self.config.main.hidden_units,
                                self.action_size).to(self.device)

# Prior belief is propagated to current time (prediction)
deterministic = self.rssm.recurrent(posterior, b_a[:, t-1, :], deterministic)

# The posterior is calculated given the new observation, this is now a stochastic state (correction)
posterior_dist, posterior = self.rssm.representation(eb_obs[:, t, :], deterministic)

For more details check the function called **dynamic_learning** in the Dreamer class

### Reward Model

The reward model operates in the imagined latent domain and is used to predict rewards along imagined trajectories.

Mathematically it is the probability of the current reward given current state:
> $q_{\theta}(r_t|s_t)$ where $q$ is a probability distribution over imagined data


In DreamerV1 it is represented as a simple fully connected network.

In [None]:
# The reward model is defined as an MLP with 2 layers (also a dense network)
self.reward = models.RewardNet(self.config.main.stochastic_size + self.config.main.deterministic_size,
                                       self.config.main.hidden_units).to(self.device)
# It's implementation in RewardNet's constructor:
self.net = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            activation(),
            nn.Linear(hidden_size, 1)
        )

We see the reward model derives it's ability to estimate imagined states by learning from real experiences first. We see in the function **dynamic_learning** that a log probability loss is computed between predicted rewards of real states and the true reward taken from the replay buffer.

In [None]:
# Variable b_r is the true reward from the replay buffer of real experiences
rewards = self.reward(posteriors, deterministics)
rewards_dist = torch.distributions.Normal(rewards, 1)
rewards_dist = torch.distributions.Independent(rewards_dist, 1)
rewards_loss = rewards_dist.log_prob(b_r[:, 1:]).mean()

In the imagined states, we see the reward model is used to predict the reward of the imagined trajectories. This is in the function called **behavioral_learning**

In [None]:
# In the behavioral_learning function the imagined trajectories are calculated and given to the reward model to predict reward
rewards = self.reward(state_trajectories, deterministics_trajectories)
rewards_dist = torch.distributions.Normal(rewards, 1)
rewards_dist = torch.distributions.Independent(rewards_dist, 1)
rewards = rewards_dist.mode

### Transition Model

The transition model is used to predict a new imagined latent states given a previous state and action. It can be used to generate full imagined trajectories of future states without real-world data, hence the dreamer aspect of DreamerV1.

Mathematically it is the probability of current state given previous state/action:
> $q_{\theta}(s_t|s_{t-1},a_{t-1})$

It also uses the RSSM model that the representation model is defined from.

Similar to how the reward model is trained from the real experience data, the transition also is used in the **dynamic_learning** in order to learn physics/logic of the environment. The transition model is then used during the **behavior_learning** function to provide imagined trajectories.

In [None]:
# In dynamic_learning function, the next stochastic state is calculated by applying transition model's current learned dynamics
prior_dist, prior = self.rssm.transition(deterministic)

# Transition model used to dream new states in behavior_learning function
for t in range(self.config.main.horizon):
    action = self.actor(state, deterministics)
    deterministics = self.rssm.recurrent(state, action, deterministics)
    _, state = self.rssm.transition(deterministics)
    state_trajectories[:, t, :] = state
    deterministics_trajectories[:, t, :] = deterministics

### Observation Model

The observation model is used as a training signal in the world model learning phase. Specifically, it learns from the real image data and tries to learn how to reconstruct them.

In the code it is a decoder that takes latent states and converts them into a probability distribution of possible images.

In [None]:
# Defining the decoder
self.decoder = models.ConvDecoder(self.config.main.stochastic_size,
                                  self.config.main.deterministic_size,
                                  out_shape=self.obs_size).to(self.device)

# Calculating the reconstruction loss
reconstruct_dist = self.decoder(posteriors, deterministics, mps_flatten)

##2. Behavior Learning

After the world model is trained, Dreamer uses it to learn how to act — this is the **Behavior Learning** phase.  
Instead of exploring in the real environment, the agent uses its learned **world model** to **imagine trajectories** inside the latent space.

Here’s what happens in this phase:

1. The **Actor (Action Model)** proposes actions based on the current latent state.  
2. The **Transition Model** imagines what the next latent state would be if that action were taken.  
3. The **Reward Model** predicts the reward the agent would get in that imagined step.  
4. The **Critic (Value Model)** estimates how good each imagined state is in the long term.  
5. The Actor and Critic are trained together — the Actor to maximize imagined returns, and the Critic to match its value predictions to those returns.

This process is enhanced by **latent imagination** — the agent “dreams” future experiences inside its world model to keep learning, even without interacting with the real environment.


In [None]:
def behavior_learning(self, start_state, start_deter):

    # Generate imagined trajectories
    actions, rewards, values = [], [], []
    state, deter = start_state, start_deter

    for t in range(self.config.main.horizon):
        action = self.actor(state, deter)
        deter = self.rssm.recurrent(state, action, deter)
        _, state = self.rssm.transition(deter)

        reward = self.reward(state, deter)
        value = self.critic(state, deter)
        actions.append(action); rewards.append(reward); values.append(value)

    # Compute returns and optimize actor and critic
    returns = td_lambda(rewards, torch.ones_like(rewards)*self.config.main.discount, values, self.config.main.lambda_, self.device)
    self.update_actor_critic(returns, values)


The function above puts the Behavior Learning process into action. It shows how Dreamer loops through imagined steps: using the actor to choose actions, the world model to predict what happens next, and the critic to estimate long-term value. The returns from these imagined rollouts are then used to update both the actor and critic networks.  

Next, we’ll look at how the **Action Model**, **Value Model**, and the **Latent Imagination** process make this learning possible.


### Action Model

The action model is an actor model in the actor-critic algorithm. It is defined as a simple neural network with two linear layers. It is used here to predict an action given the latent state. As shown in the paper, we look to do this by sampling from the action distribution but also want to use be able to backprop through this operation.

$a_\tau \sim q_\phi(a_\tau \mid s_\tau)$

$a_\tau = \tanh\!\big(\mu_\phi(x_\tau) + \sigma_\phi(s_\tau) \, \epsilon\big) where
\quad \epsilon \sim N(0, I)$

The actor does not interact with the actual environment, but only with the latent state during imagination.

In [None]:
class Actor(nn.Module):
def __init__(self,
              latent_size,
              hidden_size,
              action_size,
              discrete=True,
              activation=nn.ELU,
              min_std=1e-4,
              init_std=5,
              mean_scale=5):

    super().__init__()
    self.latent_size = latent_size
    self.hidden_size = hidden_size
    self.action_size = (action_size if discrete else action_size*2)
    self.discrete = discrete
    self.min_std=min_std
    self.init_std = init_std
    self.mean_scale = mean_scale

    self.net = nn.Sequential(
        nn.Linear(latent_size, hidden_size),
        activation(),
        nn.Linear(hidden_size, self.action_size)
    )

Here is how the actor is trained during imagination. We optimize the model via backpropogation of the gradients in order to maximize returns of the predicted actions during the trajectories.

The objective from the paper for actor's parameters $\phi$ is:

$\max_{\phi} \;
\mathbb{E}_{q_\theta, q_\phi}\!\left[
  \sum_{\tau = t}^{t + H} V_\lambda(s_\tau)
\right]$

In [None]:
# Instantiation of the action model
self.actor = models.Actor(self.config.main.stochastic_size + self.config.main.deterministic_size,
                                  self.config.main.hidden_units,
                                  self.action_size,
                                  self.config.env.discrete).to(self.device)

# Get action during imagination from the actor model
action = self.actor(state, deterministics)

# Compute loss based on discoutned returns from the simulated trajectories
actor_loss = -(discount * returns).mean()
self.actor_optimizer.zero_grad()
actor_loss.backward()
nn.utils.clip_grad_norm_(
    self.actor.parameters(),
    self.config.main.clip_grad,
    norm_type=self.config.main.grad_norm_type,
)
self.actor_optimizer.step()

### Value Model

The value model is an crtic model in the actor-critic algorithm. It is defined as a simple neural network with two three layers. It is used here to get an estimate for the total expected value in the long term from the latent state. As seen in the paper:

$v_\psi(s_\tau) \approx
\mathbb{E}_{q}\!\left[
  \sum_{n = \tau}^{t + H}
  \gamma^{n - \tau} r_n
\right]$

Similarly, the critic does not interact with the actual environment either, but only with the latent state during imagination.

In [None]:
class Critic(nn.Module):
def __init__(self, latent_size, hidden_size, activation=nn.ELU):
    super().__init__()
    self.latent_size = latent_size

    self.net = nn.Sequential(
        nn.Linear(latent_size, hidden_size),
        activation(),
        nn.Linear(hidden_size, hidden_size),
        activation(),
        nn.Linear(hidden_size, a1)
    )

Here is how the critic is trained during imagination. We optimize the model by trying to match it to returns from many future steps. The objective from the paper for critic's parameters $\psi$ is:

$\min_{\psi} \;
\mathbb{E}_{q_\theta, q_\phi}\!\left[
  \sum_{\tau = t}^{t + H}
  \frac{1}{2}
  \,
  \|v_\psi(s_\tau) - V_\lambda(s_\tau)\|^2
\right]$

where $V_\lambda(s_\tau)$ is an exponentially weighted average of estimated rewards beyond k steps with the learned value model for different k.

In [None]:
# Instantiation of the value model
self.critic = models.Critic(self.config.main.stochastic_size + self.config.main.deterministic_size,
                                  self.config.main.hidden_units).to(self.device)

# Get action during imagination from the actor model
action = self.actor(state, deterministics)

# Compute loss by matching future returns
values_dist = self.critic(state_trajectories[:, :-1].detach(), deterministics_trajectories[:, :-1].detach())

critic_loss = -(discount.squeeze() * values_dist.log_prob(returns.detach())).mean()

self.critic_optimizer.zero_grad()
critic_loss.backward()
nn.utils.clip_grad_norm_(
    self.critic.parameters(),
    self.config.main.clip_grad,
    norm_type=self.config.main.grad_norm_type,
)
self.critic_optimizer.step()

## Latent Imagination

Instead of running rollouts on the actual environment, the Dreamer agent simulates subsequent trajectories within the latent space. This process is known as latent imagination.

We start off with the state which represents the stochastic part and determinsitcs which represents the deterministic part of the latent state which initally come form the Representation Model.

In [None]:
#Stochastic
state = state.reshape(-1, self.config.main.stochastic_size)

#Deterministic
deterministics = deterministics.reshape(-1, self.config.main.deterministic_size)

We then use the state and deterministics to feed into the actor model in order to give us the action that our agent would take. Now with our state, deteminsiitcs, and action, we use the Representaiton Model, in this case the recurrent RSSM, in order to get an updated version of the resulting determinsitic hidden state. Conditioned on this new deterministic state, the transition RSSM predicts the new stochastic state (still within the latent space). Keep in mind that the agent is not actually stepping through the environment, rather just using what it has learned so far to speculate what could happen if it took certain actions.

In [None]:
for t in range(self.config.main.horizon):
    action = self.actor(state, deterministics)
    deterministics = self.rssm.recurrent(state, action, deterministics)
    _, state = self.rssm.transition(deterministics)
    state_trajectories[:, t, :] = state
    deterministics_trajectories[:, t, :] = deterministics

Now with the imagined latent states from the state and deterministics trajectories, we can use the reward model to give us an estimate of the expected reward along this path. Similarly, we use the critic to estimate the expected value from the state and deterministics trajectories.

In [None]:
rewards = self.reward(state_trajectories, deterministics_trajectories)
rewards_dist = torch.distributions.Normal(rewards, 1)
rewards_dist = torch.distributions.Independent(rewards_dist, 1)
rewards = rewards_dist.mode

# continue is set whether or not this episode should keep going or stop
if self.config.main.continue_loss:
    _, conts_dist = self.cont_net(state_trajectories, deterministics_trajectories)
    continues = conts_dist.mean
else:
    continues = self.config.main.discount * torch.ones_like(rewards)

values = self.critic(state_trajectories, deterministics_trajectories).mode

With these rewards and values, we can calculate the estimated returns.

In [None]:
# Returns calculated using temporal difference bootstrapping
# td_lambda uses gamma (discount factor) and lambda to compute multi-step returns
returns = td_lambda(
    rewards,
    continues,
    values,
    self.config.main.lambda_,
    self.device
)

Now we can finally we can use the results to help our models learn. Here we optimize our actor to maximize the expected returns using gradient descent. We also improve the critic's value prediction using the predicted value distribution.

In [None]:
#Optimizing the Actor Model
actor_loss = -(discount * returns).mean()

self.actor_optimizer.zero_grad()
actor_loss.backward()
nn.utils.clip_grad_norm_(
    self.actor.parameters(),
    self.config.main.clip_grad,
    norm_type=self.config.main.grad_norm_type,
)
self.actor_optimizer.step()


#Optimizing the Critic Model
values_dist = self.critic(state_trajectories[:, :-1].detach(), deterministics_trajectories[:, :-1].detach())

critic_loss = -(discount.squeeze() * values_dist.log_prob(returns.detach())).mean()

self.critic_optimizer.zero_grad()
critic_loss.backward()
nn.utils.clip_grad_norm_(
    self.critic.parameters(),
    self.config.main.clip_grad,
    norm_type=self.config.main.grad_norm_type,
)
self.critic_optimizer.step()

## 3. Environment Interaction and the Full Training Loop

After learning both the **world model** and the **behavior policy**, the final phase of Dreamer’s training loop is **Environment Interaction**,  
where the agent uses its current policy to gather new real experiences. These experiences are then added to the replay buffer and used in the next round of Dynamics and Behavior Learning.

This is the workflow:

1. The agent **collects real experience** from the environment using the current policy (the Actor).  
2. That experience is **stored in a replay buffer**.  
3. The **world model** is updated using this real data (Dynamics Learning).  
4. The **Actor** and **Critic** are then improved inside the world model using imagined trajectories (Behavior Learning).  
5. The updated policy is used again to collect more experience — and the cycle repeats.

This loop allows Dreamer to keep learning efficiently, balancing imagination with real-world data.  
By alternating between these steps, the agent continually refines both its understanding of the environment and the quality of its decisions.


In [None]:
def environment_interaction(self, env, replay_buffer):

    observation = env.reset()
    done = False

    while not done:
        # Encode observation and get latent state
        latent_state = self.encoder(torch.tensor(observation).unsqueeze(0).float())
        deter_state = self.rssm.init_deterministic(1)
        stoch_state = self.rssm.init_stochastic(1)

        # Actor chooses action based on current latent state
        action = self.actor(stoch_state, deter_state)
        action = action.detach().cpu().numpy()

        # Step in the real environment
        next_obs, reward, done, _ = env.step(action)

        # Store the transition in replay buffer
        replay_buffer.add(observation, action, reward, next_obs)

        # Move to next step
        observation = next_obs


This final phase closes Dreamer’s training loop.
  
After training its world model and policy, Dreamer uses the current Actor to interact with the real environment, collecting new experience tuples `(observation, action, reward, next_observation)` that are stored in a replay buffer. These samples are then used in the next round of Dynamics and Behavior Learning.  

By repeating this cycle — **collect → model → imagine → improve** —  Dreamer continually refines both its understanding of the world and its ability to act within it.
