# DreamerV1 Code Explanation

## Core Algorithm in Paper

### Dynamics Learning

### Behavior Learning

### Environment Interaction

## Models

The latent world model that Dreamer uses is composed of three components:
- Representation Model
- Transition Model
- Reward Model

An additional observation model is used as a training signal for image reconstruction loss.

The behavior is learned via an action/value model like in actor/critic RL algorithms.

### Representation Model

The representation model encodes actions and observations in order to output a continuous, vector-valued state with Markovian transitions.

Mathematically it is the probability distribution over the real data of a state $s_t$ given previous state/action ($s_{t-1}$, $a_{t-1}$) and current observation $o_t$:

> $p_{\theta}(s_t|s_{t-1},a_{t-1},o_t)$ where $p$ indicates a probability distribution over real experience data


In DreamerV1 it is represented as a CNN encoder and an RSSM (Recurrent State Space Model).

In [None]:
# Encoder is responsible for compressing the raw pixel space into a lower dimensional feature vector
#   This provides the observation part in the representation model
self.encoder = models.ConvEncoder(input_shape=self.obs_size).to(self.device)

# The encoder's output is stored in eb_obs variable which is used by the RSSM later
eb_obs = self.encoder(b_obs)

After the encoder processes an image from the environment, it's output feature vector is used in the RSSM model. The RSSM model effectively acts as a non-linear kalman filter. First, a deterministic prior belief of the world dynamics is used to propagate the past state/action pair into the current time step. Then a correction to this prediction is applied by using the observation taken from the real environment.

It is important to remember that this process operates with real transition data, not the imagined data.

In [None]:
# RSSM combines the feature vector from the encoder with the previous state/action to create the new state s_t
self.rssm = models.RSSM(self.config.main.stochastic_size,
                                self.config.main.embedded_obs_size,
                                self.config.main.deterministic_size,
                                self.config.main.hidden_units,
                                self.action_size).to(self.device)

# Prior belief is propagated to current time (prediction)
deterministic = self.rssm.recurrent(posterior, b_a[:, t-1, :], deterministic)

# The posterior is calculated given the new observation, this is now a stochastic state (correction)
posterior_dist, posterior = self.rssm.representation(eb_obs[:, t, :], deterministic)

For more details check the function called **dynamic_learning** in the Dreamer class

### Reward Model

The reward model operates in the imagined latent domain and is used to predict rewards along imagined trajectories.

Mathematically it is the probability of the current reward given current state:
> $q_{\theta}(r_t|s_t)$ where $q$ is a probability distribution over imagined data


In DreamerV1 it is represented as a simple fully connected network.

In [None]:
# The reward model is defined as an MLP with 2 layers (also a dense network)
self.reward = models.RewardNet(self.config.main.stochastic_size + self.config.main.deterministic_size,
                                       self.config.main.hidden_units).to(self.device)
# It's implementation in RewardNet's constructor:
self.net = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            activation(),
            nn.Linear(hidden_size, 1)
        )

We see the reward model derives it's ability to estimate imagined states by learning from real experiences first. We see in the function **dynamic_learning** that a log probability loss is computed between predicted rewards of real states and the true reward taken from the replay buffer.

In [None]:
# Variable b_r is the true reward from the replay buffer of real experiences
rewards = self.reward(posteriors, deterministics)
rewards_dist = torch.distributions.Normal(rewards, 1)
rewards_dist = torch.distributions.Independent(rewards_dist, 1)
rewards_loss = rewards_dist.log_prob(b_r[:, 1:]).mean()

In the imagined states, we see the reward model is used to predict the reward of the imagined trajectories. This is in the function called **behavioral_learning**

In [None]:
# In the behavioral_learning function the imagined trajectories are calculated and given to the reward model to predict reward
rewards = self.reward(state_trajectories, deterministics_trajectories)
rewards_dist = torch.distributions.Normal(rewards, 1)
rewards_dist = torch.distributions.Independent(rewards_dist, 1)
rewards = rewards_dist.mode

### Transition Model

The transition model is used to predict a new imagined latent states given a previous state and action. It can be used to generate full imagined trajectories of future states without real-world data, hence the dreamer aspect of DreamerV1.

Mathematically it is the probability of current state given previous state/action:
> $q_{\theta}(s_t|s_{t-1},a_{t-1})$

It also uses the RSSM model that the representation model is defined from.

Similar to how the reward model is trained from the real experience data, the transition also is used in the **dynamic_learning** in order to learn physics/logic of the environment. The transition model is then used during the **behavior_learning** function to provide imagined trajectories.

In [None]:
# In dynamic_learning function, the next stochastic state is calculated by applying transition model's current learned dynamics
prior_dist, prior = self.rssm.transition(deterministic)

# Transition model used to dream new states in behavior_learning function
for t in range(self.config.main.horizon):
    action = self.actor(state, deterministics)
    deterministics = self.rssm.recurrent(state, action, deterministics)
    _, state = self.rssm.transition(deterministics)
    state_trajectories[:, t, :] = state
    deterministics_trajectories[:, t, :] = deterministics

### Observation Model

decoder

### Action Model

### Value Model

## Latent Imagination

## Running xperiments