### The actor

- inputs the stochastic and hidden states from the result of being passed through the world model
- outputs two kinds of data on the condition if were training or not:
    - (1) if training: action, logprobs, entropy
    - (2) no training: action

The network:  
- MLP (input = hidden + stochastic, output = 2 (mean and std))

Forward pass:  
- gets the mean and logstd  
- we must scale them back to the action space after squashing (tanh) via an affine transform
- (for Dreamer) ensure that the sample of the distribution is not backpropagated through, so stopgradient here.
- during training though, we get the logprobs from the distribution and this is used for backprop. When no training, we just return the sampled action that is scaled back to the ranges of the action space.
- also during training, we need to calculate the entropy which tells us the randomness or uncertainty in the policy's action distribution. Lower entropy = more deterministic policy. Higher entropy = less deterministic policy. Entropy is essentially a regularization tool because if the entropy is too low, it would likely converge to some behavior. If the entropy is somewhat moderate, it encourages exploration of different actions, which helps with long term learning. 




What is an "actor"?
- In policy gradient RL, the actor is a function pi_theta(a | s) which outputs a distribution over actions given a state/observation. A sample of an action is taken from the distribution to decide what to do in the environment. Theta is trained so that actions that lead to higher return become more likely.  
- The actor and critic are trained entirely on imagined trajectories produced by the world model, not from real environment steps.

- DreamerV3's actor is different in that it does not use lookahead planning. It trains the actor network while training the world model. Therefore, when taking an action while its online, it samples from a distribution of actions to take rather than simulating multiple action sequences and choosing the best based on the expected rewards.


**Side note:** (required for creating the actor)

register_buffer creates a state tensor that has no gradients (not optimized in backward()), stays on the proper device, and remains on the state_dict for reproducibility.  
The buffers in the actor are required to normalize the ranges of inputs the actions use because activations squash them to ranges not normalized to the original inputs. For instance, when the predicted mean and std arise from the network, an activation function (e.g., tanh) squashes them between a range ([-1, 1]). These values between -1 and 1 are not expected within the environment for proper translation from number to action output, i.e., the range of the original environment may have -3,3 instead so only producing values between -1,1 does not map properly to the environment. We have to map the values bewteen -1,1 to -3,3. Hence, register_buffer creates tensors that help do this.  

They do this by an affine transformation, y = s @ x + b, where s is the scale (stretches or compresses values) and b is the bias (shifts the range).  

So we take the action ranges (action high and action low) and solve for s and b because we know both the source range (-1,1) and the target range (when action is high, tanh = 1; when action is low, tanh = -1). Therefore, when x = -1, we get y to be the low action value and when x = 1, we get y to be the high action value. Manipulating to solve for s and b and we get b = (low + high) / 2 and s = (high - low) / 2. Plug in the corresponding action lows and action highs to get the bias and scale, thus giving the formula to solve for any action value between the ranges, action_value = tanh_output * scale + bias.  



**Side note:**  

When transforming a random variable, the probability density must be conserved (integral over the space is equal to 1). So since we're using a squashing function (tanh), the space must be conserved to achieve a total probability of 1. And during training, we need to backpropagate using the gradients of the log probabilities of the actions the actor samples. I say this to ensure that when the log_probs is taken from the normal distribution for the sample action, and then it is squashed using tanh, that I remember to implement the change of variables correction (jacobian of a = tanh(x)).



**Side note:** 

Entropy is used in the actor forward pass to help with increasing the diversity of actions: E[-logp(a|st)].

'entropy-regularized policy gradient objective:'  
actor_loss ​= −E[Advantage * ​logπθ​(at​∣st​) + βH(πθ​(⋅∣st​))]  

In the actor loss, E[-logp(a|st)] = H(πθ​(⋅∣st​)), where the dot represents the whole distribution of a.  

If there wasn't entropy, the actor becomes greedy by collapsing to deterministic behavior. So it acts as a regularizer to help increase exploration by preventing premature convergence (stopping behaviors). It allows for more varied imagination rollouts.  

β controls how strongly exploration is encouraged.
- small β -> low entropy -> greedy
- large β -> high entropy -> exploratory

Essentially, in a practical sense, it is a "stay alive" regularizer to ensure the actor keeps acting and exploring.

