### Normal Distribution

##### Example 1

In [1]:
import torch

In [2]:
mean = torch.zeros(3)
std = torch.ones(3)

In [3]:
dist = torch.distributions.Normal(mean, std)

Sample an action from the distribution

In [4]:
action = dist.sample()

In [6]:
action

tensor([ 0.6733, -0.2301,  1.6005])

Calculate the log probability of the action

In [8]:
log_prob = dist.log_prob(action)

In [9]:
log_prob

tensor([-1.1456, -0.9454, -2.1998])

### 3.2 Maximum Entropy Reinforcement Learning

##### Example 1

$J(\pi)=\sum_{t=0}^T \mathbb{E}_{\left(\mathbf{s}_t, \mathbf{a}_t\right) \sim \rho_\pi}\left[r\left(\mathbf{s}_t, \mathbf{a}_t\right)+\alpha \mathcal{H}\left(\pi\left(\cdot \mid \mathbf{s}_t\right)\right)\right]$

# 1. Soft Policy Iteration

### 1.1 Policy evaluation step

##### Example 2

Write a pseudo-code to update the Q function. According to the policy evaluation step's equation:

$\mathcal{T}^\pi Q\left(\mathbf{s}_t, \mathbf{a}_t\right) \triangleq r\left(\mathbf{s}_t, \mathbf{a}_t\right)+\gamma \mathbb{E}_{\mathbf{s}_{t+1} \sim p}\left[V\left(\mathbf{s}_{t+1}\right)\right]$

**Hint**
- `n_steps` is the number of time steps that an agent interacts in an episode
- `states` and `actions`

In [None]:
for t in range(n_steps):
    # Compute the reward for the current time step
    reward = compute_reward(states[t], actions[t])
    
    # Compute the expected value of the next state
    expected_value = value_function(states[t+1]).mean()
    
    # Update the value function using the modified Bellman backup operator
    Q[t] = reward + gamma * expected_value

### 1.2 Policy improvement step

##### Example 1

$\pi_{\text {new }}=\arg \min _{\pi^{\prime} \in \Pi} D_{K L}\left(\pi^{\prime}\left(\cdot \mid \mathbf{s}_t\right) \| \frac{\exp \left(Q^{\pi_{\text {old }}}\left(\mathbf{s}_t, \cdot\right)\right)}{Z^{\pi_{\text {old }}}\left(\mathbf{s}_t\right)}\right)$

In the context of soft actor-critic. Answer the following questions

Big Picture

**1. What does the equation do?**

**Answer**: Minimize the difference between two probability distributions: $\pi^{\prime}\left(\cdot \mid \mathbf{S}_t\right)$ and $\frac{\exp \left(Q^{\pi_{\text {old }}}\left(\mathbf{s}_t, \cdot\right)\right)}{Z^{\pi_{\text {old }}\left(\mathbf{s}_t\right)}}$

`Notations`

**2. What does $\pi^{\prime} \in \Pi$ mean?**

**Answer**: Restricts the possible new policies to a set of policies 


**3. What does $Z^{\pi_{\text {old }}}\left(\mathbf{s}_t\right)$ mean? Why need it?**

**Answer**:

- A scalar value that is used to normalize the distribution represented by the exponentiated Q-function
- Ensures that the distribution sums to 1 over all actions

`Q-Function`

**4. What does $Q^{\pi_{\text {old }}}\left(\mathbf{s}_t, \cdot\right)$ mean?**

**Answer**: It's a function of actions. So it takes in an action as an input and produces a scalar output.



**5. What does $\cdot$ $Q^{\pi_{\text {old }}}\left(\mathbf{s}_t, \cdot\right)$ represent?**

**Answer**: The dot indicates that the action component of the state-action pair has not been specified yet

**6. Why is the old $\exp \left(Q^{\pi_{\text {old }}}\left(\mathbf{s}_t,{\cdot}\right)\right)$ the target for the new policy?**

**7. Why is the exponential of $Q^{\pi_{\text {old }}}\left(\mathbf{s}_t, \cdot\right)$ taken in this equation?**

**Answer**: Ensure that the resulting distribution is a valid probability distribution, since the exponential function maps all real numbers to positive values

##### Example 2

Policy improvement step

$\pi_{\text {new }}=\arg \min _{\pi^{\prime} \in \Pi} D_{K L}\left(\pi^{\prime}\left(\cdot \mid \mathbf{s}_t\right) \| \frac{\exp \left(Q^{\pi_{\text {old }}}\left(\mathbf{s}_t, \cdot\right)\right)}{Z^{\pi_{\text {old }}}\left(\mathbf{s}_t\right)}\right)$

Wha

### Enforcing Action Bounds

$\mu(\mathbf{u} \mid \mathbf{s})$ is the density of the unbounded Gaussian distribution

$\pi(\tanh (\mathbf{u}) \mid \mathbf{s})$ is the density of the bounded action distribution

Ensures that the new bounded action distribution still has the same meaning or interpretation as the old unbounded Gaussian distribution

**Hints**
- $\pi$ and $\mu$: just a convention for the notation of the new and the old distribution. They're the same meaning
- $\mathrm{d} \mathbf{a} / \mathrm{d} \mathbf{u}=\operatorname{diag}\left(1-\tanh ^2(\mathbf{u})\right)$

$\log \pi(\tanh (\mathbf{u}) \mid \mathbf{s})=\log \mu(\mathbf{u} \mid \mathbf{s})-\sum_{i=1}^D \log \left(1-\tanh ^2\left(u_i\right)\right)$

Answer the following questions:

##### Question 1: Explain why use $\tanh (\mathbf{u})$

**Explain**

Because the action $u$ is being taken from a normal distribution with infinite support. But in practice, the action must be bounded to a finite interval.

To enforce this bound, simply applies squashing function.

#### Question 3: What is the math behind this equation?

Use the Jacobian of a transformation to transform 

#### Question 2: Why minus $\sum_{i=1}^D \log \left(1-\tanh ^2\left(u_i\right)\right)$?

**Explain**

To transform the unbounded $\mu(\mathbf{u} \mid \mathbf{s})$ to bounded $\pi(\tanh (\mathbf{u}) \mid \mathbf{s})$ but still has the same meaning as the old one. 

We apply transformation according to: $f_Y(y)=f_X(x)\left|\frac{d x}{d y}\right|$

### Loss Function of soft value function

##### Example 1

In [12]:
import torch
import torch.nn.functional as F

In [13]:
q_function = torch.nn.Linear(5, 3)

In [14]:
v_function = torch.nn.Linear(5, 3)

In [15]:
action_probs = torch.randn(10, 6)
action_probs = F.softmax(action_probs, dim=-1)

In [16]:
states = torch.randn(10, 5)
actions = torch.randn(10, 5)

In [17]:
def q_function(s, a):
    # Define your q function here, for example:
    q_prediction = torch.sum(s * a)
    return q_prediction

In [18]:
q_function, v_function

(<function __main__.q_function(s, a)>,
 Linear(in_features=5, out_features=3, bias=True))

Given
- `action_probs`: a probability distribution over all possible actions at each time step
- `q_function`, `v_function`, `states`, `actions`, 

In [19]:
states.shape, actions.shape, action_probs.shape

(torch.Size([10, 5]), torch.Size([10, 5]), torch.Size([10, 6]))

Write a function calculate the loss of soft value function according to

$J_V(\psi)=\mathbb{E}_{\mathbf{s}_t \sim \mathcal{D}}\left[\frac{1}{2}\left(V_\psi\left(\mathbf{s}_t\right)-\mathbb{E}_{\mathbf{a}_t \sim \pi_\phi}\left[Q_\theta\left(\mathbf{s}_t, \mathbf{a}_t\right)-\log \pi_\phi\left(\mathbf{a}_t \mid \mathbf{s}_t\right)\right]\right)^2\right]$

(Equation 5 in the paper)

In [20]:
def calculate_loss_of_soft_v(states, actions, action_probs):
    entropy_of_policy = action_probs.log()
        
    q_expectation = torch.mean(q_function(states, actions) - entropy_of_policy)
    
    mean_square_error = (v_function(states) - q_expectation).pow(2) / 2
    
    loss = torch.mean(mean_square_error)
    
    return loss

In [21]:
calculate_loss_of_soft_v(states, actions, action_probs)

tensor(10.0669, grad_fn=<MeanBackward0>)

### Loss function of Q function

##### Example 1

Given
- `states`: a list of observations that the agent observed at each time step
- `actions`: a list of actions that the agent took at each time step

Write pseudocode for the calculate the loss of q-network according the equation

$J_Q(\theta)=\mathbb{E}_{\left(\mathbf{s}_t, \mathbf{a}_t\right) \sim \mathcal{D}}\left[\frac{1}{2}\left(Q_\theta\left(\mathbf{s}_t, \mathbf{a}_t\right)-\hat{Q}\left(\mathbf{s}_t, \mathbf{a}_t\right)\right)^2\right]$

with $\hat{Q}\left(\mathbf{s}_t, \mathbf{a}_t\right)=r\left(\mathbf{s}_t, \mathbf{a}_t\right)+\gamma \mathbb{E}_{\mathbf{s}_{t+1} \sim p}\left[V_{\bar{\psi}}\left(\mathbf{s}_{t+1}\right)\right]$

In [None]:
target_q = get_rewards(states, actions) + discount_factor * v_network(states+1).mean()

In [None]:
predicted_q = q_network(states, actions)

In [None]:
loss = ((predicted_q - target_q).pow(2) * 0.5).mean()