### **Chapter 8: Deep Reinforcement Learning**

In this chapter, we introduce **deep reinforcement learning (DRL)** — a powerful framework for solving complex optimal control problems in nonlinear and high-dimensional systems. Our focus is on **Proximal Policy Optimization (PPO)**, one of the most popular and effective policy gradient algorithms in modern DRL. Later in this chapter, we will compare PPO with NMPC to highlight their respective strengths and limitations in practical control scenarios.

First, we need to set up our Python environment and import relevant packages.

In [None]:
import sys
import os
import numpy as np

sys.path.append(os.path.abspath(".."))
from utils.env import *
from utils.simulator import *
from ex5_MPC.mpc_utils import *
from ex8_DRL.ppo_utils import *

### **Preparation: specify some common task parameters**

In the previous exercise, we demonstrated how to define a symbolic function using CasADi, including the definition of the mountain profile as a function of $p$, deriving the conversion formulas between the slope profile $h(p)$ and the inclination angle $\theta(p)$, and establishing the system's dynamics. These formulas have already been integrated into the class `Env` and `Dynamics`. In this chapter, we will specify the arguments and instantiate these classes directly to utilize their functionalities.

- Parameters in the task:  

   - case: 4 (hilly terrain)
   
   - initial state: $\boldsymbol{x}_0 = [-0.5, 0.0]^T$

   - target state: $\boldsymbol{x}_T = [0.6, 0.0]^T$

   - state space: $ \mathcal{X}_1 = [-1.7, 1.3]$,  $ \mathcal{X}_2 = [-3.5, 3.5]$

   - input space: $ \mathcal{U} = [-5.0, 5.0]$

In [None]:
# Define profile of slope, the initial / target state
case = 4
initial_position = -0.5
initial_velocity = 0.0
target_position = 0.6
target_velocity = 0.0

state_lbs = np.array([initial_position-1.2, -3.5])
state_ubs = np.array([target_position+0.7, 3.5])
input_lbs = -5.0
input_ubs = 5.0

# Define time length and freq for simulation
t_terminal = 10
freq = 10

Meanwhile, later in this chapter, we will compare the performance of PPO controller with traditional model-based control methods represented by MPC. Therefore, we can predefine a few MPC controller configurations here for later evaluation.

In [None]:
# NMPC parameters (as reference)
# To find a feasible solution, the MPC requires a bit difference discretization and larger bounds
freq_mpc = 20
dt_mpc = 1.0/freq
state_lbs_mpc = np.array([-2.0, -4.0])
state_ubs_mpc = np.array([2.0, 4.0])

# Define weight matrix in stage and terminal cost and the horizon for MPC (reference controller)
Q = np.diag([1, 1])
R = np.array([[0.1]])
Qf = Q
N = 60


----

<br>

### **Proximal Policy Optimization (PPO)**

Proximal Policy Optimization (PPO), as introduced by Schulman et al. in their original paper *“Proximal Policy Optimization Algorithms”* (2017), is one of the most popular policy gradient methods in Reinforcement Learning. It was proposed by OpenAI as a **simplified and more stable alternative** to Trust Region Policy Optimization (TRPO). PPO belongs to the family of **on-policy actor-critic methods**, and it’s widely used in both research and practical applications (e.g., training agents to play Dota 2, as shown in the core course slide). The core idea is to **avoid large destructive updates** to the policy by **constraining or clipping** how much the policy is allowed to change at each step, thereby preventing performance degradation or training instability caused by overly aggressive policy updates — this is precisely where the term "Proximal" comes from.

* Schulman et al. (2017): [Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347)

<br>

#### **Core Components of PPO**

* **Actor-Critic Structure**: 

   PPO adopts an actor-critic architecture, which is a foundational design pattern in modern policy gradient reinforcement learning methods. This structure separates the roles of **decision making (actor)** and **value estimation (critic)**.

   * **Actor**: The Policy Network

     * The **actor** is responsible for selecting actions based on the current state.
     * It parameterizes the **policy** $\pi_\theta(a|s)$, which is a probability distribution over actions given the state $s$.
     * During training, the actor is updated to maximize expected return by following the **policy gradient**, typically using advantage-weighted updates:
     $$
     \nabla_\theta J(\theta) = \mathbb{E} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot \hat{A}(s, a) \right]
     $$
   
   * **Critic**: The Value Network

     * The **critic** provides a **baseline estimate** of how good a state is, helping to reduce the variance of the policy gradient.
     * It learns the **state value function** $V^\pi(s)$, which estimates the expected return from state $s$ under the current policy $\pi$.
     * The critic is typically trained by minimizing the squared error between the predicted value and empirical return (or bootstrapped target):
     $$
     \mathcal{L}_{\text{critic}} = \left( V_\phi(s_t) - R_t \right)^2
     $$


* **Advantage Estimation with State-Dependent Baseline (GAE)**:

  * PPO reduces variance in the policy gradient by using a **state-dependent baseline**, typically the value function $V(s)$, instead of relying solely on raw returns $R_t$.
  * This leads to the **advantage function**:
    $$
    \hat{A}_t = R_t - V(s_t)
    $$
    which measures how much better or worse an action performed compared to the expected value of the state.
  * To further improve this estimate, PPO employs **Generalized Advantage Estimation (GAE)**, which blends multi-step bootstrapping and Monte Carlo returns to strike a balance between bias and variance. GAE defines the advantage recursively via the **TD residual** $\delta_t$:
    $$
    \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)
    $$
    and constructs the advantage as an exponentially weighted sum:
    $$
    \hat{A}_t^{GAE(\lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}
    $$
  * Here, $\lambda \in [0, 1]$ controls the trade-off: lower values reduce variance but introduce more bias, while higher values favor accuracy but increase variance. This technique leads to smoother, more stable advantage estimates and more efficient learning.


* **Clipped Surrogate Objective**:

  * PPO avoids the complexity of trust region methods (like TRPO) by using a **clipping mechanism** to constrain how much the new policy $\pi_\theta$ is allowed to deviate from the old policy $\pi_{\theta_{\text{old}}}$ during each update.
  * The key quantity is the **probability ratio**:
    $$
    r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}
    $$
    which measures how much more (or less) likely the new policy is to take action $a_t$ at state $s_t$ compared to the old policy.
  * The **clipped surrogate objective** then takes the minimum between the unclipped and clipped policy objective:
    $$
    L^{CLIP}(\theta) = \mathbb{E} \left[ \min \left( r_t(\theta) \hat{A}_t,\ \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t \right) \right]
    $$
    where $\hat{A}_t$ is the estimated advantage and $\epsilon$ is a small positive hyperparameter (typically $\epsilon = 0.1 \sim 0.3$).
  * This **clip operation prevents the policy from changing too much in a single update step**, which protects against large, destabilizing policy updates that could hurt performance.
    * If $r_t(\theta) > 1 + \epsilon$: the update would excessively increase the probability of a "good" action → clip it down.
    * If $r_t(\theta) < 1 - \epsilon$: the update would excessively decrease the probability of a "bad" action → clip it up.
  * Unlike TRPO, which enforces a hard KL-divergence constraint using second-order optimization, PPO uses this **first-order, simple-to-implement clip mechanism** to achieve similar "proximal" updates while remaining computationally efficient.

<br>

#### **Tips for Practice** Tips for practical implementation (consid)

Although PPO is **fairly robust** to the value of hyperparameters, **implementation details still matters a lot**. In fact, the paper *“Implementation Matters in Deep RL: A Case Study on PPO and TRPO”* (Engstrom et al., 2020) identifies **9 critical tricks** that have a **significant impact on PPO's performance**. These implementation techniques — such as advantage normalization, observation normalization, proper reward scaling, and value clipping — can be considered when building PPO from scratch or tuning an existing implementation, in order to (at least partially) achieve more stable training and better empirical performance.

* Engstrom et al. (2020): [Implementation Matters in Deep RL: A Case Study on PPO and TRPO](https://arxiv.org/abs/2005.12729)



<br>

#### **Hyperparameters in PPO**

PPO involves several hyperparameters that significantly affect training performance, stability, and sample efficiency. Choosing appropriate values for these parameters is crucial for successful learning. Below, we list some commonly used hyperparameters in PPO, along with their **typical value ranges** and brief notes on **how they impact the algorithm’s behavior and performance**.


| Hyperparameter                        | Description                        | Typical Value(s) | Notes on Performance Impact                         |
| ------------------------------------- | ---------------------------------- | ---------------- | --------------------------------------------------- |
| `learning_rate_p` / `learning_rate_v` | Learning rate $\alpha$ for actor and critic | `1e-4 ~ 1e-3`  | Too high → divergence; too low → slow learning      |
| `gae_lambda`                          | GAE smoothing parameter    $\lambda$        | `0.95 ~ 0.98`           | Closer to 1 → less bias, more variance              |
| `clip_coef`                           | Clipping range $\epsilon$          | `0.1 ~ 0.3`      | Smaller → more conservative updates                 |
| `ent_coef`                            | Entropy bonus weight (exploration) | `0.01 ~ 0.05`    | Higher → more exploration, slower convergence       |
| `vf_coef`                             | Value function loss weight         | `0.5 ~ 1.0`            | Balance actor/critic loss contributions             |
| `max_grad_norm`                       | Gradient clipping norm             | `0.5 ~ 1.0`      | Helps stabilize training                            |
| `n_steps`                             | Rollout length                     | `128 ~ 2048`     | Longer → better estimates, but more memory          |
| `batch_size`                          | Training batch size                | `32 ~ 256`       | Affects convergence speed & stability               |
| `n_epochs`                            | Training epochs per rollout        | `4 ~ 10`         | More epochs → better fitting, risk of overfitting   |


<br>

Based on the hyperparameter configuration shown in the block below, we instantiate the PPO controller and demonstrate both the training process and the controller’s performance.


In [None]:
seed = 42

learning_rate_p = 1e-3    # learning rate for policy
learning_rate_v = 3e-4    # learning rate for value function
gamma = 0.99              # discount factor
gae_lambda = 0.95         # lambda for GAE
clip_coef = 0.1           # clipping coefficient for PPO
ent_coef = 0.01           # entropy coefficient
vf_coef = 0.5             # value function coefficient
max_grad_norm = 1.5       # gradient clipping
n_steps = 256             # size of collected data in each rollout
batch_size = 32           # size of mini-batch
n_epochs = 10             # number of epochs to update the policy
max_iterations = 1300     # maximum number of iterations for training

In [None]:
# Instantiate class 'Env'
env = Env(case, np.array([initial_position, initial_velocity]), np.array([target_position, target_velocity]),
          state_lbs=state_lbs, state_ubs=state_ubs, input_lbs=input_lbs, input_ubs=input_ubs)
dynamics = Dynamics(env)
mdp = Env_rl_c(env=env, dynamics=dynamics, dt=1/freq)

# Instantiate the PPO controller class
controller_ppo = PPOController(mdp, freq, max_iterations=max_iterations, seed=seed,
                               learning_rate_p=learning_rate_p, learning_rate_v=learning_rate_v,
                               gamma=gamma, gae_lambda=gae_lambda, clip_coef=clip_coef,
                               ent_coef=ent_coef, vf_coef=vf_coef, max_grad_norm=max_grad_norm,
                               n_steps=n_steps, batch_size=batch_size, n_epochs=n_epochs,
                               name="PPO")
controller_ppo.load("checkpoints/ppo.pt")  # annotate it by the first run, enable to load the stored model
#controller_ppo.setup()  # annotate it when using the stored model
#controller_ppo.save("checkpoints/ppo.pt")  # annotate it when using the stored model
controller_ppo.plot_training_curve()
controller_ppo.plot_policy_heatmap()


In [None]:
# Instantiate the simulator, and then run the simulation
simulator_ppo = Simulator(dynamics, controller_ppo, env, 1/freq, t_terminal)
simulator_ppo.run_simulation()

# Also setup a NMPC controller for reference
env_mpc = Env(case, np.array([initial_position, initial_velocity]), np.array([target_position, target_velocity]),
          state_lbs=state_lbs_mpc, state_ubs=state_ubs_mpc, input_lbs=input_lbs, input_ubs=input_ubs)
dynamics_mpc = Dynamics(env_mpc)
controller_mpc = MPCController(env_mpc, dynamics_mpc, Q, R, Qf, freq_mpc, N, name="NMPC")
simulator_mpc = Simulator(dynamics_mpc, controller_mpc, env_mpc, 1/freq_mpc, t_terminal)
simulator_mpc.run_simulation()

# Instantiate the visualizer, and display the plottings and animation
visualizer_ppo = Visualizer(simulator_ppo)
visualizer_ppo.display_contrast_plots(simulator_mpc, title="Simulation of PPO Controller on Hilly Terrain", if_gray=True)
visualizer_ppo.display_contrast_animation_same(simulator_mpc, if_gray=True)

#### **Result Analysis:**

The above plots show a simulation comparison between a **PPO-trained controller** and a traditional **nonlinear model predictive controller (NMPC)** on a hilly terrain scenario. We observe that **PPO achieves performance comparable to the model-based NMPC**, both in terms of position tracking and constraint satisfaction. Despite minor differences in velocity overshoot and control effort, **PPO successfully reaches the target with stable behavior**.

Notably, the small performance discrepancy stems from **different cost function designs**:

* The **NMPC cost** is explicitly crafted to **balance stabilization and control effort minimization**, resulting in smoother and more energy-efficient control inputs.
* In contrast, the **DRL cost function (used to train PPO)** is often designed to **encourage time-optimality**, pushing the agent to reach the goal as quickly as possible.

As a result, the PPO controller adopts a **bang-bang-like control policy**, applying **larger input magnitudes** to accelerate faster, which can be seen in the third subplot. While this leads to higher input usage, it aligns with the learned time-optimal objective and demonstrates the flexibility of PPO in learning aggressive, goal-driven behaviors.


<blockquote style="padding-top: 20px; padding-bottom: 10px;">

##### **🔍 Hands-on Exploration: hyperparameters in the training step**

To better understand how hyperparameters affect training dynamics, try modifying the values of the parameters listed above and observe their impact on the PPO controller's learning behavior. Some hyperparameters that have the **strongest influence** on performance include:

* **`learning_rate_p / learning_rate_v`** — affects convergence speed and stability; too high may cause divergence.
* **`clip_coef`** — controls how aggressively the policy is allowed to change; smaller values lead to more conservative updates.
* **`ent_coef`** — governs the balance between exploration and exploitation; higher values promote exploration.
* **`n_steps`** — determines how many environment steps are collected per update; longer rollouts can improve value estimates but increase memory use.

We encourage you to **experiment** with these values and compare:

* How fast the agent learns,
* How stable the training curves are,
* Whether the final policy performs well across multiple runs.

</blockquote>

<br>

### **Wrap-up: PPO and What’s the next**

In this chapter, we introduced the **core ideas behind Proximal Policy Optimization (PPO)** — one of the most widely used deep reinforcement learning algorithms today. While time constraints prevented us from delving into the full theoretical background or complete implementation, we focused on **how PPO works**, and showcased its **impressive performance**, especially in scenarios involving **continuous state and action spaces**.

PPO represents just one example of learning-based control. If you are curious to **explore the underlying theory more deeply**, understand **advanced implementation techniques**, or gain **hands-on experience applying learning algorithms to real-world systems**, we highly encourage you to consider the following courses:

* **Advanced Robot Learning and Decision-Making** (SoSe/WiSe): Deepening understanding through hands-on implementation and diving into state-of-the-art methods.
* **Autonomous Drone Racing Project Course** (SoSe/WiSe): Apply these algorithms in high-speed, real-time robotics settings.
* **Seminar on Semantics for Robot Perception and Decision-Making** (SoSe/WiSe): Building foundational research skills in preparation for thesis and beyond.

These courses offer not just theoretical depth, but also the **opportunity to design and implement controllers and modern learning algorithms in practice**. We believe you’ll gain valuable insights and skills — and have a lot of fun along the way.


