
---
# Module 4: Model Predictive Control (MPC)

## Objectives

* Understand the principles of **Model Predictive Control** (MPC) for sequential decision-making.
* Learn how **optimization** and **prediction models** guide control actions.
* Explore how **learned representations** integrate into MPC.

---

## 1. Introduction to MPC

### **What is Model Predictive Control (MPC)?**

Model Predictive Control (MPC) is an advanced control strategy that uses a model of the system to predict its future behavior over a defined time horizon. It's particularly well-suited for **sequential decision-making** problems, where a series of actions are taken over time to achieve a goal.

At each time step, MPC follows an iterative process:

1. **Predict:** Using the system model, MPC predicts the system's future states over a prediction horizon, considering a sequence of potential control actions.
2. **Optimize:** Based on these predictions, an optimization problem is solved to find the sequence of control actions that minimizes a cost function (e.g., minimizing error, maximizing performance) over the prediction horizon.
3. **Apply:** Only the first action from the optimal sequence is applied to the system.
4. **Repeat:** The prediction and optimization process is repeated at the next time step, using the updated system state. This receding horizon approach allows MPC to adapt to changing conditions and disturbances.

---
MPC is used extensively in **process control**, **robotics**, and **autonomous systems** due to its ability to handle multi-variable systems with constraints.

<img src= "https://i.postimg.cc/d3zWdCD6/anim.webp">
The MPC principle is visualized in the graphic above. The dotted line indicates the current prediction and the solid line represents the realized values. The graphic is generated using the innate plotting capabilities of do-mpc.


---

## 2. TD-MPC: Model Predictive Control with Latent Representations

TD-MPC (Task-Driven Model Predictive Control) introduces a **latent dynamics model** to learn compact representations of high-dimensional environments. All predictions, planning, and optimization are done in **latent space**.

To learn a policy $\pi_\theta(s) \rightarrow a$ that maximizes the long-term reward in the given infinite-horizon Markov Decision Process (MDP) with continuous state and action spaces, we need to maximize the expected discounted cumulative reward, defiThe goal is to maximize the expected discounted return:

$$
J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^\infty \gamma^t R(s_t, a_t) \right],
$$

where:

* $\pi_\theta(s)$ is the parameterized policy (e.g., a neural network),
* $\tau = (s_0, a_0, s_1, a_1, \dots)$ is a trajectory,
* $s_0 \sim p_0$, the initial state distribution,
* $a_t = \pi_\theta(s_t)$, the action taken by the policy,
* $s_{t+1} \sim T(\cdot|s_t, a_t)$, the environment dynamics.

---

### Common RL Methods for Continuous Control

To optimize $J(\pi_\theta)$ in continuous domains, we often use **policy gradient methods** or **actor-critic methods**:

---

#### 1. **Policy Gradient Methods**

These directly optimize the policy by estimating the gradient $\nabla_\theta J(\pi_\theta)$ and applying gradient ascent:

$$
\nabla_\theta J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R_t \right],
$$

where $R_t = \sum_{t'=t}^\infty \gamma^{t'-t} R(s_{t'}, a_{t'})$ is the return from time $t$ onward.

**Example algorithms**:

* **REINFORCE** (Vanilla Policy Gradient)
* **Trust Region Policy Optimization (TRPO)**
* **Proximal Policy Optimization (PPO)** — very popular due to stability and ease of implementation

---

#### 2. **Actor-Critic Methods**

These use two function approximators:

* The **actor** $\pi_\theta(a|s)$, which selects actions
* The **critic** $V_w(s)$ or $Q_w(s,a)$, which estimates the value function

The critic helps reduce variance in the policy gradient:

$$
\nabla_\theta J(\pi_\theta) \approx \mathbb{E}_{s,a} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot A^{\pi}(s,a) \right],
$$

where $A^\pi(s,a)$ is the advantage function (e.g., $Q(s,a) - V(s)$).

**Popular actor-critic methods**:

* **Deep Deterministic Policy Gradient (DDPG)** — deterministic policy for continuous actions
* **Twin Delayed DDPG (TD3)** — improves over DDPG with less overestimation
* **Soft Actor-Critic (SAC)** — adds entropy regularization for better exploration and stability

---



---

Absolutely! Here's a **detailed breakdown** of the components of the **Task-Oriented Latent Dynamics (TOLD) model**, as introduced in the TD-MPC paper. This model forms the **backbone of the TD-MPC architecture**, enabling efficient decision-making in a learned latent space, rather than the high-dimensional raw observation space.

---

## **Task-Oriented Latent Dynamics (TOLD) Model: In-Depth**

---

### **Why Learn Latent Dynamics?**

In high-dimensional environments (like vision-based RL), it's inefficient and unnecessary to model every detail of the raw observation.
Instead, we learn a **latent representation** that:

* **Compresses** the observation,
* **Focuses** only on task-relevant aspects,
* **Supports planning and value estimation** in this compact space.

This helps the agent **generalize** better across tasks and reduces sample complexity.

---

##COMPONENTS OF THE TOLD MODEL

Each component is a neural network that plays a role in **modeling and predicting future behavior** in the **latent space**.

---

### 1. **Representation Function**

Encodes an observation $s_t$ into a latent state $z_t$:

$$
z_t = h_\theta(s_t)
$$

* $s_t$: Raw observation (e.g., image, joint state).
* $z_t$: Compressed latent representation.
* $h_\theta$: Encoder (typically a CNN for images, MLP for state vectors).

**Purpose**: Filter out noise and extract features relevant to the control task.

---

### 2. **Dynamics Function**

Predicts next latent state from current latent and action:

$$
z_{t+1} = d_\theta(z_t, a_t)
$$

* $d_\theta$: Latent dynamics model (e.g., MLP or recurrent network).
* Learns to simulate how the latent state evolves under action $a_t$.

**Purpose**: Enable model-based planning entirely in the latent space.

---

### 3. **Reward Function**

$$
\hat{r}_t = R_\theta(z_t, a_t)
$$

* Predicts the reward from latent state and action.
* Trained to regress to the actual reward $r_t$.

**Purpose**: Provide reward signals for imagined trajectories during planning.

---

### 4. **Value Function**

$$
\hat{q}_t = Q_\theta(z_t, a_t)
$$

* Approximates the **expected return** (cumulative discounted reward) from a state-action pair.
* Used for bootstrapping and value learning.

**Purpose**: Guide the agent toward high-return trajectories.

---

### 5. **Policy Function**

$$
\hat{a}_t = \pi_\theta(z_t)
$$

* Outputs the action to take from latent state $z_t$.
* Learns a mapping from latent representation to optimal actions.

**Purpose**: Infer actions directly from compact features.

---

## TRAINING OBJECTIVE

The TOLD model is trained to **jointly learn** representation, dynamics, reward, value, and policy by minimizing a **composite loss** over imagined rollouts.

---

### **Total Loss Function**

$$
J(\theta) = \sum_{i=t}^{t+H} \lambda^{i-t} \mathcal{L}(\theta; \Gamma_i)
$$

* $H$: Horizon length.
* $\Gamma_i$: Training tuple at time $i$.
* $\lambda$: Discount factor for horizon depth (not to be confused with RL discount $\gamma$).

---

### Single-Step Loss: $\mathcal{L}(\theta; \Gamma_i)$

$$
\mathcal{L}(\theta; \Gamma_i) = c_1 \ell^r_i + c_2 \ell^v_i + c_3 \ell^c_i
$$

Each term trains a different component:

---

#### a. **Reward Loss**

$$
\ell^r_i = \| R_\theta(z_i, a_i) - r_i \|^2
$$

* Supervised regression of predicted reward to ground-truth reward.
* Encourages accurate predictions for imagined rewards.

---

#### b. **Value TD Loss**

$$
\ell^v_i = \left\| Q_\theta(z_i, a_i) - \left(r_i + \gamma Q_{\theta^-}(z_{i+1}, \pi_\theta(z_{i+1}))\right) \right\|^2
$$

* A **temporal difference (TD)** loss using bootstrapped targets.
* $\gamma$: RL discount factor.
* $\theta^-$: Target network (slow-moving version of $\theta$).

Promotes stability by not updating the target and prediction networks simultaneously.

---

#### c. **Consistency Loss**

$$
\ell^c_i = \| d_\theta(z_i, a_i) - h_{\theta^-}(s_{i+1}) \|^2
$$

* Ensures that the predicted next latent state matches the encoder output of the next true observation.
* Encourages the model to learn **consistent latent transitions**.

---

## Target Network Update

To improve training stability, target networks (used in value and consistency loss) are updated slowly via:

$$
\theta^-_{t+1} = (1 - \zeta) \theta^-_t + \zeta \theta_t
$$

* $\zeta \in (0, 1)$: Smoothing factor (e.g., 0.01).
* Prevents value overestimation and training oscillations.

---




---

## 2. System Modelling and Prediction Horizons

---

### **System Modelling in MPC**

In **Model Predictive Control (MPC)**, the controller **predicts the future behavior of a system** using a **mathematical model** of its dynamics. At each time step, MPC solves an optimization problem to determine the best control inputs by simulating how the system will evolve in the future. This prediction is based entirely on the model.

---

####  What is a System Model?

A **system model** is a set of equations that describe how the internal state of a system evolves in response to inputs. It allows us to forecast future states and outputs.

In its **linear time-invariant (LTI)** form (a simplified case often used in control), the model looks like this:

$$
x(t+1) = A x(t) + B u(t) \\
y(t) = C x(t)
$$

---

#### Breakdown of the Terms

* **$x(t)$**: The state vector at time $t$. This represents the internal configuration of the system (e.g., position and velocity in a robot).

* **$u(t)$**: The control input at time $t$. This is the decision the controller makes (e.g., force applied by motors).

* **$y(t)$**: The system output (what we observe or want to control, such as robot position).

* **$A, B, C$**: Matrices that define how states evolve:

  * $A$: State transition matrix — how the system moves on its own.
  * $B$: Control matrix — how the input affects the state.
  * $C$: Output matrix — how the state is mapped to measurable outputs.

> In practice, especially for complex or nonlinear systems, these matrices aren’t known ahead of time. Instead, they are:
>
> * **Identified** from physical modeling (e.g., physics-based equations), or
> * **Learned** from data (using system identification or neural networks).

---
---

#### Optimization Problem in MPC

The optimization problem typically looks like:

$$
\min_{u_0, ..., u_{N-1}} \sum_{k=0}^{N-1} \ell(x_k, u_k) + \ell_f(x_N)
$$

subject to:

$$
x_{k+1} = Ax_k + Bu_k \quad (\text{System dynamics}) \\
x_k \in \mathcal{X}, \quad u_k \in \mathcal{U} \quad (\text{State/input constraints})
$$

Where:

* $\ell(x_k, u_k)$: Stage cost at each time step (e.g., penalize deviation from a desired state)
* $\ell_f(x_N)$: Terminal cost at the horizon end
* $\mathcal{X}, \mathcal{U}$: Constraint sets on states and controls

---

#### Receding Horizon Control

Only the **first control input** $u(t)$ is applied to the real system. Then the system moves to a new state $x(t+1)$, and the process repeats with a new optimization problem. This is called **receding horizon control** and provides **feedback**, making the system robust to disturbances or model errors.

---

### Summary

| Concept                      | Description                                                                                   |
| ---------------------------- | --------------------------------------------------------------------------------------------- |
| **System Model**             | Describes how the system state evolves in time using equations like $x(t+1) = Ax(t) + Bu(t)$. |
| **Prediction Horizon ($N$)** | Number of future steps MPC uses to simulate and optimize control decisions.                   |
| **Control Input ($u(t)$)**   | The decision made by the MPC at time $t$.                                                     |
| **Receding Horizon**         | Only the first control input is used, then re-optimization happens at the next step.          |

---

### If the system is nonlinear or unknown?

In modern settings (like autonomous driving or robotics), we often replace the linear model with a **learned nonlinear model**, such as:

* Neural networks
* Gaussian processes
* Koopman operator models

This leads to **Learning-based MPC** or **Model-based Reinforcement Learning**, where a learned model is used within the MPC framework.

---




### **Prediction Horizon**

The **prediction horizon** is a key parameter in MPC. It defines the number of future time steps over which the system's behavior is predicted and the optimization problem is solved. A longer prediction horizon allows MPC to anticipate future events and plan actions accordingly, potentially leading to better long-term performance. However, it also increases the computational complexity of the optimization problem.

In MPC, predictions are made over a finite horizon `N`. The optimization problem is solved at each time step to determine the sequence of control actions that minimizes the cost function over the next `N` steps.

---


---

## 3. Optimization-Based Action Selection in MPC

---

### **What Is It?**

At every time step, **Model Predictive Control (MPC)** computes the best control input by solving an optimization problem. This problem:

* Predicts how the system will behave over a **future time window** (the prediction horizon),
* Chooses control inputs that **minimize a cost function**,
* Respects **constraints** on the system and controls.

This makes MPC **explicitly goal-driven**: it chooses actions that optimize future performance based on predictions from a system model.

---

### Optimization Problem in MPC

A common form of the MPC optimization problem is:

$$
\min_{\mathbf{U}} \ \| \mathbf{Y} - \mathbf{Y}_{\text{ref}} \|^2 + \| \Delta \mathbf{U} \|^2
$$

**Subject to:**

* Dynamics constraints (i.e., how $Y$ depends on inputs and the model),
* State and input constraints (e.g., actuator limits, safety constraints).

---

### Terms Explained

| Term                      | Description                                                                       |
| ------------------------- | --------------------------------------------------------------------------------- |
| $\mathbf{Y}$              | Vector of **predicted outputs** over the horizon: $[y(t+1), y(t+2), ..., y(t+N)]$ |
| $\mathbf{Y}_{\text{ref}}$ | **Reference trajectory** (desired future outputs to follow)                       |
| $\Delta \mathbf{U}$       | Change in control inputs over time: $\Delta u(t) = u(t) - u(t-1)$                 |
| $\mathbf{U}$              | Control input sequence: $[u(t), u(t+1), ..., u(t+N-1)]$                           |

---

### Why These Cost Terms?

* **$\|\mathbf{Y} - \mathbf{Y}_{\text{ref}}\|^2$**:

  * Penalizes deviation from the target trajectory.
  * Drives the system toward goals (e.g., desired position, speed, temperature).

* **$\|\Delta \mathbf{U}\|^2$**:

  * Penalizes rapid changes in control inputs (e.g., jerky movements).
  * Encourages **smooth, stable control behavior**.

---

### Constraints

Constraints ensure physical feasibility and safety:

* **Input constraints**: $u_{\text{min}} \le u(t) \le u_{\text{max}}$
* **State constraints**: $x(t) \in \mathcal{X}$ (e.g., position must stay inside boundaries)
* **Output constraints**: $y(t) \in \mathcal{Y}$

The optimization must choose control actions that satisfy all of these.

---

### Receding Horizon Control

Once the optimal control sequence $\mathbf{U}^* = [u^*_0, u^*_1, \dots, u^*_{N-1}]$ is found:

* **Only the first input** $u^*_0$ is applied to the system.
* The system moves forward to the next state.
* A **new optimization** is solved at the next time step with updated state info.

This loop gives MPC its **feedback** and **adaptive behavior**.

---

### What If the Model or Cost Is Complex?

In practice, the dynamics model might be:

* **Nonlinear** (e.g., robot arm, drone),
* **Learned from data** (e.g., neural networks),
* **Implicit** (e.g., in black-box simulators).

These introduce challenges:

* The optimization may become **non-convex**, requiring nonlinear or gradient-based solvers.
* If a neural network is used to model dynamics or output predictions, automatic differentiation (e.g., using PyTorch or TensorFlow) is used to compute gradients.

---

### Example (High-Level)

Let’s say a drone wants to follow a path at a constant altitude and avoid obstacles.

The MPC optimization problem could be:

$$
\min_{\mathbf{U}} \sum_{k=0}^{N-1} \left[ \| y_k - y_k^{\text{ref}} \|^2 + \lambda \| \Delta u_k \|^2 \right]
$$

**Subject to:**

* Drone dynamics (e.g., learned or physics-based model)
* Constraints on thrust, pitch, yaw
* Obstacle avoidance regions

Here, the optimization must **balance trajectory tracking**, **smooth motion**, and **constraint satisfaction**.

---

### Summary Table

| Component                 | Role                                                                  |
| ------------------------- | --------------------------------------------------------------------- |
| **Cost Function**         | Guides the system toward desired behavior (e.g., trajectory tracking) |
| **Optimization Problem**  | Computes best future control actions over prediction horizon          |
| **Constraints**           | Enforce physical limits and safety                                    |
| **Receding Horizon**      | Only apply the first control input; replan at next step               |
| **Learning-Based Models** | Allow MPC in systems where physical models are unavailable or complex |

---


---

## 4. State-Based MPC Formulation

### **State-Based MPC – Concept**

State-based MPC refers to the use of **optimal control synthesis** for solving the MPC problem over a finite horizon. This approach involves solving the system's dynamics iteratively to find the optimal control inputs that minimize the cost function.

An infinite-horizon optimal control problem can be approximated as a **N-step horizon** problem, leading to a state-space MPC formulation:

```math
minimize J = Σ_{τ=t}^{t+N} [(y(τ) - r(τ))² + u(τ)²]  
subject to x(τ+1) = Ax(τ) + Bu(τ)
```

Where:

* `J` is the cost function.
* `r(τ)` is the reference trajectory.
* `x(τ)` is the system state at time `τ`.
* `u(τ)` is the control action at time `τ`.

This formulation uses **receding horizon control** (RHC), where the optimization is recalculated at each time step, always looking ahead over a fixed horizon.

---

## 5. Integration with Learned Representations

The paper explores the **integration of learned representations** into the MPC framework. Traditionally, MPC relies on hand-designed or identified system models and cost functions based on raw sensor data. However, learned representations, often derived from deep learning models, can provide a more abstract and potentially more informative state representation. Integrating these learned representations into MPC can offer several advantages:

* **Improved System Modelling**: Learned representations can capture complex, non-linear system dynamics more effectively than traditional models, leading to more accurate predictions.
* **Enhanced Optimization**: Optimization can be performed in a lower-dimensional or more meaningful latent space defined by the learned representations, potentially simplifying the optimization problem and improving efficiency.
* **Handling High-Dimensional Data**: Learned representations can effectively process high-dimensional sensor data (e.g., images) into a compact representation suitable for MPC.

The paper likely details specific methods for integrating these learned representations, such as using the learned representation as the state for the MPC, or using the learned representation to improve the system model or the cost function used in the MPC optimization.

---


##  References

* Joe Qin's Survey on Industrial MPC: [Qin MPC Resource](http://www.che.utexas.edu/~qin/cpcv/cpcv14.html)
* Bemporad et al., 1994 - Stability of Constrained MPC
* Mayne et al., 2000 - MPC Stability Theory

---

This version includes clear definitions of key concepts and explanations, providing a comprehensive guide to the lecture material. Would you like me to export this as a `.md` file or convert it to another format?
