### **Chapter 7.2: Model-Free Reinforcement Learning**



In this chapter, we introduce reinforcement learning, a powerful and widely used framework for solving constrained optimal control problems in both linear and nonlinear systems.

All the contents are summarized in the table below.  


<table border="1" style="border-collapse: collapse; text-align: center;">
  <!-- Title Row -->
  <tr>
    <th colspan="2" style="text-align:center">Content of Chapter 7.2 Exercise</th>
  </tr>

  <!-- Row group 1 -->
  <tr>
    <td rowspan="2">Sampling</td>
    <td>Sampling from a stochastic Policy</td>
  </tr>
  <tr>
  </tr>

  <!-- Row group 2 -->
  <tr>
    <td rowspan="4">Monte-Carlo Method</td>
    <td>Implementation</td>
  </tr>
  <tr>
    <td>Training Curve and Key Metrics</td>
  </tr>
  <tr>
    <td>Simulation on a Hilly Terrain</td>
  </tr>
  <tr>
    <td>Exploration vs. Exploitation</td>
  </tr>

  <!-- Row group 3 -->
  <tr>
    <td rowspan="3">Q-Learning</td>
    <td>Implementation</td>
  </tr>
  <tr>
    <td>Simulation on a Flat Terrain</td>
  </tr>
  <tr>
    <td>Simulation on a Hilly Terrain</td>
  </tr>

  <!-- Row group 4 -->
  <tr>
    <td rowspan="4">Comparison I</td>
    <td>Fair Comparison & Statistical Evaluation in RL</td>
  </tr>
  <tr>
    <td>Monte-Carlo Method vs. Q-Learning</td>
  </tr>
  <tr>
  </tr>
  <tr>
  </tr>

  <!-- Row group 4 -->
  <tr>
    <td rowspan="1">Comparison II</td>
    <td>Comparison of All Control and Learning Paradigms</td>
  </tr>

</table>

First, we need to set up our Python environment and import relevant packages.

In [None]:
import sys
import os

sys.path.append(os.path.abspath(".."))
from utils.env import *
from utils.simulator import *
from ex5_MPC.mpc_utils import *
from ex7_RL.rl_utils import *

<br>

### **Mountain Car Problem Setup:**

- Task: starting from given initial position $p_0$, reach a given target position $p_T$ (stabilization)

- Slope profile (height $h$ with reference to horizontal displacement $p$):  
   - case 1: zero slope (linear case), $h(p) = c$
   - case 2: constant slope (linear case), $h(p) = \frac{\pi}{18} \cdot p$
   - case 3: varying slope for small disturbances (nonlinear case), $h(p) = k \cdot \cos(18 p)$
   - case 4: varying slope for under actuated case (nonlinear case), $h(p) = \begin{cases} k \cdot \sin(3 p), & p \in [- \frac{\pi}{2}, \frac{\pi}{6}] \\ k, & p \in (-\infty, -\frac{\pi}{2}) \cup (\frac{\pi}{6}, \infty) \end{cases}$

- System dynamics of 1d mountain car model (in state space representation): 
   - state vector $\boldsymbol{x} = [p, v]^T$
   - input vector $u$
   - system dynamics:
   \begin{align*}
     \begin{bmatrix} \dot{p} \\ \dot{v} \end{bmatrix} = \begin{bmatrix} v \\ - g \sin(\theta) \cos(\theta) \end{bmatrix} + \begin{bmatrix} 0 \\ \cos(\theta)  \end{bmatrix} u
   \end{align*}

### **Preparation: specify some common task parameters**

In the previous exercise, we demonstrated how to define a symbolic function using CasADi, including the definition of the mountain profile as a function of $p$, deriving the conversion formulas between the slope profile $h(p)$ and the inclination angle $\theta(p)$, and establishing the system's dynamics. These formulas have already been integrated into the class `Env` and `Dynamics`. In this chapter, we will specify the arguments and instantiate these classes directly to utilize their functionalities.

- Parameters in the task:  

   - case: 1 (flat terrain) / 4 (hilly terrain)
   
   - initial state: $\boldsymbol{x}_0 = [-0.5, 0.0]^T$

   - target state: $\boldsymbol{x}_T = [0.6, 0.0]^T$

   - state constraints (discretization space in RL): $ \mathcal{X}_1 = [-1.7, 1.3]$,  $ \mathcal{X}_2 = [-3.5, 3.5]$

   - input constraints (discretization space in RL): $ \mathcal{U} = [-5.0, 5.0]$

   - state / input space discretization: 31 units for $x_1$, 21 units for $x_2$, 21 units for $u$

In [None]:
# Define the initial / target state
initial_position = -0.5
initial_velocity = 0.0
target_position = 0.6
target_velocity = 0.0

# State bounds
state_lbs = np.array([initial_position-1.2, -3.5])
state_ubs = np.array([target_position+0.7, 3.5])

# Input bounds
input_lbs = -5.0
input_ubs = 5.0

# Define time length for simulation
t_terminal = 8.0

# Define the control frequency for controller
freq = 10
dt = 1.0/freq

# Define the number of states and actions
num_states = np.array([31, 21])
num_actions = 21

Meanwhile, at the end of this chapter, we will compare the performance of model-free RL with traditional model-based control methods, such as MPC. Therefore, we can predefine a few MPC controller configurations here in advance for later evaluation.

In [None]:
# NMPC parameters (as reference)
# To find a feasible solution, the MPC requires a bit difference discretization and larger bounds
freq_mpc = 20
dt_mpc = 1.0/freq
state_lbs_mpc = np.array([-2.0, -4.0])
state_ubs_mpc = np.array([2.0, 4.0])

# Define weight matrix in stage and terminal cost and the horizon for MPC (reference controller)
Q = np.diag([1, 1])
R = np.array([[0.1]])
Qf = Q
N = 60

---

<br>

### **Sampling from a stochastic Policy: A Discrete CDF Approach**

In reinforcement learning, an agent often follows a stochastic policy rather than a fully greedy one, i.e. a probability-mass function $\pi(u|x)$ that assigns a weight to every action in the current state. An $\varepsilon$-soft policy is one of the simplest case: with probability $1-\varepsilon$ the agent takes the greedy action and with $\varepsilon$ it explores the rest according to that distribution. Sampling from such a policy is crucial, without it the agent can’t keep a balance between exploration and exploitation, expecially in a on-policy RL method.

In this section, we will:

- Define a $\varepsilon$-soft policy based on given $Q$-values as testbed, compute cumulative distribution function (CDF) based on probabilistic distribution function (PDF),

- Introduce how to sample an action from a discrete probability distribution using its CDF and implement the method manually from scratch,

- Compare it with `np.random.choice`, which is the convenience routine provided by Python’s NumPy library that performs the same draw in a single call.

*Note: In this section we only consider the distribution of a discrete random variable. As continuous distributions require different machinery, they are not covered here.*

For a $\varepsilon$-soft policy defined over $x \in \mathcal{X}$ and $u \in \mathcal{U}$, it can be formulated as:

$$
p_{\pi}(u \mid x) =
\begin{cases}
\dfrac{\varepsilon}{\lvert \mathcal{U} \rvert}, 
    & \text{for an action associated with non-optimal } Q, \\[6pt]
1-\varepsilon \Bigl(1-\dfrac{1}{\lvert \mathcal{U} \rvert}\Bigr), 
    & \text{for the action associated with optimal } Q.
\end{cases}
$$

In [None]:
# Example Q-values for 4 actions
action_index = np.array([0, 1, 2, 3])
Q_table = np.array([5.0, 2.0, 1.0, 2.5]) # assume the Q-value is only a function of actions

# Define a stochastic policy (ε-soft as example)
epsilon = 0.2
num_u = len(Q_table)
action_probs = np.ones(num_u) * (epsilon / num_u)
best_action = np.argmax(Q_table)
action_probs[best_action] += 1 - epsilon

print("ε-soft action probabilities:", action_probs)

# Convert PDF to CDF
cdf = np.cumsum(action_probs)

# Plotting the PDF and CDF
fig, ax1 = plt.subplots(figsize=(8, 4))
x = np.arange(num_u)
ax1.set_xticks(x)
ax1.bar(x, action_probs, color='C0', width=0.4, label='PDF', align='center')
ax1.set_ylabel("Probability", color='C0')
ax1.set_xlabel("Action index")
ax1.set_ylim(0, 1.05)
ax1.tick_params(axis='y', labelcolor='C0')
ax1.set_title("PDF and CDF of ε-soft Policy")
ax2 = ax1.twinx()
ax2.step(x, cdf, where='post', color='C3', label='CDF', linewidth=2)
ax2.scatter(x, cdf, color='C3', marker='o')
ax2.set_ylabel("Cumulative Probability", color='C3')
ax2.tick_params(axis='y', labelcolor='C3')
ax2.set_ylim(0, 1.05)
lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(lines1 + lines2, labels1 + labels2, loc='lower right')
plt.tight_layout()
plt.show()


At each step in a training episode, you need to sample an action from the $\varepsilon$-soft policy that is represented as a probability distribution. Denote $\hat{f}_x(x)$ as a generic PDF over a discrete random variable $x \in \mathcal{X}$ and $\hat{F}_x(x)=\sum_{\tilde{x}=-\infty}^{x}\hat{f}(\tilde{x})$ as the associated CDF. One approach to sample from the distribution $\hat{f}_x(x)$ is to:

- Generate a random number $u_s$ from a uniform distribution $f_u(u)$ supported on the closed interval $[0,1]$, 

- Take the smallest value that satisfies $\hat{F}_x(x_s-1) < u_s \le \hat{F}_x(x_s)$ as the sample $x_s$.


In [None]:
def sample_from_cdf(cdf):
    """
    Sample an index from a discrete probability distribution using its cumulative distribution function (CDF).
    
    Parameters:
    -----------
    cdf : array-like of shape (n,)
        The cumulative distribution function values of the discrete distribution, which must be non-decreasing and end at 1.0.
    
    Returns:
    --------
    index : int
        The sampled index based on the CDF.
    """

    u = np.random.rand()  # uniform sample from [0,1)
    
    for i in range(len(cdf)):
        if u < cdf[i]:
            return i
        
    return len(cdf) - 1  # fallback in case of round-off

Building on the `sample_from_cdf` implementation above, here we will draw 10 000 actions from the same CDF to test the sampler. By listing the empirical probabilities next to the theoretical ones for each action, we can immediately see whether the hand-crafted sampler reproduces the target distribution.

In [None]:
# Here we use the manual sampling method to sample from CDF for 10000 times, and compare the empirical distribution with the theoritical distribution
samples = [sample_from_cdf(cdf) for _ in range(10000)]

# Statistics of the samples
counts = np.bincount(samples, minlength=num_u)
empirical_probs = counts / len(samples)

# Print each action's probabilities
print(f"{'Action':>6} | {'Theoretical':>12} | {'Empirical':>10}")
print("-" * 36)
for i in range(num_u):
    print(f"{i:>6} | {action_probs[i]:>12.4f} | {empirical_probs[i]:>10.4f}")


> <br>
> 📌 <b>Note:</b> Try varying the sampling size! You can replace `10000` with `10`, `100`, `500`, etc., and observe how the empirical probabilities change.
>
> - A **smaller** number of samples may result in larger **deviations** from the theoretical distribution due to randomness.
> 
> - A **larger** sample size will generally lead to a **closer match** with the theoretical probabilities.
>
> <br>


<br>

Alternaltively, NumPy provides a built-in one-liner for drawing samples from a discrete probability distribution—essentially replacing the hand-rolled CDF routine:

<p align="center">
  <code>numpy.random.choice(a,&nbsp;size=None,&nbsp;replace=True,&nbsp;p=None)</code>
</p>

Arguments:
 - `a`: if an integer N, samples are drawn from the set {0,…,N-1}; if an array, it is treated as the list of labels to sample.

 - `size`: number (or shape) of samples to return; defaults to a single draw.

 - `replace`:sample with replacement (True, default) or without (False).

 - `p`: 1-D array of probabilities that must sum to 1; if omitted, sampling is uniform.

Return value:

 - An ndarray whose shape is given by `size`, containing the drawn labels (dtype matches `a`).

In [None]:
# In python, we can use numpy's random choice to sample from the action probabilities
samples_np = np.random.choice(num_u, size=10000, p=action_probs)

# Statistics of the samples
counts_np = np.bincount(samples_np, minlength=num_u)
empirical_probs_np = counts_np / len(samples_np)

# Print each action's probabilities
print(f"{'Action':>6} | {'Theoretical':>12} | {'Empirical':>10}")
print("-" * 36)
for i in range(num_u):
    print(f"{i:>6} | {action_probs[i]:>12.4f} | {empirical_probs_np[i]:>10.4f}")


---

<br>

### **Monte-Carlo Method**

In the last section we learned how to draw actions from an $\varepsilon$-soft stochastic policy. Based on that now we can further explore the model-free RL methods that engage shochastic policy to balance the exploration and exploitation. Most textbooks split reinforcement-learning algorithms that do not need an explicit model into a handful of archetypes – each updates the value (or policy) estimate in a different way.

<div align="center">

| Family                              | Core idea                                                                       | Example methods                                       |
| ----------------------------------- | ------------------------------------------------------------------------------- | ----------------------------------------------------- |
| **Monte-Carlo**                | Wait until an episode ends, then use the **total return** to update $Q$ or $V$. | every/first-visit MC Method |
| **Temporal-Difference (TD)**        | Update after **every step** by boot-strapping from the next state’s estimate.   | SARSA, Q-learning, TD(λ)                              |
| **n-step / λ-return hybrids**       | Blend MC and TD: update after $n$ steps or with eligibility traces.             | n-step TD, TD(λ)                                      |
| **Policy-gradient / Actor–Critic**  | Optimise the policy parameters directly via ∇-estimates.                        | REINFORCE, PPO, A2C                                   |

</div>

<br>

Among these, Monte-Carlo methods stand out by requiring no bootstrap targets and being naturally unbiased—but they must wait for the episode to finish, giving them higher variance and slower online feedback. To tame that variance we often adopt an $var\epsilon$-soft policy so every action is sampled infinitely often, and we use a small, constant stepsize $\alpha$ to form a recursive MC update.

$$
\begin{array}{l}
\textbf{$\varepsilon$-Soft, On-Policy Monte-Carlo Algorithm}
\end{array}
$$

$$
\begin{array}{l}
\text{1. Initialization: }\\
\text{01: } \text{\quad - \quad Define a discount factor $\gamma$}\\
\text{02: } \text{\quad - \quad Choose a constant learning rate $\alpha$ and rollout length $N$}\\
\text{03: } \text{\quad - \quad Initialise arbitrary $Q(x,u)\in\mathbb{R}$ for all $x\in\mathcal{X},\,u\in\mathcal{U}(x)$}\\
\text{04: } \text{\quad - \quad Initialise an arbitrary $\varepsilon$-soft policy $\pi(u\mid x)$ for every $x$}\\
\text{2. Episode-Based Action Value Function and Policy Updates: }\\
\text{05: } \quad \textbf{while } \textit{not converged} \textbf{ do} \\
\text{06: } \quad | \quad \text{a) Rollout}\\
\text{07: } \quad | \quad \text{Generate an episode using the $\varepsilon$-soft policy $p_{\pi}(u|x)$ and record $\mathcal{D}=\{(x_k,u_k,r_k)\}_{k=0}^{N}$}\\
\text{08: } \quad | \quad \text{b) Policy Evaluation}\\
\text{09: } \quad | \quad \textbf{for } \textit{each pair $(x,u)$ in $\mathcal{D}$} \textbf{ do}\\
\text{10: } \quad | \quad | \quad R(x,u) \leftarrow \text{the cumulative return following the first occurrence of} (x,u)\\
\text{11: } \quad | \quad | \quad Q(x,u) \leftarrow Q(x,u) + \alpha(R(x,u) - Q(x,u))\\
\text{12: } \quad | \quad \textbf{end}\\
\text{13: } \quad | \quad \text{c) Policy Improvement}\\
\text{14: } \quad | \quad \textbf{for } \textit{each $x$ in $\mathcal{D}$} \textbf{ do}\\
\text{15: } \quad | \quad | \quad \pi^*(x) \leftarrow \text{argmax}_{u \in \mathcal{U}(x)} Q(x,u)\\
\text{16: } \quad | \quad | \quad \text{Update $\varepsilon$-soft policy $p_{\pi}(u|x)$}\\
\text{17: } \quad | \quad \textbf{end}\\
\text{18: } \quad | \quad \text{d) Optionally reduce $\varepsilon$}\\
\text{19: } \quad \textbf{end}
\end{array}
$$



#### **Reward Function**

In model-free RL, due to the absence of explicit dynamics guidance, reward shaping plays a critical role in determining whether the algorithm converges. A well-designed reward should provide informative, dense, and consistent feedback to effectively guide the agent’s behavior. In this example, we design the reward function as follows:

$$
r(x, u, x') = \begin{cases} 
10, & \text{if }x' \text{ is target state} \\
-1, & \text{if }x' \text{ is not target state} \\
-10, & \text{if }x' \notin \mathcal{X}
\end{cases},
$$

where $x$ and $u$ denote the current state and input, and $x'$ denote the next state.

> <br>
> 📌 <b>Note:</b> Cost design in model-free RL vs. Cost design in model-based control
>
> In traditional model-based control methods such as LQR or MPC, the cost function is typically a smooth, differentiable quadratic form:
> $$
> J = \sum_{k=0}^{N-1} \left( x_k^\top Q x_k + u_k^\top R u_k \right) + x_N^\top Q_f x_N
> $$
> which penalizes deviations from the desired state and excessive control effort. These formulations require the cost to be continuous and differentiable to ensure the optimization problem is well-posed and efficiently solvable. <b>In contrast, RL methods can flexibly handle sparse or discrete rewards—such as binary success/failure signals—without requiring differentiability, making them more versatile for problems where such feedback is more natural or practical to define.</b>
>
> <br>

<br>

#### **Implementation**

Before we can roll out episodes, we must define the $\varepsilon$-soft sampling policy, which can be formulated as:
$$

p_{\pi}(u \mid x) =
\begin{cases}
\dfrac{\varepsilon}{\lvert \mathcal{U} \rvert}, 
    & \text{for an action associated with non-optimal } Q, \\[6pt]
1-\varepsilon \Bigl(1-\dfrac{1}{\lvert \mathcal{U} \rvert}\Bigr), 
    & \text{for the action associated with optimal } Q.
\end{cases}
$$

Given a discrete state index, the routine first looks up the action with the highest Q-value in that row of the table. It then constructs a probability vector in which every action receives a uniform exploration mass of $\varepsilon / \lvert \mathcal{U} \rvert$, while the greedy action gets an extra slice $1-\varepsilon$. The resulting vector therefore sums to $1$, places most of the probability on the current greedy action, yet guarantees that every action is sampled with non-zero probability. This vector is returned and can be passed directly to `np.random.choice` (or any other sampler) to sample an exact action.



In [None]:
def _get_action_probabilities(self, state_index: int) -> np.ndarray:
    """Calculate the action probabilities using epsilon-soft policy."""

    probabilities = np.ones(self.dim_inputs) * (self.epsilon / self.dim_inputs)
    best_action = np.argmax(self.Q[state_index, :])
    probabilities[best_action] += (1.0 - self.epsilon)

    return probabilities

Based on the ε-soft sampling rule, the whole **Monte Carlo method training loop** in the `setup()` routine is built around three consecutive phases:

1) **Episode rollout:** 
   Starting from the specified initial state, the agent uses the current ε-soft policy to generate a complete episode, storing each $(x,u,r)$ triple until it reaches the goal, fails, or hits the step cap.

2) **Recursive Monte-Carlo update:** 
   After the episode ends the code scans the trajectory backwards, accumulates the discounted return $G$, and updates the table with
   $Q(x,u)\leftarrow Q(x,u)+\omega\bigl(G-Q(x,u)\bigr)$,
   turning the raw return into an exponential moving-average estimate.

3) **Policy refresh & exploration decay:** 
   With the new $Q$ the greedy action for every visited state is recomputed and plugged back into an ε-soft distribution; ε is then multiplied by a decay factor (but never below `epsilon_min`) so that each subsequent episode strikes a gradually tighter explore-exploit balance.

<br>

> 📌 <b>Note:</b> Here we restrict initial state distribution to boost the training, but one can also randomly initialize state given sufficient interaction to improve the generalization.


In [None]:
def setup(self) -> None:

    for iteration in range(self.max_iterations):

        episode = []  # storage state, action and reward for current episode
        total_reward = 0  # total reward for current episode
        total_steps = 0  # total steps for current episode

        if iteration % 100 == 0:

            if iteration != 0:
                # Record the SR_100epsd, F_100epsd, TO_100epsd
                self.SR_100epsd.append(SR_100epsd)
                self.F_100epsd.append(F_100epsd)
                self.TO_100epsd.append(TO_100epsd)
            
            SR_100epsd = 0
            F_100epsd = 0
            TO_100epsd = 0

        # Start from init state
        # Note: Here we restrict initial state distribution to boost the training but one can also 
        #       randomly initialize state given sufficient interaction to improve the generalization.
        current_state = self.mdp.init_state
        current_state_index = self.mdp.nearest_state_index_lookup(current_state)
        
        # Generate an episode
        for step in range(self.max_steps_per_episode):
            # Choose action based on epsilon-soft policy
            action_probabilities = self._get_action_probabilities(current_state_index)
            action_index = np.random.choice(np.arange(self.dim_inputs), p=action_probabilities)
            current_input = self.mdp.input_space[action_index]

            # Take action and observe the next state and reward
            next_state, reward = self.mdp.one_step_forward(current_state, current_input)
            next_state_index = self.mdp.nearest_state_index_lookup(next_state)
            total_reward += reward
            total_steps += 1 

            # Store the state, action and reward for this step
            episode.append((current_state_index, action_index, reward))

            # Check if the episode is finished
            terminate_condition_1 = next_state[0]>self.mdp.pos_partitions[-1]
            terminate_condition_2 = next_state[0]<self.mdp.pos_partitions[0]
            terminate_condition_3 = np.all(self.mdp.state_space[:, next_state_index]==self.target_state)
            
            if terminate_condition_1 or terminate_condition_2 or terminate_condition_3:
                if terminate_condition_3:
                    SR_100epsd += 1
                    if self.verbose:
                        print(f"Iteration {iteration + 1}/{self.max_iterations}: finished successfully! epsilon: {self.epsilon:.4f}, residual reward: {total_reward:.2f}")
                else:
                    F_100epsd += 1
                    if self.verbose:
                        print(f"Iteration {iteration + 1}/{self.max_iterations}: episode failed! epsilon: {self.epsilon:.4f}, residual reward: {total_reward:.2f}")
                break

            if step == self.max_steps_per_episode-1:
                TO_100epsd +=1
                if self.verbose:
                    print(f"Iteration {iteration + 1}/{self.max_iterations}: time out! epsilon: {self.epsilon:.4f}, residual reward: {total_reward:.2f}")
                
            # Move to the next state
            current_state_index = next_state_index
            current_state = self.mdp.state_space[:, current_state_index]

        # Update Q table using Monte Carlo method
        G = 0  # Return
        for state_index, action_index, reward in reversed(episode):
            # Cumulative return
            G = reward + self.gamma * G
            
            # Factor in recursive estimation
            self.state_action_counts[state_index, action_index] += 1
            alpha = self.learning_rate

            # Update Q using MC and log TD error
            td_error = G - self.Q[state_index, action_index]
            self.Q[state_index, action_index] += alpha * td_error

        # Decrease epsilon
        self.epsilon *= self.k_epsilon
        self.epsilon = max(self.epsilon, self.epsilon_min)
        self.epsilon_list.append(self.epsilon)

        # Record the residual reward and loss
        self.residual_rewards.append(total_reward)
        self.step_list.append(total_steps)

    # Return the deterministic policy and value function
    self.policy = np.argmax(self.Q, axis=1)
    self.value_function = np.max(self.Q, axis=1)

    # Repeat success/failure stats for plotting
    self.SR_100epsd = np.repeat(self.SR_100epsd, 100)/100
    self.F_100epsd = np.repeat(self.F_100epsd, 100)/100
    self.TO_100epsd = np.repeat(self.TO_100epsd, 100)/100

    if self.verbose:
        print("Training finished！")

<br>

#### **Training Curve and Key Metrics**

When analysing a reinforcement-learning run we usually track a handful of standard metrics:

1) **Episode Return / Reward:** the total discounted reward collected in one episode; the primary signal for whether the agent is learning the task.

2) **Success Rate (SR):** fraction of episodes that reach the goal under the current policy; complements reward on sparse-reward tasks.

3) **Episode Length:** number of steps per episode; often drops as the policy becomes more efficient.

4) **TD Error (or Loss):** the squared temporal-difference error (or MC error); useful for diagnosing convergence and instability.

Using the hyper-parameter settings listed below we train the agent and then plot the learning curves. 

*Note that: In this Jupyter Notebook we will mainly focus on* **Episode Reward** *and* **Success Rate (SR)**, *as these two curves give the clearest picture of both sample-efficiency and eventual task performance.*


In [None]:
case = 1

gamma = 0.90
epsilon = 1.0
k_epsilon = 0.995
learning_rate = 0.1
max_iterations = 2000

seed = 42

In [None]:
np.random.seed(seed)

# Instantiate class 'Env', 'Dynamics', and 'Env_rl_d'
env = Env(case, np.array([initial_position, initial_velocity]), np.array([target_position, target_velocity]),
          state_lbs=state_lbs, state_ubs=state_ubs, input_lbs=input_lbs, input_ubs=input_ubs)
dynamics = Dynamics(env)
mdp = Env_rl_d(env=env, dynamics=dynamics, num_states=num_states, num_actions=num_actions, dt=1/freq, build_stochastic_mdp=False)

# Instantiate the MCRL controller class
controller_mcrl = MCRLController(mdp, freq, epsilon=epsilon, k_epsilon=k_epsilon, 
                                    learning_rate=learning_rate, gamma=gamma, max_iterations=max_iterations)
controller_mcrl.setup()
controller_mcrl.plot_training_curve("Training Curve before Smoothing")

The raw-reward trace on the left is **very jagged**—individual episode returns oscillate wildly, making the underlying learning trend hard to see. To reveal that trend we usually **smooth** the curve, e.g. with a simple moving average:

$$
\tilde R_t \;=\; \frac{1}{w}\sum_{k=0}^{w-1} R_{t-k},
$$

where $w$ is a fixed window (50 episodes in this case). Averaging neighbouring points damps the high-frequency noise so the long-term rise in performance becomes much clearer.


In [None]:
def postprocessing(self, window=20):
    
    """Postprocess the training results, including smoothing the reward curve."""

    residual_rewards = np.array(self.residual_rewards)
    if window <= 1:
        return residual_rewards.copy()
    
    kernel = np.ones(window)
    z = np.ones(len(residual_rewards))        
    self.residual_rewards_smoothed = np.convolve(residual_rewards, kernel, mode='same') / np.convolve(z, kernel, mode='same')

    return self.residual_rewards_smoothed, self.SR_100epsd

In [None]:
controller_mcrl.postprocessing(window=50)
controller_mcrl.plot_training_curve("Training Curve after Smoothing")

The training curves make the learning dynamics of the car controller easy to follow. First, $\varepsilon$ decays roughly as a power-law with the episode index; this gradual reduction turns the $\varepsilon$-soft policy from heavy exploration toward exploitation until it finally converges. In parallel, the success-rate measured every 100 episodes rises steadily with more training, ultimately reaching about $60 \%$. Together these plots show the Monte Carlo method agent learning to trade exploration for performance and converging to a reliably successful driving policy, which will be shown detailedly in the following simlulation section.


In [None]:
# Instantiate the simulator, and then run the simulation
simulator_mcrl = Simulator(dynamics, controller_mcrl, env, 1/freq, t_terminal)
simulator_mcrl.run_simulation()

# Also setup a NMPC controller for reference
env_mpc = Env(case, np.array([initial_position, initial_velocity]), np.array([target_position, target_velocity]),
          state_lbs=state_lbs_mpc, state_ubs=state_ubs_mpc, input_lbs=input_lbs, input_ubs=input_ubs)
dynamics_mpc = Dynamics(env_mpc)
controller_mpc = MPCController(env_mpc, dynamics_mpc, Q, R, Qf, freq_mpc, N, name="NMPC")
simulator_mpc = Simulator(dynamics_mpc, controller_mpc, env_mpc, 1/freq_mpc, t_terminal)
simulator_mpc.run_simulation()

# Instantiate the visualizer, and display the plottings and animation
visualizer_mcrl = Visualizer(simulator_mcrl)
visualizer_mcrl.display_contrast_plots(simulator_mpc, title="Simulation of Monte Carlo Method on Flat Terrain", if_gray=True)
visualizer_mcrl.display_contrast_animation_same(simulator_mpc, if_gray=True)


Overall, the trained policy is able to steer the car to the target position without violating any constraints, althrough it has not yet converged to the optimal policy, which should be a bang-bang policy under the current cost design.

<blockquote style="padding-top: 20px; padding-bottom: 10px;">

##### **🔍 Hands-on Exploration: hyperparameters in training**

Haperparameters are user-chosen knobs, which are set before training starts and not updated within the period. Together they dictate the speed, stability and final quality of learning. **Which hyperparameters appear in our implementation of Monte Carlo method? How does each one influence the training curves?** Change the value and rerun the code block above to verify your understandings.

Some important hyperparameters in our Monte Carlo implementation include:
- **Learning rate (α)**: controls how quickly the value function adapts; large α speeds up learning but can introduce instability.
- **Exploration parameters (ε and k_ε)**: ε controls how often non-greedy actions are taken; k_ε is the decay factor, gradually reducing exploration over time.
- **Episode length (N)**: determines how long each trajectory rollout is; longer episodes can provide more return information but increase variance.
- **Max episode number**: affects the total number of updates and convergence time; more episodes generally allow for better value estimation.


<br>

#### **Simulation on a Hilly Terrain**

The figures above illustrate in the **flat-terrain scenario**. In addition, we can carry out the same training and simulation procedure on the under-actuated hilly terrain. **Here the input is constrained to the under-actuated case**, which means the car cannot generate enough kinetic energy in a single burst to crest the hill; instead it must swing back and forth several times to build momentum, making the task markedly more challenging. Run the training script below, inspect the resulting learning curves, and then launch a simulation to see how the converged policy performs in this tougher setting.


In [None]:
case = 4

gamma = 0.90
epsilon = 1.0
k_epsilon = 0.995
learning_rate = 0.05
max_iterations = 1500

seed = 42

In [None]:
np.random.seed(seed)

# Instantiate class 'Env', 'Dynamics', and 'Env_rl_d'
env = Env(case, np.array([initial_position, initial_velocity]), np.array([target_position, target_velocity]),
          state_lbs=state_lbs, state_ubs=state_ubs, input_lbs=input_lbs, input_ubs=input_ubs)
dynamics = Dynamics(env)
mdp = Env_rl_d(env=env, dynamics=dynamics, num_states=num_states, num_actions=num_actions, dt=1/freq, build_stochastic_mdp=False)

# Instantiate the MCRL controller class
controller_mcrl = MCRLController(mdp, freq, epsilon=epsilon, k_epsilon=k_epsilon, 
                                    learning_rate=learning_rate, gamma=gamma, max_iterations=max_iterations)
controller_mcrl.setup()
controller_mcrl.postprocessing(window=100)
controller_mcrl.plot_training_curve()

# Instantiate the simulator, and then run the simulation
simulator_mcrl = Simulator(dynamics, controller_mcrl, env, 1/freq, t_terminal)
simulator_mcrl.run_simulation()

# Also setup a NMPC controller for reference
env_mpc = Env(case, np.array([initial_position, initial_velocity]), np.array([target_position, target_velocity]),
          state_lbs=state_lbs_mpc, state_ubs=state_ubs_mpc, input_lbs=input_lbs, input_ubs=input_ubs)
dynamics_mpc = Dynamics(env_mpc)
controller_mpc = MPCController(env_mpc, dynamics_mpc, Q, R, Qf, freq_mpc, N, name="NMPC")
simulator_mpc = Simulator(dynamics_mpc, controller_mpc, env_mpc, 1/freq_mpc, t_terminal)
simulator_mpc.run_simulation()

# Instantiate the visualizer, and display the plottings and animation
visualizer_mcrl = Visualizer(simulator_mcrl)
visualizer_mcrl.display_contrast_plots(simulator_mpc, title="Simulation of Monte Carlo Method on Hilly Terrain", if_gray=True)
visualizer_mcrl.display_contrast_animation_same(simulator_mpc, if_gray=True)

We observe that, after training, the car has learned to swing back and forth to accumulate kinetic energy and then crest the hill in one final push.
Compared with the flat-terrain experiment, the hilly case converges much faster—the Success-Rate curve climbs earlier and reaches a stable plateau sooner.
Moreover, although we're still using Monte Carlo such a high-variance estimator, the reward trace on hilly terrain shows far fewer late-stage spikes; even without smoothing the overall downward noise, the curve clearly trends upward toward convergence. In contrast, the reward curve for the flat-terrain case still displays many sharp spikes in the later stages, signalling lingering high variance. **All in all, training—and therefore inspecting the policy’s progress—is noticeably easier in the hilly-terrain scenario.**

*Note: In this setup, once the car reaches the target position, its velocity is set to zero and it remains at the target state indefinitely. This simplification ensures convergence in reinforcement learning and avoids oscillatory behavior after success. While this does not fully reflect real-world dynamics—where residual motion or control delay may cause overshoot—it is a common modeling practice in RL. Alternatively, one may continue sampling the dynamics after reaching the goal until the episode terminates to reflect more realistic interaction.*

<br>

#### **Exploration vs. Exploitation**

In model-free Reinforcement Learning, you have no prior knowledge about the system dynamcis or the transition property, so you will need some policy to conduxt some exploration on this. From there raise the fundamental trade-off between **exploration** (trying new actions to discover potentially better strategies) and **exploitation** (leveraging the current best-known policy to maximize reward). To manage this balance, we adopt an **$\varepsilon$-soft policy**, where the agent chooses the best-known action with high probability $1 - \varepsilon$, but with small probability $\varepsilon$ it randomly explores other actions.

To investigate the impact of this trade-off, we compare the learning performance under three different settings:

* **Run 1**: $\varepsilon = 1.0$ with decay ($k$ = 0.995), a gradually shifting balance from pure exploration to exploitation.

* **Run 2**: $\varepsilon = 0.0$ (pure exploitation), no exploration throughout training.

* **Run 3**: $\varepsilon = 1.0$ (pure exploration), the agent never exploits.

Run 1 represents a well-balanced exploration strategy, while Runs 2 and 3 illustrate extreme cases. These variations help us understand the importance of proper exploration in ensuring convergence and performance.


In [None]:
case = 4

gamma = 0.90
learning_rate = 0.1
max_iterations = 1500

seed_list = [42]

save_dir = "./mfrl_results"

# Run 1: decayng epsilon
epsilon_1 = 1.0
k_epsilon_1 = 0.995

# Run 2: epsilon = 0.0
epsilon_2 = 0.0
k_epsilon_2 = 0.0

# Run 3: epsilon = 1.0
epsilon_3 = 1.0
k_epsilon_3 = 1.0 

In [None]:
# Instantiate class 'Env', 'Dynamics', and 'Env_rl_d'
env = Env(case, np.array([initial_position, initial_velocity]), np.array([target_position, target_velocity]),
          state_lbs=state_lbs, state_ubs=state_ubs, input_lbs=input_lbs, input_ubs=input_ubs)
dynamics = Dynamics(env)
mdp = Env_rl_d(env=env, dynamics=dynamics, num_states=num_states, num_actions=num_actions, dt=1/freq, build_stochastic_mdp=False)

# Instantiate the MCRL controller class
controller_mcrl_1 = MCRLController(mdp, freq, epsilon=epsilon_1, k_epsilon=k_epsilon_1, learning_rate=learning_rate, gamma=gamma, max_iterations=max_iterations, name="decaying_eps")
controller_mcrl_2 = MCRLController(mdp, freq, epsilon=epsilon_2, k_epsilon=k_epsilon_2, learning_rate=learning_rate, gamma=gamma, max_iterations=max_iterations, name="eps=0(greedy_case)")
controller_mcrl_3 = MCRLController(mdp, freq, epsilon=epsilon_3, k_epsilon=k_epsilon_3, learning_rate=learning_rate, gamma=gamma, max_iterations=max_iterations, name="eps=1(explorative_case)")

# Set up the controllers
controller_instances = {
    "mcrl_decaying_eps": controller_mcrl_1,
    "mcrl_eps=0(greedy_case)": controller_mcrl_2,
    "mcrl_eps=1(explorative_case)": controller_mcrl_3,
}

# Set up the RLExperimentRunner
runner = RLExperimentRunner(
    controller_instances=controller_instances,
    seed_list=seed_list,
    save_dir=save_dir
)

# Run all controllers with all seeds and save the results
runner.run_all()
runner.plot("Exploration vs. Exploitation")

In [None]:
# Load the trained controllers for simulation
controller_mcrl_1 = runner.get_trained_controller(name="mcrl_decaying_eps", seed=seed_list[0])
simulator_mcrl_1 = Simulator(dynamics, controller_mcrl_1, env, 1/freq, t_terminal)
simulator_mcrl_1.run_simulation()

controller_mcrl_2 = runner.get_trained_controller(name="mcrl_eps=0(greedy_case)", seed=seed_list[0])
simulator_mcrl_2 = Simulator(dynamics, controller_mcrl_2, env, 1/freq, t_terminal)
simulator_mcrl_2.run_simulation()

controller_mcrl_3 = runner.get_trained_controller(name="mcrl_eps=1(explorative_case)", seed=seed_list[0])
simulator_mcrl_3 = Simulator(dynamics, controller_mcrl_3, env, 1/freq, t_terminal)
simulator_mcrl_3.run_simulation()

# Instantiate the visualizer, and display the plottings and animation
visualizer_mcrl = Visualizer(simulator_mcrl_1)
visualizer_mcrl.display_contrast_plots(simulator_mcrl_2, simulator_mcrl_3, title="Exploration vs. Exploitation")
visualizer_mcrl.display_contrast_animation_same(simulator_mcrl_2, simulator_mcrl_3)

- **Run 1: Decaying ε (ε=1.0 → ε\_min, k=0.995)**

  -  **Reward Curve**: Significant fluctuations early on indicate active exploration; gradually converges to a higher reward, showing stable policy learning.
  -  **Success Rate**: Rapidly increases around episode 500 and approaches 100%, indicating good convergence and robustness.
  -  **Trajectory Analysis**: Shows trial-and-error behavior initially, but converges to smooth and efficient motion towards the target in later stages.

   **Summary**: The decaying ε-soft strategy effectively balances exploration and exploitation. It allows the agent to explore sufficiently at the beginning while exploiting learned knowledge later, leading to the best overall performance.

<br>

- **Run 2: Constant ε = 0.0 (Pure Exploitation)**

  -  **Reward Curve**: Initially poor due to suboptimal policy, but once a good policy is accidentally discovered, it converges quickly.
  -  **Success Rate**: Approaches 100% eventually, but this is largely attributed to the long episode length.
  -  **Trajectory Analysis**: Shows repetitive and conservative behavior with no variation, lacking adaptability.

   **Summary**: Pure exploitation may converge quickly if lucky with initialization, but it risks getting stuck in local optima due to lack of exploration.

<br>

- **Run 3: Constant ε = 1.0 (Pure Exploration)**

  -  **Reward Curve**: Highly volatile throughout, with no sign of convergence.
  -  **Success Rate**: Remains low (<20%) across episodes, indicating that the agent continues to explore without learning from past experience.
  -  **Trajectory Analysis**: Trajectories remain random and inconsistent, showing no effective strategy formation.

    **Summary**: Full exploration without exploitation prevents the agent from consolidating learning. Although it covers the state space well, it fails to use acquired knowledge.

<blockquote style="padding-top: 20px; padding-bottom: 10px;">

##### **💡 Take-away: Exploration–Exploitation Trade-off in model-free RL**

**A proper balance between exploration and exploitation is critical in model-free RL.** Pure exploitation may lead to premature convergence, while pure exploration prevents policy stabilization. A decaying ε-soft strategy enables effective learning by exploring early and exploiting later.

</blockquote>

In [None]:
# Also setup a NMPC controller for reference
env_mpc = Env(case, np.array([initial_position, initial_velocity]), np.array([target_position, target_velocity]),
          state_lbs=state_lbs_mpc, state_ubs=state_ubs_mpc, input_lbs=input_lbs, input_ubs=input_ubs)
dynamics_mpc = Dynamics(env_mpc)
controller_mpc = MPCController(env_mpc, dynamics_mpc, Q, R, Qf, freq_mpc, N, name="NMPC")
simulator_mpc = Simulator(dynamics_mpc, controller_mpc, env_mpc, 1/freq_mpc, t_terminal)
simulator_mpc.run_simulation()

# Compare the best result from Monte Carlo method with NMPC
visualizer_mcrl = Visualizer(simulator_mcrl_1)
visualizer_mcrl.display_contrast_plots(simulator_mpc, title="The best Epsilon-decaying Group from Monte Carlo Method vs. NMPC", if_gray=True)
visualizer_mcrl.display_contrast_animation_same(simulator_mpc, if_gray=True)

----

<br>

### **Q-Learning**

Q-learning is one of the most widely used **model-free reinforcement learning algorithms**, based on the *temporal-difference (TD)* learning paradigm. Unlike the Monte Carlo Reinforcement Learning method discussed earlier—which waits until the end of each episode to update value estimates—Q-learning performs **incremental updates** at every step based on the current state, action, reward, and the maximum expected future reward.

This **off-policy** nature and step-wise update strategy allow Q-learning to converge more efficiently and with lower variance compared to Monte Carlo method, especially in environments with long or stochastic episodes. Below is the pseudocode for the Q-learning algorithm.

$$
\begin{array}{l}
\textbf{Q-learning (Off-Policy Temporal-Difference Method)}
\end{array}
$$

$$
\begin{array}{l}
\text{1. Initialization: }\\
\text{01: } \text{\quad - \quad Define a discount factor $\gamma$}\\
\text{02: } \text{\quad - \quad Choose a constant learning rate $\alpha$ and a small $\epsilon > 0$}\\
\text{03: } \text{\quad - \quad Initialise arbitrary $Q(x,u)\in\mathbb{R}$ for all $x\in\mathcal{X},\,u\in\mathcal{U}(x)$}\\
\text{2. Episode-Based Action Value Function Updates: }\\
\text{04: } \quad \textbf{while } \textit{not converged} \textbf{ do} \\
\text{0: } \quad | \quad \text{Initialize state $x$}\\
\text{06: } \quad | \quad \textbf{while } \textit{episode not terminated} \textbf{ do}\\
\text{07: } \quad | \quad | \quad \text{Choose $u$ based on a policy $p_{\pi}(u|x)$ derived from $Q(x,u)$ (e.g., an $\epsilon$-soft policy)}\\
\text{08: } \quad | \quad | \quad \text{Take action $u$ and observe $r$ and $x'$}\\
\text{09: } \quad | \quad | \quad Q(x,u) \leftarrow Q(x,u) + \alpha(r + \gamma \text{max}_{u'}Q(x', u') - Q(x,u))\\
\text{10: } \quad | \quad | \quad \text{Goto next state $x \leftarrow x'$}\\
\text{11: } \quad | \quad \textbf{end}\\
\text{12: } \quad \textbf{end}
\end{array}
$$


#### **Implementation**

The `setup()` routine for **Q-learning** revolves around three core phases executed iteratively across episodes:

1. **Step-wise interaction:**
   Starting from the initial state (typically fixed to accelerate training), the agent repeatedly selects actions based on the current ε-soft policy, steps through the environment, and records the reward and next state until a termination condition is reached or the episode times out.

2. **Temporal-Difference (TD) update:**
   Immediately after each transition \$(x, u, r, x')\$, the agent performs an online update to the Q-table using the TD rule: $Q(x,u) \leftarrow Q(x,u) + \alpha \bigl(r + \gamma \max_{u'} Q(x', u') - Q(x,u)\bigr)$, which incorporates both immediate rewards and bootstrapped estimates of future returns.

3. **Policy evaluation & exploration decay:**
   After each episode, the greedy policy (used implicitly inside the ε-soft action sampler) becomes more reliable as the Q-table improves. The exploration rate ε is reduced geometrically to gradually shift from exploration to exploitation, while all key statistics like success rate and average reward are logged.


In [None]:
# Implementation
def setup(self) -> None:

    for iteration in range(self.max_iterations):

        total_reward = 0  # To accumulate rewards for this episode
        total_steps = 0  # To count total steps in this episode

        if iteration % 100 == 0:

            if iteration != 0:
                # Record the SR_100epsd, F_100epsd, TO_100epsd
                self.SR_100epsd.append(SR_100epsd)
                self.F_100epsd.append(F_100epsd)
                self.TO_100epsd.append(TO_100epsd)
            
            SR_100epsd = 0
            F_100epsd = 0
            TO_100epsd = 0

        # Start from init state
        # Note: Here we restrict initial state distribution to boost the training but one can also 
        #       randomly initialize state given sufficient interaction to improve the generalization.
        current_state = self.mdp.init_state
        current_state_index = self.mdp.nearest_state_index_lookup(current_state)

        for step in range(self.max_steps_per_episode):

            # Choose action based on epsilon-soft policy
            action_probabilities = self._get_action_probabilities(current_state_index)
            action_index = np.random.choice(np.arange(self.dim_inputs), p=action_probabilities)
            current_input = self.mdp.input_space[action_index]

            # Take action and observe the next state and reward
            next_state, reward = self.mdp.one_step_forward(current_state, current_input)
            next_state_index = self.mdp.nearest_state_index_lookup(next_state)
            total_reward += reward  # Accumulate total reward
            total_steps += 1  # Increment step count
            
            # Update Q table and compute TD error
            td_error = reward + self.gamma * np.max(self.Q[next_state_index, :]) - self.Q[current_state_index, action_index]

            # Factor in recursive estimation
            self.state_action_counts[current_state_index, action_index] += 1
            alpha = self.learning_rate

            # Update Q table and log TD error
            self.Q[current_state_index, action_index] += alpha * td_error
            
            # Check if the episode is finished
            terminate_condition_1 = next_state[0]>self.mdp.pos_partitions[-1]
            terminate_condition_2 = next_state[0]<self.mdp.pos_partitions[0]
            terminate_condition_3 = np.all(self.mdp.state_space[:, next_state_index]==self.target_state)

            if terminate_condition_1 or terminate_condition_2 or terminate_condition_3:
                if terminate_condition_3:
                    SR_100epsd += 1
                    if self.verbose:
                        print(f"Iteration {iteration + 1}/{self.max_iterations}: finished successfully at step {step}! epsilon: {self.epsilon:.4f}, residual reward: {total_reward:.2f}")
                else:
                    F_100epsd += 1
                    if self.verbose:
                        print(f"Iteration {iteration + 1}/{self.max_iterations}: episode failed at step {step}! epsilon: {self.epsilon:.4f}, residual reward: {total_reward:.2f}")
                break

            else:
                
                # Move to the next state
                current_state_index = next_state_index
                current_state = self.mdp.state_space[:, current_state_index]
            
            if step == self.max_steps_per_episode-1:
                TO_100epsd +=1
                if self.verbose:
                    print(f"Iteration {iteration + 1}/{self.max_iterations}: time out (step: {step})! epsilon: {self.epsilon:.4f}, residual reward: {total_reward:.2f}")
                
        # Decrease epsilon
        self.epsilon *= self.k_epsilon
        self.epsilon = max(self.epsilon, self.epsilon_min)
        self.epsilon_list.append(self.epsilon)

        # Record the residual reward and loss
        self.residual_rewards.append(total_reward)
        self.step_list.append(total_steps)
    
    # Return the deterministic policy and value function
    self.policy = np.argmax(self.Q, axis=1)
    self.value_function = np.max(self.Q, axis=1)

    # Repeat success/failure stats for plotting
    self.SR_100epsd = np.repeat(self.SR_100epsd, 100)/100
    self.F_100epsd = np.repeat(self.F_100epsd, 100)/100
    self.TO_100epsd = np.repeat(self.TO_100epsd, 100)/100
    
    if self.verbose:
        print("Training finished！")

<br>

#### **Simulation on a Flat Terrain**

Similar to the Monte Carlo RL setup, we first train and simulate the agent on a flat terrain to establish a clean and ideal baseline. Notably, since Q-learning uses recursive Temporal-Difference updates, the learning signal (TD error) tends to have smaller variance, leading to more stable updates. This allows us to adopt a relatively higher learning rate, accelerating the convergence process without sacrificing stability.

In [None]:
case = 1

gamma = 0.90
epsilon = 1.0
k_epsilon = 0.995
learning_rate = 0.3
max_iterations = 1500

seed = 42

In [None]:
np.random.seed(seed)

# Instantiate class 'Env', 'Dynamics', and 'Env_rl_d'
env = Env(case, np.array([initial_position, initial_velocity]), np.array([target_position, target_velocity]),
          state_lbs=state_lbs, state_ubs=state_ubs, input_lbs=input_lbs, input_ubs=input_ubs)
dynamics = Dynamics(env)
mdp = Env_rl_d(env=env, dynamics=dynamics, num_states=num_states, num_actions=num_actions, dt=1/freq, build_stochastic_mdp=False)

# Instantiate the Q-learning controller class
controller_ql = QLearningController(mdp, freq, epsilon=epsilon, k_epsilon=k_epsilon,
                                    learning_rate=learning_rate, gamma=gamma, max_iterations=max_iterations)
controller_ql.setup()
controller_ql.postprocessing(window=50)
controller_ql.plot_training_curve()

# Instantiate the simulator, and then run the simulation
simulator_ql = Simulator(dynamics, controller_ql, env, 1/freq, t_terminal)
simulator_ql.run_simulation()

# Also setup a NMPC controller for reference
env_mpc = Env(case, np.array([initial_position, initial_velocity]), np.array([target_position, target_velocity]),
          state_lbs=state_lbs_mpc, state_ubs=state_ubs_mpc, input_lbs=input_lbs, input_ubs=input_ubs)
dynamics_mpc = Dynamics(env_mpc)
controller_mpc = MPCController(env_mpc, dynamics_mpc, Q, R, Qf, freq_mpc, N, name="NMPC")
simulator_mpc = Simulator(dynamics_mpc, controller_mpc, env_mpc, 1/freq_mpc, t_terminal)
simulator_mpc.run_simulation()

# Instantiate the visualizer, and display the plottings and animation
visualizer_ql = Visualizer(simulator_ql)
visualizer_ql.display_contrast_plots(simulator_mpc, title="Simulation of Q-learning Controller on Flat Terrain", if_gray=True)
visualizer_ql.display_contrast_animation_same(simulator_mpc, if_gray=True)

<br>

#### **Simulation on a Hilly Terrain**

We likewise conduct training on the hilly terrain and visualize the simulation results:

In [None]:
case = 4

gamma = 0.90
epsilon = 1.0
k_epsilon = 0.995
learning_rate = 0.2
max_iterations = 2000

seed = 42

In [None]:
np.random.seed(seed)

# Instantiate class 'Env', 'Dynamics', and 'Env_rl_d'
env = Env(case, np.array([initial_position, initial_velocity]), np.array([target_position, target_velocity]),
          state_lbs=state_lbs, state_ubs=state_ubs, input_lbs=input_lbs, input_ubs=input_ubs)
dynamics = Dynamics(env)
mdp = Env_rl_d(env=env, dynamics=dynamics, num_states=num_states, num_actions=num_actions, dt=1/freq, build_stochastic_mdp=False)

# Instantiate the Q-learning controller class
controller_ql = QLearningController(mdp, freq, epsilon=epsilon, k_epsilon=k_epsilon,
                                    learning_rate=learning_rate, gamma=gamma, max_iterations=max_iterations)
controller_ql.setup()
controller_ql.postprocessing(window=50)
controller_ql.plot_training_curve()

# Instantiate the simulator, and then run the simulation
simulator_ql = Simulator(dynamics, controller_ql, env, 1/freq, t_terminal)
simulator_ql.run_simulation()

# Also setup a NMPC controller for reference
env_mpc = Env(case, np.array([initial_position, initial_velocity]), np.array([target_position, target_velocity]),
          state_lbs=state_lbs_mpc, state_ubs=state_ubs_mpc, input_lbs=input_lbs, input_ubs=input_ubs)
dynamics_mpc = Dynamics(env_mpc)
controller_mpc = MPCController(env_mpc, dynamics_mpc, Q, R, Qf, freq_mpc, N, name="NMPC")
simulator_mpc = Simulator(dynamics_mpc, controller_mpc, env_mpc, 1/freq_mpc, t_terminal)
simulator_mpc.run_simulation()

# Instantiate the visualizer, and display the plottings and animation
visualizer_ql = Visualizer(simulator_ql)
visualizer_ql.display_contrast_plots(simulator_mpc, title="Simulation of Q-learning Controller on Hilly Terrain", if_gray=True)
visualizer_ql.display_contrast_animation_same(simulator_mpc, if_gray=True)


The Q-learning agent demonstrates a smooth and stable learning curve even on the more challenging hilly terrain. The reward increases steadily, and the success rate reaches nearly 100% after sufficient training. Although the final policy is not perfectly optimal, it satisfies the task requirements without triggering input saturation or constraint violations, showing strong generalization despite the terrain's complexity.

---

<br>

### **Monte Carlo Method vs. Q-Learning**

In this section, we aim to empirically compare the performance of two fundamental model-free RL algorithms: Monte Carlo Reinforcement Learning and Q-learning. While both methods share the same objective—learning an optimal policy without explicit model knowledge—they differ in how value updates are performed: Monte Carlo method uses full-episode returns, whereas Q-learning adopts a recursive temporal-difference approach.


#### **Fair Comparison & Statistical Evaluation**

To ensure a fair and rigorous comparison, we follow best practices inspired by recent reproducibility efforts in RL research, particularly:

> **[JMLR 2024, “Revisiting RL Evaluation: Reproducibility, Fairness, and Statistical Significance”](https://www.jmlr.org/papers/volume25/23-0183/23-0183.pdf)**

Key recommendations adopted from the paper:

* **Multiple Random Seeds:** Since RL algorithms are sensitive to initialization and stochastic transitions, we fix a list of diverse random seeds (e.g., `[42, 63, 8042, 10328, 4174]`) and evaluate performance across them.

* **Mean & Variance Reporting:** After training under each seed, we aggregate the reward and success rate curves, and report **mean ± 3 * standard deviation** to reflect central tendency and variability.

* **Identical Environment Setup:** Both algorithms use the same reward function, state/input space, discount factor `γ`, and initial conditions to ensure structural fairness.

* **Controlled Hyperparameters:** Only algorithm-specific parameters (like update rule or exploration strategy) are allowed to differ.

This ensures that the observed performance differences can be attributed to algorithmic factors rather than confounding implementation artifacts.


#### **Experimental Design**

The hyperparameter setting and chosen random seeds are specifed in the following code block. Under the setup mentioned above, both MCRL and Q-learning are trained and evaluated independently for each seed. The final results are visualized using reward curves, success rate evolution, and state-input trajectories under the learned policy.

In [None]:
case = 4

gamma = 0.90
epsilon = 1.0
k_epsilon = 0.995
learning_rate = 0.1

max_iterations = 2000

seed_list = [42, 63, 8042, 10328, 4174]

save_dir = "./mfrl_results"

In [None]:
# Instantiate class 'Env', 'Dynamics', and 'Env_rl_d'
env = Env(case, np.array([initial_position, initial_velocity]), np.array([target_position, target_velocity]),
          state_lbs=state_lbs, state_ubs=state_ubs, input_lbs=input_lbs, input_ubs=input_ubs)
dynamics = Dynamics(env)
mdp = Env_rl_d(env=env, dynamics=dynamics, num_states=num_states, num_actions=num_actions, dt=1/freq, build_stochastic_mdp=False)

# Instantiate the MCRL controller class
controller_mcrl = MCRLController(mdp, freq, epsilon=epsilon, k_epsilon=k_epsilon, 
                                    learning_rate=learning_rate, gamma=gamma, max_iterations=max_iterations, name="Monte Carlo Method")

# Instantiate the Q-learning controller class
controller_ql = QLearningController(mdp, freq, epsilon=epsilon, k_epsilon=k_epsilon,
                                    learning_rate=learning_rate, gamma=gamma, max_iterations=max_iterations, name="Q-learning")

# Set up the controllers
controller_instances = {
    "MCRL": controller_mcrl,
    "Q-learning": controller_ql
}

# Set up the RLExperimentRunner
runner = RLExperimentRunner(
    controller_instances=controller_instances,
    seed_list=seed_list,
    save_dir=save_dir
)

# Run all controllers with all seeds and save the results
runner.run_all()
runner.plot("Monte Carlo Method vs. Q-learning")

In [None]:
# Load the trained controllers for simulation
controller_mcrl = runner.get_trained_controller(name="MCRL", seed=seed_list[0])
simulator_mcrl = Simulator(dynamics, controller_mcrl, env, 1/freq, t_terminal)
simulator_mcrl.run_simulation()

controller_ql = runner.get_trained_controller(name="Q-learning", seed=seed_list[0])
simulator_ql = Simulator(dynamics, controller_ql, env, 1/freq, t_terminal)
simulator_ql.run_simulation()

# Instantiate the visualizer, and display the plottings and animation
visualizer_mcrl = Visualizer(simulator_mcrl)
visualizer_mcrl.display_contrast_plots(simulator_ql, title = "Monte Carlo Method vs. Q-learning")
visualizer_mcrl.display_contrast_animation_same(simulator_ql)

The comparative results between **Monte Carlo method** and **Q-learning** on the same hilly terrain task yield the following insights:

* **Monte Carlo method** benefits from **unbiased return estimation**, which helps it eventually reach a near-optimal solution. However, its reliance on full-episode returns leads to **higher variance**, especially in the early stage of training. This can result in less stable learning curves and lower repeatability across runs. Despite this, Monte Carlo method sometimes outperforms Q-learning in final performance when training is sufficiently long.

* **Q-learning**, by contrast, exhibits **faster initial convergence** due to its recursive temporal-difference update mechanism. It has **lower variance** across seeds, resulting in **smoother and more stable training**. However, its bootstrapping nature may introduce bias, which occasionally limits its ability to reach the globally optimal value compared to Monte Carlo method.

Overall, while both algorithms can achieve comparable success rates after sufficient training, Q-learning offers better training stability and efficiency, whereas Monte Carlo method may achieve slightly better peak performance if variance is well managed.


<blockquote style="padding-top: 20px; padding-bottom: 10px;">

##### **💡 Take-away: Monte Carlo Method vs. Q-learning**

* **Monte Carlo method** provides **unbiased return estimation** by averaging full-episode rewards, which can lead to highly accurate value predictions in the long run. However, its reliance on complete episodes and high variance may result in unstable or slow convergence.

* **Q-learning** offers **more stable and sample-efficient learning** through recursive temporal-difference updates. Though slightly biased, its lower variance and online nature allow faster convergence and smoother training dynamics in practice.

</blockquote> 




---

<br>

### **Comparison of Control and Reinforcement Learning Paradigms**

In the previous sections, we have explored three different paradigms for sequential decision-making: **model-based Control** (e.g., MPC), **model-based Reinforcement Learning** (e.g., Value Iteration, Policy Iteration, GPI), and **model-free Reinforcement Learning** (e.g., Monte Carlo method, Q-learning). Each of these methods offers distinct strengths and trade-offs, depending on the availability of system dynamics, the nature of the state and action spaces, and computational constraints. Below we provide a structured comparison across several key dimensions:

<br>

- In terms of model:

| **Dimension**            | **Model-based Control (MPC)**                                  | **Model-based RL (VI/PI/GPI)**                              | **Model-free RL (Monte Carlo method/Q-learning)**                         |
| ------------------------ | -------------------------------------------------------------- | ----------------------------------------------------------- | ----------------------------------------------------------- |
| **Need for dynamics**     | System dynamics required                                         | MDP required (transition probabilities & rewards) | No (all learned from samples)                                             |
| **Type of dynamics**     | Continuous / discrete-time                                     | Discrete-time | Discrete-time                                       |
| **State/action space**   | Continuous | Discrete (tabular, continuous requires discretization)       | Discrete (tabular, continuous requires discretization)                                          |

<br>

- In terms of computational burden:

| **Dimension**            | **Model-based Control (MPC)**                                  | **Model-based RL (VI/PI/GPI)**                              | **Model-free RL (Monte Carlo method/Q-learning)**                         |
| ------------------------ | -------------------------------------------------------------- | ----------------------------------------------------------- | ----------------------------------------------------------- |
| **Optimization timing**  | Online (solves OCP in real-time)                                  | Offline (solves DP using model)                             | Offline (learns directly from interaction)                  |
| **Offline computation time**     | -                       | High (but depends on grid size)                    | Moderate (depends on number of episodes) |
| **Online computation time**     | High (due to online QPs)                       | Low                   | Low |
| **Scalability**     | Moderate (porpotional to state/input dimension)                       | Poor (exponential in number of bins)                   | Better than the other two (learning through sampling) |

<br>

- In terms of learning properties:

| **Dimension**            | **Model-based Control (MPC)**                                  | **Model-based RL (VI/PI/GPI)**                              | **Model-free RL (Monte Carlo method/Q-learning)**                         |
| ------------------------ | -------------------------------------------------------------- | ----------------------------------------------------------- | ----------------------------------------------------------- |
| **Data requirement**     | Low (default not learning from data)                                 | Low (default not learning from data)                               | High (samples collected through interaction)                |
| **Exploration strategy** | -                                     | Implicit via value improvement                              | Explicit via ε-soft policy or random sampling               |
| **Convergence & stability**     | Deterministic and stable under proper design                                 | Can converge, but may be unstable if model is inaccurate                               | Slower convergence, sensitive to hyperparameters                |
| **Optimality** | High (provably optimal under known system dynamics and cost design) | High (provably optimal under known MDP model and reward structure) | Often suboptimal (limited by data efficiency and exploration challenges) |

<br>

#### **Summary of Each Method**

* **Model-based Control:**
  Provides high performance with accurate models and works well in continuous domains. However, real-time optimization can be computationally intensive, especially in constrained or nonlinear problems. No learning is involved—relies entirely on predictive models and open-loop optimization at each step.

* **Model-based RL:**
  Bridges control and learning by computing value functions from a known model (MDP). It guarantees convergence to the optimal policy in discrete settings but is limited by discretization and computational burden in large-scale problems. It serves as a theoretical benchmark but is impractical in continuous domains.

* **Model-free RL:**
  Trades off model accuracy for direct interaction-based learning. It is more robust to model mismatch and easier to implement when dynamics are unknown. However, it typically requires large amounts of data and careful exploration strategy design to achieve comparable performance. Similar to model-based RL, its classical tabular formulation is limited to discrete domains.



> <br>
> 📌 <b>Bridging the Gap: From Classical Reinforcement Learning to Deep Reinforcement Learning</b>
>
> These limitations highlight the necessity of **Deep Reinforcement Learning (DRL)**, which combines neural function approximation with model-free learning paradigms to scale RL to high-dimensional, continuous problems.
>
> <br>