# L8c: Introduction to Q-Learning
In this lecture, we'll continue our disucssion of reinforcement learning concepts, and will discuss model-free off-policy Q-learning. We'll start by discussing the Q-learning algorithm. We'll also discuss the convergence of Q-learning, and will see how to implement Q-learning in a more complex environment. Finally, we'll discuss the limitations of Q-learning, and will see how to overcome these limitations using deep Q-learning (advanved topic). 

The key ideas of this lecture are:
* Q-learning is a model-free off-policy algorithm that learns the optimal state-action-value function $Q(s,a)$ by experimenting with the environment. This is done by updating the Q-values using the Bellman equation. The optimal policy is then derived from the optimal Q-values: $\pi^*(s) = \arg\max_a Q^*(s,a)$.
* Q-learning is guaranteed to converge to the optimal Q-values under certain conditions. Convergence is observed when Q-value updates become negligible and the reward plateaus. Theoretical convergence guarantees require two conditions: learning rate decay and infinite exploration, where the discount factor $0<\gamma<1$ for infinite-horizon problems, and environment must adhere to the Markov property.
* $\epsilon$-greedy exploration is a common strategy to balance exploration and exploitation, and solve the Q-learning problem. The agent selects the action with the highest Q-value with probability $1-\epsilon$, and selects a random action with probability $\epsilon$. We'll let $\epsilon$ decay over time to encourage the agent to explore more in the beginning and exploit more later.

The notes for today were inspired by the [Reinforcement Learning (RL) course ](https://gibberblot.github.io/rl-notes/) prepared by [Prof. Tim Miller from The University of Queensland](https://uqtmiller.github.io).

## Q-Learning Theory
Q-learning estimates the action-value function $Q(s, a)$ by conducting repeaded experiments $t=1,2,\ldots$ in the world $\mathcal{W}$. 
In each experiment $k$, an agent in state $s\in\mathcal{S}$ takes action $a\in\mathcal{A}$, receives a reward $r$, and (potentially) transitions to a new state $s^{\prime}$. After each experiment $t$, the agent updates its estimate of $Q(s, a)$:
$$
\begin{equation*}
Q_{t+1}(s,a)\leftarrow{Q_{t}(s,a)}+\alpha_{t}\cdot\left(r+\gamma\cdot\max_{a^{\prime}\in\mathcal{A}}Q_{t}(s^{\prime},a^{\prime}) - Q_{t}(s,a)\right)\quad{t = 1,2,3,\ldots}
\end{equation*}
$$
where $0<\alpha_{t} <{1}$ is the learning rate parameter at time $t$, and $0<\gamma<{1}$ is the discount factor. 
We estimate the policy function $\pi:\mathcal{S}\rightarrow\mathcal{A}$ by selecting the action $a$ that maximizes $Q(s,a)$ at each state $s$:
$$
\begin{equation*}
\pi(s) = \arg\max_{a\in\mathcal{A}}Q(s,a)
\end{equation*}
$$
where $\pi(s)$ is the action that maximizes the action-value function $Q(s,a)$ at state $s$.

### Algorithm
Initialize $Q(s,a)$ arbitrarily for all $s\in\mathcal{S}$, and $a\in\mathcal{A}$.
Set the hyperparameters: learning rate $\alpha_{t}$, the discount factor $\gamma$, the exploration rate $\epsilon_{t}$, and the convergence tolerance $\theta$.

For $s\in\mathcal{S}$
1. Initialize the time $t\gets{1}$
2. While not converged and $s\not\in\mathcal{S}_{\text{term}}$:
    1. Role a random number $p\in[0,1]$.
    1. If $p\leq\epsilon_{t}$, choose a random (uniform) action $a_{t}\in\mathcal{A}$. Otherwise, choose a greedy action $a_{t} = \text{arg}\max_{a\in\mathcal{S}}{Q_{t}(s,a)}$.
    3. Take action $a_{t}$, observe the reward $r$ and transition to the next state $s^{\prime}$.
    4. __Update__ the state-action-value function: $Q_{t+1}(s,a)\leftarrow{Q_{t}(s,a)}+\alpha_{t}\cdot\left(r+\gamma\cdot\max_{a^{\prime}\in\mathcal{A}}Q_{t}(s^{\prime},a^{\prime}) - Q_{t}(s,a)\right)$.
    5. __Update__ the state $s\leftarrow{s^{\prime}}$, the time $t\leftarrow{t+1}$, the exploration rate $\epsilon_{t+1}\leftarrow\epsilon_{t}$ and the learning rate $\alpha_{t+1}\leftarrow\alpha_{t}$.
    6. __Convergence?__: if the $Q(s,a)$ has bounded change $\lVert{Q_{t+1}(s,a) - Q_{t}(s,a)}\rVert\leq\theta$, then the algorithm has _converged_. Otherwise, continue.
3. End While
4. End For

### Convergence
Q-learning converges to the optimal policy under two key theoretical conditions (assuming the Markov property holds for the world):
* __Learning rate decay__: The learning rate $\alpha_{t}$ must satisfy $\sum_{t=0}^\infty \alpha_t(s,a) = \infty$ and $\sum_{t=0}^\infty \alpha_t^2(s,a) < \infty$ for all state-action pairs, ensuring sufficient initial updates while stabilizing over time. Thus, $\alpha_t \sim 1/t$ or, alternatively, $\alpha_t \gets \beta\alpha_{t}$ where $\beta<1$ are common choices.
* __Infinite exploration__: All state-action pairs can be visited infinitely often, typically enforced by $\epsilon$-greedy policies with persistent exploration ($\epsilon_{t} > 0\,\,\forall{t}$).

## Limitations of Q-learning
Q-learning, while foundational in reinforcement learning, faces several key limitations:  
* __Scalability issues with tabular methods__: The reliance on a Q-table becomes impractical in large or continuous state-action spaces due to exponential memory growth (e.g., millions of entries), leading to slow convergence and inefficient learning.  
* __Overestimation bias__: Noisy environments can cause Q-learning to overestimate action values, resulting in suboptimal policies, a problem mitigated by Double Q-learning.  
* __Discrete state-action requirement__: The algorithm inherently struggles with continuous spaces, requiring discretization techniques that trade precision for computational feasibility.
* __Delayed rewards__: Q-learning struggles with delayed rewards, as it may not learn the optimal policy due to the temporal difference (TD) error's reliance on immediate rewards. 

## Lab
In `L8d`, we will implement the $\epsilon$-Greedy algorithm to solve a Q-learning problem in a $n$-product consumer choice problem.

# Today?
That's a wrap! What are some of the interesting things we discussed today?