# What is reinforcement learning? 

In very simple terms, an agent learns to pick actions in an environment to get more reward over time. The point is to make the agent learn the process of 

## Markov Decision Processes

Also known as MDPs, they provide a way to formalize sequential decision making. In an MDP, we have: 

1. Environment: A surrounding or scenario which defines our world.
2. Agent: A model that interacts with the environment it is placed in.
3. State: A configuration of the environment that the agent faces at a given time step.
4. Action: A step taken by the agent to bring about a change of state in its environment.
5. Reward: A score given to the agent as a result of that action.

The whole process repeats over and over, referred to as a trajectory, where the agent's aim is to  maximize cumulative reward i.e the reward it gets across time steps (not just at one time step). 

### Formal Problem Notation

In an MDP, we have set of states **S** , a set of actions **A**, and a set of rewards **R**. Each of them has a finite set of elements. 

1. At each time step *t*= **0,1,2,...**, agent is at a state $S_t \in S$ and takes an action $A_t \in A$. So, for each time step, we have a state-action pair $(S_t, A_t)$.
2. At $S_{t+1}$, we get numerical reward $R_{t+1} \in R$.
3. The reward can be expressed as a function of state and time.
   $$f(S_t,A_t)= R_{t+1}$$
4. Th trajectory is a sequence of $S, A, R$.
   $$ S_0, A_0, R_1, S_1, A_1, R_2, ...$$
5. Given **S** and **R** are finite, the random variables $S_t$ and $R_t$ have well defined probability distributions. The transition probability of going from action $a$ in state $s$ to state $s'$ with reward $r$ can be defined.
   $$ p(s',r|s,a)= Probability(S_t=s', R_t=r | S_{t-1}=s, A_{t-1}=a)$$

### Expected Return

In an MDP, the main driving force behind the agent is maximizing it's cumulative reward, which we also know as *Expected Return*. We can define it as 

 $$ G_t= R_{t+1}+ R_{t+2}+ R_{t+3}+ ...+ R_T $$ 

where *T* is the total number of time step and the goal of the agent is to maximize $G$.

However, there is an issue if we just add up rewards forever:

1. **Infinity problem**: If episodes don’t end , the sum may diverge (e.g., robot that keeps getting +1 reward each step = infinite return). 

2. **Unrealistic planning**: We care more about sooner rewards than way in the future. For example, a robot that delivers coffee in 10 steps is better than one that delivers in 1000 steps, even if both eventually succeed.

To combat these issues, we introduce the discounted expected return. 

### Discounted Return

We define the discount rate to be a number $\gamma$ between 0 and 1 such that each term in the expected return function will be multiplied by $\gamma$: 

$$G_t= \gamma R_{t+1}+ \gamma^2R_{t+2}+ \gamma^3R_{t+3}+ ...$$
$$G_t= \sum_{k=0}^\inf \gamma^kR_{t+k+1}$$

One key feature of discounted return is that agent values immediate rewards more due to lower weights of future rewards.

## Policy and Value Functions

When it comes to agents selecting an action at a given state, there are two factors of consideration: Policies and Value Functions.

### Policy

A policy is the function that describes the probability of picking different actions from each possible state. For each state $s\in S$, the policy $\pi$ is a probability distribution over all actions $a\in A(s)$ .

At any time $t$, under a policy $\pi$, the probability of taking action $a$ in state $s$ is given by $\pi(a|s)$.

### Value Functions

A value function is the function over states or state-action pairs that estimates two possible things: 

1. State-Value Function: How good it is for the agent to be in a given state. 
2. Action-Value Function: How good it is for the agent to pick an action in a given state.

Value functions are thus defined in terms of expected returns on acting. The expected returns on acting is determined in terms the policy. Thus value functions are defined in terms of the policy. 

**State Value Function** for a state $s$ is given in terms of expected returns on starting from $s$ at time $t$ and following policy $\pi$ thereafter: 
$$v_{\pi}(s)= E(G_t|S_t=s)$$

**Action Value Function/Q Function** for an action $a$ in state $s$ is given in terms of expected returns on starting from $s$ at time $t$, taking action $a$,  and following policy $\pi$ thereafter: 
$$q_{\pi}(s,a)= E(G_t|S_t=s,A_t=a)$$

## Optimization

Any learning algorithm learns by optimizing a function. In traditional ML, the cost function is optimized. In RL, what do we optimize? We optimize the policy and the value functions!

**Optimal Policy**: A policy is considered optimal if the expected return of $\pi$ is greater than or equal to the expected returns of all other policies $\pi'$. 

$$\pi>\pi' \iff v_{\pi}(s)> v_{\pi'}(s), \forall s\in S$$

**Optimal State-Value Function**: The optimal policy has an associated optimal state-value function $v*$ that gives the largest return possible by any policy $\pi$ for each state. 
$$v*(s)=max_{\pi}v_{\pi}(s),  \forall s \in S$$

**Optimal Action-Value/Q Function**: The optimal policy has an associated optimal action-value function $q*$ that gives the largest return possible by any policy $\pi$ for each state-action pair. 
$$q*(s,a)=max_{\pi}q_{\pi}(s,a),  \forall s \in S , \forall a \in A$$

The $q*$ function must satisfy the Bellman Optimality which states that for any state-action pair at time $t$, the expected return of starting in state $s$ and selecting action $a$ and following optimal policy $\pi$ thereafter i.e *q-value* of $(s,a)$ is the expected reward plus the maximum discounted return that can be achieved from any possible next $(s',a')$

$$ q*(s,a)= E[R_{t+1} + \gamma max_{a'}q*(s',a')]$$

## Q Learning

It is the process by which reinforcement learning occurs. 

* Objective: Learn the optimal q-values for each (s,a) pair and thus find the optimal policy. 
* Value Iteration: The algorithm iteratively updates the Q-values for each state-action pair until it converges to $q*$. 
* Episodes: The number of episodes i.e number of times agent terminates is set beforehand. Within each episode, we have multiple time steps.

### Intuition

The Bellman Equation gives us the q-value at optimality

$$ q*(s,a)= E[R_{t+1} + \gamma max_{a'}q*(s',a')]$$

Now, in our learning process , we will get some q values. We calculate error $\delta$

$$ \delta= q*(s,a) - q(s,a)$$

Putting in the value of optimal q from Bellman Equation, we get error $\delta$

$$ \delta= [R_{t+1} + \gamma max_{a'}q*(s',a')] - q(s,a)$$

To get the new q-value, we add this $\delta$ to the old q value.

### Training


The training process is as follows:

1. We initialize with an empty q-value matrix $X_{m\times n}$ where $m$= number of states and $n$= number of actions.
2. At each action of agent, the new q-value is calculated with learning rate $\alpha$
$$ q_{new}(s,a)= (1-\alpha)q_{old}(s,a) + \alpha (R_{t+1}+ \gamma max q(s',a'))$$

3. The agent can use the q-values to go down paths that maximize q-values at every state.
4. This whole process is repeated until a termination condition is met or fixed number of steps have been taken.
5. The optimal q-function gives us the ultimate optimized policy.

The exploration-exploitation tradeoff determines how much agent tries new paths in the environment vs how much it tries to maximize rewards based on what it has already seen. It is typically done through $\epsilon$-greedy strategy. 

1. We initially set $\epsilon=1$ and decay it in each step.
2. At every time step, we choose a random number $r$ where we explore if $r>\epsilon$ and exploit if $r<\epsilon$ 

# Implementing RL

In [8]:
# Frozen Lake Game
import numpy as np
import gymnasium as gym
import random 
import time
from IPython.display import clear_output

In [9]:
# initializing environment and agent
env = gym.make('FrozenLake-v1', render_mode='ansi')