In [1]:
from IPython.display import Image
import os

### Reinforcement Learning

#### Disclaimer

### I am not an expert. I don't really know that much about reinforcement learning as applied to a machine learning context beyond a couple of books and some research. There are ~probably~ people in here who know more than me and I welcome any corrections and or critisim or questions.


The aim of this presentation is for a *general overview* of reinforcement learning as it applies to machine learning. We're going to look at the history briefly, and then dive into an example.

We are not going to be getting into any of the tricksy mathematics, so far as we can avoid it.

Just because you don't understand exactly how these algorithms work, don't avoid touching the topic. It's cool and interesting, and apparently, ever more relevant...

### What is Reinforcement Learning?

Reinforcement learning is an area of machine learning that involves:
- **Policy**: The strategy an agent employs to determine its actions.🕹️
- **Value Function**: An estimation of future rewards given a state and action.🎯
- **Model of the Environment**: Optionally used to simulate outcomes of actions (not always used).📖

A cross between neuroscience, behaviorist psychology, engineering and mathematics.


It is not unsupervised learning (find the pattern) or supervised learning (you've trained on the pattern). Fundamentally, this is more about training the actions of an agent, first by trial and error, and then by figuring out how to get more of a good thing.

It is also *not* an LLM - but it has been used as the final step in training them!

### How Did It Start?

Originating from early studies in behavioral psychology, reinforcement learning has evolved through:
- Thorndike's Law of Effect, where behaviors followed by rewards become more likely.

_When a modifiable connection between a situation and a response is made and is accompanied or followed by a satisfying state of affairs, that connection's strength is increased: When made and accompanied or followed by an annoying state of affairs its strength is decreased. The strengthening effect of satisfyingness (or the weakening effect of annoyingness) upon a bond varies with the closeness of the connection between it and the bond_


### Operant conditioning

Note: this is *operant* conditioning, not *classical* conditioning. Mowgli's voluntary behaviour is affected the consequences of his behaviour.


Let's see my glamourous assistant Mowgli assisting me:

![come](come.gif)
![shake](shake.gif)
![lie_down](lie_down.gif)
![jump](jump.gif)

We can connect this to our earlier statement. 
Mowgli's *policy* is a strategy that maximises getting treats.
Mowgli's *rewards* are his treats.
Mowgli's value function estimates that when he sees me do certain motions combined with hearing certain sound queues, he sits/lies down/comes/shakes because he *thinks* that that will lead to a reward


### A note about shaping

![pigeon](pigeon.jpeg)

B.F. Skinner wanted to train pigeons to fly guided missiles during WWII - and he even got it working. How does someone go about training an animal to do increasingly complex behaviour?

Through auto-shaping. Reward closer *approximations* of behaviour, until you get the desired behaviour.

This is still fraught with challenges

- how can you make sure the correct behaviour is trained?
- how do you avoid exploration vs exploitation?


### How can we apply this to machines?

Turing talking about how he might train a network:

_When a configuration is reached for which the action is undetermined, a random choice for the missing data is made and the appropriate entry is made in the description tentatively, and is applied. When a pain stimulus occurs all tentative entries are cancelled, and when a pleasure stimulus occurs they are all made permanent_

- Skinner's operant conditioning, which introduced the use of consequences for shaping behavior. ⚡
- Bellman's development of dynamic programming, setting the stage for modern reinforcement learning algorithms.
- Sutton and Barto founding the field of reinforcement learning

Andrew Barto and Richard Sutton: they began in 1972 and didn't stop publishing papers and researching for 45 years

### Cross-discipline advances

Advances in psychology were critical to early work in the field of machine learning (and indeed, in it's inception). But what goes around comes around; years of breaking down behavioural whys in models in machine learning was in turn critical to understanding how dopamine works (or in other words, how our value function works).

What is Dopamine? 

Reward? Attention? Novelty? Surprise?

It's more explained by expectation - or as they called it in machine learning: Temporal Difference Learning.


### Temporal Difference Learning

Learning guesses from a guess

As you move forward toward an uncertain future, you maintain a king of "running expectation" of how promising things seem. 

In a chess game, it might be the odds you give yourself to win the game. In a video game, it might be how much progress you expect to make or how many points you expect to rack up in total. These guesses fluctuate over time, and in general they get more accurate they closer you are to whatever  it is you're trying to predict.

As our expectation fluctuates, we get difference between our successive expectation, each of which is a learning opportunity. These are temporal differences. 

Algorithm called TD-lambda




### Dynamic programming link

RL takes the key steps from DP, which are all about making decisions and improving them, and adapts them for situations where you don’t have all the information from the start. It’s all about learning the best moves by practicing and seeing the results, much like learning to play a video game better by playing it more and more.


### Common Definitions in Reinforcement Learning

#### Operant Conditioning vs. Classical Conditioning
- **Operant Conditioning**: A form of learning where an individual's behavior is modified by its consequences, such as rewards and punishments.
- **Classical Conditioning**: A learning process that occurs when two stimuli are repeatedly paired; a response that is at first elicited by the second stimulus is eventually elicited by the first stimulus alone.

#### On-policy vs. Off-policy Learning
- **On-policy Learning**: The strategy where the learning agent evaluates and improves the policy that it uses to make decisions.
- **Off-policy Learning**: The strategy where the agent learns a potentially different policy from the behavior policy that is used for making decisions and collecting data.

#### Common Reinforcement Learning Algorithms
- **Proximal Policy Optimization (PPO)**: A policy gradient method for reinforcement learning that uses a clipped surrogate objective function to prevent large policy updates, thereby improving training stability.
- **Deep Q-Network (DQN)**: An algorithm that combines Q-learning with deep neural networks to approximate the Q-value function, allowing the agent to learn from high-dimensional sensory input.

### Additional Relevant Concepts

#### Markov Decision Process (MDP)
A mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker.

#### Temporal Difference (TD) Learning
A class of model-free reinforcement learning methods that learn by bootstrapping from the estimated value of subsequent states, rather than waiting for a final outcome.

#### Exploration vs. Exploitation
A dilemma faced by learning agents about whether to explore new possibilities (exploration) or choose options that are known to yield high rewards (exploitation).

#### Reward Function
A function used to signal the success of an agent's actions. It's the key feedback signal that guides the learning algorithm in reinforcement learning.

#### ε-greedy Policy
A common policy for controlling the agent's balance between exploration and exploitation. With a certain probability ε, the agent explores a random action, and with probability 1-ε, it exploits the currently known best action.

These definitions and concepts form the foundation of reinforcement learning theory and practice, providing a framework for developing algorithms that enable machines to learn and make decisions autonomously.


#### Dynamic Programming
This technique solves problems by breaking them into smaller, overlapping sub-problems. The results are then stored in a table to be reused so the same problem will not have to be computed again. 

In [4]:
# Function for nth Fibonacci number (naive, not using dynamic programming)
 
def fibonacci_basic(n):
    if n<= 0:
        print("Incorrect input")
    # First Fibonacci number is 0
    elif n == 1:
        return 0
    # Second Fibonacci number is 1
    elif n == 2:
        return 1
    else:
        return fibonacci_basic(n-1)+fibonacci_basic(n-2)
 
print(fibonacci_basic(35))

### Each number is just re-calculated every single time for every recursive call

5702887


In [5]:
# Function for nth fibonacci number - Dynamic Programming
 
FibArray = [0, 1]
 
def fibonacci_dynamic(n):
    if n<0:
        print("Incorrect input")
    elif n<= len(FibArray):
        return FibArray[n-1]
    else:
        temp_fib = fibonacci_dynamic(n-1)+fibonacci_dynamic(n-2)
        FibArray.append(temp_fib)
        return temp_fib
 
print(fibonacci_dynamic(50))
print(FibArray)

### Each number is calculated just once

7778742049
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040, 1346269, 2178309, 3524578, 5702887, 9227465, 14930352, 24157817, 39088169, 63245986, 102334155, 165580141, 267914296, 433494437, 701408733, 1134903170, 1836311903, 2971215073, 4807526976, 7778742049]


### Live Demonstration

In [2]:
Image(url='https://media.tenor.com/IJuLSEYNCcAAAAAC/its-happening.gif')

In [3]:
Image(url='https://gymnasium.farama.org/_images/cart_pole.gif')

In [6]:
!pip install 'stable-baselines3[extra]'
!pip install gymnasium





In [8]:
import os
import gymnasium as gym # https://gymnasium.farama.org/
import pygame

from IPython import display
from stable_baselines3 import PPO 
from stable_baselines3.common.vec_env import DummyVecEnv 
from stable_baselines3.common.evaluation import evaluate_policy 

In [9]:
environment_name = 'CartPole-v0'

## Cartpole - training an agent to balance a pole on top of a cart.
### Action Space: Defines the actions you can take in your environment. So, in our case, we have two actions in our action space:
 0: Push cart to the left
 1: Push cart to the right


In [None]:
env = gym.make(environment_name, render_mode="human")
env.action_space

### Observation Space: Defines what you can see:
| Num | Observation           | Min                 | Max                |
|-----|-----------------------|---------------------|--------------------|
| 0   | Cart Position         | -4.8                | 4.8                |
| 1   | Cart Velocity         | -Inf                | Inf                |
| 2   | Pole Angle            | ~ -0.418 rad (-24°) | ~ 0.418 rad (-24°) |
| 3   | Pole Angular Velocity | -Inf                | Inf                |

In [10]:
env = gym.make(environment_name, render_mode="human")
env.reset()

  logger.deprecation(


(array([-0.00712545, -0.02686494, -0.02453898,  0.04317247], dtype=float32),
 {})

This is a *box* environment - the actions and observations are arrays or vectors of numbers. These arrays are referred to as "boxes" because they contain numerical values in a specific format or range.. It's a space that contains numbers arranged in an array or vector, like the dimensions of a box.

Or for the more mathematically minded:

![box_env](box_env.png)

### Rewards - Keep the pole upright for as long as possible, a reward of +1 for every step taken

### Episode End
The episode ends if any one of the following occurs:

1. Termination: Pole Angle is greater than ±12°
2. Termination: Cart Position is greater than ±2.4 (center of the cart reaches the edge of the display)
3. Truncation: Episode length is greater than 500 (200 for v0)

### More info

- This is a fixed environment (as opposed to a continuous environment). 200 frames, not forever eg a game where you run out of lives

In [12]:
# let's see what it looks like when we randomly takes actions
env = gym.make(environment_name, render_mode="human")

episodes =  5 
for episode in range(1, episodes+1):
    state = env.reset() # Initial set of observations
    terminated = False 
    truncated = False
    score = 0
    
    while not terminated:
        env.render()
        action = env.action_space.sample() #only a 1 or a 0
        obs, reward, terminated, truncated, info = env.step(action)
        score += reward
    print('Episode:{} Score:{}'.format(episode, score))
pygame.display.quit()
env.close()

Episode:1 Score:22.0
Episode:2 Score:12.0
Episode:3 Score:12.0
Episode:4 Score:25.0
Episode:5 Score:15.0


### Randomness won't get you nowhere, y'hear

So - randomly taking actions is the baseline. There's no training going on here, we're just randomly picking something in the action space (here, that's just a one or a zero for left and right. So let's pick an algorithm and get training!

### Algorithm Choice

There's a ton of different algos out there to choose from. 
Two main types:

Model-Free RL:
Only uses the current state values to try to make a prediction. It focuses on learning from experience directly. Does not build an explicit understanding or representation of how the environment works.

Model-Based RL:
Prediction about the future state of the model to try to generate a best possible action. It involves learning and utilizing an explicit model of the environment. This model is used to predict how the environment will behave in the future based on different actions taken. Then, this prediction is used to make decisions about which actions are likely to lead to the best outcomes.

| Name           | Box | Discrete | MultiDiscrete | MultiBinary | Multi Processing |
|----------------|-----|----------|---------------|-------------|------------------|
| ARS 1          | ✔️   | ✔️        | ❌             | ❌           | ✔️                |
| A2C            | ✔️   | ✔️        | ✔️             | ✔️           | ✔️                |
| DDPG           | ✔️   | ❌        | ❌             | ❌           | ✔️                |
| DQN            | ❌   | ✔️        | ❌             | ❌           | ✔️                |
| HER            | ✔️   | ✔️        | ❌             | ❌           | ✔️                |
| PPO            | ✔️   | ✔️        | ✔️             | ✔️           | ✔️                |
| QR-DQN 1       | ❌   | ️ ✔️       | ❌             | ❌           | ✔️                |
| RecurrentPPO 1 | ✔️   | ✔️        | ✔️             | ✔️           | ✔️                |
| SAC            | ✔️   | ❌        | ❌             | ❌           | ✔️                |
| TD3            | ✔️   | ❌        | ❌             | ❌           | ✔️                |
| TQC 1          | ✔️   | ❌        | ❌             | ❌           | ✔️                |
| TRPO 1         | ✔️   | ✔️        | ✔️             | ✔️           | ✔️                |
| Maskable PPO 1 | ❌   | ✔️        | ✔️             | ✔️           | ✔️                |


Which algo you can use is based on the action space, not so much the observation space. So, in our case, we have a discrete action space. We're going to use PPO.

Because it was in the tutorial I watched. Sue me.

It stands for for Proximal Policy Optimization

Proximal - because it adjusts the policy gradually
It tries to balance exploration and exploitation.

### Pseudocode example of PPO
```
# Initialize policy network
policy_network = initialize_policy_network()

# Set hyperparameters
# These are not learned from the data but is set before the learning process begins. 

num_episodes = 1000
num_policy_updates = 10
buffer_size = 10000
gamma = 0.99
epsilon = 0.2
learning_rate = 0.001

# Repeat for multiple episodes
for episode in range(num_episodes):
    # Initialize buffer to store experiences
    buffer = []
    
    # Interact with the environment
    state = env.reset()
    done = False
    
    while not done:
        # Collect experiences by taking actions using the policy
        action_probabilities = policy_network.predict(state)
        action = sample_action_from_distribution(action_probabilities)
        
        next_state, reward, done, _ = env.step(action)
        
        # Store experience in the buffer
        buffer.append((state, action, reward, next_state, done))
        
        state = next_state
    
    # Update policy using collected experiences
    for update in range(num_policy_updates):
        # Prepare data from the buffer for policy update
        states, actions, rewards, next_states, dones = zip(*buffer)
        
        # Calculate advantages (using e.g., generalized advantage estimation)
        advantages = calculate_advantages(rewards, dones, states, next_states, gamma)
        
        # Normalize advantages
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        # Perform policy gradient updates (PPO clipping)
        for i in range(len(states)):
            state = states[i]
            action = actions[i]
            advantage = advantages[i]
            
            old_action_prob = policy_network.predict(state)
            old_action_prob = old_action_prob[action]
            
            # Calculate new action probabilities
            new_action_prob = policy_network.predict(state)
            new_action_prob = new_action_prob[action]
            
            # Calculate surrogate objective (PPO objective)
            ratio = new_action_prob / (old_action_prob + 1e-8)
            clipped_ratio = np.clip(ratio, 1 - epsilon, 1 + epsilon)
            surrogate = clipped_ratio * advantage
            
            # Update policy by minimizing the surrogate objective
            policy_network.train_step(state, surrogate, learning_rate)
```

### Training Metrics


Let's focus on two types:

- Evaluation metrics
  - all to with episode length and reward
- Other metrics
  - explained variance - how much of the variance in the environment our agent can explain
  - learning rate - how fast our policy is updating
  - how many

### Cat break and refocus
Here's some inspiration from Mowgli to get you back into it:

![cat](cat.jpg)


Wait! Don't go to sleep! or work on your project! Or watch TikTok!


Let's train the model. And I can answer some questions while it's training!

In [13]:
log_path = os.path.join('Training', 'Logs')
env = gym.make(environment_name, render_mode="human")
env = DummyVecEnv([lambda: env])
model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

Using cpu device


In [14]:
model.learn(total_timesteps=20000)

Logging to Training/Logs/PPO_8
-----------------------------
| time/              |      |
|    fps             | 45   |
|    iterations      | 1    |
|    time_elapsed    | 44   |
|    total_timesteps | 2048 |
-----------------------------
----------------------------------------
| time/                   |            |
|    fps                  | 45         |
|    iterations           | 2          |
|    time_elapsed         | 89         |
|    total_timesteps      | 4096       |
| train/                  |            |
|    approx_kl            | 0.00966014 |
|    clip_fraction        | 0.0822     |
|    clip_range           | 0.2        |
|    entropy_loss         | -0.687     |
|    explained_variance   | -0.00749   |
|    learning_rate        | 0.0003     |
|    loss                 | 7.99       |
|    n_updates            | 10         |
|    policy_gradient_loss | -0.0121    |
|    value_loss           | 48.6       |
----------------------------------------
---------------------

KeyboardInterrupt: 

In [15]:
PPO_Path = os.path.join('Training', 'Saved Models', 'PPO_Model_Cartpole1')
# model.save(PPO_Path)
model = PPO.load(PPO_Path, env=env)

In [None]:
evaluate_policy(model, env, n_eval_episodes=2, render=True)
pygame.display.quit()
env.close()

In [16]:
# let's see what it looks like when the model actually works
env = gym.make(environment_name, render_mode="human")

episodes =  5 
for episode in range(1, episodes+1):
    obs, info = env.reset()
    terminated = False 
    truncated = False
    score = 0
    
    while not terminated:
        env.render()
        action, next_state = model.predict(obs) # Now using model here
        obs, reward, terminated, truncated, info = env.step(action)
        score += reward
    print('Episode:{} Score:{}'.format(episode, score))
pygame.display.quit()
env.close()

  logger.deprecation(


Episode:1 Score:223.0
Episode:2 Score:256.0
Episode:3 Score:330.0
Episode:4 Score:240.0
Episode:5 Score:246.0


So that was pretty neat!

We can use tensorboard to look at actual metrics for how the model performed:

In [None]:
training_log_path = os.path.join(log_path, 'PPO_1')
!tensorboard --logdir={training_log_path}

TensorFlow installation not found - running with reduced feature set.
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.15.1 at http://localhost:6006/ (Press CTRL+C to quit)


### Now what?

Questions?