# [IA Frameworks](https://github.com/wikistat/AI-Frameworks) - Introduction to Deep Reinforcement Learning 

<center>
<a href="http://www.insa-toulouse.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/logo-insa.jpg" style="float:left; max-width: 120px; display: inline" alt="INSA"/></a> 
<a href="http://wikistat.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/wikistat.jpg" width=400, style="max-width: 150px; display: inline"  alt="Wikistat"/></a>
<a href="http://www.math.univ-toulouse.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/logo_imt.jpg" width=400,  style="float:right;  display: inline" alt="IMT"/> </a>
    
</center>

# Part 1 : Q-Learning
The objectives of this noteboks are the following : 

* Implement Q-Learning on simple Markov Decision Process

Source : [https://github.com/ageron/handson-ml](https://github.com/ageron/handson-ml) and https://github.com/breeko/Simple-Reinforcement-Learning-with-Tensorflow/blob/master/Part%202%20-%20Policy-based%20Agents%20with%20Keras.ipynb

# Import librairies

In [1]:
import copy
import numpy as np
import random
import os

# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

# To plot figures and animations
%matplotlib inline
%matplotlib nbagg
import matplotlib
import matplotlib.animation as animation
import matplotlib.pyplot as plt
from IPython.display import HTML


plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12


import tensorflow.keras.models as km
import tensorflow.keras.layers as kl
import tensorflow.keras.initializers as ki
import tensorflow.keras.optimizers as ko
import tensorflow.keras.losses as klo
import tensorflow.keras.backend as K


# Gym Librairy
import gym
import pandas as pd


def update_scene(num, frames, patch):
    patch.set_data(frames[num])
    return patch,

def plot_animation(frames, repeat=False, interval=400):
    plt.close()  # or else nbagg sometimes plots in the previous cell
    fig = plt.figure()
    patch = plt.imshow(frames[0])
    plt.axis('off')
    return animation.FuncAnimation(fig, update_scene, fargs=(frames, patch), frames=len(frames), repeat=repeat, interval=interval)

# Markov Decision Process

## Definition

We will first define a simple markov process on wich we will apply Q-learning algorithm.

here is an illustration of the MDP that we will define.

![images](images/mdp.png)

### Transition probabilities

We first define the different **transition probabilities** for each $(s,a,s')$ combination where
* $s$ is the `from_state`
* $a$ is the `action` taken
* $s$ is the `to_state`

We store the **transition probabilities** within a python list and use pandas to visualize it better

In [2]:
transition_probabilities = [
        [[0.7, 0.3, 0.0], [1.0, 0.0, 0.0], [0.8, 0.2, 0.0]], 
        [[0.0, 1.0, 0.0], None, [0.0, 0.0, 1.0]],
        [None, [0.8, 0.1, 0.1], None],
    ]

transition_probabilities_df = pd.DataFrame(transition_probabilities).rename_axis('Actions', axis=1)
transition_probabilities_df.index.name="State"
transition_probabilities_df

Actions,0,1,2
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,"[0.7, 0.3, 0.0]","[1.0, 0.0, 0.0]","[0.8, 0.2, 0.0]"
1,"[0.0, 1.0, 0.0]",,"[0.0, 0.0, 1.0]"
2,,"[0.8, 0.1, 0.1]",


### Rewards 

We also define the **rewards** for each $(s,a,s')$ combination.

In [3]:
rewards = [
        [[+10, 0, 0], [0, 0, 0], [0, 0, 0]],
        [[0, 0, 0], [0, 0, 0], [0, 0, -50]],
        [[0, 0, 0], [+40, 0, 0], [0, 0, 0]],
    ]

rewards_df = pd.DataFrame(rewards).rename_axis('Actions', axis=1)
rewards_df.index.name="State"
rewards_df

Actions,0,1,2
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,"[10, 0, 0]","[0, 0, 0]","[0, 0, 0]"
1,"[0, 0, 0]","[0, 0, 0]","[0, 0, -50]"
2,"[0, 0, 0]","[40, 0, 0]","[0, 0, 0]"


### Actions

And the list of possible **actions** that can be taken at each state.

In [4]:
possible_actions = [[0, 1, 2], [0, 2], [1]]

possible_actions_df = pd.DataFrame([[x] for x in possible_actions], columns=["List of possible actions"])
possible_actions_df.index.name="State"
possible_actions_df

Unnamed: 0_level_0,List of possible actions
State,Unnamed: 1_level_1
0,"[0, 1, 2]"
1,"[0, 2]"
2,[1]


## Class environment

Finally we define now a class that will act as a Gym environment. 

* The environement is the MDP.
* The observation is the current step.
* The action possible are the three actions we previously define.

In [5]:
class MDPEnvironment(object):
    def __init__(self, start_state=0):
        self.start_state=start_state
        self.reset()
    def reset(self):
        self.total_rewards = 0
        self.state = self.start_state
    def step(self, action):
        next_state = np.random.choice(range(3), p=transition_probabilities[self.state][action])
        reward = rewards[self.state][action][next_state]
        self.state = next_state
        self.total_rewards += reward
        return self.state, reward

**Questions** More question to understand how it works?

## Hard Coded Policy

Let's first implement a random policy, as a baseline we want to improve

In [6]:
def policy_random(state):
    return np.random.choice(possible_actions[state])


def run_episode(policy, n_steps, start_state=0):
    env = MDPEnvironment()
    for step in range(n_steps):
        action = policy(env.state)
        state, reward = env.step(action)
    return env.total_rewards


all_score = []
for episode in range(1000):
    all_score.append(run_episode(policy_random, n_steps=100))
print("Summary: mean={:.1f}, std={:1f}, min={}, max={}".format(np.mean(all_score), np.std(all_score), np.min(all_score), np.max(all_score)))


Summary: mean=-24.4, std=83.257374, min=-290, max=250


**Exercise** Which policy would be the safest? The more risky? Implement it and test it. What can you say about their results?

In [7]:
# %load solutions/exercise_2_1.py
def policy_fire(state):
    return [0, 2, 1][state]

all_score = []
for episode in range(1000):
    all_score.append(run_episode(policy_fire, n_steps=100))
print("Summary: mean={:.1f}, std={:1f}, min={}, max={}".format(np.mean(all_score), np.std(all_score), np.min(all_score), np.max(all_score)))


def policy_safe(state):
    return [0, 0, 1][state]

all_score = []
for episode in range(1000):
    all_score.append(run_episode(policy_safe, n_steps=100))
print("Summary: mean={:.1f}, std={:1f}, min={}, max={}".format(np.mean(all_score), np.std(all_score), np.min(all_score), np.max(all_score)))


Summary: mean=120.3, std=135.570008, min=-360, max=480
Summary: mean=24.3, std=29.021838, min=0, max=260


## Q-iteration

Let's know try to find the best policy! <br>
Because we know all the **transition probabilities** and **reward values** for each $(s,a,s')$ combination we can compute the this best policy using the **Q-iteration algorithm**

$$Q_{k+1}(s,a) \leftarrow  \sum_{s'}P^a_{s,s'}\big[ R(s,a,s') + \gamma \cdot max_{a'}~Q_k(s',a') \big]$$

In [8]:
n_states = 3
n_actions = 3
gamma = 0.99  #<-- The discount rate
q_values = np.full((n_states, n_actions), -np.inf) 
for state, action in enumerate(possible_actions):
    q_values[state][action]=0
q_values

array([[  0.,   0.,   0.],
       [  0., -inf,   0.],
       [-inf,   0., -inf]])

In [11]:
n_steps=10
for step in range(n_steps):
    q_values_ = copy.deepcopy(q_values)
    for state in range(n_states):
        for action in range(n_actions):
            qas = 0
            if transition_probabilities[state][action] is not None:
                for next_state in range(n_states):
                    qas += transition_probabilities[state][action][next_state] * (rewards[state][action][next_state] + gamma * max(q_values_[next_state]))
            q_values[state][action]= qas

In [12]:
optimal_action_per_state = np.argmax(q_values,axis=1)
optimal_action_per_state

array([0, 2, 1])

In [13]:
def optimal_policy(state):
    return optimal_action_per_state[state]

In [14]:
all_totals = []
for episode in range(1000):
    all_totals.append(run_episode(optimal_policy, n_steps=100))
print("Summary: mean={:.1f}, std={:1f}, min={}, max={}".format(np.mean(all_totals), np.std(all_totals), np.min(all_totals), np.max(all_totals)))
print()

Summary: mean=121.2, std=127.635330, min=-330, max=510



## Q-Learning

Let's know implement Q-learning algorithm to learn a better policy!

Q-Learning works by watching an agent play (e.g., randomly) and gradually improving its estimates of the Q-Values. 
Once it has accurate Q-Value estimates (or close enough), then the optimal policy consists in choosing the action that has the highest Q-Value (i.e., the greedy policy).

We first initiate:
* the different parameters (learning_rate $\alpha$ and the discount rate $\gamma$}
* The number of step to play
* The exploration policy (random one)
* The Q-values tables

In [21]:
n_states = 3
n_actions = 3
n_steps = 200000
alpha = 0.01  #<-- Learning Rate
gamma = 0.99  #<-- The discount rate


 
exploration_policy = policy_random #<-- Policy that we will play during exploration
q_values = np.full((n_states, n_actions), -np.inf) #<-- Policy that we will be updated
for state, actions in enumerate(possible_actions):
    q_values[state][actions]=0
q_values

array([[  0.,   0.,   0.],
       [  0., -inf,   0.],
       [-inf,   0., -inf]])

**Exercise**
Run *n_steps* over the MDP and update the Q-values table at each step according to the Q-learning iteration algorithm

In [22]:
# %load solutions/exercise_2_21.py
env = MDPEnvironment()
for step in range(n_steps):
    action = exploration_policy(env.state)
    state = env.state
    next_state, reward = env.step(action)
    next_value = np.max(q_values[next_state]) # greedy policy
    q_values[state, action] = (1-alpha)*q_values[state, action] + alpha*(reward + gamma * next_value)
q_values

array([[119.71944697, 118.1115916 , 113.8363728 ],
       [ 99.90285568,         -inf, 100.79449868],
       [        -inf, 152.24852118,         -inf]])

In [23]:
optimal_action_per_state = np.argmax(q_values,axis=1)
optimal_action_per_state

array([0, 2, 1])

**Exercise** How do we defined the optimal policy from the computed Q_values? Implement it.

In [24]:
def optimal_policy(state):
    return optimal_action_per_state[state]

Compute its performance.

In [None]:
all_totals = []
for episode in range(1000):
    all_totals.append(run_episode(optimal_policy, n_steps=100))
print("Summary: mean={:.1f}, std={:1f}, min={}, max={}".format(np.mean(all_totals), np.std(all_totals), np.min(all_totals), np.max(all_totals)))
print()

**Q** We uses the Q learning iteration algorithm to learn the best policy. Would it have been possible to use a different algorithm here? Why?

# Frozen lake
Add Q-learning and approximate Q-learning on frozen lake example?