<a href="https://colab.research.google.com/github/wikistat/AI-Frameworks/blob/master/IntroductionDeepReinforcementLearning/Policy_Gradient.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [IA Frameworks](https://github.com/wikistat/AI-Frameworks) - Introduction to Deep Reinforcement Learning 

<center>
<a href="http://www.insa-toulouse.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/logo-insa.jpg" style="float:left; max-width: 120px; display: inline" alt="INSA"/></a> 
<a href="http://wikistat.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/wikistat.jpg" width=400, style="max-width: 150px; display: inline"  alt="Wikistat"/></a>
<a href="http://www.math.univ-toulouse.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/logo_imt.jpg" width=400,  style="float:right;  display: inline" alt="IMT"/> </a>
    
</center>

# Part 2 : Policy Gradient Algorithm

The objectives of this notebook are the following : 

* Implement Hard-Coded And Neural network policy to solve the *CartPole* Game 
* Implement Policy gradient algorithm to solve the *CartPole* Game 


# Files & Data (Google Colab)

If you're running this notebook on Google colab, you do not have access to the `solutions` folder you get by cloning the repository locally. 

The following lines will allow you to build the folders and the files you need for this TP.

**WARNING 1** Do not run this line localy. <br>
**WARNING 2** The magic command `%load` does not work work on google colab, you will have to copy-paste the solution on the notebook.

In [None]:
! mkdir solution
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/IntroductionDeepReinforcementLearning/solutions/pg_simple_policy.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/IntroductionDeepReinforcementLearning/solutions/pg_neural_network_policy.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/IntroductionDeepReinforcementLearning/solutions/pg_learn_given_policy.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/IntroductionDeepReinforcementLearning/solutions/PG_class.py
! wget -p . https://github.com/wikistat/AI-Frameworks/raw/master/IntroductionDeepReinforcementLearning/discounted_rewards.py
! wget -p . https://github.com/wikistat/AI-Frameworks/raw/master/IntroductionDeepReinforcementLearning/keras_model.py

# Import librairies

In [None]:
import numpy as np
import random
import math
from tqdm import tqdm

# To plot figures and animations
import matplotlib
import matplotlib.animation as animation
import matplotlib.pyplot as plt
from IPython.display import HTML


#Tensorflow/Keras utils
import tensorflow.keras.models as km
import tensorflow.keras.layers as kl
import tensorflow.keras.initializers as ki
import tensorflow.keras.optimizers as ko
import tensorflow.keras.backend as K


# Gym Library
import gym

The following functions enable to build a video from a list of images. <br>
They will be used to build video of the game you will played.

In [None]:
def update_scene(num, frames, patch):
    patch.set_data(frames[num])
    return patch,

def plot_animation(frames, repeat=False, interval=40):
    plt.close()  # or else nbagg sometimes plots in the previous cell
    fig = plt.figure()
    patch = plt.imshow(frames[0])
    plt.axis('off')
    return animation.FuncAnimation(fig, update_scene, fargs=(frames, patch), frames=len(frames), repeat=repeat, interval=interval)

# AI Gym Librairie
<a href="https://gym.openai.com/" ><img src="https://gym.openai.com/assets/dist/home/header/home-icon-54c30e2345.svg" style="float:left; max-width: 120px; display: inline" alt="INSA"/></a> 

In this notebook we will be using [OpenAI gym](https://gym.openai.com/), a great toolkit for developing and comparing Reinforcement Learning algorithms. It provides many environments for your learning *agents* to interact with.

# A simple environment: the Cart-Pole

## Description
A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum starts upright, and the goal is to prevent it from falling over by increasing and reducing the cart's velocity.

### Observation

Num | Observation | Min | Max
---|---|---|---
0 | Cart Position | -2.4 | 2.4
1 | Cart Velocity | -Inf | Inf
2 | Pole Angle | ~ -41.8&deg; | ~ 41.8&deg;
3 | Pole Velocity At Tip | -Inf | Inf

### Actions

Num | Action
--- | ---
0 | Push cart to the left
1 | Push cart to the righ&t

Note: The amount the velocity is reduced or increased is not fixed as it depends on the angle the pole is pointing. This is because the center of gravity of the pole increases the amount of energy needed to move the cart underneath it

### Reward
Reward is 1 for every step taken, including the termination step

### Starting State
All observations are assigned a uniform random value between ±0.05

### Episode Termination
1. Pole Angle is more than ±12°
2. Cart Position is more than ±2.4 (center of the cart reaches the edge of the display)
3. Episode length is greater than 200

### Solved Requirements
Considered solved when the average reward is greater than or equal to 195.0 over 100 consecutive trials.

The description above if part of the official description of this environemtn. Read full description [here](https://github.com/openai/gym/wiki/CartPole-v0).

The following command will load the `CartPole` environment.

In [None]:
env = gym.make("CartPole-v0")

In [None]:
env.reset()
img = env.render(mode = "rgb_array")
env.close()
print("Environemnt is a %dx%dx%d images" %img.shape)

In [None]:
plt.imshow(img)
plt.axis("off")

If you have forgotten how the `CartPole` environment works, open the `Deep_Q_Learning_CartPole.ipynb` notebook to run explanation's cell.

# Hard coded policy

How can we make the poll remain upright? We will need to define a _policy_ for that. 

This is the strategy that the agent will use to select an action at each step. It can use all the past actions and observations to decide what to do.

Let's first implement **Hard Coded policies**, *i.e.* simple rules that defines which action to takes according to the parameters.

In [None]:
def run_one_episode(policy, return_frames=False):
    frames = []
    observation = env.reset()    
    reward_episod = 0    
    done = False
    while not(done):
        action = policy(observation)
        observation, reward, done, _ = env.step(action)
        reward_episod += reward
        if return_frames:
            img = env.render(mode = "rgb_array")
            env.close()
            frames.append(img)
    return reward_episod, frames

In [None]:
def play_games(policy, n_games=100):
    all_reward_sum = []   
    n_game = 0          
    while n_game < n_games:
        reward_episod, _ = run_one_episode(policy)
        if n_game %10 == 0:
            print("Game played : %d. Mean and Standart deviation's reward for the last 10 episode: %.1f - %.1f" %(n_game, np.mean(all_reward_sum[-10:]), np.std(all_reward_sum[-10:])) )
        all_reward_sum.append(reward_episod)
        n_game += 1
    print("Over %d episodes, mean reward: %.1f, std : %.1f" %(n_games, np.mean(all_reward_sum), np.std(all_reward_sum)))

## Random policy

Let's start with a completly random policy and see how much time the poll will remain upright over 100 episodes.

In [None]:
def policy_random(state):
    return env.action_space.sample()
play_games(policy = policy_random)

#### Visualize a complete game

Let's run one pisode with his random policy and save all images representing the environment at each step.

In [None]:
reward_episod, frames = run_one_episode(policy = policy_random, return_frames=True)
HTML(plot_animation(frames).to_html5_video())

### Simple strategy

Let's hard code a simple strategy: if the pole is tilting to the left, then push the cart to the left, and _vice versa_. Let's see if that works.

**Exercise** implement this policy and play 100 games with this policy. What are the means and std deviation of the reward sum over the 100 games?

In [None]:
# %load solutions/pg_simple_policy.py

In [None]:
play_games(policy = simple_policy)

**Q** What can you say about this strategy?

**Exercise** Vizualize a complete game:

In [None]:
reward_episod, frames = run_one_episode(policy = simple_policy, return_frames=True)
HTML(plot_animation(frames).to_html5_video())

# Neural Network Policies

Let's create a neural network to build a better policy.

This network will take the observations as inputs, and output the probability of the action to take for each observation. <br>

In the case of the Cart-Pole environment, there are just two possible actions (left or right), so we only need one output neuron: it will output the probability `p` of the action 1 (right), and of course the probability of action 0 (right) will be `1 - p`.

Let's first see how this neural network policy work without training it and then let's try to learn the simple policy define above.

## The architecture

Because this problem is simple, we can define a very simple architecture for our neural network.<bR> 
    
Here it's simple MLP with 1 hidden layer and four neurons.

In [None]:
# Specify the network architecture
n_inputs = 4  # == env.observation_space.shape[0]
n_hidden = 9  # it's a simple task, we don't need more than this
n_outputs = 1 # only outputs the probability of accelerating right

# Build the neural network
policy_network=km.Sequential()
policy_network.add(kl.Dense(n_hidden, input_shape = (n_inputs,), activation = "relu"))
policy_network.add(kl.Dense(n_outputs, activation = "sigmoid"))
policy_network.summary()

Note that the model is not compile so far, no loss function is defined. 

## Policy from a neural network
We can now easily predict the probability of both actions given the observation.

**Exercise** Define a function to choose the action to take from an observation and the neural network.

In [None]:
# %load solutions/pg_neural_network_policy.py

***NB*** :  In this particular environment, the past actions and observations can safely be ignored, since each observation contains the environment's full state. If there were some hidden state then you may need to consider past actions and observations in order to try to infer the hidden state of the environment. 

For example, if the environment only revealed the position of the cart but not its velocity, you would have to consider not only the current observation but also the previous observation in order to estimate the current velocity. <br> Another example:  if the observations are noisy: you may want to use the past few observations to estimate the most likely current state. Our problem is thus as simple as can be: the current observation is noise-free and contains the environment's full state.

## Random neural network policy.
Let's see how this neural network policy perform.



In [None]:
play_games(policy = lambda obs : neural_network_policy(obs, model=policy_network), n_games=10)

Let's randomly initialize this policy neural network and use it to play one game:

In [None]:
reward_episod, frames = run_one_episode(policy = lambda obs : neural_network_policy(obs, model=policy_network), return_frames=True)
HTML(plot_animation(frames).to_html5_video())

The neural network is working. But it's still acting randomly because we do not train the neural network. Let's try to make it learn better policy.

## Learn a given policy

In this part we will train the neural network in order that it learns the simple strategy we hard coded before : if the pole is tilting to the left, then push the cart to the left, and _vice versa_. <br>


The class defined below enables to train the neural network in order to learn this simple policy.

The **pseudo code** is quite simple here:

while *n_episode_max* is not reached:
    * Play and episode and return the observation and target
    * Train the network from these observation and target. 


**Exercicse**: Complete the code below: <br>
* Choose a loss to compile the model with in the `init_model`method.
* Write the `play_one_episode`method that will: 
    * Play an episode until it's end with the hard coded policy
    * Return observation an target, i.e, the action to take according to each observation, of this episode. 
    
You can do this exercise on this notebook or with the `PG_learn_a_policy.py`and the `PG_learn_a_policy_solution.py` files.

In [None]:
class PG:

    def __init__(self):
        # Environment
        self.env = gym.make("CartPole-v0")
        self.dim_input = self.env.observation_space.shape[0]

        # Model
        self.model = self.init_model()
        self.n_episode_max = 1000

    def init_model(self):

        # Build the neural network
        policy_network = km.Sequential()
        policy_network.add(kl.Dense(9, input_shape=(self.dim_input,), activation="relu"))
        policy_network.add(kl.Dense(1, activation="sigmoid"))
        policy_network.compile(loss=, optimizer=ko.Adam(), metrics=['accuracy'])
        return policy_network

    def play_one_episode(self):
        # Todo
        return train_data

    def train(self):

        for iteration in tqdm(range(self.n_episode_max)):
            train_data = self.play_one_episode()
            n_step = len(train_data)
            target = np.array([x[1] for x in train_data]).reshape((n_step, 1))
            observations = np.array([x[0] for x in train_data])
            self.model.train_on_batch(observations, target)

In [None]:
pg = PG()
pg.train()

In [None]:
# %load solutions/pg_learn_given_policy.py

In [None]:
play_games(policy = lambda obs : neural_network_policy(obs, model=pg.model), n_games=10)

In [None]:
reward_episod, frames = run_one_episode(policy = lambda obs : neural_network_policy(obs, model=pg.model), return_frames=True)
HTML(plot_animation(frames).to_html5_video())

Looks like it learned the policy correctly! <br>

Let's now reach our final target : The neural network has to find a better policy by its own.

# Policy Gradients

The idea behind *Policy Gradients* its quite simple : The _Policy Gradients_ algorithm tackles this problem by first playing multiple games, then making the actions in good games slightly more likely, while actions in bad games are made slightly less likely. First we play, then we go back and think about what we did.

### Algorithm

* Run an episode untill it's done and save at each iteration the observation, action and reward.
* When an episode it's done. Compute the discounted rewards for all the episode, and save it.
* If you have done *batch_size=50* episodes train your model on this batch.
* Stop if you have reach *num episodes* or *goal* target.


### Parameters

| Variable  | Value  | Description  | 
|---|---|---|
|Gamma   | 0.99  | The discounted rate apply for the discounted reward  |
|batch_size  | 50   | Number of episode to run before training model on a batch of episode  |
| Num episodes | 10.000   | Maximum number of episode to run before stopping the training  | 
| goal | 190  | Number of step to achieve on one episode to stop the training.  |

Those parameters are fixed for this TP, they are common value for this kind of problem based on experiences. They are not definitive nor results or any research.

## Discounted rewards


To train this neural network we will then used the observation of the experiences as an inputs and the actions taken as an output.

But how do we provide to the neural network the information the choosen actions  was good or bad?
The problem is that most actions have delayed effects, so when you win or lose points in a game, it is not clear which actions contributed to this result: was it just the last action? Or the last 10? Or just one action 50 steps earlier? <br>
This is called the _credit assignment problem_.


To tackle this problem, a common strategy is to evaluate an action based on the sum of all the rewards that come after it, usually applying a discount rate r at each step. It's call the **discounted rewards**

$$ R_t = \sum_{i=0}^{\infty}\gamma^i r_{t+i}$$



This rate will the be applied to the loss function of the neural network :
* A high discounted reward will lead to higher gradient which will increase the importance of this action
* A low  discounted reward will lead to lower gradient which will decrease the importance of this action
 

**Exercise** : Implement the discount_rewards function.

In [None]:
def discount_rewards(r, gamma=0.99):
    """Takes 1d float array of rewards and computes discounted reward
    e.g. f([1, 1, 1], 0.99) -> [2.9701, 1.99, 1]
    """
    TODO
    return discounted_rewards

In [None]:
# %load discounted_rewards.py

In [None]:
assert np.all(discount_rewards([1,1,1], gamma=0.99) == [2.9701, 1.99, 1])
assert np.all(discount_rewards([3,2,1],gamma=0.99) == [5.960100000000001, 2.99, 1.0])

## Architecture & Loss Function

As before we will define a very simple architecture to our neural network : A MLP with only one hidden layer and 8 neurons.

We have to be aware here that the neural network will have two different behaviour :

* For training: the model will take two information as an input : The observations (to predict the action), and the discounted rate (also call advantages) that will be applied on the loss function.
* For prediction : the model will take only the observations as an input to predict the action.

So we have to define a neural network that can either handle one or two inputs! 

In keras we define it that way : we define the layers, and we create two models (for training and prediction) that will share the same layers and weight.

We will also implement the loss function, which is weighted binary cross entropy, where the weight are the discounted rated computed from the rewards

Here is how we implement it : (Make sure you understand it!) 

**TODO** Define this keras Model class with tensorflow 2

**Exercise**: Write the loss?

In [None]:
import tensorflow as tf
import tensorflow.keras.models as km
import tensorflow.keras.layers as kl
import tensorflow.keras.initializers as ki
import tensorflow.keras.optimizers as ko
import tensorflow.keras.losses as klo
import tensorflow.keras.backend as K
import tensorflow.keras.metrics as kme

class discountedLoss(klo.Loss):
    """
    Args:
      pos_weight: Scalar to affect the positive labels of the loss function.
      weight: Scalar to affect the entirety of the loss function.
      from_logits: Whether to compute loss from logits or the probability.
      reduction: Type of tf.keras.losses.Reduction to apply to loss.
      name: Name of the loss function.
    """

    def __init__(self,
                 reduction=klo.Reduction.AUTO,
                 name='discountedLoss'):
        super().__init__(reduction=reduction, name=name)

    def call(self, y_true, y_pred, adv):
        log_lik = - (y_true * K.log(y_pred) + (1 - y_true) * K.log(1 - y_pred))
        loss = K.mean(log_lik * adv, keepdims=True)
        return loss


class kerasModel(km.Model):
    def __init__(self):
        super(kerasModel, self).__init__()
        self.layersList = []
        self.layersList.append(kl.Dense(9, activation="relu",
                     input_shape=(4,),
                     use_bias=False,
                     kernel_initializer=ki.VarianceScaling(),
                     name="dense_1"))
        self.layersList.append(kl.Dense(1,
                       activation="sigmoid",
                       kernel_initializer=ki.VarianceScaling(),
                       use_bias=False,
                       name="out"))

        self.loss = discountedLoss()
        self.optimizer = ko.Adam(lr=1e-2)
        self.train_loss = kme.Mean(name='train_loss')
        self.validation_loss = kme.Mean(name='val_loss')
        self.metric = kme.Accuracy(name="accuracy")

        @tf.function()
        def predict(x):
            """
            This is where we run
            through our whole dataset and return it, when training and testing.
            """
            for l in self.layersList:
                x = l(x)
            return x
        self.predict = predict

        @tf.function()
        def train_step(x, labels, adv):
            """
                This is a TensorFlow function, run once for each epoch for the
                whole input. We move forward first, then calculate gradients with
                Gradient Tape to move backwards.
            """
            with tf.GradientTape() as tape:
                predictions = self.predict(x)
                loss = self.loss.call(
                    y_true=labels,
                    y_pred = predictions,
                    adv = adv)
            gradients = tape.gradient(loss, self.trainable_variables)
            self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))
            self.train_loss(loss)
            return loss

        self.train_step = train_step

## PG class


The `PG` class contains the implementation of the **Policy Gradient** algorithm. The code is incomplete and you will have to fill it!

**GENERAL INSTRUCTION**:

* Read the init of the `PG` class. 
    * Various variable are set with their definition, make sure you understand all of its.
    * The *game environment*, the *experiences list* and the *keras model* are initialised.
* Read the `train` method. It contains the main code corresponding to the **pseudo code** below. YOU DO NOT HAVE TO MODIFY IT! But make sure you understand it.
* The `train` method use methods that are not implemented. 
    * You will have to complete the code of 3 functions. (read instruction of each exercise below)
    * After the cell of the `PG` class code below there is a **test cells**. <br>
    This cell should be executed after all the methods have been completed. This cell will check that the function you implemented take input and output in the desired format. <br> DO NOT MODIFY this cell. It will work if you're code is good <br> **Warning** The test cell does not guarantee that your code is correct. It just test than input and output are in the good format.


#### Pseudo code 
We will consider that the game is **completed** if the mean score over 10 games is above 190.

While you didn't reach the expected *goal* reward or the max *num_episodes* allow to be played:
* Run `one_episode` and save all experiences in the `experiences` list (**Exercise 1 & 2**):
* Every `batch_size` episodes played:
    * train model over a batch of experiences (**Exercise 3**)


    
**Exercise 1**:  Implement `choose_action`<br>
This method chooses an action according to this rules:<br>

* let $p$ be the probability of the output of the model that we play the action $right(=1)$,
* Then choose action to play right with probability $p$ else play left.
* Hence, the more the model will be good about it's prediction, the less exploration we will perform.


**Exercise 2**:  Implement `run_one_episode` <br>
This method:<br>
* play an complete episode until it's done. At each step of the episode, it :
    * chooses an action
    * save all the actions, state and reward
* once the episode is done and all rewards are know, it compute all the discounted rewards.
* fill the `experiences's` list with all experience of the episode = '[state, action, discounted_reward]'

**Exercise 3**:  Implement `run_one_batch_train`<br>
This method:<br>
* call the on `train_step` method of the `model`with the argument in the `experiences` list.
* Empty the `experiences` list
* return the loss of this batch step.


You can do these exercises on this notebook or with the `PG.py`and the `PG_solution.py` files.

In [None]:
import tensorflow as tf
tf.config.experimental_run_functions_eagerly(True)
tf.keras.backend.set_floatx('float64')
class PG:

    def __init__(self, gamma=.99, batch_size=50, num_episodes=10000, goal=190, n_test=10, print_every=100):
        # Environment
        self.env = gym.make("CartPole-v0")
        self.dim_input = self.env.observation_space.shape[0]

        # Parameters
        self.gamma = gamma  # -> Discounted reward
        self.batch_size = batch_size  # -> Size of episode before training on a batch

        # Stop factor
        self.num_episodes = num_episodes  # Max number of iterations
        self.goal = goal  # Stop if our network achieve this goal over *n_test*
        self.n_test = n_test
        self.print_every = print_every  # ?Numbe rof episode before trying if our model perform well.

        # Init Model to be trained
        self.model = kerasModel()

        # Placeholders for our observations, outputs and rewards
        self.experiences = []
        self.losses = []

    def choose_action(self, state):
        # TODO
        return action

    def run_one_episode(self):
        # TODO
        return score

    def run_one_batch_train(self):
        # TODO
        return loss

    def score_model(self, model, num_tests, dimen, ):
        scores = []
        for num_test in range(num_tests):
            observation = self.env.reset()
            reward_sum = 0
            while True:
                state = np.reshape(observation, [1, dimen])
                predict = model.predict(state)[0]
                action = 1 if predict > 0.5 else 0
                observation, reward, done, _ = self.env.step(action)
                reward_sum += reward
                if done:
                    break
            scores.append(reward_sum)
        return np.mean(scores)

    def train(self):
        metadata = []
        i_batch = 0
        # Number of episode and total score
        num_episode = 0
        train_score_sum = 0

        while num_episode < self.num_episodes:
            train_score = self.run_one_episode()
            train_score_sum += train_score
            num_episode += 1

            if num_episode % self.batch_size == 0:
                i_batch += 1
                loss = self.run_one_batch_train()
                self.losses.append(loss)
                metadata.append([i_batch, self.score_model(self.model, self.n_test, self.dim_input)])

            # Print results periodically
            if num_episode % self.print_every == 0:
                test_score = self.score_model(self.model, self.n_test, self.dim_input)
                print(
                    "Average reward for training episode {}: {:0.2f} Mean test score over {:d} episode: {:0.2f} Loss: {:0.6f} ".format(
                        num_episode, train_score_sum / self.print_every, self.n_test,
                        test_score,
                        self.losses[-1]))
                reward_sum = 0
                if test_score >= self.goal:
                    print("Solved in {} episodes!".format(num_episode))
                    break
        return metadata

In [None]:
pg = PG()
score = pg.run_one_episode()
assert type(score) is float
for state, action, dreward in pg.experiences:
    assert np.all(state.shape==(1,4))
    assert type(action)==int
    assert type(dreward)==np.float64

loss = pg.run_one_batch_train()
assert type(loss) == float
assert len(pg.experiences)==0

In [None]:
# %load solutions/PG_class.py

### Training

Let's train the model !

In [None]:
pg = PG(goal=200)
metadata = pg.train()

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sb
sb.set_style("whitegrid")
fig = plt.figure(figsize=(20,6))
ax = fig.add_subplot(1,1,1)
ax.plot(list(range(len(metadata))),[x[1] for x in metadata])
ax.set_yticks(np.arange(0,210,10))    
ax.set_xticks(np.arange(0,100,25))    
ax.set_title("Score/Lenght of episode over Iteration", fontsize=20)
ax.set_xlabel("Number of iteration", fontsize=14)
plt.yticks(fontsize=12)
plt.xticks(fontsize=12)
ax.set_ylabel("Score/Length of episode", fontsize=16)
plt.savefig("pg_normalized.png", bbox_to_anchor="tigh", dpi=200)

**Exercise** 

* Use the model to play 100 games and check how it performs compare to previous policy tested
* Register a video of a game and display it

In [None]:
play_games(policy = lambda obs : neural_network_policy(obs, model=pg.model), n_games=10)

In [None]:
reward_episod, frames = run_one_episode(policy = lambda obs : neural_network_policy(obs, model=pg.model), return_frames=True)
HTML(plot_animation(frames).to_html5_video())