# AI-Frameworks

<center>
<a href="http://www.insa-toulouse.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/logo-insa.jpg" style="float:left; max-width: 120px; display: inline" alt="INSA"/></a> 
<a href="http://wikistat.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/wikistat.jpg" width=400, style="max-width: 150px; display: inline"  alt="Wikistat"/></a>
<a href="http://www.math.univ-toulouse.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/logo_imt.jpg" width=400,  style="float:right;  display: inline" alt="IMT"/> </a>
    
</center>

# LAB 4 Introduction to Reinforcement Learning  -  Part 1 : Policy Gradient Algorithm

The objectives of this noteboks are the following : 

* Discover AI Gym librairy and environment
* Implement Hard-Coded & And Neural network policy
* Implement Policy gradient algorithm
 
Source : [https://github.com/ageron/handson-ml](https://github.com/ageron/handson-ml) and https://github.com/breeko/Simple-Reinforcement-Learning-with-Tensorflow/blob/master/Part%202%20-%20Policy-based%20Agents%20with%20Keras.ipynb

# Import librairies

In [None]:
import numpy as np
import random
import os

# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

# To plot figures and animations
%matplotlib inline
%matplotlib nbagg
import matplotlib
import matplotlib.animation as animation
import matplotlib.pyplot as plt
from IPython.display import HTML

plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

#Tensorflow/Keras utils
import tensorflow.keras.models as km
import tensorflow.keras.layers as kl
import tensorflow.keras.initializers as ki
import tensorflow.keras.optimizers as ko
import tensorflow.keras.losses as klo
import tensorflow.keras.backend as K


# Gym Library
import gym

# AI Gym Librairie
<a href="https://gym.openai.com/" ><img src="https://gym.openai.com/assets/dist/home/header/home-icon-54c30e2345.svg" style="float:left; max-width: 120px; display: inline" alt="INSA"/></a> 

In this notebook we will be using [OpenAI gym](https://gym.openai.com/), a great toolkit for developing and comparing Reinforcement Learning algorithms. It provides many environments for your learning *agents* to interact with.

# A simple environment: the Cart-Pole

## Description
A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum starts upright, and the goal is to prevent it from falling over by increasing and reducing the cart's velocity.

### Observation

Num | Observation | Min | Max
---|---|---|---
0 | Cart Position | -2.4 | 2.4
1 | Cart Velocity | -Inf | Inf
2 | Pole Angle | ~ -41.8&deg; | ~ 41.8&deg;
3 | Pole Velocity At Tip | -Inf | Inf

### Actions

Num | Action
--- | ---
0 | Push cart to the left
1 | Push cart to the righ&t

Note: The amount the velocity is reduced or increased is not fixed as it depends on the angle the pole is pointing. This is because the center of gravity of the pole increases the amount of energy needed to move the cart underneath it

### Reward
Reward is 1 for every step taken, including the termination step

### Starting State
All observations are assigned a uniform random value between ±0.05

### Episode Termination
1. Pole Angle is more than ±12°
2. Cart Position is more than ±2.4 (center of the cart reaches the edge of the display)
3. Episode length is greater than 200

### Solved Requirements
Considered solved when the average reward is greater than or equal to 195.0 over 100 consecutive trials.

The description above if part of the official description of this environemtn. Read full description [here](https://github.com/openai/gym/wiki/CartPole-v0).

The following command will load the `CartPole` environment.

In [None]:
env = gym.make("CartPole-v0")

The `reset` command initialize the environement and return the first observation which are a 1D array of size 4.

In [None]:
obs = env.reset()
env.observation_space, obs

**Q:** What are the four output above?

The `render` command allow to generate the environment which is here a 400X600 pixels with RGB channel. 

The `render` command for the `CartPole`environment also open another window that we will close directly with the `env.close`command bu can produce disturbing behaviour.

In [None]:
img = env.render(mode = "rgb_array")
env.close()
print("Environemnt is a %dx%dx%d images" %img.shape)

The environment can then easily be displayed with matplotlib function. 

In [None]:
plt.imshow(img)
plt.axis("off")

The action space is composed of two actions push to the left (0), push to the right (1).

In [None]:
env.action_space

The `step function`allow to apply one of this actions and return several information : 

* The new observation after applying this action
* The reward this action has produce
* A boolean that indicate if the experience is over or not.
* Extra information that depend of the environment (CartPole environment does not provide anything).

Let's push the cart pole to the left!

In [None]:
obs, reward, done, info = env.step(0)
print("New observation : %s" %str(obs))
print("Reward : %s" %str(reward))
print("Is the experience over? : %s" %str(done))
print("Extra information : %s" %str(info))


Let's visualize the environment

In [None]:
img = env.render(mode = "rgb_array")
env.close()
plt.imshow(img)
axs =  plt.axis("off")

**Q** : What can you see? Does it seems normal to you?

**Exercise** : Reset the environment, and push the car to the left untill the experience is over then display the final environment. 
**Q** : Why do the environment ends? 

In [None]:
# %load solutions/exercise_1_1.py

## Hard coded policy

How can we make the poll remain upright? We will need to define a _policy_ for that. This is the strategy that the agent will use to select an action at each step. It can use all the past actions and observations to decide what to do.

### Random policy

Let's start with a completly random policy and see how much time the poll will remain upright over 100 episodes.

In [None]:

env.reset()           
num_games = 100       
reward_sum = 0        
all_reward_sum = []   
num_game = 0          
while num_game < num_games:
    observation, reward, done, _ = env.step(env.action_space.sample())
    reward_sum += reward
    if done:
        if num_game %10 == 0:
            print("Game played : %d. Reward for the last 10 episode: %s" %(num_game,all_reward_sum[-10:]) )
        all_reward_sum.append(reward_sum)
        reward_sum = 0
        num_game += 1
        env.reset()
        
print("Over %d episodes, mean reward: %d, std : %d" %(num_games, np.mean(all_reward_sum), np.std(all_reward_sum)))

#### Visualize a complete game

Let's run one pisode with his random policy and save all images representing the environment at each step.

In [None]:
frames = []

env.reset()    
observation, reward, done, _ = env.step(env.action_space.sample())
reward_sum = 0        
while not(done):
    img = env.render(mode = "rgb_array")
    env.close()
    frames.append(img)
    reward_sum += reward
    observation, reward, done, _ = env.step(env.action_space.sample())

In [None]:
def update_scene(num, frames, patch):
    patch.set_data(frames[num])
    return patch,

def plot_animation(frames, repeat=False, interval=40):
    plt.close()  # or else nbagg sometimes plots in the previous cell
    fig = plt.figure()
    patch = plt.imshow(frames[0])
    plt.axis('off')
    return animation.FuncAnimation(fig, update_scene, fargs=(frames, patch), frames=len(frames), repeat=repeat, interval=interval)

In [None]:
HTML(plot_animation(frames).to_html5_video())

### Simple strategy

Let's hard code a simple strategy: if the pole is tilting to the left, then push the cart to the left, and _vice versa_. Let's see if that works.

**Exercise** implement this policy and play 100 games with this policy. What are the means and std deviation of the reward sum over the 100 games?

In [None]:
# %load solutions/exercise_1_2.py

**Exercise** Vizualize a complete game:

In [None]:
# %load solutions/exercise_1_3.py

**Q** What can you say about this strategy?

## Neural Network Policies

Let's create a neural network to build a better policy. <br> 
Iwill take observations as inputs, and output the probability of the action to take for each observation. <br>
In the case of the Cart-Pole environment, there are just two possible actions (left or right), so we only need one output neuron: it will output the probability `p` of the action 0 (left), and of course the probability of action 1 (right) will be `1 - p`.

### The architecture

Because this problem is simple, we can define a very simple architecture for our neural network. Here it's simple MLP with 1 hidden layer and four neurons.

In [None]:
# Specify the network architecture
n_inputs = 4  # == env.observation_space.shape[0]
n_hidden = 4  # it's a simple task, we don't need more than this
n_outputs = 1 # only outputs the probability of accelerating left
initializer = ki.VarianceScaling()

# Build the neural network
policy_network=km.Sequential()
policy_network.add(kl.Dense(n_hidden, input_shape = (n_inputs,), activation = "relu", kernel_initializer = initializer))
policy_network.add(kl.Dense(n_outputs, activation = "sigmoid", kernel_initializer = initializer))

policy_network.summary()

### Predict the action
We can now easly predict the probability one of the both action given the observation:

In [None]:
obs = env.reset()
p_left = policy_network.predict(np.expand_dims(obs,axis=0))
# Choose and action according to the previously generated probability
action = 0 if random.uniform(0,1)< p_left else 1 

**Question** What can you say about the procedure to choose the action? Why are we doing this?

***NB*** : 

* In this particular environment, the past actions and observations can safely be ignored, since each observation contains the environment's full state. If there were some hidden state then you may need to consider past actions and observations in order to try to infer the hidden state of the environment. For example, if the environment only revealed the position of the cart but not its velocity, you would have to consider not only the current observation but also the previous observation in order to estimate the current velocity. Another example is if the observations are noisy: you may want to use the past few observations to estimate the most likely current state. Our problem is thus as simple as can be: the current observation is noise-free and contains the environment's full state.

### Random neural network policy.
Let's see how this neural network policy perform

In [None]:
reward_sum = 0
num_games = 100       
num_game = 0
all_reward_sum = []
obs = env.reset()
reward
while num_game < num_games:
    p_left = policy_network.predict(np.expand_dims(obs,axis=0))
    action = 0 if random.uniform(0,1)< p_left else 1 
    obs, reward, done, info = env.step(action)
    reward_sum += reward
    if done:
        if num_game %10 == 0:
            print("Game played : %d. Reward for the last 10 episode: %s" %(num_game,all_reward_sum[-10:]) )
        all_reward_sum.append(reward_sum)
        reward_sum = 0
        num_game += 1
        env.reset()
print("Over %d episodes, mean reward: %d, std : %d" %(num_games, np.mean(all_reward_sum), np.std(all_reward_sum)))

Let's randomly initialize this policy neural network and use it to play one game:

In [None]:
frames = []

env.reset()    
observation, reward, done, _ = env.step(env.action_space.sample())
reward_sum = 0        
while not(done):
    img = env.render(mode = "rgb_array")
    env.close()
    frames.append(img)
    reward_sum += reward
    p_left = policy_network.predict(np.expand_dims(observation,axis=0))
    action = 0 if random.uniform(0,1)< p_left else 1 
    observation, reward, done, _ = env.step(action)
plt.close()
HTML(plot_animation(frames).to_html5_video())

The neural network is working. But it's still acting randomly. Let's try to make it learn better policy.

### Learn a given policy

In this part we will train the neural network in order that it learns the simple strategy we hard coded before : if the pole is tilting to the left, then push the cart to the left, and _vice versa_.

We start by defining the neural network policy with the same architecture than before. <br>
This time we compile this network in order to learn the wanted behaviour

In [None]:
policy_network=km.Sequential()
policy_network.add(kl.Dense(n_hidden, input_shape = (n_inputs,), activation = "relu", kernel_initializer = initializer))
policy_network.add(kl.Dense(n_outputs, activation = "sigmoid", kernel_initializer = initializer))

# apprentissage
learning_rate = 0.01
adam = ko.Adam(lr = learning_rate)
policy_network.compile(loss='binary_crossentropy',optimizer=ko.Adam(lr = learning_rate),metrics=['accuracy'])


In order to learn this policy we will first define 10 environment that we will play in parallel at each iterations

In [None]:
n_environments = 10

envs = [gym.make("CartPole-v0") for _ in range(n_environments)]
observations = np.array([env.reset() for env in envs])

At each iteration we will :

* Train the network over the ten observation we have, where these observation will be the input, and the target will be the expected behaviour.
* Predict the action according to the neural network policy.

In [None]:
n_iterations = 2000
for iteration in range(n_iterations):
    target_probas = np.array([(1 if obs[2] < 0 else 0) for obs in observations]) # if angle<0 we want proba(left)=1., or else proba(left)=0.
    policy_network.train_on_batch(observations, target_probas)  # one batch iteration
    
    # Generate probablity and observation on ntext step
    p_lefts = policy_network.predict(observations)
    actions = [0 if random.uniform(0,1)< p_left else 1 for p_left in p_lefts]
    
    # Compute next state
    for env_index, env in enumerate(envs):
        obs, reward, done, info = env.step(actions[env_index])
        observations[env_index] = obs if not done else env.reset()

for env in envs:
    env.close()

In [None]:
reward_sum = 0
num_games = 100       
num_game = 0
all_reward_sum = []
obs = env.reset()
reward
while num_game < num_games:
    # hard-coded policy
    p_left = policy_network.predict(np.expand_dims(obs,axis=0))
    action = 0 if random.uniform(0,1)< p_left else 1 
    obs, reward, done, info = env.step(action)
    reward_sum += reward
    if done:
        if num_game %10 == 0:
            print("Game played : %d. Reward for the last 10 episode: %s" %(num_game,all_reward_sum[-10:]) )
        all_reward_sum.append(reward_sum)
        reward_sum = 0
        num_game += 1
        env.reset()
print("Over %d episodes, mean reward: %d, std : %d" %(num_games, np.mean(all_reward_sum), np.std(all_reward_sum)))

In [None]:
frames = []

env.reset()    
observation, reward, done, _ = env.step(env.action_space.sample())
reward_sum = 0        
while not(done):
    img = env.render(mode = "rgb_array")
    env.close()
    frames.append(img)
    reward_sum += reward
    p_left = policy_network.predict(np.expand_dims(observation,axis=0))
    action = 0 if random.uniform(0,1)< p_left else 1 
    observation, reward, done, _ = env.step(action)
plt.close()
HTML(plot_animation(frames).to_html5_video())

Looks like it learned the policy correctly! <br>

Let's now reach our final target : The neural network has to find a better policy by its own.

## Policy Gradients

The idea behind *Policy Gradients* its quite simple : The _Policy Gradients_ algorithm tackles this problem by first playing multiple games, then making the actions in good games slightly more likely, while actions in bad games are made slightly less likely. First we play, then we go back and think about what we did.

### Algorithm

* Run an episode untill it's done and save at each iteration the observation, action and reward.
* When an episode it's done. Compute the discounted rewards for all the episode, and save it.
* If you have done *batch_size=50* episodes train your model on this batch.
* Stop if you have reach *num episodes* or *goal* target.


### Parameters

| Variable  | Value  | Description  | 
|---|---|---|
|Gamma   | 0.99  | The discounted rate apply for the discounted reward  |
|batch_size  | 50   | Number of episode to run before training model on a batch of episode  |
| Num episodes | 10.000   | Maximum number of episode to run before stopping the training  | 
| goal | 190  | Number of step to achieve on one episode to stop the training.  |

Those parameters are fixed for this TP, they are common value for this kind of problem based on experiences. They are not definitive nor results or any research.

### Discounted rewards


To train this neural network we will then used the observation of the experiences as an inputs and the actions taken as an output.

But how do we provide to the neural network the information the choosen actions  was good or bad?
The problem is that most actions have delayed effects, so when you win or lose points in a game, it is not clear which actions contributed to this result: was it just the last action? Or the last 10? Or just one action 50 steps earlier? <br>
This is called the _credit assignment problem_.


To tackle this problem, a common strategy is to evaluate an action based on the sum of all the rewards that come after it, usually applying a discount rate r at each step. 
This rate will the be applied to the loss function of the neural network :
* A high discounted reward will lead to higher gradient which will increase the importance of this action
* A low  discounted reward will lead to lower gradient which will decrease the importance of this action
 

**Exercise** : Implement the discount_rewards function.

In [None]:
def discount_rewards(r, gamma=0.99):
    """Takes 1d float array of rewards and computes discounted reward
    e.g. f([1, 1, 1], 0.99) -> [2.9701, 1.99, 1]
    """
    TODO
    return discounted_rewards

In [None]:
# %load solutions/exercise_1_4.py

### Architecture

As before we will define a very simple architecture to our neural network : A MLP with only one hidden layer and 8 neurons.

We have to be aware here that the neural network will have two different behaviour :

* For training: the model will take two information as an input : The observations (to predict the action), and the discounted rate (also call advantages) that will be applied on the loss function.
* For prediction : the model will take only the observations as an input to predict the action.

So we have to define a neural network that can either handle one or two inputs! 

In keras we define it that way : we define the layers, and we create two models (for training and prediction) that will share the same layers and weight.

Here is how we implement it : (Make sure you understand it!) 

In [None]:
hidden_layer_neurons = 8

dimen = env.reset().shape
num_actions = 1
inp = kl.Input(shape=dimen,name="input_x")
adv = kl.Input(shape=[1], name="advantages")
x = kl.Dense(hidden_layer_neurons,  activation="relu", 
                 use_bias=False,
                 kernel_initializer=initializer,
                 name="dense_1")(inp)
out = kl.Dense(num_actions, 
                   activation="sigmoid", 
                   kernel_initializer=initializer,
                   use_bias=False,
                   name="out")(x)

model_train = km.Model(inputs=[inp, adv], outputs=out)
model_predict = km.Model(inputs=[inp], outputs=out)

### Loss Function

We will know impement the loss function, which is weighted binary cross entropy, where the weight are the discounted rated computed from the rewards

In [None]:
lr = 1e-2

def my_custom_loss(y_true, y_pred):

    log_lik = - (y_true*K.log(y_pred) + (1-y_true)*K.log(1-y_pred))
    return K.mean(log_lik * adv, keepdims=True)
model_train.compile(loss=my_custom_loss, optimizer=ko.Adam(lr))


### Test Function

We define a simple function that will run *num_tests* episode following our learned policy to evaluate it.

In [None]:
# See our trained bot in action
def score_model(model, num_tests, render=False):
    scores = []    
    for num_test in range(num_tests):
        observation = env.reset()
        reward_sum = 0
        while True:
            if render:
                env.render()

            state = np.reshape(observation, [1, dimen])
            predict = model.predict([state])[0]
            action = 0 if predict>0.5 else 1 
            observation, reward, done, _ = env.step(action)
            reward_sum += reward
            if done:
                break
        scores.append(reward_sum)
    env.close()
    return np.mean(scores)

### Training

In [None]:
# 
gamma = .99      # -> Discounted reward
batch_size = 50  # -> Size of episode before training on a batch
dimen = 4        # Size of observation

# Stop factor
num_episodes = 10000 # Max number of iterations
goal = 190           # Stop if our network achieve this goal over *n_test*
n_test = 10   

print_every = 100 #?Numbe rof episode before trying if our model perform well.

In [None]:
# Placeholders for our observations, outputs and rewards
states = np.empty(0).reshape(0,dimen)
actions = np.empty(0).reshape(0,1)
rewards = np.empty(0).reshape(0,1)
discounted_rewards = np.empty(0).reshape(0,1)

# Setting up our environment
observation = env.reset()
num_episode = 0
reward_sum = 0

losses = []

while num_episode < num_episodes:
    
    
    # Generate state and action for the current iteratio
    state = np.reshape(observation, [1, dimen])
    predict = model_predict.predict([state])[0]
    action = 0 if random.uniform(0,1)< predict else 1 
    
    # Append the observations and outputs for learning
    states = np.vstack([states, state])
    actions = np.vstack([actions, action])
    
    
    # Determine the oucome of the action generated 
    observation, reward, done, _ = env.step(action)
    reward_sum += reward
    
    ##Append the rewards for learning
    rewards = np.vstack([rewards, reward])
    
    
    # If the episode if Over 
    if done:
        
        # Computed the discounted rewards for this episode
        discounted_rewards_episode = discount_rewards(rewards, gamma)   
        
        # Append the discounted rewards for learning
        discounted_rewards = np.vstack([discounted_rewards, discounted_rewards_episode])
        rewards = np.empty(0).reshape(0,1)
        
        # if we have reach 50=batch_size episodes run training for the build batch
        if (num_episode + 1) % batch_size == 0:
            
            # Normalize the discounted rewards
            discounted_rewards -= discounted_rewards.mean()
            discounted_rewards /= discounted_rewards.std()
            discounted_rewards = discounted_rewards.squeeze()
           
        
            actions_train = 1-actions
            loss = model_train.train_on_batch([states, discounted_rewards], actions_train)
            losses.append(loss)
            
            # Clear out game variables
            states = np.empty(0).reshape(0,dimen)
            actions = np.empty(0).reshape(0,1)
            discounted_rewards = np.empty(0).reshape(0,1)

          
        # Print results periodically
        if (num_episode + 1) % print_every == 0:
            # Print status
            score = score_model(model_predict, n_test)
            print("Average reward for training episode {}: {:0.2f} Test Score of {:d} episode: {:0.2f} Loss: {:0.6f} ".format(
                (num_episode + 1), reward_sum/print_every, n_test,
                score,
                losses[-1]))
            
            if score >= goal:
                print("Solved in {} episodes!".format(num_episode))
                break
            reward_sum = 0
                
        num_episode += 1
        observation = env.reset()
        

In [None]:
reward_sum = 0
num_games = 100       
num_game = 0
all_reward_sum = []
obs = env.reset()
reward
while num_game < num_games:
    # hard-coded policy
    p_left = model_predict.predict(np.expand_dims(obs,axis=0))
    action = 0 if p_left[0][0]>0.5 else 1 
    obs, reward, done, info = env.step(action)
    reward_sum += reward
    if done:
        if num_game %10 == 0:
            print("Game played : %d. Reward for the last 10 episode: %s" %(num_game,all_reward_sum[-10:]) )
        all_reward_sum.append(reward_sum)
        reward_sum = 0
        num_game += 1
        env.reset()
print("Over %d episodes, mean reward: %d, std : %d" %(num_games, np.mean(all_reward_sum), np.std(all_reward_sum)))

In [None]:
frames = []

env.reset()    
observation, reward, done, _ = env.step(env.action_space.sample())
reward_sum = 0        
while not(done):
    img = env.render(mode = "rgb_array")
    env.close()
    frames.append(img)
    reward_sum += reward
    p_left = model_predict.predict(np.expand_dims(observation,axis=0))
    action = 0 if  p_left>0.5 else 1 
    observation, reward, done, _ = env.step(action)
plt.close()
HTML(plot_animation(frames).to_html5_video())