# Proximal Policy Optimization (PPO) with TensorFlow

Understanding PPO reinforcement learning algorithm and implementing it with TensorFlow 2.x

![image](https://miro.medium.com/max/1400/1*XreRjuz6MmATuoRvmprsdA.webp)

The source of the notes come from this [blog](https://towardsdatascience.com/proximal-policy-optimization-ppo-with-tensorflow-2-x-89c9430ecc26).

In this article, we will try to understand Open-AI’s Proximal Policy Optimization algorithm for reinforcement learning. After some basic theory, we will be implementing PPO with TensorFlow 2.x. Before you read further, I would recommend you take a look at the Actor-Critic method from [here](https://towardsdatascience.com/actor-critic-with-tensorflow-2-x-part-2of-2-b8ceb7e059db), as we will be modifying the code of that article for PPO.



## Why PPO?

1. *Unstable Policy Update*: In Many Policy Gradient Methods, policy updates are unstable because of larger step size, which leads to bad policy updates and when this new bad policy is used for learning then it leads to even worse policy. And if steps are small then it leads to slower learning.

2. *Data Inefficiency*: Many learning methods learn from current experience and discard the experiences after gradient updates. This makes the learning process slow as a neural net takes lots of data to learn.

PPO comes handy to overcome the above issues.

### Core Idea Behind PPO

In earlier Policy gradient methods, the objective function was something like $\hat{\mathbb{E}}[\log \pi_{\theta}(a_t/s_t) \cdot \hat{A}_t]$. But now instead of the log of current policy, we will be taking the ratio of current policy and old policy.

$$\hat{\mathbb{E}} \big[\frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} \hat{A}_t \big] = \hat{\mathbb{E}} \big[r_t(\theta) \hat{A}_t \big]$$

Equation comes from this [paper](https://arxiv.org/abs/1707.06347).

We will be also clipping the ratio and will the minimum of the two i.e b/w clipped and unclipped.

$$L^{\text{CLIP}(\theta)} = \hat{\mathbb{E}} \big[\min(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t\big]$$

This clipped objective will restrict large policy updates as shown below.

![image](https://miro.medium.com/max/1400/1*VN01Obh5VyJ6QuA0qfyq6w.webp)

Photo from this [paper](https://arxiv.org/abs/1707.06347)


## Algorithm Steps

1. Play game for n steps and store state, action probability, rewards, done variables.
2. Apply the Generalized Advantage Estimation method on the above experience. We will see this in the coding section.
3. Train neural networks for some epochs by calculating their respective loss.
4. Test this trained model for “m” episodes.
5. If the average reward of test episodes is larger than the target reward set by you then stop otherwise repeat from step one.

## Code

1. After importing the required libraries and initializing our environment, we define our neural networks and are similar to that of the Actor-Critic article.
2. The Actor-network takes the current state as input and outputs probability for each action.
3. The Critic network outputs the value of a state.

```py
class critic():
    # here is a neural network with dense layers

class actor():
    # here is another neural network with dense layers
```

## Action Selection

1. We define our agent class and initialize optimizer and learning rate.
2. We also define a clip_pram variable which will be used in the actor loss function.
3. For action selection, we will be using the TensorFlow probabilities library, which takes probabilities as input and convert them into distribution.
4. Then, we use the distribution for action selection.

```py
class agent()
    def __init__(self):
        # define optimizer
        # define actor()
        # define critic()
    
    def act(self, state):
        # define what the actor does
    
    def actor_loss(self, prob, action, td):
        # define the formuls
        # and compute the loss for actor
        # loss is defined according to PPO formula
    
    def learn():
        # use gradient tape to update gradient
        # according to PPO loss function
```

## Test Model Knolwedge

This function will be used to test our agent’s knowledge and returns the total reward for one episode.

```py
def test_reward(env):
    # there we have a while-loop to update reward
```

## Training Loop

1. We will loop for “steps” time i.e we will collect experience for “steps” time.
2. The next loop is for the number of times agent interacts with environments and we store experiences in different lists.
3. After the above loop, we calculate and add the value of the state next to the last state for calculations in the Generalized Advantage Estimation method.
4. Then, we process all the lists in the Generalized Advantage Estimation method to get returns, advantage.
5. Next, we train our networks for 10 epochs.
6. After training, we will test our agent on the test environment for five episodes.
7. If the average reward of test episodes is larger than the target reward set by you then stop otherwise repeat from step one.

```py
# define params
for s in range(steps):
    # define params for inner loop
    while loop is running:
        # run
        agent.learn()
```

## Code Starts from Here

### Library

In [None]:
import numpy as np
import tensorflow as tf 
import gym
import tensorflow_probability as tfp

### Installation

Please install the following.

In [None]:
pip install box2d-py

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting box2d-py
  Downloading box2d-py-2.3.8.tar.gz (374 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.5/374.5 KB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: box2d-py
  Building wheel for box2d-py (setup.py) ... [?25l[?25hdone
  Created wheel for box2d-py: filename=box2d_py-2.3.8-cp38-cp38-linux_x86_64.whl size=2835001 sha256=1aac00046f2e9a376195a72d2aa31ee39ec050a0df36c3e51c1d5107bf61c9d0
  Stored in directory: /root/.cache/pip/wheels/cc/4f/d6/44eb0a9e6fea384e58f19cb0c4125e46a23af2b33fe3a7e81c
Successfully built box2d-py
Installing collected packages: box2d-py
Successfully installed box2d-py-2.3.8


In [None]:
pip install gym[box2d]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pygame==2.1.0
  Using cached pygame-2.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
Collecting box2d-py==2.3.5
  Using cached box2d-py-2.3.5.tar.gz (374 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: box2d-py
  Building wheel for box2d-py (setup.py) ... [?25l[?25hdone
  Created wheel for box2d-py: filename=box2d_py-2.3.5-cp38-cp38-linux_x86_64.whl size=2834814 sha256=130ff3883c95bb2e24e10d15454a08019dc496fcd2aa4e4f5e069e4a2325d292
  Stored in directory: /root/.cache/pip/wheels/8b/95/16/1dc99ff9a3f316ff245fdb5c9086cd13c35dad630809909075
Successfully built box2d-py
Installing collected packages: box2d-py, pygame
  Attempting uninstall: box2d-py
    Found existing installation: box2d-py 2.3.8
    Uninstalling box2d-py-2.3.8:
      Successfully uninstalled box2d-py-2.3.8
Successfully installed box2d-py-2.3

### Initiate Environment

In [None]:
env = gym.make("LunarLander-v2")
low = env.observation_space.low
high = env.observation_space.high

  deprecation(
  deprecation(


### Define Class Object: `critic`

In [None]:
class critic(tf.keras.Model):
    def __init__(self):
        super().__init__()
        self.d1 = tf.keras.layers.Dense(2048,activation='relu')
        self.d2 = tf.keras.layers.Dense(1536,activation='relu')
        self.v = tf.keras.layers.Dense(1, activation = None)

    def call(self, input_data):
        x = self.d1(input_data)
        x = self.d2(x)
        v = self.v(x)
        return v

class actor(tf.keras.Model):
    def __init__(self):
        super().__init__()
        self.d1 = tf.keras.layers.Dense(2048,activation='relu')
        self.d2 = tf.keras.layers.Dense(1536,activation='relu')
        self.a = tf.keras.layers.Dense(4,activation='softmax')

    def call(self, input_data):
        x = self.d1(input_data)
        x = self.d2(x)
        a = self.a(x)
        return a

### Define Class Object: `agent`

In [None]:
class agent():
    def __init__(self, gamma = 0.99):
        self.gamma = gamma
        self.a_opt = tf.keras.optimizers.Adam(learning_rate=5e-6)
        self.c_opt = tf.keras.optimizers.Adam(learning_rate=5e-6)
        self.actor = actor()
        self.critic = critic()
        self.log_prob = None

    def act(self,state):
        prob = self.actor(np.array([state]))
        #print(prob)
        prob = prob.numpy()
        dist = tfp.distributions.Categorical(probs=prob, dtype=tf.float32)
        action = dist.sample()
        return int(action.numpy()[0])
        # action = np.random.choice([i for i in range(env.action_space.n)], 1, p=prob[0])
        # log_prob = tf.math.log(prob[0][action]).numpy()
        # self.log_prob = log_prob[0]
        # #print(self.log_prob)
        # return action[0]

    def actor_loss(self, prob, action, td):
        dist = tfp.distributions.Categorical(probs=prob, dtype=tf.float32)
        log_prob = dist.log_prob(action)
        loss = -log_prob*td
        return loss

    def learn(self, state, action, reward, next_state, done):
        state = np.array([state])
        next_state = np.array([next_state])
        #self.gamma = tf.convert_to_tensor(0.99, dtype=tf.double)
        #d = 1 - done
        #d = tf.convert_to_tensor(d, dtype=tf.double)
        with tf.GradientTape() as tape1, tf.GradientTape() as tape2:
            p = self.actor(state, training=True)
                
            #p = self.actor(state, training=True).numpy()[0][action]
            #p = tf.convert_to_tensor([[p]], dtype=tf.float32)
            #print(p)
            v =  self.critic(state,training=True)
            #v = tf.dtypes.cast(v, tf.double)

            vn = self.critic(next_state, training=True)
            #vn = tf.dtypes.cast(vn, tf.double)
            td = reward + self.gamma*vn*(1-int(done)) - v
            #print(td)
            #td = tf.math.subtract(tf.math.add(reward, tf.math.multiply(tf.math.multiply(self.gamma, vn), d)), v)
            #a_loss = -self.log_prob*td
            a_loss = self.actor_loss(p, action, td)
            #a_loss = -tf.math.multiply(tf.math.log(p),td)
            #a_loss = tf.keras.losses.categorical_crossentropy(td, p)
            #a_loss = -tf.math.multiply(self.log_prob,td)
            c_loss = td**2
            #c_loss = tf.math.pow(td,2)
        grads1 = tape1.gradient(a_loss, self.actor.trainable_variables)
        grads2 = tape2.gradient(c_loss, self.critic.trainable_variables)
        self.a_opt.apply_gradients(zip(grads1, self.actor.trainable_variables))
        self.c_opt.apply_gradients(zip(grads2, self.critic.trainable_variables))
        return a_loss, c_loss

### Define and Run Training

In [None]:
agentoo7 = agent()
steps = 10

for s in range(steps):

    done = False
    state = env.reset()
    total_reward = 0
    all_aloss = []
    all_closs = []
  
    while not done:
        #env.render()
        action = agentoo7.act(state)
        #print(action)
        next_state, reward, done, _ = env.step(action)
        aloss, closs = agentoo7.learn(state, action, reward, next_state, done)
        all_aloss.append(aloss)
        all_closs.append(closs)
        state = next_state
        total_reward += reward

        if done:
            #print("total step for this episord are {}".format(t))
            print("total reward after {} steps is {}".format(s, total_reward))


total reward after 0 steps is -101.85410305615486
total reward after 1 steps is -388.9730642679365
total reward after 2 steps is -128.5784935695012
total reward after 3 steps is -94.56507968972281
total reward after 4 steps is -184.29354193471238
total reward after 5 steps is -420.0412933497794
total reward after 6 steps is -92.74203850219722
total reward after 7 steps is -156.6542665888134
total reward after 8 steps is -196.48567587771737
total reward after 9 steps is -213.73151792177583
