<a href="https://colab.research.google.com/github/gitHubAndyLee2020/OpenAI_Gym_RL_Algorithms_Database/blob/main/PPO_Module.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### PPO

> About

- Consists of Actor and Critic, where the Actor generates action probability for given state, and Critic generates expected return for given state
- The Critic network is used to calculate the expected reward of taking some action sampled from training data, and the advantage is calculate by the difference between the expected reward and actual reward
- For some state and action taken during training, ratio between the current updated Actor network selecting that action given the state divided by the original Actor network selecting that action is used to control the amount that the model is updated according to the advantage; this is used to prevent the updated network from deviating too much from the original network
- The difference discounted actual reward value and predicted reward value is used to update the Critic network

> Pro

- Stability and Robustness

> Con

- Sample Inefficiency

```
class Actor(nn.Module):
  def __init__(self):
    - Initialize neural network that maps state -> hidden layer -> action probability

  def forward(self, x):
    - Feed the input state through the neural network and apply softmax to the output action probability
    - Return the action probability
```

```
class Critic(nn.Module):
  def __init__(self):
    - Initialize neural network that maps state -> hidden layer -> expected reward
    
  def forward(self, x):
    - Feed the input state through the neural network
    - Return the expected reward
```

```
class PPO():
  def __init__(self):
    - Initialize Actor and Critic networks, Actor and Critic optimizers, and buffer to store training data

  def select_action(self, state):
    - Feed the state through the Actor network and get the action probability
    - Convert the action probability into categorical probability and select an action
    - Return the selected action and its action probability

  def get_value(self, state):
    - Feed the state through the Critic network
    - Return the expected reward

  def save_param(self):
    - Save the Actor and Critic networks' weights

  def store_transition(self, transition):
    - Store the transition into the storage

  def update(self, i_ep):
    - Get the state tensor, action tensor, reward tensor, and action log probability tensor from transitions in the storage
    - Calculate the discounted returns using R = r_cur + gamma * r_cur+1 + gamma^2 * r_cur+2 + ..., the discounted return is stored as a tensor [r0 + gamma * r1 + gamma^2 r2, r1 + gamma * r2 + gamma^2 * r3,...] for each time step, Gt represents how much reward the Actor model managed to achieve from the current time step to end of the game
    - Run the following update loop for some amount of times
    # Update Loop
    - 1. Select a random index from the storage, item at the index is a batch of data
    - 2. Fetch the Gt value at the index, this represents the actual reward achieved
    - 3. Feed the state at the index to the Critic model, and get the expected reward
    - 4. Take the difference between the Gt value and the expected reward. This represents how much better the Actor model performed compared to what was expected from the state, called the advantage
    - 5. Feed the state into the Actor network, and get the probability of the action that was actually taken, let's call this generated action probability
    - 6. Calculate the ratio by generated action probability / actual action probability. The ratio represents how much the Actor model changed compared to when the data was collect (in the first loop, the ratio will be close to 1 since the model hasn't been updated yet)
    - 7. Calculate the first surrogate loss value by multiplying ratio and advantage. This represents how more likely is the updated Actor to choose the action that brings "advantage" amount of more rewards ("advantage" could be positive or negative)
    - 8. Calculate the second surrogate loss value by clamping the ratio value to 1 +- clip paramter range then multiplying by the advantage. This achieves the same purpose as the first surrogate loss value, except the ratio is confined between hardline range limit, to avoid large loss value
    - 9. Select the minimum value from the first and second surrogate loss values and take the negated mean (since all the values from the above are tensors from batches of data). The negation means that (1) high ratio and positive advantage -> low loss, less change, (2) high ratio and negative advantage -> high loss, more change; the model is directed towards favoring high-reward actions
    - 10. Apply backpropagation to Actor network
    - 11. Calculate the Critic loss by the Mean Square Loss of Gt value and expected value, more difference between Gt value (actual reward value) and expected value results in higher loss for more adjustments to the Critic network
```

```
def main():
  - Run the following training loop for some number of epochs
  # Training loop
  - Collect data from the environment until the game ends
  - When the game is over, update the agent using the collected data
```