<a href="https://colab.research.google.com/github/gitHubAndyLee2020/OpenAI_Gym_RL_Algorithms_Database/blob/main/ActorCritic_Module.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Actor-Critic

> About

- Given some state, the Actor-Critic model will output the probability of taking each possible action, and expected reward for the state
- The difference between the actual reward from taking the selected action and the expected reward is used to calculate the loss value, which trains the neural network to maximize the probability of action that produces the maximum reward
- Same concept as Policy Gradient, except the expected reward stabilizes the training by reducing the variance of the loss value

> Pro

- Variance Reduction; more stable training than Policy Gradient

> Con

- Sensitivity to Hyperparameters; compared to Policy Gradient

```
class Policy(nn.Module):
  def __init__(self):
    - Initialize neural network that has two mappings, one that maps state space to action space and another that maps state space to state value
    - The state value represents the estimated reward given the current state
    - Initialize storage for saving actions and rewards

  def forward(self, x):
    - Feed the state input through both state-to-action-probability mapping and state-to-state-value mapping
    - Apply softmax to generated action probability and return the action probability and state value
```

```
def select_action(state):
  - Feed the state to the Policy network to get action probability and state value
  - Select an action from the action probabilty; if action had p probability in the action probabilty, then it has p chance of being selected
  - Store the selected action and state value pair in the storage
  - Return the selected action
```

```
def finish_episode():
  - Calculate the accumulated reward with formula R = r0 + gamma * r1 + gamma^2 * r2 + ...
  - Normalize the reward
  - Calculate the policy loss using the following steps:
    1. Compute the difference between actual reward and state value, a.k.a. predicted reward. The difference represents the advantage of the action compared to the average reward expected
    2. The reward is then multiplied by the negated log probability of the selected action to get the loss value, where higher reward and higher action log probability results in lower loss value, which means the neural network will be more adjusted to produce actions that result in higher reward
  - Backpropagation is applied
```

```
def main():
  - During the data collection loop, selected actions and rewards are stored
  - The data collection loop runs until the agent fails in the environment
  - Afterwards, the collected data is used for the finish_episode functon defined above, where the model is trained
```