<a href="https://colab.research.google.com/github/gitHubAndyLee2020/OpenAI_Gym_RL_Algorithms_Database/blob/main/A2C_Module.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### A2C

> About

- Given some state, the A2C model will output the probability of taking each possible action, and the expected reward for the state
- The A2C model is applied to multiple environments with different starting state to collect more data in parallel
- The difference between the actual reward from taking the selected action and the expected reward is used to calculate the loss value for each environment, and the mean of the loss is used to train the A2C model. This trains the neural network to maximize the probability of action that produces the maximum reward
- Same concept as Actor-Critic, except the Actor-Critic network is applied to multiple environments simultaneously.

> Pro

- Parallelism; more data collected simultaneously

> Con

- Complexity

```
Assume multiprocessing_env module is implemented already
```

```
def make_env():
  def _thunk():
    - Create environment and return it
  - Return the function to create environment
```

```
# Before model definition
- Initialize multiple environments
- Convert them into multi-processed environments
- Define test environment
```

```
class ActorCritic(nn.Module):
  def __init__(self, num_inputs, num_outputs, hidden_size, std=0.0):
    - Define critic network that maps inputs -> hidden layer -> estimated reward
    - Define actor network that maps inputs -> hidden layer -> action probabilities

  def forward(self, x):
    - Feed the input through the critic network and actor network, and convert the action probabilities to categorical distribution (ex: P(A) = 0.2, P(B) = 0.3, etc...)
    - Return the estimated reward and categorical action probabilities
```

```
def test_env(vis=False):
  - Test the trained model on the test environment
  - Run the model on the environment with the actions generated by the Actor-Critic network until the game is done
  - Return the total reward
```

```
def compute_returns(next_value, rewards, masks, gamma=0.99):
  - Compute R = r0 + gamma * r1 + gamma^2 * r2 + ... for all rewards, apply mask to ignore rewards after the agent is done. The next_value is the last reward value
  - Return the computed return values
```

```
# Run the following training loop and post training loop until the frame count of the agent playing the environment exceeds the max frame limit
# Miltiple environment is used during data collection to speed up the process
# Pre Training Loop
- Reset the multiple environments, each environment will have slightly different starting state
# Main Training Loop
- For x number of steps, feed the state of each environment to the Actor-Critic network, and select the action from the action categorical probability
- Apply the each action to its environment, and get the resulting reward for each environment
- Store the reward, action log probability, entropy, potential reward, and mask for each environment. Mask is a list of 0s and 1s where its 0 at "done", and 1 otherwise. This acts to ignore rewards after the agent is done in the environment. Entropy is the measure of uncertainty and randomness in the distribution
- Every y frames, run the test_env function above to test the performance of the current Actor-Critic network
# Post Traning Loop
- Continuing from the last state from each of the environment from the training loop, feed it to the Actor-Critic network and get the expected reward values; named "next_value" in the code
- Use the "next_value", collected rewards, and masks to calculate the temporarily discounted return value from compute_returns function from above
- Calculate the difference between the discounted returns values and collected expected values to get the advantage for each environment, which represents how well each environment performed with the current Actor compared to the expected return that the critic expected
- Actor loss is calculated by the mean of each negated log probability multiplied by each advantage. Higher advantage results in higher loss, which means the optimizer will adjust the neural network more towards the direction of producing higher probability for high-reward actions
- Critic loss is calculated by taking the mean of the square of the advantages. This represents the Mean Squared Error of the Critic network, where the network will try to minimize this value
- Actor loss and Critic loss is combined, and subtract by entropy. Higher advantage leads to both higher Actor loss and Critic loss, which works for both Actor network and Critic network (Actor network wants to move more towards favoring high-reward actions, Critic network wants to adjust more for more wrong predictions).
- Entropy is subtracted from the combined loss, because (1) exploration is encouraged. (2) High entropy means more uncertainty in the model choosing the actions. (2) Higher entropy decreases the loss value to maintain the current status-quo of exploratory nature
- Apply backpropagation
```

```
# Why does higher advantage leads to higher loss, and why does higher loss lead to the model favoring thoses actions?
# A higher positive advantage makes the loss term more positive.
Since the optimizer aims to minimize this loss, it will adjust the parameters such that log p(a∣s) becomes less negative, effectively making p(a∣s) larger. This means the probability of taking that action a in state s will increase, making it more likely to be chosen in the future.
```