# Neural Networks used in PPO

**Important**: the following is valid only for Ray 2.40.0!

I use PPO to perform experiments with the DFaaS environment. Since I have adopted a decentralized training approach for this multi-agent problem, each trained agent is associated with a policy that has two artificial neural networks, one for the actor and one for the critic. This notebook explores the structure of these networks.

Main official page documentation: [Models, Preprocessors, and Action Distributions](https://docs.ray.io/en/releases-2.40.0/rllib/rllib-models.html#models-preprocessors-and-action-distributions).

If no option is given, the default neural network used is a fully connected network. This network is configurable and the defaults are stored in the `MODEL_DEFAULT` dictionary [rllib/models/catalog.py](https://github.com/ray-project/ray/blob/887eddd9245c77adc5684c78410400327d266427/rllib/models/catalog.py#L52).

The fully connected network in Ray is represented by the `FullyConnectedNetwork` class in [rllib/models/torch/fcnet.py](https://github.com/ray-project/ray/blob/releases/2.40.0/rllib/models/torch/fcnet.py).

With the default options:

* Two hidden layers with 256 neurons each,
* [tanh](https://pytorch.org/docs/stable/generated/torch.nn.Tanh.html) as the activation function for the hidden layers,
* Linear activation function for the output layer.

This is the architecture of the Critic network, but the Actor network shares the same architecture (with different inputs and outputs, of course).

In the `FullyConnectedNetwork` class, the neural networks are stored in the following object variables:

* `_logits`: is the output layer of the actor network;
* `_hidden_layers`: contains the hidden layers of the actor network;
* `_value_branch_separate`: contains the hidden layers of the critic network;
* `_value_branch`: is the output layer of the critic network.

## Default neural networks

In [None]:
# Common imports.
from pathlib import Path
from pprint import pformat as pf
from pprint import pp

import base

import numpy as np

from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.policy.policy import PolicySpec
from ray.rllib.models.catalog import MODEL_DEFAULTS

import dfaas_env

In [None]:
# Create a dummy environment, used to get observation and action spaces.
dummy_env = dfaas_env.DFaaS()
dummy_agent = dummy_env.agents[0]

# Normally we would have one policy for each agent, but in this simplified
# version we only need to show the network architecture, so one policy for
# all agents is sufficient.
policies = {
    "policy_node_x": PolicySpec(
        policy_class=None,
        observation_space=dummy_env.observation_space[dummy_agent],
        action_space=dummy_env.action_space[dummy_agent],
        config=None,
    )
}


# Link the single policy to all agents.
def policy_mapping_fn(agent_id, episode, runner, **kwargs):
    return "policy_node_x"


# Algorithm config.
ppo_config = (
    PPOConfig()
    # By default RLlib uses the new API stack, but I use the old one.
    .api_stack(
        enable_rl_module_and_learner=False, enable_env_runner_and_connector_v2=False
    )
    .environment(env=dfaas_env.DFaaS.__name__)
    .framework("torch")
    .env_runners(num_env_runners=0)
    .evaluation(evaluation_interval=None)
    .resources(num_gpus=1)
    .callbacks(dfaas_env.DFaaSCallbacks)
    .multi_agent(policies=policies, policy_mapping_fn=policy_mapping_fn)
)

# Build the algorithm.
ppo_algo = ppo_config.build()

# Get the (only) policy.
policy = ppo_algo.get_policy("policy_node_x")

In [None]:
agent_observation_space = dummy_env.observation_space[dummy_agent]
agent_action_space = dummy_env.action_space[dummy_agent]

print(f"Agent observaton space = {pf(dict(agent_observation_space))}\n")
print("Agent action space =", agent_action_space)

In [None]:
print(f"Policy observation space = {policy.observation_space}\n")
print(
    f"Policy original observation space = {pf(dict(policy.observation_space.original_space))}\n"
)

print(f"Policy Action space = {policy.action_space}\n")

print(f"Model\n{policy.model}")

**Note**: The input to the networks is pre-processed, which is why the policy's observation space and the agent's observation space are different. 

This default neural network is not suitable for the DFaaS environment for two reasons:

1. We have a discrete (finite) observation space and a continuous action space (the concentration parameters): more hidden layers allow the agent to learn complex patterns when distributing requests.

2. The network output (actor only) is a vector of concentration parameters for Dirichlet distribution. These parameters are strictly positive, but since there is a linear activation function on the output layer (`_logits`), we can get negative parameters. The [Softplus](https://pytorch.org/docs/stable/generated/torch.nn.Softplus.html#torch.nn.Softplus) activation function should be more appropriate.

3. The same is true for the critic network, since the reward function returns non-negative rewards.

## Updated neural networks

The model can be customized using the `custom_model_config` subdictionary. There is a key `last_activation_fn` that specifies the activation function for the last output layer of the actor network.

Note that I have modified the `FullyConnectedNetwork` class to support this new functionality, since the original class forces the linear function as the activation function.

The following code specify a model that uses the Softplus function.

In [None]:
# Create a dummy environment, used to get observation and action spaces.
dummy_env = dfaas_env.DFaaS()
dummy_agent = dummy_env.agents[0]

# Normally we would have one policy for each agent, but in this simplified
# version we only need to show the network architecture, so one policy for
# all agents is sufficient.
policies = {
    "policy_node_x": PolicySpec(
        policy_class=None,
        observation_space=dummy_env.observation_space[dummy_agent],
        action_space=dummy_env.action_space[dummy_agent],
        config=None,
    )
}


# Link the single policy to all agents.
def policy_mapping_fn(agent_id, episode, runner, **kwargs):
    return "policy_node_x"


# Customize the default model.
model = MODEL_DEFAULTS.copy()
model["custom_model_config"] = {"last_activation_fn": "Softplus"}
model["vf_share_layers"] = False

# Algorithm config.
ppo_config = (
    PPOConfig()
    # By default RLlib uses the new API stack, but I use the old one.
    .api_stack(
        enable_rl_module_and_learner=False, enable_env_runner_and_connector_v2=False
    )
    .environment(env=dfaas_env.DFaaS.__name__)
    .training(model=model)
    .framework("torch")
    .env_runners(num_env_runners=0)
    .evaluation(evaluation_interval=None)
    .resources(num_gpus=1)
    .callbacks(dfaas_env.DFaaSCallbacks)
    .multi_agent(policies=policies, policy_mapping_fn=policy_mapping_fn)
)

# Build the algorithm.
ppo_algo = ppo_config.build()

# Get the (only) policy.
policy = ppo_algo.get_policy("policy_node_x")

In [None]:
print(f"Model\n{policy.model}")