# MARL training and deployment

We use CommonPower to create a simulation of a power system where one node (corresponding to a multi-family household) is controlled by the RL agent. Since RL does not naturally allow considering constraints, such as a minimum state of charge of a battery, we have implemented a safety layer that is wrapped around the RL agent. It extracts all necessary constraints from the power system model and checks whether a control action suggested by the agent is safe. If necessary, the safety layer adjust the action, before passing it on to the simulation. The agent then receives a feedback informing it about the adjustment of its action.

In this notebook, you will learn how to 
- use CommonPower to import a network topology from [pandapower](http://www.pandapower.org/)
- add components like energy storage systems to a network
- set up a decentralized control scheme with multiple RL agents
- assign nodes to the agents
- train the RL agents
- monitor the training process using Tensorboard

##### Prerequisites
1. Install the requirements in `Readme.md`
2. Optional (if you want to track the learning processes using Weights&Biases): Sign up for the [academic version of Weights&Biases](https://wandb.ai/site/research).
3. Catch-up on basic knowledge on [Deep Reinforcement Learning (DRL)](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html) and  Multi-Agent DRL (PPO) with [MAPPO](https://github.com/TUMcps/on-policy) (we use our own fork).
4. Be familiar with [Tensorboard](https://www.tensorflow.org/tensorboard/get_started), a tool to track training of any machine learning project, and (optionally) [Weights&Biases](https://docs.wandb.ai/quickstart).

In [None]:
import pathlib
from datetime import datetime
from numpy.random import uniform, choice
from commonpower.models.components import *
from commonpower.models.buses import *
from commonpower.models.lines import *
from commonpower.models.powerflow import *
from commonpower.extensions.factories import Factory, Sampler
from commonpower.data_forecasting import *
from commonpower.modeling.param_initialization import RangeInitializer
from commonpower.utils.helpers import get_adjusted_cost
from commonpower.control.controllers import *
from commonpower.control.observation_handling import ObservationHandler
from commonpower.control.safety_layer.safety_layers import *
from commonpower.control.safety_layer.penalties import *
from commonpower.control.logging_utils.loggers import *
from commonpower.control.runners import *
from commonpower.control.wrappers import *
from commonpower.control.configs.algorithms import *
from commonpower.control.util import predicted_cost_callback

from commonpower.extensions.network_import import PandaPowerImporter
import pandapower.networks as pn

## System set-up
First, we have to define the energy system the RL agents should interact with. We will use a small network that we import from pandapower to get the network topology and characteristics (line admittances etc.). Then, we will add components to the load buses of this network. 

In [None]:
# number of control steps
horizon = timedelta(hours=24)
frequency = timedelta(minutes=60)

# path to data profiles
current_path = pathlib.Path().absolute()
data_path = current_path / 'data'
data_path = data_path.resolve()

ds1a = CSVDataSource(
    data_path / '1-LV-rural2--1-sw' / 'LoadProfile.csv',
    delimiter=";",
    datetime_format="%d.%m.%Y %H:%M",
    rename_dict={"time": "t", "H0-A_pload": "p", "H0-A_qload": "q"},
    auto_drop=True,
    resample=frequency).apply_to_column("p", lambda x: 10 * x).apply_to_column("q", lambda x: 0.0)

ds1b = CSVDataSource(
    data_path / '1-LV-rural2--1-sw' / 'LoadProfile.csv',
    delimiter=";",
    datetime_format="%d.%m.%Y %H:%M",
    rename_dict={"time": "t", "H0-A_pload": "p", "H0-A_qload": "q"},
    auto_drop=True,
    resample=frequency
).apply_to_column("p", lambda x: 10 * x).shift_time_series(timedelta(hours=24)).apply_to_column("q", lambda x: 0.0)

ds1c = CSVDataSource(
    data_path / '1-LV-rural2--1-sw' / 'LoadProfile.csv',
    delimiter=";",
    datetime_format="%d.%m.%Y %H:%M",
    rename_dict={"time": "t", "H0-B_pload": "p", "H0-B_qload": "q"},
    auto_drop=True,
    resample=frequency).apply_to_column("p", lambda x: 10 * x).apply_to_column("q", lambda x: 0.0)

ds2 = CSVDataSource(
    data_path / 'spot_prices_dk.csv',
    delimiter=";",
    decimal=",",
    datetime_format="%Y-%m-%d %H:%M",
    rename_dict={"HourUTC": "t", "SpotPriceEUR": "psi"},
    auto_drop=True,
    resample=frequency).apply_to_column("psi", lambda x: x / 1000)  # prices are EUR/MWh

ds3a = CSVDataSource(
    data_path / '1-LV-rural2--1-sw' / 'RESProfile.csv',
    delimiter=";",
    datetime_format="%d.%m.%Y %H:%M",
    rename_dict={"time": "t", "PV3": "p"},
    auto_drop=True,
    resample=frequency).apply_to_column("p", lambda x: -10 * x)

ds3b = CSVDataSource(
    data_path / '1-LV-rural2--1-sw' / 'RESProfile.csv',
    delimiter=";",
    datetime_format="%d.%m.%Y %H:%M",
    rename_dict={"time": "t", "PV7": "p"},
    auto_drop=True,
    resample=frequency).apply_to_column("p", lambda x: -10 * x)

dp1a = DataProvider(ds1a, NoisyForecaster(frequency=frequency, horizon=horizon))
dp1b = DataProvider(ds1b, NoisyForecaster(frequency=frequency, horizon=horizon))
dp1c = DataProvider(ds1c, NoisyForecaster(frequency=frequency, horizon=horizon))
dp2 = DataProvider(ds2, NoisyForecaster(frequency=frequency, horizon=horizon))
dp3a = DataProvider(ds3a, NoisyForecaster(frequency=frequency, horizon=horizon))
dp3b = DataProvider(ds3b, NoisyForecaster(frequency=frequency, horizon=horizon))

# We are using DC powerflow 
power_flow_mode = DCPowerFlowModel()

We use a Factory so we do not have to manually add components to each load bus.

In [None]:
# Individual system (decentralized control)
rand_seed = 888
np.random.seed(rand_seed)

ind_factory = Factory()

ind_factory.set_bus_template(
    RTPricedBusLinear,
    meta_config={
        "p": Sampler(uniform, low=[-1e3, 1e3], high=[-1e3, 1e3]),
        "q": Sampler(uniform, low=[-1e3, 1e3], high=[-1e3, 1e3]),
        "v": Sampler(uniform, low=[0.9, 1.1], high=[0.9, 1.1]),
        "d": Sampler(uniform, low=[-15, 15], high=[-15, 15])
    },
    data_providers=[dp2]
)

# add components to factory
# Load: base load of the household (corresponds to fridge, dishwasher, washing machine, etc.)
ind_factory.add_component_template(Load, probability=1., data_providers=[Sampler(choice, a=[dp1a, dp1b, dp1c])])

# RenewableGen: renewable generation (e.g., PV)
ind_factory.add_component_template(
    RenewableGen,
    probability=.5,
    meta_config={
        "p": Sampler(uniform, low=[-7, 0], high=[-7, 0]),
        "q": Sampler(uniform, low=[0, 0], high=[0, 0])
    },
    data_providers=[Sampler(choice, a=[dp3a, dp3b])]
)

# ESSLinear: energy storage system (e.g., battery) with highly simplified dynamics  (to reduce computation time)
ess_capa = 10  # kwh
ind_factory.add_component_template(
    ESSLinear,
    probability=0.5,
    meta_config={
        'rho': 0.1,
        'p': Sampler(uniform, low=[-5, 5], high=[-5, 5]),
        'q': Sampler(uniform, low=[0, 0], high=[0, 0]),
        'etac': 0.95,
        'etad': 0.95,
        'etas': 0.99,
        'soc': Sampler(uniform, low=[0.2 * ess_capa, 0.8 * ess_capa], high=[0.2 * ess_capa, 0.8 * ess_capa]),
        "soc_init": [RangeInitializer, {"lb": Sampler(uniform, low=0.3 * ess_capa, high=0.3 * ess_capa),
                                        "ub": Sampler(uniform, low=0.5 * ess_capa, high=0.5 * ess_capa)}]
    }
)

Now we load the network topology from pandapower and create a power system using the factory.

In [None]:
# initialize system
net = pn.create_kerber_landnetz_kabel_2()
ind_sys = PandaPowerImporter().import_net(
    net=net,
    power_flow_model=power_flow_mode,
    node_factory=ind_factory,
    restrict_factory_to="loadbus"
)

n999 = ExternalGrid("ExternalGrid")

# set node 1 (main_busbar) as external grid connection, i.e. trading node to import unbalanced quantities of power
ind_sys.add_node(n999, at_index=1)

# update the respective lines
ind_sys.lines[0].src = n999
ind_sys.lines[24].src = n999

# remove line from ext to main busbar
ind_sys.lines.pop(-1)
# remove node 0
ind_sys.nodes.pop(0)
# Show the system set-up
ind_sys.pprint()

## MARL Training
Now that we have the system set-up, we can add controllers to all buses that have controllable components. In this case, this means all buses with a battery storage system (ESSLinear). 

There are many hyperparameters for MARL training. The most important ones are the `num_env_steps`, which determines the length of the training, and the `episode_length`, which determines how many days of data we collect before updating the policies. This should be a multiple of the `horizon` parameter. We use pydantic classes to handle the hyperparameters. 

In [None]:
# pydantic class
mappo_config = MAPPOBaseConfig(
    algorithm_name='mappo',
    seed=1,
    num_env_steps=1200 * int(horizon.total_seconds() // 3600),
    episode_length=3 * int(horizon.total_seconds() // 3600),
    penalty_factor=2.0
)

In [None]:
# add individual RL controllers
for i in range(len(ind_sys.nodes)):
    # will also add a controller to households which do not have inputs (e.g., households with only a Load component),
    # but these are disregarded when the system is initialized
    _ = RLControllerMA(
        name=f"agent{i}",
        obs_handler=ObservationHandler(num_forecasts=6),
        cost_callback=predicted_cost_callback,
        safety_layer=ActionProjectionSafetyLayer(
            penalty=DistanceDependingPenalty(penalty_factor=mappo_config.penalty_factor)
        )
    ).add_entity(ind_sys.nodes[i])

For logging the training, you can use tensorboard or Weights&Biases, as described above. When you use W&B you need to change the `entity_name` to the name of your W&B team.

In [None]:
# logging using Tensorboard
logger = MARLTensorboardLogger(
    log_dir="./test_run/",
    callback=MARLBaseCallback
)
# logging using WandB
#logger = MARLWandBLogger(
#    log_dir="./test_run/",
#    entity_name="srl4ps",  # change to your team name!
#    project_name="commonpower",
#    alg_config=mappo_config,
#    callback=MARLWandBCallback
#)

The actual training will happen in the next cell. WARNING: It will take several hours until the training fully converges. If you just want to get an idea of what the training process would look like, reduce the `num_env_steps` in the `mappo_config` above. You can also skip the training and go directly to the benchmarking, in which case you will use agents that we have pre-trained for you. 

In [None]:
runner = MAPPOTrainer(
    sys=ind_sys,
    global_controller=OptimalController('global'),
    wrapper=MultiAgentWrapper,
    alg_config=mappo_config,
    seed=mappo_config.seed,
    logger=logger
)
runner.run()

### Training visualization
If you used the TensorBoardLogger, you can plot the training metrics using the notebook magic of tensorboard. The metrics are sorted by agent. The most interesting metrics for us are the __average_episode_rewards__, the __ep_penalty_mean__, and the __value_loss__. Think about what these charts tell you and discuss it!

In [None]:
%load_ext tensorboard
%tensorboard --logdir test_run

## Benchmarking MARL and decentralized optimal control
We want to benchmark our trained agents against decentralized optimal control. 

In [None]:
# parameters for deployment
n_deployment_steps = 48
eval_seed = 5

### Decentralized optimal control

In [None]:
# add optimal controllers (will overwrite the RL controllers)
for i in range(len(ind_sys.nodes)):
    # will also add a controller to households which do not have inputs (e.g., households with only a Load component), 
    # but these are disregarded when the system is initialized
    _ = OptimalController(f"agent{i}").add_entity(ind_sys.nodes[i])

In [None]:
ind_sys_history = ModelHistory([ind_sys])
runner = DeploymentRunner(sys=ind_sys, global_controller=OptimalController("global"),
                          seed=eval_seed, history=ind_sys_history, continuous_control=True)
runner.run(n_steps=n_deployment_steps)

### MARL deployment
After training the agents, we will showcase how to load them. 

In [None]:
# retrieve information which agent controlled which nodes
top_level_nodes = []
for ctrl in ind_sys.controllers.values():
    top_level_nodes.append(ctrl.top_level_nodes)
    
# deployment of trained agents
load_path = "./saved_models/test_model/"  # In case you used W&B for logging, the models will be saved in "./test_run/models/og1te5k4", where you have to replace the last part with the respective run ID. 
agents = []
for i in range(len(ind_sys.controllers)):
    agent_i = RLControllerMA(
        name=f"agent{i}",
        obs_handler=ObservationHandler(num_forecasts=6),
        safety_layer=ActionProjectionSafetyLayer(
            penalty=DistanceDependingPenalty(penalty_factor=mappo_config.penalty_factor)
        ),
        pretrained_policy_path=load_path+f"/agent{i}"
    ).add_entity(top_level_nodes[i][0])
    agents.append(agent_i)

In [None]:
ind_sys_history_marl = ModelHistory([ind_sys])
runner = DeploymentRunner(sys=ind_sys, global_controller=OptimalController("global"), alg_config=mappo_config,
                          wrapper=MultiAgentWrapper, history=ind_sys_history_marl, seed=eval_seed, continuous_control=True)
runner.run(n_steps=n_deployment_steps)

### Comparison of total cost

In [None]:
# We compare controllers by tracking the realized cost until the last timestep. 
# The cost of the last timestep is the accumulated cost of the projected horizon. 
# Since the projection is computed by the system's "internal" solver, which is by definition optimal wrt. to the system's cost function, this represents the "best case" cost (subject to the forecaster).
# This makes sure that costs realized in the future, e.g. by discharing batteries, is considered in the comparison.
decentralized_cost = get_adjusted_cost(ind_sys_history, ind_sys)
decentralized_cost_rl = get_adjusted_cost(ind_sys_history_marl, ind_sys)
print(f"decentralized_cost: {sum(decentralized_cost)}")
print(f"decentralized_cost_rl: {sum(decentralized_cost_rl)}")

### Comparison of controllers for one day
Next, we will show an example of the difference in behavior of an RL controller and an optimal controller for a given day and one household. 

In [None]:
 # day to use
start = datetime(2016, 8, 29, 0, 0)
end = datetime(2016, 8, 29, 23, 0)

In [None]:
price_history = ind_sys_history.filter_for_entities(ind_sys.nodes[4]).filter_for_time_period(start,end).filter_for_element_names(["psi"]).filter_for_time_index().history
prices = [t[1]['n4.psi'] for t in price_history]
time_stamps = [t[0] for t in price_history]
cost_history = ind_sys_history.filter_for_entities(ind_sys.nodes[4]).filter_for_time_period(start,end).filter_for_element_names(["cost"]).filter_for_time_index().history
costs = [t[1]['n4.cost'] for t in cost_history]
soc_history = ind_sys_history.filter_for_entities(ind_sys.nodes[4].nodes[2]).filter_for_time_period(start,end).filter_for_element_names(["soc"]).filter_for_time_index().history
soc = [t[1]["n4.el42.soc"] for t in soc_history]
soc_history_rl = ind_sys_history_marl.filter_for_entities(ind_sys.nodes[4].nodes[2]).filter_for_time_period(start,end).filter_for_element_names(["soc"]).filter_for_time_index().history
soc_rl = [t[1]["n4.el42.soc"] for t in soc_history_rl]

In [None]:
import matplotlib.pyplot as plt
def make_results_plot(time_stamps, prices, soc_oc, soc_rl):
    fig, ax = plt.subplots()
    ax.plot(time_stamps, prices, color='blue', label='Spot market prices')
    ax.tick_params(axis='y', labelcolor='blue')
    ax.tick_params(axis='x', rotation=45)
    ax2 = ax.twinx()
    ax2.plot(time_stamps, soc_oc, color='orange', label='SoC Optimal Controller')
    ax2.tick_params(axis='y', labelcolor='orange')
    ax3 = ax.twinx()
    ax3.plot(time_stamps, soc_rl, color='green', label='SoC RL Controller')
    ax3.tick_params(axis='y', labelcolor='green')
    ax3.spines['right'].set_position(('outward', 60))
    fig.legend()
    
    plt.show()

make_results_plot(time_stamps, prices, soc, soc_rl)

As you can see, the optimal controller is far more aggressive than the RL controller. This is probably due to the safety constraints, since the agent learned not to get close to the limits of the battery. It gives a hint on why the optimal controller performs better in the overall cost. 