# Training a RL agent with Stable Baselines3 using a GEM environment

This notebook serves as an educational introduction to the usage of stable-baselines3 using a GEM environment. Goal of this notebook is to give an understanding what stable-baselines3 is and how to use it to train and evaluate an Reinforcement Learning agent, which is able to solve a current control problem of the GEM toolbox.

The following code snippets are only needed if you are executing this file directly from a cloned GitHub repository where you don't have GEM installed directly

In [5]:
from pathlib import Path
import sys
sys.path.append(str(Path().resolve().parent.parent))

## 1. What you need

Before you can start you need to make sure that you have both gym-electric-motor and Stable-Baselines3 installed. You can install both easily using pip:

- pip install gym-electric-motor
- pip install stable-baselines3

Alternatively, you can install them and their latest developer version directly from GitHub:

- https://github.com/upb-lea/gym-electric-motor
- https://github.com/DLR-RM/stable-baselines3

You also need to make sure that numpy and gym are installed. You can install both using pip, too. After you have done that you should be able to execute the following cells without any problems


## 2. Setting up a GEM evironment

This notebook does not focus directly on the usage of GEM and how to set up a GEM environment. If you are new to GEM and interested to find out what it does and how to use it we recommend taking a look at the educational notebook which is dealing with GEM.


For this notebook, will use a function defined in an external Python file called setting_environment.py. If you are interested to see, how we defined our environment's parameters you can take a look into that file. We are using the Discrete DC Permanently Excited Motor Environment:

- https://upb-lea.github.io/gym-electric-motor/parts/environments/dc_permex_disc.html

- https://upb-lea.github.io/gym-electric-motor/parts/physical_systems/electric_motors/pmsm.html

The motor schematic is the following:

![Motor Setup](img/ESBdq1.svg)

And the electrical ODEs of that motor are:

<h3 align="center">

$\frac{\mathrm{d}i_{sq}}{\mathrm{d}t} = \frac{u_{sq}-pL_d\omega_{me}i_{sd}-R_si_{sq}}{L_q}$

$\frac{\mathrm{d}i_{sd}}{\mathrm{d}t} = \frac{u_{sd}-pL_q\omega_{me}i_{sq}-R_si_{sd}}{L_d}$

$\frac{\mathrm{d}\epsilon_{el}}{\mathrm{d}t} = p\omega_{me}$

</h3>

In the end we wish for an agent which is able to solve the current control problem of this environment. This means it should control the system such that $i_{sq}$ and $i_{sd}$ follow a given trajectory. The following code is using our pre-written function set_env to import our pre-defined GEM environment.

In [6]:
from setting_environment import set_env
env = set_env(training=True)

## 3. What is Stable Baselines3?



Stable-Baselines3 is a collection of Reinforcement Learning algorithms implemented in Pytorch. It can be used in a scenario where you want to train an agent of a specific RL algorithm if you don't want to implement the algorithm yourself.

Stable Baselines3 is still a very new library with it's current release being 0.9. That is why its selection of algorithms is not very large yet and most algorithms lack more sophisticated variants. However, it is planned for the future to broaden the available algorithms. For the currently available algorithms see their documentation:

- https://stable-baselines3.readthedocs.io/en/master/guide/rl.html

To use an agent provided by Stable Baselines3 your environment has to have a gym interface:

- https://stable-baselines3.readthedocs.io/en/master/guide/custom_env.html

## 4. Training an agent

To train an agent in Stable Baselines3 you need two things: The agent and a policy. The agent is the algorithm you want to use to solve your problem. The policy defines the function estimation you want to use. Mostly supported are MLP and CNN policies for the respective neural network architecture. Check the algorithm in the documentation to see what policy the algorithm you want to use supports. In the future recurrent policies are supposed to be implemented, too.

### 4.1 Imports

In our control problem we have an environment with a discrete action space. Therefore, we decided for the Deep-Q-Network (DQN):

- https://arxiv.org/abs/1312.5602

For the implementation of the DQN you can check Stable Baslines3's docs and see, that currently the MLP and the CNN policy are supported:

- https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html

In our case only the MLP policy does make sense. That is why we have to import the DQN and the MlpPolicy. You can also see which gym spaces for the observation and the actions are supported. You might have to take that into account for your environment.

In [7]:
from stable_baselines3 import DQN
from stable_baselines3.dqn import MlpPolicy

### 4.2 Setting the parameters

For the DQN algorithm we have to define a set of values. The policy_kwargs dictionary is a parameter which is directly given to the MlpPolicy. The net_arch key defines the network architecture of our MLP.

In [None]:
buffer_size = 200000 #number of old obersation steps saved
learning_starts = 10000 # memory warmup
train_freq = 1 # prediction network gets an update each train_freq's step
batch_size = 25 # mini batch size drawn at each update step
policy_kwargs = {
        'net_arch': [64,64] # hidden layer size of MLP
        }
exploration_fraction = 0.1 # Fraction of training steps the epsilon decays 
target_update_interval = 1000 # Target network gets updated each target_update_interval's step
verbose = 1 # verbosity of stable basline's prints

Additionally, we have to define how long our agent will train. We can just set a concrete number of steps or use our knowledge of the environment's temporal resolution to define an in-simulation training time. In this example we want to train the agent for 5 seconds which will translate to 500000 steps

In [None]:
tau = 1e-5
simulation_time = 5 # seconds
nb_steps = int(simulation_time // tau)

### 4.3 Starting the training

Once you've setup the environment and defined your parameters starting the training is nothing more than an one-liner. For each algorithm all you have to do is call its .learn() function. However, you should note that the execution of the training can take a long time. Don't execute the next line if you don't have that time.

In [None]:
model = DQN(MlpPolicy, env, buffer_size=buffer_size, learning_starts=learning_starts ,train_freq=train_freq, 
            batch_size=batch_size, gamma=gamma, policy_kwargs=policy_kwargs, 
            exploration_fraction=exploration_fraction, target_update_interval=target_update_interval,
            verbose=verbose)
model.learn(total_timesteps=nb_steps)

### 4.4 Saving the model

When the training has finished you can save the model your DQN has learned to reuse it later, e.g. for evaluation or if you want to continue your training. For this, each Stable Baselines3 algorithm has a .save() function where you only have to specify your path.

In [None]:
model.save(Path(__file__).parent / "saved_agents" / "TutorialAgent")

## 5. Evaluating an agent

After you have trained your agent you would like to see how well it does on your control problem. For this you can for example look at a visual representation of your currents in a test trajectory or see how well the reward of of your agent is in a test scenario.

### 5.1 Loading a model

First, before we start our evaluation let us load a pre-trained agent. If you have executed the provided code above you can either uncomment the next line of code or try to load your own saved agent. To load an trained agent you only have to call the .load() function of your algorithm with the respective path.

In [None]:
model = DQN.load(Path(__file__).parent / "saved_agents" / "TutorialPreTrainedAgent")  

### 5.2 Taking a look at a test trajectory

At first we want to take a look at a test trajectory and see how well the trained agent is able to control the currents to behave like the test trajectory. For the agent to take an action given an observation you can just call its .predict() function. The key deterministic is important so that the agent is not using a stochastic policy like epsilon greedy but is just chosing an action greedily.

In [None]:
env = set_env(training=False)
visualization_steps = int(9e4) # currently this crashes for larger values
obs = env.reset()
for i in range(visualization_steps):
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, done, _ = env.step(action)
    cum_rew_episode += reward
    env.render()
    if done:
        obs = env.reset()

### 5.3 Calculating further evaluation parameters

With the knowledge you acquired in the previous sections you are now able to train and evaluate any in Stable Baselines3 available Reinforcement Learning algorithm. The code below should give you an example how to use the trained agent to calculate a mean reward and mean episode length over a specific amount of steps. For further questions you can always have a look at the documentation of gym-electric-motor and Stable Baselines3 or raise an issue in their respective GitHub repositories.

In [None]:
test_steps = int(1e6) #1 milion for stability reasons
cum_rew = 0
episode_step = 0
for i in range(test_steps):
    print(f"{i+1}", end = '\r')
    episode_step += 1
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, done, _ = env.step(action)
    cum_rew += reward
    if done:
        episode_lengths.append(episode_step)
        episode_step = 0
        obs = env.reset()
print(f"The reward per step with {test_steps} steps was: {cum_rew_testing_period/test_steps:.4f} ")
print(f"The average Episode length was: {round(np.mean(episode_lengths))} ")
