# 1. Getting Started with AngoraPy

In this tutorial we will introduce you to the basic functionality of AngoraPy. We will cover the full workflow from creating an environment, to building the model, to combining them in an agent, through to training and evaluating that agent. We will take all of these steps with no major customization so that we can focus on the overall structure of applying AngoraPy. Customizing specifically the task and the model is covered in other notebooks in this same repository.

## Installation
Before you build your first agent in AngoraPy, you need to install the package. Since AngoraPy depends on a multitude of other packages and their specific versions, we recommend doing a clean installation in a new virtual environment. In this environment, first install some build dependencies as follows:

    pip install swig imageio

and then install AngoraPy itself.

    pip install angorapy

You now have all you need to build an agent.

## Your First Agent in AngoraPy

We begin by importing angorapy, and numpy for basic operations. Additionally, we turn off tensorflow's logging to keep outputs clean.

In [1]:
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

import numpy as np
import angorapy as apy

For most environments, PPO needs to normalize states and rewards; to add this functionality we wrap the environment with transformers fulfilling this task. You can also add your own custom transformers this way.

In [2]:
env = apy.make_env("CartPole-v1")

Next, we need to create the policy distribution we would like to model to map to. We will use a coategorical distribution. Since the distribution will depend on the action space of the environment, we need to provide the distribution with the environment object.

In [4]:
distribution = apy.policies.CategoricalPolicyDistribution(env)

Lastly, we need a model. To that end, we create a *model builder*. AngoraPy needs to be able to constantly build new versions of the model. Thus, it requires a model building function instead of a model instance. This function must return a tuple of (policy, value, joint) network. The former are the network selecting the action (policy network) and valuating the state (value network. The latter is their combination. The separation of the three serves computational efficiency.

For built in architectures, we can use the *get_model_builder()* function. Lets also check the models this model builder creates.

In [5]:
from tensorflow.keras.utils import plot_model

build_models = apy.get_model_builder(model="simple", model_type="ffn", shared=False)
policy, value, joint = build_models(env, distribution)

plot_model(joint)

('Failed to import pydot. You must `pip install pydot` and install graphviz (https://graphviz.gitlab.io/download/), ', 'for `pydotprint` to work.')


We can see that the model created three network references. Importantly, it is references, as any change to the value or policy network will also change the joint network and vice versa. In the model plot, we can also see how policy and value network are separated. They only share their input, but not their weights.

With model, environment and distribution set up, we can now assemble an agent.

In [6]:
agent = apy.Agent(build_models, env, horizon=1024, workers=1, distribution=distribution)

Detected 0 GPU devices.
Using [StateNormalizationTransformer, RewardNormalizationTransformer] for preprocessing.
An MPI Optimizer with 1 ranks has been created; the following ranks optimize: [0]


We will now train the agent for 10 cycle and afterwards save the final state. AngoraPy will additionally always save the agents best version.

In [7]:
agent.drill(n=10, epochs=3, batch_size=64)
agent.save_agent_state()



Drill started using 1 processes for 1 workers of which 1 are optimizers. Worker distribution: [1].
IDs over Workers: [[0]]
IDs over Optimizers: [[0]]
Gathering cycle 0...

Gathering experience...: 100%|█████████████| 1024/1024 [00:04<00:00, 240.17it/s]


[92mBefore Training[0m: r: [91m   19.20[0m; len: [94m   19.20[0m; n: [94m 51[0m; loss: [[94m  pi  [0m|[94m  v     [0m|[94m  ent [0m]; eps: [94m    0[0m; lr: [94m1.00e-03[0m; upd: [94m     0[0m; f: [94m   0.000[0mk; y.exp: [94m0.000[0m; times:  ; took s [unknown time left]; mem: 1.11/12|0.0/0.0;


Optimizing...:   0%|                                     | 0/48 [00:02<?, ?it/s]


Finalizing...Gathering cycle 1...

Gathering experience...: 100%|█████████████| 1024/1024 [00:04<00:00, 248.97it/s]


[92mCycle     1/10[0m: r: [91m   31.48[0m; len: [94m   31.48[0m; n: [94m 31[0m; loss: [[94m  0.08[0m|[94m    0.63[0m|[94m  0.68[0m]; eps: [94m   51[0m; lr: [94m1.00e-03[0m; upd: [94m    48[0m; f: [94m   1.024[0mk; times: [5.2|0.0|2.3] [69|0|31]; took 7.58s [1.1mins left]; mem: 1.14/12|0.0/0.0;


Optimizing...:   0%|                                     | 0/48 [00:00<?, ?it/s]


Finalizing...Gathering cycle 2...

Gathering experience...: 100%|█████████████| 1024/1024 [00:04<00:00, 245.70it/s]


[92mCycle     2/10[0m: r: [91m   52.26[0m; len: [94m   52.26[0m; n: [94m 19[0m; loss: [[94m -0.06[0m|[94m    0.32[0m|[94m  0.67[0m]; eps: [94m   82[0m; lr: [94m1.00e-03[0m; upd: [94m    96[0m; f: [94m   2.048[0mk; times: [5.1|0.0|1.0] [84|0|16]; took 6.36s [0.9mins left]; mem: 1.14/12|0.0/0.0;


Optimizing...:   0%|                                     | 0/48 [00:00<?, ?it/s]


Finalizing...Gathering cycle 3...

Gathering experience...: 100%|█████████████| 1024/1024 [00:04<00:00, 246.86it/s]


[92mCycle     3/10[0m: r: [91m   85.82[0m; len: [94m   85.82[0m; n: [94m 11[0m; loss: [[94m -0.13[0m|[94m    0.22[0m|[94m  0.63[0m]; eps: [94m  101[0m; lr: [94m1.00e-03[0m; upd: [94m   144[0m; f: [94m   3.072[0mk; times: [5.1|0.0|1.0] [84|0|16]; took 6.26s [0.8mins left]; mem: 1.14/12|0.0/0.0;


Optimizing...:   0%|                                     | 0/48 [00:00<?, ?it/s]


Finalizing...Gathering cycle 4...

Gathering experience...: 100%|█████████████| 1024/1024 [00:04<00:00, 212.49it/s]


[92mCycle     4/10[0m: r: [91m   99.70[0m; len: [94m   99.70[0m; n: [94m 10[0m; loss: [[94m  0.10[0m|[94m    0.14[0m|[94m  0.61[0m]; eps: [94m  112[0m; lr: [94m1.00e-03[0m; upd: [94m   192[0m; f: [94m   4.096[0mk; times: [5.1|0.0|1.0] [84|0|16]; took 6.96s [0.7mins left]; mem: 1.14/12|0.0/0.0;


Optimizing...:   0%|                                     | 0/48 [00:00<?, ?it/s]


Finalizing...Gathering cycle 5...

Gathering experience...: 100%|█████████████| 1024/1024 [00:04<00:00, 243.64it/s]


[92mCycle     5/10[0m: r: [91m  115.50[0m; len: [94m  115.50[0m; n: [94m  8[0m; loss: [[94m  0.03[0m|[94m    0.07[0m|[94m  0.59[0m]; eps: [94m  122[0m; lr: [94m1.00e-03[0m; upd: [94m   240[0m; f: [94m   5.120[0mk; times: [5.8|0.0|1.0] [85|0|15]; took 6.43s [0.6mins left]; mem: 1.14/12|0.0/0.0;


Optimizing...:   0%|                                     | 0/48 [00:01<?, ?it/s]


Finalizing...Gathering cycle 6...

Gathering experience...: 100%|█████████████| 1024/1024 [00:04<00:00, 250.87it/s]


[92mCycle     6/10[0m: r: [91m  162.17[0m; len: [94m  162.17[0m; n: [94m  6[0m; loss: [[94m -0.03[0m|[94m    0.06[0m|[94m  0.57[0m]; eps: [94m  130[0m; lr: [94m1.00e-03[0m; upd: [94m   288[0m; f: [94m   6.144[0mk; times: [5.2|0.0|1.2] [81|0|19]; took 6.42s [0.4mins left]; mem: 1.14/12|0.0/0.0;


Optimizing...:   0%|                                     | 0/48 [00:00<?, ?it/s]


Finalizing...Gathering cycle 7...

Gathering experience...: 100%|█████████████| 1024/1024 [00:04<00:00, 242.95it/s]


[92mCycle     7/10[0m: r: [33m  276.00[0m; len: [94m  276.00[0m; n: [94m  3[0m; loss: [[94m -0.08[0m|[94m    0.03[0m|[94m  0.56[0m]; eps: [94m  136[0m; lr: [94m1.00e-03[0m; upd: [94m   336[0m; f: [94m   7.168[0mk; times: [5.0|0.0|1.0] [83|0|17]; took 6.4s [0.3mins left]; mem: 1.14/12|0.0/0.0;


Optimizing...:   0%|                                     | 0/48 [00:00<?, ?it/s]


Finalizing...Gathering cycle 8...

Gathering experience...: 100%|█████████████| 1024/1024 [00:04<00:00, 251.23it/s]


[92mCycle     8/10[0m: r: [33m  399.50[0m; len: [94m  399.50[0m; n: [94m  2[0m; loss: [[94m  0.09[0m|[94m    0.02[0m|[94m  0.59[0m]; eps: [94m  139[0m; lr: [94m1.00e-03[0m; upd: [94m   384[0m; f: [94m   8.192[0mk; times: [5.2|0.0|1.0] [84|0|16]; took 6.24s [0.2mins left]; mem: 1.14/12|0.0/0.0;


Optimizing...:   0%|                                     | 0/48 [00:00<?, ?it/s]


Finalizing...Gathering cycle 9...

Gathering experience...: 100%|█████████████| 1024/1024 [00:04<00:00, 239.37it/s]


[92mCycle     9/10[0m: r: [33m  336.33[0m; len: [94m  336.33[0m; n: [94m  3[0m; loss: [[94m -0.14[0m|[94m    0.04[0m|[94m  0.54[0m]; eps: [94m  141[0m; lr: [94m1.00e-03[0m; upd: [94m   432[0m; f: [94m   9.216[0mk; times: [5.0|0.0|1.0] [83|0|17]; took 6.45s [0.1mins left]; mem: 1.14/12|0.0/0.0;


Optimizing...:   0%|                                     | 0/48 [00:00<?, ?it/s]


Finalizing...Drill finished after 65.84serialization.


After training is done, we can evaluate the agent. For this purpose we tell the agent to *act confidently*. Because policies in AngoraPy are stochastic, actions are usually sampled from the policy distribution. At evaluation time, we would however prefer the agent to stop exploring and instead choose the action it is most confident about. Thus, when told to act confidently, the agent will not sample but instead choose the most likely action under the predicted distribution.

In [8]:
evaluation_results = agent.evaluate(10, act_confidently=True)[0]
print(np.mean(evaluation_results.episode_rewards))

100%|███████████████████████████████████████████| 10/10 [00:10<00:00,  1.08s/it]

500.0





As we can see (usually) performance is higher than after the last optimization, because the agent is not exploring anymore.