# 1. Getting Started with AngoraPy

In this tutorial we will introduce you to the basic functionality of AngoraPy. We will cover the full workflow from creating an environment, to building the model, to combining them in an agent, through to training and evaluating that agent. We will take all of these steps with no major customization so that we can focus on the overall structure of applying AngoraPy. Customizing specifically the task and the model is covered in other notebooks in this same repository.

## Installation
Before you build your first agent in AngoraPy, you need to install the package. Since AngoraPy depends on a multitude of other packages and their specific versions, we recommend doing a clean installation in a new virtual environment. In this environment, first install some build dependencies as follows:

    pip install swig imageio

and then install AngoraPy itself.

    pip install angorapy

You now have all you need to build an agent.

## Your First Agent in AngoraPy

We begin by importing angorapy, and numpy for basic operations. Additionally, we turn off tensorflow's logging to keep outputs clean.

In [7]:
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

import numpy as np
import angorapy as ap

For most environments, PPO needs to normalize states and rewards; to add this functionality we wrap the environment with transformers fulfilling this task. You can also add your own custom transformers this way.

In [8]:
env = ap.make_env("CartPole-v1")

Next, we need to create the policy distribution we would like to model to map to. We will use a coategorical distribution. Since the distribution will depend on the action space of the environment, we need to provide the distribution with the environment object.

In [9]:
distribution = ap.policies.CategoricalPolicyDistribution(env)

Lastly, we need a model. To that end, we create a *model builder*. AngoraPy needs to be able to constantly build new versions of the model. Thus, it requires a model building function instead of a model instance. This function must return a tuple of (policy, value, joint) network. The former are the network selecting the action (policy network) and valuating the state (value network. The latter is their combination. The separation of the three serves computational efficiency.

For built in architectures, we can use the *get_model_builder()* function. Lets also check the models this model builder creates.

In [10]:
from tensorflow.keras.utils import plot_model

build_models = ap.models.get_model_builder(model="simple", model_type="ffn", shared=False)
policy, value, joint = build_models(env, distribution)

plot_model(joint)

You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) for plot_model to work.


We can see that the model created three network references. Importantly, it is references, as any change to the value or policy network will also change the joint network and vice versa. In the model plot, we can also see how policy and value network are separated. They only share their input, but not their weights.

With model, environment and distribution set up, we can now assemble an agent.

In [11]:
agent = ap.Agent(build_models, env, horizon=1024, workers=1, distribution=distribution)

We will now train the agent for 10 cycle and afterwards save the final state. AngoraPy will additionally always save the agents best version.

In [12]:
agent.drill(n=10, epochs=3, batch_size=64)
agent.save_agent_state()



Drill started using 1 processes for 1 workers of which 1 are optimizers. Worker distribution: [1].
IDs over Workers: [[0]]
IDs over Optimizers: [[0]]
Gathering cycle 0...

                                                                             

[92mBefore Training[0m; r: [91m   24.14[0m; len: [94m   24.14[0m; n: [94m 42[0m; loss: [[94m  pi  [0m|[94m  v     [0m|[94m  ent [0m]; upd: [94m     0[0m; y.exp: [94m0.000[0m; ; time:  ; time left: [94munknown time[0m; took s [unknown time left]


                                                              

Gathering cycle 1...

                                                                             

[92mCycle     1/10[0m; r: [91m   25.25[0m; len: [94m   25.25[0m; n: [94m 40[0m; loss: [[94m -0.01[0m|[94m    0.58[0m|[94m  0.69[0m]; upd: [94m    48[0m; ; time: [8.6|0.0|1.6] [84|0|16]; time left: [94m1.5mins[0m; took 9.75s [1.5mins left]


                                                              

Gathering cycle 2...

                                                                             

[92mCycle     2/10[0m; r: [91m   36.11[0m; len: [94m   36.11[0m; n: [94m 28[0m; loss: [[94m -0.09[0m|[94m    0.25[0m|[94m  0.66[0m]; upd: [94m    96[0m; ; time: [7.7|0.0|1.0] [88|0|12]; time left: [94m1.2mins[0m; took 8.66s [1.2mins left]


                                                              

Gathering cycle 3...

                                                                             

[92mCycle     3/10[0m; r: [91m   60.06[0m; len: [94m   60.06[0m; n: [94m 17[0m; loss: [[94m -0.02[0m|[94m    0.22[0m|[94m  0.65[0m]; upd: [94m   144[0m; ; time: [7.3|0.0|1.0] [88|0|12]; time left: [94m1.1mins[0m; took 9.62s [1.1mins left]


                                                              

Gathering cycle 4...

                                                                             

[92mCycle     4/10[0m; r: [91m   98.89[0m; len: [94m   98.89[0m; n: [94m  9[0m; loss: [[94m -0.01[0m|[94m    0.17[0m|[94m  0.61[0m]; upd: [94m   192[0m; ; time: [8.2|0.0|1.0] [89|0|11]; time left: [94m1.0mins[0m; took 10.11s [1.0mins left]


                                                              

Gathering cycle 5...

                                                                             

[92mCycle     5/10[0m; r: [91m  216.25[0m; len: [94m  216.25[0m; n: [94m  4[0m; loss: [[94m -0.03[0m|[94m    0.12[0m|[94m  0.60[0m]; upd: [94m   240[0m; ; time: [8.8|0.0|1.0] [90|0|10]; time left: [94m0.8mins[0m; took 9.24s [0.8mins left]


                                                              

Gathering cycle 6...

                                                                             

[92mCycle     6/10[0m; r: [91m  186.20[0m; len: [94m  186.20[0m; n: [94m  5[0m; loss: [[94m -0.06[0m|[94m    0.07[0m|[94m  0.58[0m]; upd: [94m   288[0m; ; time: [7.9|0.0|1.0] [89|0|11]; time left: [94m0.6mins[0m; took 8.53s [0.6mins left]


                                                              

Gathering cycle 7...

                                                                             

[92mCycle     7/10[0m; r: [91m  137.29[0m; len: [94m  137.29[0m; n: [94m  7[0m; loss: [[94m  0.02[0m|[94m    0.05[0m|[94m  0.55[0m]; upd: [94m   336[0m; ; time: [7.3|0.0|1.0] [88|0|12]; time left: [94m0.5mins[0m; took 8.78s [0.5mins left]


                                                              

Gathering cycle 8...

                                                                             

[92mCycle     8/10[0m; r: [91m  160.00[0m; len: [94m  160.00[0m; n: [94m  6[0m; loss: [[94m -0.02[0m|[94m    0.02[0m|[94m  0.57[0m]; upd: [94m   384[0m; ; time: [7.5|0.0|0.9] [89|0|11]; time left: [94m0.3mins[0m; took 8.44s [0.3mins left]


                                                              

Gathering cycle 9...

                                                                             

[92mCycle     9/10[0m; r: [91m  198.50[0m; len: [94m  198.50[0m; n: [94m  4[0m; loss: [[94m -0.08[0m|[94m    0.02[0m|[94m  0.54[0m]; upd: [94m   432[0m; ; time: [7.3|0.0|1.0] [88|0|12]; time left: [94m0.2mins[0m; took 8.28s [0.2mins left]


                                                              

Finalizing...Drill finished after 91.72serialization.


After training is done, we can evaluate the agent. For this purpose we tell the agent to *act confidently*. Because policies in AngoraPy are stochastic, actions are usually sampled from the policy distribution. At evaluation time, we would however prefer the agent to stop exploring and instead choose the action it is most confident about. Thus, when told to act confidently, the agent will not sample but instead choose the most likely action under the predicted distribution.

In [13]:
evaluation_results = agent.evaluate(10, act_confidently=True)[0]
print(np.mean(evaluation_results.episode_rewards))

100%|██████████| 10/10 [00:05<00:00,  1.90it/s]

197.2





As we can see (usually) performance is higher than after the last optimization, because the agent is not exploring anymore.