# 3. Building Custom Recurrent Models

In the previous notebook we have looked at building custom models. We revisit this here, but this time the model is going to also be recurrent. It will be a very short tutorial, as the intricacy lies solely in the model building function. Everything else, i.e. handling the model, will be happening under the hood of AngoraPy, which automatically detects whether your model is recurrent and not, and then deals with it accordingly.

In [1]:
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

import numpy as np
import tensorflow as tf
import angorapy as ap

env = ap.make_env("CartPole-v1")
distribution = ap.policies.CategoricalPolicyDistribution(env)

The model we build is going to be quite similar to the previous one, however we replace the last shared feedforward layers by a shared GRU recurrent layer. AngoraPy's training algorithm requires you to make the recurrent layers both stateful and returning sequences. Additionally, you need to set the batch size. 

In [2]:
from tensorflow.keras.layers import TimeDistributed
from angorapy.utilities.model_utils import env_extract_dims

@ap.models.register_model("MyModel")
def build_my_amazing_model(env, distribution, bs=1, sequence_length=1):
    state_dimensionality, n_actions = env_extract_dims(env)

    inputs = tf.keras.Input(batch_shape=(bs, sequence_length,) + state_dimensionality["proprioception"], name="proprioception")
    masked = tf.keras.layers.Masking(batch_input_shape=(bs, sequence_length,) + (inputs.shape[-1], ))(inputs)

    x = TimeDistributed(tf.keras.layers.Dense(8))(masked)

    x, *_ = tf.keras.layers.GRU(4,
                       stateful=True,
                       return_sequences=True,
                       return_state=True,
                       batch_size=bs,
                       name="policy_recurrent_layer")(x)

    x_policy = tf.keras.layers.Dense(8)(x)
    x_value = tf.keras.layers.Dense(8)(x)

    out_policy = distribution.build_action_head(n_actions, x_policy.shape[1:], bs)(x_policy)
    out_value = tf.keras.layers.Dense(1)(x_value)

    policy = tf.keras.Model(inputs=inputs, outputs=out_policy, name="my_policy_function")
    value = tf.keras.Model(inputs=inputs, outputs=out_value, name="my_value_function")
    joint = tf.keras.Model(inputs=inputs, outputs=[out_policy, out_value], name="my_joint_networks")

    return policy, value, joint

We wont load the model this time, so we can skip registering it. However, lets again plot the model after building the agent.

In [3]:
from tensorflow.keras.utils import plot_model

agent = ap.Agent(build_my_amazing_model, env, horizon=2048, workers=1, distribution=distribution)
plot_model(agent.joint)

You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) for plot_model to work.


Great. Now only training is left.

In [4]:
agent.drill(n=5, epochs=3, batch_size=128)
agent.save_agent_state()



Drill started using 1 processes for 1 workers of which 1 are optimizers. Worker distribution: [1].
IDs over Workers: [[0]]
IDs over Optimizers: [[0]]

The policy is recurrent and the batch size is interpreted as the number of transitions per policy update. Given the batch size of 128 this results in: 
	8 chunks per update and 16 updates per epoch
	Batch tilings of (1, 8) per process and (1, 8) in total.


Gathering cycle 0...

                                                                             

[92mBefore Training[0m; r: [91m   20.89[0m; len: [94m   20.89[0m; n: [94m 74[0m; loss: [[94m  pi  [0m|[94m  v     [0m|[94m  ent [0m]; upd: [94m     0[0m; y.exp: [94m0.000[0m; ; time:  ; time left: [94munknown time[0m; took s [unknown time left]


                                                              

Gathering cycle 1...

                                                                             

[92mCycle     1/5[0m; r: [91m   26.93[0m; len: [94m   26.93[0m; n: [94m 59[0m; loss: [[94m  9.90[0m|[94m    7.33[0m|[94m  5.54[0m]; upd: [94m    48[0m; ; time: [30.2|0.0|5.4] [85|0|15]; time left: [94m2.3mins[0m; took 34.42s [2.3mins left]


                                                              

Gathering cycle 2...

                                                                             

[92mCycle     2/5[0m; r: [91m   31.33[0m; len: [94m   31.33[0m; n: [94m 49[0m; loss: [[94m 11.89[0m|[94m    6.71[0m|[94m  5.47[0m]; upd: [94m    96[0m; ; time: [28.6|0.0|3.2] [90|0|10]; time left: [94m1.6mins[0m; took 29.7s [1.6mins left]


                                                              

Gathering cycle 3...

                                                                             

[92mCycle     3/5[0m; r: [91m   44.77[0m; len: [94m   44.77[0m; n: [94m 39[0m; loss: [[94m 17.41[0m|[94m    9.25[0m|[94m  5.25[0m]; upd: [94m   144[0m; ; time: [26.2|0.0|3.2] [89|0|11]; time left: [94m1.1mins[0m; took 32.87s [1.1mins left]


                                                              

Gathering cycle 4...

                                                                             

[92mCycle     4/5[0m; r: [91m   95.47[0m; len: [94m   95.47[0m; n: [94m 19[0m; loss: [[94m  9.30[0m|[94m    5.26[0m|[94m  5.10[0m]; upd: [94m   192[0m; ; time: [29.4|0.0|3.3] [90|0|10]; time left: [94m0.5mins[0m; took 33.99s [0.5mins left]


                                                              

Finalizing...Drill finished after 165.32serialization.


You might have noticed that the drill function informed you about some details on the training that it did not include previously, because the model is recurrent. Because AngoraPy operates on temporal data, and specifically temporal chunks (as opposed to full sequences), it needs to convert the batch size you provide it with (which is the number of transitions included in every batch) into the number of chunks it processes per update. If we would distribute the training, it would additionally have to allocate chunks to the processes. 

Lets evaluate again. Most likely, training ended at a lower performance than our feedforward model. Thats because training recurrent policies is generally requiring more data, and for the given task we also do not need a memory. The state dynamics are already explicitly included as variables.

In [5]:
evaluation_results = agent.evaluate(1, act_confidently=True)[0]
print(f"Mean performance after training: {np.mean(evaluation_results.episode_rewards)}")

100%|██████████| 1/1 [00:03<00:00,  3.29s/it]

Mean performance after training: 285.0





Thats it for model building. Next, we will learn how to load and inspect agents.