Skip to content

Policies are not stochastic on arm64 architectures #961

@kamilazdybal

Description

@kamilazdybal

I'm implementing a continuous-action space for TF-Agents, where I want the action to be a four-element array with elements:

$s \in [0,10]$, $dx \in [-1,1]$, $dy \in [-1,1]$, $dz \in [-1,1]$

I'm then using RandomTFPolicy to test taking actions for a batch of 5 observations. This is what I'm getting every time I run my code:

[[ 3.1179297  -0.37641406 -0.37641406 -0.37641406]
 [ 8.263412    0.65268254  0.65268254  0.65268254]
 [ 6.849456    0.36989117  0.36989117  0.36989117]
 [ 0.06709099 -0.9865818  -0.9865818  -0.9865818 ]
 [ 7.8749514   0.5749903   0.5749903   0.5749903 ]]

My questions are:

  1. How come $dx$, $dy$, and $dz$ are the same float? Why aren't they sampled independently?
  2. How come I get the exact same action values every time I run my code? I'm not setting random seeds anywhere.

I'm using arm64 MacOS and:

python==3.11.13
tf-agents==0.19.0
tensorflow==2.15.1
tensorflow-metal==1.1.0
tensorflow-probability==0.23.0
numpy==1.26.4

Interestingly, this does not happen on a x86 MacOS, neither on a Windows machine with the same package versions! There, all numbers are random:

[[ 2.9280972e+00 -3.7891769e-01 -7.7160120e-02  8.4350657e-01]
 [ 2.3010242e+00 -1.9348240e-01 -6.7645931e-01  2.9825187e-01]
 [ 6.4993248e+00  4.0297508e-03 -5.8490920e-01 -5.0786805e-01]
 [ 9.6005363e+00 -2.8406858e-01 -7.8258038e-02  5.8963799e-01]
 [ 1.4861953e+00 -8.2189059e-01 -2.9714632e-01 -5.1117587e-01]]

My code:

import numpy as np
import tensorflow as tf
from tf_agents.specs import array_spec
from tf_agents.trajectories import time_step as ts
from tf_agents.policies import random_tf_policy

observation_spec = array_spec.BoundedArraySpec(
    shape=(64, 64, 2),
    dtype=np.float32,
    minimum=0.0,
    maximum=1.0,
    name="observation",
)

action_spec = array_spec.BoundedArraySpec(
    shape=(4,),
    dtype=np.float32,
    minimum=np.array([0.0, -1.0, -1.0, -1.0], dtype=np.float32),
    maximum=np.array([10.0,  1.0,  1.0,  1.0], dtype=np.float32),
    name="action",
)

time_step_spec = ts.time_step_spec(observation_spec)

policy = random_tf_policy.RandomTFPolicy(time_step_spec=time_step_spec,
                                         action_spec=action_spec)

obs = tf.random.uniform(shape=(5, 64, 64, 2), minval=0.0, maxval=1.0, dtype=tf.float32)

timestep = ts.restart(observation=obs, batch_size=5)
action_step = policy.action(timestep, seed=None)
actions = action_step.action

print(actions.numpy())

Moreover, I'm also seeing that agent.collect_policy leads to the same action sampled for the same observation value, just like it should be for agent.policy. My understanding is that collect_policy should always be stochastic? Here's a couple of actions where the last 11 actions correspond to the agent seeing the exact same observation (note that the actions are deterministic at this point but should be stochastic):

<tf.Tensor: shape=(1, 20, 4), dtype=float32, numpy=
array([[[ 4.543779  , -0.7549822 , -0.98628926, -0.6924889 ],
        [ 4.543779  , -0.75414133, -0.9862873 , -0.6924889 ],
        [ 4.543778  , -0.7378491 , -0.98595184, -0.6924772 ],
        [ 4.5437074 , -0.69559884, -0.98290896, -0.69219804],
        [ 4.5435147 , -0.66731834, -0.97919464, -0.6916727 ],
        [ 4.543341  , -0.6532383 , -0.9768447 , -0.6912718 ],
        [ 4.5433545 , -0.6541612 , -0.9770225 , -0.6912997 ],
        [ 4.543534  , -0.6692811 , -0.979528  , -0.69171715],
        [ 4.5436664 , -0.6870763 , -0.9819559 , -0.69207287],
        [ 4.5437007 , -0.6941073 , -0.982763  , -0.6921773 ],
        [ 4.5437007 , -0.6941073 , -0.982763  , -0.6921773 ],
        [ 4.5437007 , -0.6941073 , -0.982763  , -0.6921773 ],
        [ 4.5437007 , -0.6941073 , -0.982763  , -0.6921773 ],
        [ 4.5437007 , -0.6941073 , -0.982763  , -0.6921773 ],
        [ 4.5437007 , -0.6941073 , -0.982763  , -0.6921773 ],
        [ 4.5437007 , -0.6941073 , -0.982763  , -0.6921773 ],
        [ 4.5437007 , -0.6941073 , -0.982763  , -0.6921773 ],
        [ 4.5437007 , -0.6941073 , -0.982763  , -0.6921773 ],
        [ 4.5437007 , -0.6941073 , -0.982763  , -0.6921773 ],
        [ 4.5437007 , -0.6941073 , -0.982763  , -0.6921773 ]]],
      dtype=float32)>

I've used a simple REINFORCE agent here:

agent = reinforce_agent.ReinforceAgent(time_step_spec=train_env.time_step_spec(),
                                       action_spec=train_env.action_spec(),
                                       actor_network=actor_net,
                                       optimizer=optimizer,
                                       train_step_counter=train_step_counter, 
                                       gamma=0.95, 
                                       normalize_returns=False, 
                                       entropy_regularization=None)

agent.initialize()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions