# Pulse Sequence Design using DQN
_Written by Will Kaufman_

This notebook walks through a reinforcement learning approach to pulse sequence design for spin systems. [TF-Agents](https://www.tensorflow.org/agents) is used as a reinforcement learning library that uses Tensorflow, a common machine learning framework.

## TODO

- Figure out if an RNN starts with empty initial state (even if start of trajectory isn't initial state), or if it starts with initial state saved in replay buffer

In [1]:
import numpy as np

import spin_simulation as ss

import tensorflow as tf

from tf_agents.agents.dqn import dqn_agent
from tf_agents.drivers import dynamic_step_driver
from tf_agents.environments import tf_py_environment
from tf_agents.eval import metric_utils
from tf_agents.metrics import tf_metrics
from tf_agents.networks import q_network, q_rnn_network
from tf_agents.policies import random_tf_policy
from tf_agents.replay_buffers import tf_uniform_replay_buffer, episodic_replay_buffer
from tf_agents.trajectories import trajectory
from tf_agents.trajectories import time_step as ts
from tf_agents.utils import common

from environments import spin_sys_discrete

In [2]:
import importlib
importlib.reload(spin_sys_discrete)

<module 'environments.spin_sys_discrete' from '/Users/willkaufman/projects/rl_pulse/rl_pulse/environments/spin_sys_discrete.py'>

## Define algorithm hyperparameters



In [3]:
num_iterations = 1000 # @param {type:"integer"}
episode_length = 5 # @param {type:"integer"}

initial_collect_steps = 1000  # @param {type:"integer"}
collect_steps_per_iteration = episode_length  # @param {type:"integer"}
replay_buffer_max_length = 100000  # @param {type:"integer"}

batch_size = 12 #64  # @param {type:"integer"}
learning_rate = 1e-3  # @param {type:"number"}
log_interval = 100  # @param {type:"integer"}

num_eval_episodes = 5  # @param {type:"integer"}
eval_interval = 200  # @param {type:"integer"}

## Initialize the spin system

This sets the parameters of the system ($N$ spin-1/2 particles, which corresponds to a Hilbert space with dimension $2^N$). For the purposes of simulation, $\hbar \equiv 1$.

The total internal Hamiltonian is given by
$$
H_\text{int} = C H_\text{dip} + \delta \sum_i^N I_z^{i}
$$
where $C$ is the coupling strength, $\delta$ is the chemical shift strength (each spin is assumed to be identical), and $H_\text{dip}$ is given by
$$
H_\text{dip} = \sum_{i,j}^N d_{i,j} \left(3I_z^{i}I_z^{j} - \mathbf{I}^{i} \cdot \mathbf{I}^{j}\right)
$$

The target Hamiltonian is set to be the 0th-order average Hamiltonian from the WHH-4 pulse sequence, which is designed to remove the dipolar interaction term from the internal Hamiltonian. The pulse sequence is $\tau, \overline{X}, \tau, Y, \tau, \tau, \overline{Y}, \tau, X, \tau$.
The zeroth-order average Hamiltonian for the WAHUHA pulse sequence is
$$
H_\text{WHH}^{(0)} = \delta / 3 \sum_i^N \left( I_x^{i} + I_y^{i} + I_z^{i} \right)
$$

In [4]:
N=4
dim = 2**N
coupling = 1e3
delta = 500
(X,Y,Z) = ss.get_total_spin(N=N, dim=dim)
H_target = ss.get_H_WHH_0(X, Y, Z, delta)

The `SpinSystemDiscreteEnv` class keeps track of the system dynamics, and implements methods that are necessary for RL:

- `action_spec`: Returns an `ArraySpec` that gives the shape and range of a valid action. For example, in a discrete action space, an action will be an integer scalar between 0 and `numActions - 1`. For a continuous action space, an action will be a 3-dimensional vector representing phase, amplitude, and duration of the pulse.
- `observation_spec`: Returns an `ArraySpec` that gives the shape and range of a valid observation. In this case, the observations are all the actions performed on the environment so far.
- `_reset`: Resets the environment. This means setting the propagator to the identity, and choosing a new random dipolar interaction matrix $(d_{i,j})$.
- `_step`: Evolves the environment according to the action. Returns a `TimeStep` which includes the step type (`FIRST`, `MID`, or `LAST`), the **reward**, the discount rate to apply to future rewards, and an **observation** of the environment.

The reward function $r(s,a)$ can in general depend on the environment state _and_ action performed. However, because the goal of pulse sequence design is to find high-fidelity pulse sequences, the reward only depends on the state. 
$$
r = -\log \left( 1-
    \left|
        \frac{\text{Tr} (U_\text{target}^\dagger U_\text{exp})}{\text{Tr}(\mathbb{1})}
    \right|
    \right)
% = -\log\left( 1- \text{fidelity}(U_\text{target}, U_\text{exp}) \right)
$$



In [5]:
env = spin_sys_discrete.SpinSystemDiscreteEnv(N=4, dim=16, coupling=1e3,
    delta=500, H_target=H_target, X=X, Y=Y, delay=5e-6, pulse_width=0,
    delay_after=True, state_size=episode_length)
# env.reset()

# train_py_env = spin_sys_discrete.SpinSystemDiscreteEnv(N=4, dim=16, coupling=1e3,
#     delta=500, H_target=H_target, X=X, Y=Y, delay=5e-6, pulse_width=0,
#     delay_after=True)
# eval_py_env = spin_sys_discrete.SpinSystemDiscreteEnv(N=4, dim=16, coupling=1e3,
#     delta=500, H_target=H_target, X=X, Y=Y, delay=5e-6, pulse_width=0,
#     delay_after=True)

print('Observation Spec:')
print(env.time_step_spec().observation)

print('Reward Spec:')
print(env.time_step_spec().reward)

print('Action Spec:')
print(env.action_spec())

train_env = tf_py_environment.TFPyEnvironment(env)
eval_env = tf_py_environment.TFPyEnvironment(env)

Observation Spec:
ArraySpec(shape=(16, 16, 4), dtype=dtype('float32'), name=None)
Reward Spec:
ArraySpec(shape=(), dtype=dtype('float32'), name='reward')
Action Spec:
BoundedArraySpec(shape=(), dtype=dtype('int32'), name=None, minimum=0, maximum=4)


## Define Q-network

For the DQN algorithm, a Q-network must be defined. The Q-function $Q^\pi(s,a)$ approximates the total return from performing action $a$ in state $s$, then following policy $\pi$.
<!-- $$
Q^\pi(s,a) = 
$$ -->

Because the system's state is entirely determined by the sequence of actions performed, a RNN is used for the Q-network. This means that the network has an internal state that acts as a memory of the episode. Each observation passed to the Q-network updates the internal state, and the internal state is reset at the end of every episode.

In [6]:
q_net = q_network.QNetwork(
    train_env.observation_spec(),
    train_env.action_spec(),
    conv_layer_params=[(32, 5, 1), (32, 3, 1), (16, 3, 1)]
)

target_q_net = q_network.QNetwork(
    train_env.observation_spec(),
    train_env.action_spec(),
    conv_layer_params=[(32, 5, 1), (32, 3, 1), (16, 3, 1)]
)

See what the initial Q-values are for the network.

In [7]:
q_net(train_env.current_time_step().observation)[0].numpy()

array([[-0.20589064, -0.19506574, -0.18172377, -0.19854292, -0.22036171]],
      dtype=float32)

In [8]:
q_net.summary()

Model: "QNetwork"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
EncodingNetwork (EncodingNet multiple                  97019     
_________________________________________________________________
dense_2 (Dense)              multiple                  205       
Total params: 97,224
Trainable params: 97,224
Non-trainable params: 0
_________________________________________________________________


In [9]:
q_net.get_layer("EncodingNetwork").summary()

Model: "EncodingNetwork"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              multiple                  3232      
_________________________________________________________________
conv2d_1 (Conv2D)            multiple                  9248      
_________________________________________________________________
conv2d_2 (Conv2D)            multiple                  4624      
_________________________________________________________________
flatten (Flatten)            multiple                  0         
_________________________________________________________________
dense (Dense)                multiple                  76875     
_________________________________________________________________
dense_1 (Dense)              multiple                  3040      
Total params: 97,019
Trainable params: 97,019
Non-trainable params: 0
_______________________________________________

In [11]:
#q_net.get_layer("EncodingNetwork").get_layer("dense_42").get_weights()[0].shape

## Create agent

In RL, the "agent" has a policy that determines its behavior. For DQN, the agent will act greedily during evaluation (i.e. it picks the action with the maximal Q-value) and epsilon-greedily during data collection. These policies are accessed with `agent.policy` (for evaluation) and `agent.collect_policy` (for data collection).

According to [the docs](https://www.tensorflow.org/agents/api_docs/python/tf_agents/agents/tf_agent/TFAgent?hl=fa#args), I can adjust `train_sequence_length=None` for RNN-based agents. When using non-RNN DQN, though, I don't have that option. 

In [12]:
optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate)

train_step_counter = tf.Variable(0)

agent = dqn_agent.DqnAgent(
    train_env.time_step_spec(),
    train_env.action_spec(),
    q_network=q_net,
    target_q_network=target_q_net,
    target_update_period=10,
    optimizer=optimizer,
    gamma=0.99,
    td_errors_loss_fn=common.element_wise_squared_loss,
    train_step_counter=train_step_counter)

agent.initialize()

In [13]:
eval_policy = agent.policy
collect_policy = agent.collect_policy

random_policy = random_tf_policy.RandomTFPolicy(train_env.time_step_spec(),
                                                train_env.action_spec())

In [14]:
def compute_avg_return(environment, policy, num_episodes=10, print_actions=False):

    total_return = 0.0
    for _ in range(num_episodes):

        time_step = environment.reset()
        policy_state = policy.get_initial_state(environment.batch_size)
        episode_return = 0.0

        while not time_step.is_last():
            action_step = policy.action(time_step, policy_state = policy_state)
            policy_state = action_step.state
            time_step = environment.step(action_step.action)
            episode_return += time_step.reward
            if print_actions:
                print(f"action: {action_step.action}, reward: {time_step.reward}, return: {episode_return}")
        total_return += episode_return

    avg_return = total_return / num_episodes
    return avg_return.numpy()[0]

In [15]:
compute_avg_return(eval_env, random_policy, num_eval_episodes)

0.7807545

In [None]:
# TODO include other metrics

## Create the replay buffer

A replay buffer stores trajectories (sequences of states and actions) from data collection, and then samples those trajectories to train the agent. This increases data-efficiency and decreases bias.

Trying to use the [EpisodicReplaybuffer](https://github.com/tensorflow/agents/blob/v0.5.0/tf_agents/replay_buffers/episodic_replay_buffer.py#L100) to return full episodes from the buffer. This is important when using a Q-RNN network, because the internal state must update starting from the beginning of the episode.

In [16]:
# replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
#     data_spec=agent.collect_data_spec,
#     batch_size=train_env.batch_size,
#     max_length=replay_buffer_max_length)

replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    data_spec=agent.collect_data_spec,
    batch_size=train_env.batch_size,
    max_length=replay_buffer_max_length,
)

replay_buffer

<tf_agents.replay_buffers.tf_uniform_replay_buffer.TFUniformReplayBuffer at 0x7f8e8c84eef0>

In [20]:
def collect_step(environment, policy, buffer):
    time_step = environment.current_time_step()
    if time_step.is_last():
        time_step = environment.reset()
    action_step = policy.action(time_step)
    next_time_step = environment.step(action_step.action)
    traj = trajectory.from_transition(time_step, action_step, next_time_step)
#     print(traj)
    # Add trajectory to the replay buffer
    buffer.add_batch(traj)

def collect_data(env, policy, buffer, steps):
    for _ in range(steps):
        collect_step(env, policy, buffer)

Collect 64 episodes from a random policy and store to the replay buffer.

In [24]:
train_env.reset()

TimeStep(step_type=<tf.Tensor: shape=(1,), dtype=int32, numpy=array([0], dtype=int32)>, reward=<tf.Tensor: shape=(1,), dtype=float32, numpy=array([0.], dtype=float32)>, discount=<tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>, observation=<tf.Tensor: shape=(1, 16, 16, 4), dtype=float32, numpy=
array([[[[ 9.99949932e-01, -1.00060096e-02,  9.99997914e-01,
          -1.66666496e-03],
         [ 0.00000000e+00,  0.00000000e+00,  4.16145253e-04,
          -4.17186908e-04],
         [ 0.00000000e+00,  0.00000000e+00,  4.16145253e-04,
          -4.17186908e-04],
         ...,
         [ 0.00000000e+00,  0.00000000e+00, -1.44736126e-10,
          -1.44615570e-10],
         [ 0.00000000e+00,  0.00000000e+00, -1.44736126e-10,
          -1.44615570e-10],
         [ 0.00000000e+00,  0.00000000e+00, -1.20563227e-13,
           1.66400347e-29]],

        [[ 0.00000000e+00,  0.00000000e+00, -4.17186908e-04,
          -4.16145253e-04],
         [ 9.99995530e-01,  6.05578534e-0

In [18]:
train_env.reset()

collect_data(env=train_env,
    policy=random_policy, 
    buffer=replay_buffer,
    steps=episode_length*64)

In [19]:
agent.collect_data_spec

Trajectory(step_type=TensorSpec(shape=(), dtype=tf.int32, name='step_type'), observation=TensorSpec(shape=(16, 16, 4), dtype=tf.float32, name=None), action=BoundedTensorSpec(shape=(), dtype=tf.int32, name=None, minimum=array(0, dtype=int32), maximum=array(4, dtype=int32)), policy_info=(), next_step_type=TensorSpec(shape=(), dtype=tf.int32, name='step_type'), reward=TensorSpec(shape=(), dtype=tf.float32, name='reward'), discount=BoundedTensorSpec(shape=(), dtype=tf.float32, name='discount', minimum=array(0., dtype=float32), maximum=array(1., dtype=float32)))

A Tensorflow `Dataset` takes care of sampling the replay buffer and generating trajectories quite nicely. The replay buffer can be converted to a `Dataset` which is then used for training.

In [None]:
# Dataset generates trajectories with shape [Bx2x...]
dataset = replay_buffer.as_dataset(
    num_parallel_calls=2,
    sample_batch_size=batch_size, 
    num_steps=2).prefetch(3)


dataset

In [None]:
iterator = iter(dataset)

print(iterator)

In [None]:
#iterator.next()

## Create the driver

TODO

- see whether I actually need driver (seems slower...)
- add writeup to this section

In [None]:
num_episodes = tf_metrics.NumberOfEpisodes()
avg_return = tf_metrics.AverageReturnMetric()

observers = [num_episodes,
             avg_return,
             replay_buffer.add_batch]

In [None]:
driver = dynamic_step_driver.DynamicStepDriver(
    train_env,
    collect_policy,
    observers,
    num_steps=episode_length*1
)

In [None]:
final_time_step, policy_state = driver.run()

In [None]:
num_episodes.result()

## Train the agent

In [None]:
# (Optional) Optimize by wrapping some of the code in a graph using TF function.
agent.train = common.function(agent.train)
agent.collect_policy.action = common.function(agent.collect_policy.action)

# Reset the train step
agent.train_step_counter.assign(0)

# Evaluate the agent's policy once before training.
avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
returns = [avg_return]
print(returns)

In [None]:
%load_ext line_profiler

In [None]:
def train_agent():
    train_env.reset()
    policy_state = agent.collect_policy.get_initial_state(train_env.batch_size)

    for _ in range(num_iterations):

        # Collect a few steps using collect_policy and save to the replay buffer.
#         final_time_step, policy_state = driver.run()
        for _ in range(collect_steps_per_iteration):
            #print(policy_state)
            collect_step(train_env,
                         agent.collect_policy,
                         replay_buffer)

        # Sample a batch of data from the buffer and update the agent's network.
        experience, unused_info = next(iterator)
        train_loss = agent.train(experience).loss

        step = agent.train_step_counter.numpy()

        if step % log_interval == 0:
            # print(q_net(np.zeros((1,5,5), dtype="float32"))[0].numpy())
            print(f'step = {step}: loss = {train_loss}')

        if step % eval_interval == 0:
            avg_return = compute_avg_return(eval_env, agent.policy)
            print(f'step = {step}: Average Return = {avg_return}')
            if avg_return > 50:
                break
            returns.append(avg_return)

In [None]:
%lprun -f train_agent train_agent()

## Evaluate the agent

See what pulse sequences it's performing

In [None]:
compute_avg_return(eval_env, agent.policy, num_episodes=1, print_actions=True)

Look at the Q-network structure (including the encoding network, LSTM, and final dense layers).

In [None]:
q_rnn_net.summary()

In [None]:
w = q_net.get_layer("EncodingNetwork").get_weights()
for weight in w:
    print(weight.shape)

And see what the Q-function returns for a play-through

In [None]:
ts = train_env.reset()
print(q_net(ts.observation, step_type=ts.step_type)[0].numpy())
ts = train_env.step(1)
print(q_net(ts.observation, step_type=ts.step_type)[0].numpy())
ts = train_env.step(2)
print(q_net(ts.observation, step_type=ts.step_type)[0].numpy())
ts = train_env.step(4)
print(q_net(ts.observation, step_type=ts.step_type)[0].numpy())
ts = train_env.step(3)
print(q_net(ts.observation, step_type=ts.step_type)[0].numpy())
ts = train_env.step(0)
print(q_net(ts.observation, step_type=ts.step_type)[0].numpy())
print(ts.reward.numpy())

## Manually interact with the environment

In [None]:
eval_env.reset()
# run the WHH-4 sequence
eval_env.step(1)
eval_env.step(2)
eval_env.step(4)
eval_env.step(3)
eval_env.step(0)