# Pulse Sequence Design using DDPG
_Written by Will Kaufman, October 2020_

This notebook walks through a reinforcement learning approach to pulse sequence design for spin systems. [TF-Agents](https://www.tensorflow.org/agents) is used as a reinforcement learning library that uses Tensorflow, a common machine learning framework.

The following notebook is loosely based on [this DDPG example](https://github.com/tensorflow/agents/blob/v0.6.0/tf_agents/agents/ddpg/examples/v2/train_eval.py#L71) using TF-Agents.

In [None]:
import numpy as np
import os
import time
import qutip as qt
import tensorflow as tf

from rl_pulse.environments import spin_system_continuous

In [None]:
import importlib
importlib.reload(spin_system_continuous)

## Define algorithm hyperparameters



In [None]:
# TODO eventually fill in hyperparameters at top of doc

In [None]:
# TODO add summary writers, save data periodically to view in tensorboard

In [None]:
# TODO add global step, for training/eval

## Initialize the spin system

This sets the parameters of the system ($N$ spin-1/2 particles, which corresponds to a Hilbert space with dimension $2^N$). For the purposes of simulation, $\hbar \equiv 1$.

The total internal Hamiltonian is given by
$$
H_\text{int} = C H_\text{dip} + \sum_i^N \delta_i I_z^{i}
$$
where $C$ is the coupling strength, $\delta$ is the chemical shift strength (each spin is assumed to be identical), and $H_\text{dip}$ is given by
$$
H_\text{dip} = \sum_{i,j}^N d_{i,j} \left(3I_z^{i}I_z^{j} - \mathbf{I}^{i} \cdot \mathbf{I}^{j}\right)
$$

The target unitary transformation is a simple $\pi/2$-pulse about the x-axis
$$
U_\text{target} = \exp\left(-i \frac{\pi}{4} \sum_j I_x^j \right)
$$

<!-- Hamiltonian is set to be the 0th-order average Hamiltonian from the WHH-4 pulse sequence, which is designed to remove the dipolar interaction term from the internal Hamiltonian. The pulse sequence is $\tau, \overline{X}, \tau, Y, \tau, \tau, \overline{Y}, \tau, X, \tau$.
The zeroth-order average Hamiltonian for the WAHUHA pulse sequence is
$$
H_\text{WHH}^{(0)} = \delta / 3 \sum_i^N \left( I_x^{i} + I_y^{i} + I_z^{i} \right)
$$ -->

In [None]:
N = 3  # 4-spin system

In [None]:
chemical_shifts = np.random.normal(scale=50, size=(N,))
Hcs = sum(
    [qt.tensor(
        [qt.identity(2)]*i
        + [chemical_shifts[i] * qt.sigmaz()]
        + [qt.identity(2)]*(N-i-1)
    ) for i in range(N)]
)

In [None]:
dipolar_matrix = np.random.normal(scale=50, size=(N, N))
Hdip = sum([
    dipolar_matrix[i, j] * (
        2 * qt.tensor(
            [qt.identity(2)]*i
            + [qt.sigmaz()]
            + [qt.identity(2)]*(j-i-1)
            + [qt.sigmaz()]
            + [qt.identity(2)]*(N-j-1)
        )
        - qt.tensor(
            [qt.identity(2)]*i
            + [qt.sigmax()]
            + [qt.identity(2)]*(j-i-1)
            + [qt.sigmax()]
            + [qt.identity(2)]*(N-j-1)
        )
        - qt.tensor(
            [qt.identity(2)]*i
            + [qt.sigmay()]
            + [qt.identity(2)]*(j-i-1)
            + [qt.sigmay()]
            + [qt.identity(2)]*(N-j-1)
        )
    )
    for i in range(N) for j in range(i+1, N)
])

In [None]:
Hsys = Hcs + Hdip
X = qt.tensor([qt.sigmax()]*N)
Y = qt.tensor([qt.sigmay()]*N)
# Z = qt.tensor([qt.sigmaz()]*N)
Hcontrols = [50e3 * X, 50e3 * Y]
target = qt.propagator(X, np.pi/4)

The `SpinSystemContinuousEnv` simulates the quantum system given above, and exposes relevant methods for RL (including a `step` method that takes an action and returns an observation and reward, a `reset` method to reset the system).

**TODO**: rewrite `spin_system_continuous` so it only uses `tensorflow` (no `tf-agents`).

In [None]:
env = spin_system_continuous.SpinSystemContinuousEnv(
    Hsys=Hsys,
    Hcontrols=Hcontrols,
    target=target,
    discount_factor=gamma
)

In [None]:
train_env = tf_py_environment.TFPyEnvironment(env)
eval_env = tf_py_environment.TFPyEnvironment(env)

## Define actor and critic networks

The observations of the system are sequences of control amplitudes that have been performed on the system (which most closely represents the knowledge of a typical experimental system). Both the actor and the critic (value) networks share an LSTM layer to convert the sequence of control amplitudes to a hidden state, and two dense layers. Separate policy and value "heads" are used for the two different networks.

In [None]:
lstm = tf.keras.layers.LSTM(64)
stateful_lstm = tf.keras.layers.LSTM(64, stateful=True)
hidden1 = tf.keras.layers.Dense(64, activation=tf.keras.activations.relu)
hidden2 = tf.keras.layers.Dense(64, activation=tf.keras.activations.relu)
policy = tf.keras.layers.Dense(2, activation=tf.keras.activations.tanh)
value = tf.keras.layers.Dense(1)

In [None]:
actor_net = tf.keras.models.Sequential([
    lstm,
    hidden1,
    hidden2,
    policy
])

In [None]:
critic_net = tf.keras.models.Sequential([
    lstm,
    hidden1,
    hidden2,
    value
])

In [None]:
stateful_actor_net = tf.keras.models.Sequential([
    stateful_lstm,
    hidden1,
    hidden2,
    policy
])

## Define PPO agent

[Proximal Policy Optimization (PPO)](https://arxiv.org/abs/1707.06347) is a state-of-the-art RL algorithm that can be used for both discrete and continuous action spaces. PPO prevents the policy from over-adjusting during training by defining a clipped policy gradient loss function:
$$
L^\text{clip}(\theta) = \mathbb{E}_t\left[
\min(r_t(\theta)\hat{A}_t, \text{clip}(
    r_t(\theta), 1-\epsilon, 1+\epsilon)
)\hat{A}_t
\right]
$$
where the "importance ratio" $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)}$ is the relative probability of choosing the action under the new policy compared to the old policy. By clipping the loss function, there is non-zero gradient only in a small region around the original policy.

Because the actor and critic networks share layers, the total loss function is used for training
$$
L(\theta) = \mathbb{E}_t \left[
-L^\text{clip}(\theta) + c_1 L^\text{VF}(\theta)
\right]
$$
with $L^\text{VF}(\theta)$ as the MSE loss for value estimates.

Basing off TF-Agents [abstract base class](https://www.tensorflow.org/agents/api_docs/python/tf_agents/agents/TFAgent). Also using [PPOAgent code](https://github.com/tensorflow/agents/blob/v0.6.0/tf_agents/agents/ppo/ppo_agent.py#L746).

## Collect some experience from the environment

The following collects experience by interacting with the environment. Right now it's really slow, so I'm using `%%time` to time the code and `%%prun` to profile the code and see where the slow-downs are happening.

It runs 100 timesteps in 3.5s. 1000 timesteps in 86s. Gets much slower the longer you run it.

- [ ] rewrite actor/critic so they can take lstm state as input and return final lstm state as output

All the data should have dimensions `batch_size * [other dims]`, shouldn't just be `batch_size`.

In [None]:
# %load_ext line_profiler

In [None]:
def calculate_returns(rewards, step_types, gamma=0.99):
    """
    Args:
        rewards: A tensor of rewards for the episode. Should have size batch_size * 1.
        step_types: A tensor of step types, 1 if a normal step.
    """
    returns = [0] * rewards.shape[0]
    returns[-1] = rewards[-1]
    for i in range(1, len(rewards)):
        returns[-(i + 1)] = gamma * returns[-i] * tf.cast(step_types[-(i + 1)] == 1, tf.float32) + rewards[-(i+1)]
    return returns

In [None]:
def get_obs_and_mask(obs, max_sequence_length=500):
    obs2 = []
    mask = []
    num_features = obs[0].shape[-1]
    for i in range(len(obs)):
        obs_length = obs[i].shape[-2]
        obs2.append(tf.concat(
            [tf.cast(obs[i], tf.float32), tf.zeros((1, max_sequence_length - obs_length, num_features))],
            axis=1
        ))
        mask.append(tf.concat(
            [tf.ones((1, obs_length)), tf.zeros((1, max_sequence_length - obs_length))],
            axis=1
        ))
    obs2 = tf.squeeze(tf.stack(obs2))
    mask = tf.squeeze(tf.stack(mask))
    return obs2, mask

In [None]:
def collect_data(num_steps=500, stddev=1e-3, max_sequence_length=500):
    step_types = []
    observations = []
    actions = []
    action_means = []
    rewards = []
    step = train_env.reset()
    if stateful_actor_net.built:
        stateful_actor_net.reset_states()
    # collect experience
    for _ in range(num_steps):
        observations.append(step.observation)
        action_mean = stateful_actor_net(tf.expand_dims(step.observation[:, -1, :], 1)) #collect_action(step.observation, stddev=stddev)
        action = action_mean + tf.random.normal(shape=action_mean.shape, stddev=stddev)
        actions.append(action)
        action_means.append(action_mean)
        step = train_env.step(action)
        rewards.append(step.reward)
        step_types.append(step.step_type)
        if step.step_type == 2:
            stateful_actor_net.reset_states()
            step = train_env.reset()
    # put data into tensors
    step_types = tf.stack(step_types)
    actions = tf.squeeze(tf.stack(actions))
    action_means = tf.squeeze(tf.stack(action_means))
    rewards = tf.stack(rewards)
    # reshape observations to be same sequence length, and create
    # a mask for original sequence length
    obs, mask = get_obs_and_mask(observations, max_sequence_length)
    returns = tf.stack(calculate_returns(rewards, step_types))
    advantages = returns - critic_net(obs, mask=mask)
    # TODO check below... I think this should be a 500x1 tensor
    old_action_log_probs = tf.reduce_sum(-(actions - action_means)**2 / stddev**2, axis=1, keepdims=True)
    return (obs, mask, actions, action_means,
            rewards, step_types, returns,
            advantages, old_action_log_probs)

In [None]:
# %lprun -f f f()
(obs, mask, actions,
 action_means, rewards, step_types, returns,
 advantages, old_action_log_probs) = collect_data()

## Training

In [None]:
optimizer = tf.optimizers.Adam()
mse = tf.losses.mse

In [None]:
if not actor_net.built:
    actor_net.build(input_shape=(None, None, 2))

Define a list of trainable variables that should be updated when minimizing the loss function.

In [None]:
critic_vars = critic_net.trainable_variables
actor_vars = actor_net.trainable_variables
trainable_variables = set()
for var in critic_vars + actor_vars:
    trainable_variables.add(var.ref())
trainable_variables = list(trainable_variables)
trainable_variables = [var.deref() for var in trainable_variables]

In [None]:
def grad(
        actor_net,
        critic_net,
        trainable_variables,
        obs,
        mask,
        actions,
        action_means,
        old_action_log_probs,
        returns,
        advantages,
        epsilon=.2,
        c1=1):
    """
    Returns: tuple containing
        l: Total loss.
        grad: Gradient of loss wrt trainable variables.
    """
    advantages = tf.expand_dims(advantages, -1)
    batch_size = obs.shape[0]
    with tf.GradientTape() as tape:
        action_log_probs = tf.reduce_sum(-(actions - actor_net(obs, mask))**2 / stddev**2,
                                         axis=1,
                                         keepdims=True)
        importance_ratio = tf.exp(action_log_probs - old_action_log_probs)
        loss_clip = tf.reduce_sum(tf.minimum(
            importance_ratio * advantages,
            tf.clip_by_value(
                importance_ratio,
                1 - epsilon,
                1 + epsilon) * advantages
        )) / batch_size
        loss_value = mse(tf.squeeze(returns), tf.squeeze(critic_net(obs, mask)))
        l = -loss_clip + c1 * loss_value
        return tape.gradient(l, trainable_variables), loss_clip, loss_value

In [None]:
def train_minibatch(obs, mask, actions, action_means, old_action_log_probs, returns, advantages,
                    actor_net, critic_net, trainable_variables,
                    num_epochs=10, minibatch_size=100):
    for i in range(num_epochs):
        minibatch = np.random.choice(obs.shape[0], size=minibatch_size)
        grads, loss_clip, loss_value = grad(
            actor_net, critic_net, trainable_variables,
            tf.gather(obs, indices=minibatch),
            tf.gather(mask, indices=minibatch),
            tf.gather(actions, indices=minibatch),
            tf.gather(action_means, indices=minibatch),
            tf.gather(old_action_log_probs, indices=minibatch),
            tf.gather(returns, indices=minibatch),
            tf.gather(advantages, indices=minibatch))
        optimizer.apply_gradients(zip(grads, trainable_variables))
        print(f'{i}:\tloss_clip: {loss_clip}\tloss_value: {loss_value}')

In [None]:
train_minibatch(obs, mask, actions, action_means, old_action_log_probs, returns, advantages,
                actor_net, critic_net, trainable_variables,
                num_epochs=10, minibatch_size=100)

In [None]:
stateful_actor_net.layers[0].set_weights(actor_net.layers[0].get_weights())
stateful_actor_net.reset_states()

## Write a training loop

In [None]:
def train_step():
    """Collect experience, train networks
    """
    (obs, mask, actions,
     action_means, rewards, step_types, returns,
     advantages, old_action_log_probs) = collect_data()
    print('collected data')
    # train
    train_minibatch(obs, mask, actions, action_means,
                    old_action_log_probs, returns, advantages,
                    actor_net, critic_net, trainable_variables,
                    num_epochs=10, minibatch_size=64)
    print('trained networks')
    stateful_actor_net.layers[0].set_weights(actor_net.layers[0].get_weights())
    stateful_actor_net.reset_states()
    print('reset state')

In [None]:
train_step()

In [None]:
(obs, mask, actions,
 action_means, rewards, step_types, returns,
 advantages, old_action_log_probs) = collect_data()

In [None]:
tf.reduce_sum(rewards)

In [None]:
stateful_actor_net.reset_states()