# Pulse Sequence Design Using Reinforcement Learning

Implementing deep deterministic policy gradient (DDPG) to learn pulse sequence design for spin systems. The [OpenAI SpinningUp resource](https://spinningup.openai.com/en/latest/algorithms/ddpg.html#pseudocode) has a good theoretical background on DDPG which I used to implement the algorithm below.

DDPG is designed for _continuous_ action spaces, which is the ultimate goal for this project (to apply pulses with arbitrary axes of rotation, rotation angles, and times, instead of limiting to pi/2 pulses along X or Y). However, that means the algorithm is less suited to constrained versions of the problem, such as only applying pi/2 pulses of a certain length about X or Y.

For training, the following reward function was used
$$
r = -\log\left( 1- \left| \text{Tr}\left( \frac{U_\text{target}^\dagger U_\text{exp}}{2^N} \right) \right| \right)
= -\log\left( 1- \text{fidelity}(U_\text{target}, U_\text{exp}) \right)
$$
For example, if the fidelity is $0.999$, then the reward $r = -\log(0.001) = 3$. 

<!-- For the policy function, I need to perform gradient ascent with the following gradient
$$
\nabla_\theta 1/|B| \sum_{s \in B} Q_\phi (s, \pi_\theta(s))
$$

And for the Q-function, perform gradient descent with
$$
\nabla_\phi 1/|B| \sum_{(s,a,r,s',d) \in B} (Q_\phi(s,a) - y(r,s',d))^2
$$ -->

Other resources:

- https://www.tensorflow.org/tutorials/customization/custom_training_walkthrough#define_the_loss_and_gradient_function
- https://www.tensorflow.org/guide/migrate#customize_the_training_step

In [None]:
import spin_simulation as ss
import rl_pulse as rlp
import numpy as np
import scipy.linalg as spla
import importlib
from tqdm import tqdm
import matplotlib.pyplot as plt
from datetime import datetime

In [None]:
importlib.reload(ss)
importlib.reload(rlp)

# Initialize spin system

This sets the parameters of the system ($N$ spin-1/2 particles, which corresponds to a Hilbert space with dimension $2^N$). For the purposes of simulation, $\hbar \equiv 1$.

The total internal Hamiltonian is given by
$$
H_\text{int} = C H_\text{dip} + \Delta \sum_i^N I_z^{(i)}
$$
where $C$ is the coupling strength, $\Delta$ is the chemical shift strength (each spin is assumed to be identical), and $H_\text{dip}$ is given by
$$
H_\text{dip} = \sum_{i,j}^N d_{i,j} \left(3I_z^{(i)}I_z^{(j)} - \mathbf{I}^{(i)} \cdot \mathbf{I}^{(j)}\right)
$$

The WAHUHA pulse sequence is designed to remove the dipolar interaction term from the internal Hamiltonian. The pulse sequence is $\tau, P_{-x}, \tau, P_{y}, \tau, \tau, P_{-y}, \tau, P_{x}, \tau$.
The zeroth-order average Hamiltonian for the WAHUHA pulse sequence is
$$
H_\text{WHH} = \Delta / 3 \sum_i^N I_x^{(i)} + I_y^{(i)} + I_z^{(i)}
$$

In [None]:
N = 4
dim = 2**N
coupling = 2*np.pi * 5e3    # coupling strength
delta = 2*np.pi * 500       # chemical shift strength (for identical spins)

(x,y,z) = (ss.x, ss.y, ss.z)
(X,Y,Z) = ss.get_total_spin(N, dim)

Hdip, Hint = ss.getAllH(N, dim, coupling, delta)
HWHH0 = ss.get_H_WHH_0(N, dim, delta)

# Initialize RL algorithm

An "action" performed on the system corresponds to an RF-pulse applied to the system. A pulse can be parametrized by the axis of rotation (e.g. $(\theta, \phi)$, but for now $\theta = \pi/2$ is assumed so the axis of rotation lies in the xy-plane), the rotation angle, and the duration of the pulse.

The state of the system can correspond to the propagator, but because the propagator grows exponentially (it has $4^N$ elements for an $N$-spin system) and the pulse sequence determines the propagator, the state is represented by the pulse sequence instead.

The target network parameters $\theta_\text{target}$ are updated by
$$
\theta_\text{target} = (1-\rho) \theta_\text{target} + \rho\theta
$$

TODO figure out if this buffer size makes sense

In [None]:
sDim = 3 # state represented by sequences of actions...
aDim = 3 # action = [phi, rot, time]

numGen = 5 # how many generations to run
bufferSize = int(1e5) # size of the replay buffer
batchSize = 1024 # size of batch for training, multiple of 32
popSize = 10 # size of population
polyak = .01 # polyak averaging parameter
gamma = .99 # future reward discount rate

syncEvery = 1 # how often to copy RL actor into population

p = .05

actorLR = .01
criticLR = .01
lstmLayers = 1
fcLayers = 3
lstmUnits = 32
fcUnits = 256

eliteFrac = .2
tourneyFrac = .3
mutateProb = .25
mutateFrac = .1

Initialize the actor and critic, as well as target actor and target critic. The actor learns the policy function
$$
\pi_\theta: S \to A, s \mapsto a
$$
that picks the optimal action $a$ for a given state $s$, with some set of parameters $\theta$ (in this case weights/biases in the neural network). The critic learns the Q-function
$$
Q_\phi: S \times A \to \mathbf{R}, (s,a) \mapsto q
$$
where $q$ is the total expected rewards by doing action $a$ on a state $s$, and $\phi$ is the parameter set for the Q-function model. The target actor/critic have different parameter sets $\theta_\text{target}$ and $\phi_\text{target}$.

The "environment" keeps track of the system state, and calculates rewards after each episode.

The replay buffer keeps track of the most recent episodes.

In [None]:
env = rlp.Environment(N, dim, coupling, delta, sDim, HWHH0, X, Y)
noiseProcess = rlp.NoiseProcess(p)

actor = rlp.Actor(sDim,aDim, actorLR)
critic = rlp.Critic(sDim, aDim, gamma, criticLR)
actor.createNetwork(lstmLayers, fcLayers, lstmUnits, fcUnits)
critic.createNetwork(lstmLayers, fcLayers, lstmUnits, fcUnits)

actorTarget = actor.copy()
criticTarget = critic.copy()

pop = rlp.Population(popSize)
pop.startPopulation(sDim, aDim, actorLR, \
    lstmLayers, fcLayers, lstmUnits, fcUnits)

replayBuffer = rlp.ReplayBuffer(bufferSize)

## ERL algorithm

In [None]:
paramDiff = []
popFitnesses = [] # generation, array of fitnesses
testMat = [] # generation, fitness from test

samples = 250

for i in range(numGen):
    # evaluate and iterate the population
    pop.evaluate(env, replayBuffer, None, numEval=2)
    if i % int(np.ceil(numGen / samples)) == 0:
        popFitnesses.append((i, np.copy(pop.fitnesses)))
    pop.iterate(eliteFrac=eliteFrac, tourneyFrac=tourneyFrac, \
         mutateProb=mutateProb, mutateFrac=mutateFrac)
    print("iterated population")
    
    # evaluate the actor
    f = actor.evaluate(env, replayBuffer, noiseProcess)
    print(f"evaluated the actor,\tfitness is {f:.02f}")
    
    # update networks
    batch = replayBuffer.getSampleBatch(batchSize)
    # train critic
    critic.trainStep(batch, actorTarget, criticTarget)
    # train actor
    actor.trainStep(batch, critic)
    # update target networks
    criticTarget.copyParams(critic, polyak)
    actorTarget.copyParams(actor, polyak)
    
    print("trained actor/critic")
    
    if i % int(np.ceil(numGen / samples)) == 0:
        print("="*20 + f"\nRecording test results (generation {i})")
        s, rMat = actor.test(env)
        f = np.max(rMat)
        # record results from the test
        print(f'Fitness from test: {f:0.02f}')
        testMat.append((i, f))
        print(f"Test result from generation {i}")
        print("Chosen pulse sequence:")
        print(rlp.formatAction(s) + "\n")
        print("Rewards from the pulse sequence:\n")
        for testR in rMat:
            print(f"{testR:.02f}, ", end='')
        print(f'\nFitness: {f:.02f}')
        print("\n"*3)
    
    print(f'buffer size is {replayBuffer.size}\n')
    
    if i % syncEvery == 0:
        # sync actor with population
        pop.sync(actor)
    
    if i % int(np.ceil(numGen/samples)) == 0:
        # calculate difference between parameters for actors/critics
        paramDiff.append((i, actor.paramDiff(actorTarget), \
                                 critic.paramDiff(criticTarget)))

# Results

In [None]:
%matplotlib inline

diffEps = [_[0] for _ in paramDiff]
actorDiffs = np.array([_[1] for _ in paramDiff])
criticDiffs = np.array([_[2] for _ in paramDiff])

for d in range(np.shape(actorDiffs)[1]):
    plt.plot(diffEps, actorDiffs[:,d], label=f"parameter {d}")
plt.title(f"Actor parameter MSE vs target networks")
plt.xlabel('Generation number')
plt.ylabel('MSE')
plt.yscale('log')
#plt.legend()
# plt.gcf().set_size_inches(12,8)

In [None]:
%matplotlib inline

for d in range(np.shape(actorDiffs)[1]):
    plt.plot(diffEps, criticDiffs[:,d], label=f"parameter {d}")
plt.title(f"Critic parameter MSE vs target networks")
plt.xlabel('Generation number')
plt.ylabel('MSE')
plt.yscale('log')

In [None]:
%matplotlib inline

popFitGens = [_[0] for _ in popFitnesses]
popFits = [_[1] for _ in popFitnesses]

for i in range(len(popFitGens)):
    g = popFitGens[i]
    plt.plot([g] * len(popFits[i]), popFits[i], '.k')
plt.title(f"Population fitnesses by generation")
plt.xlabel('Generation number')
plt.ylabel('Fitness')

In [None]:
%matplotlib inline

testGens = [_[0] for _ in testMat]
testFits = [_[1] for _ in testMat]

plt.plot(testGens, testFits, '.k')
plt.title(f"Test fitnesses by generation")
plt.xlabel('Generation number')
plt.ylabel('Fitness')
# plt.yscale('log')

### Analysis of networks

In [None]:
s = np.zeros((32,3), dtype="float32")
s[:,2] = -1
# s1 = np.array([[0,0,1],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0]], dtype="float32")
for i in range(5):
    print(pop.pop[i].predict(s))
# print(actor.predict(s1))
# a = np.array([0,0,1], dtype="float32")
# a1 = np.array([.5,.5,.5], dtype="float32")
# print(critic.predict(s,a))
# print(critic.predict(s,a1))

In [None]:
w = actor.getParams()
print(len(w))

In [None]:
w[3:7]