# Pulse Sequence Design Using Reinforcement Learning

{{ explanation for my approach, what I'm trying to do }}

Currently looking to implement DDPG for RL algorithm

Using the [OpenAI SpinningUp resource](https://spinningup.openai.com/en/latest/algorithms/ddpg.html#pseudocode) for the theoretical background on DDPG, and lots of TensorFlow documentation for how to write the algorithm below.

For the policy function, I need to perform gradient ascent with the following gradient
$$
\nabla_\theta 1/|B| \sum_{s \in B} Q_\phi (s, \pi_\theta(s))
$$

And for the Q-function, perform gradient descent with
$$
\nabla_\phi 1/|B| \sum_{(s,a,r,s',d) \in B} (Q_\phi(s,a) - y(r,s',d))^2
$$

Other resources:

- https://keras.io/getting-started/sequential-model-guide/
- https://www.tensorflow.org/guide/keras/overview
- https://www.tensorflow.org/tutorials/customization/custom_training_walkthrough#define_the_loss_and_gradient_function
- https://github.com/floodsung/DDPG/blob/master/actor_network.py
- https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/contents/9_Deep_Deterministic_Policy_Gradient_DDPG/DDPG.py

Also helpful: https://www.tensorflow.org/guide/migrate#customize_the_training_step

In [None]:
import spinSimulation as ss
import rlPulse as rlp
import numpy as np
import scipy.linalg as spla
import importlib

In [None]:
importlib.reload(ss)
importlib.reload(rlp)

Define system parameters.

In [None]:
N = 4
dim = 2**N

pulse = .25e-6    # duration of pulse
delay = 3e-6      # duration of delay
f1 = 1/(4*pulse)  # for pi/2 pulses
coupling = 5e3    # coupling strength
delta = 500       # chemical shift strength (for identical spins)

(x,y,z) = (ss.x, ss.y, ss.z)
(X,Y,Z) = ss.getTotalSpin(N, dim)

Hdip, Hint = ss.getAllH(N, dim, coupling, delta)
HWHH0 = ss.getHWHH0(X,Y,Z,delta)

Initialize the RL algorithm parameters.

- `numExp`: Specifies how many experiences to "play" through.
- `numUpdates`: How many updates to perform using a random subset of experiences from the replay buffer.
- `bufferSize`: Size of the replay buffer (i.e. how many experiences to keep in memory).
- `batchSize`: Size of batch (subset of replay buffer) to use as training for actor and critic.
- `p`: Action noise level (determines probabilities of rotating along a different axis or by different angle)
- `polyak`: Polyak averaging parameter. The target network parameters $\theta_\text{target}$ are updated by
$$
\theta_\text{target} = \rho \theta_\text{target} + (1-\rho)\theta
$$

In [None]:
numExp = 1000
numUpdates = 10
bufferSize = 500 # TODO figure out if this buffer size makes sense
batchSize = 50
p = 0.5 # action noise parameter
polyak = 0.75
gamma = 0.5

printEvery = 25
randomizeDipolarEvery = 25

Initialize the actor and critic, as well as target actor and target critic. The actor learns the policy function
$$
\pi_\theta: S \to A, s \mapsto a
$$
that picks the optimal action $a$ for a given state $s$, with some set of parameters $\theta$. The critic learns the Q-function
$$
Q_\phi: S \times A \to \mathbf{R}, (s,a) \mapsto q
$$
where $q$ is the total expected rewards by doing action $a$ on a state $s$, and $\phi$ is the parameter set for the Q-function model. The target actor/critic have different parameter sets $\theta_\text{target}$ and $\phi_\text{target}$.

In [None]:
sDim = 3 # state represented by sequences of actions...
aDim = 3 # action = [phi, rot, time]

actor = rlp.Actor(sDim,aDim,None)
actorTarget = rlp.Actor(sDim,aDim,None)
critic = rlp.Critic(sDim,aDim,None, gamma)
criticTarget = rlp.Critic(sDim,aDim,None, gamma)
env = rlp.Environment(N, dim, sDim, HWHH0, X, Y)

actorTarget.setParams(actor.getParams())
criticTarget.setParams(critic.getParams())

replayBuffer = rlp.ReplayBuffer(bufferSize)

def actionNoise(p):
    '''Add noise to actions. Generates a 1x3 array with random values
    
    Arguments:
        p: Parameter to control amount of noise
    '''
#     return np.array([1.0/4*np.random.choice([0,1,-1],p=[1-p,p/2,p/2]), \
#                      1.0/4*np.random.choice([0,1,-1],p=[1-p,p/2,p/2]), \
#                      np.random.normal(0,.25)])
    return np.array([np.random.normal(0, p/2), \
                     np.random.normal(0, p/2), \
                     np.random.normal(0, p/2)])

Below is the implementation of the DDPG algorithm (see [this OpenAI resource for reference](https://spinningup.openai.com/en/latest/algorithms/ddpg.html#pseudocode)). 

In [None]:
rMat = np.zeros((numExp,))
aMat = np.zeros((numExp,aDim))
timeMat = np.zeros((numExp, 2)) # record length of episode so far and number of pulses
# keep track of when resets/updates happen
resetStateEps = []
updateEps = []

for i in range(numExp):
    if i % printEvery == 0:
        print("on episode {}".format(i))
    if i > 0 and i % randomizeDipolarEvery == 0:
        # randomize dipolar coupling strengths for Hint
        Hdip, Hint = ss.getAllH(N, dim, coupling, delta)
    s = env.getState()
    a = rlp.clipAction(actor.predict(env.state) + actionNoise(p))
    aMat[i,:] = a
    env.evolve(a, Hint)
    r = env.reward()
    rMat[i] = r
    timeMat[i,0] = [env.t, np.sum(env.state[:,2] != 0)]
    if r > 1:
        print("high reward in episode {}".format(i))
    s1 = env.getState()
    d = env.isDone()
    replayBuffer.add(s,a,r,s1,d)
    if d:
        print("terminal state (episode {})".format(i))
        env.reset()
        resetStateEps.append(i)
    if (i > 0) and (i % 50 == 0):
        print("updating actor/critic networks (episode {})".format(i))
        updateEps.append(i)
        for update in range(numUpdates):
            batch = replayBuffer.getSampleBatch(batchSize)
            # train critic
            critic.trainStep(batch, actorTarget, criticTarget)
            # train actor
            actor.trainStep(batch, critic)
            # update target networks
            criticTarget.updateParams(critic, polyak)
            actorTarget.updateParams(actor, polyak)
            

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.plot(rMat, color='black', label='rewards')
ymin, ymax = plt.ylim()
plt.vlines(updateEps, ymin, ymax, color='red', alpha=0.2, label='updates')
#plt.vlines(resetStateEps, ymin, ymax, color='blue', alpha=0.2, linestyles='dashed', label='state reset')
plt.title('Rewards for each episode')
plt.xlabel('Episode number')
plt.ylabel('Reward')
plt.legend()

In [None]:
%matplotlib inline

plt.hist(rMat, bins=20, color='black', label='rewards')
# ymin, ymax = plt.ylim()
# plt.vlines(updateEps, ymin, ymax, color='red', alpha=0.2, label='updates')
#plt.vlines(resetStateEps, ymin, ymax, color='blue', alpha=0.2, linestyles='dashed', label='state reset')
plt.title('Rewards histogram')
plt.legend()

In [None]:
# print the state with highest reward in the buffer (might not be highest of all episodes because buffer forgets)
rBuffer = [_[2] for _  in replayBuffer.buffer]
rlp.printAction(replayBuffer.buffer[np.argmax(rBuffer)][3])
print("max reward in buffer: ", np.max(rBuffer))

In [None]:
# for debugging purposes...

#rlp.printAction(replayBuffer.buffer[-2][3])

Uexp = spla.expm(-1j*(Hint*.11e-6 + Y*2*np.pi))
Utarget = ss.getPropagator(HWHH0, .11e-6)
ss.fidelity(Utarget, Uexp)