# Pulse Sequence Design Using Reinforcement Learning

{{ explanation for my approach, what I'm trying to do }}

Currently looking to implement DDPG for RL algorithm

Using the [OpenAI SpinningUp resource](https://spinningup.openai.com/en/latest/algorithms/ddpg.html#pseudocode) for the theoretical background on DDPG, and lots of TensorFlow documentation for how to write the algorithm below.

For the policy function, I need to perform gradient ascent with the following gradient
$$
\nabla_\theta 1/|B| \sum_{s \in B} Q_\phi (s, \pi_\theta(s))
$$

And for the Q-function, perform gradient descent with
$$
\nabla_\phi 1/|B| \sum_{(s,a,r,s',d) \in B} (Q_\phi(s,a) - y(r,s',d))^2
$$

Other resources:

- https://keras.io/getting-started/sequential-model-guide/
- https://www.tensorflow.org/guide/keras/overview
- https://www.tensorflow.org/tutorials/customization/custom_training_walkthrough#define_the_loss_and_gradient_function
- https://github.com/floodsung/DDPG/blob/master/actor_network.py
- https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/contents/9_Deep_Deterministic_Policy_Gradient_DDPG/DDPG.py

Also helpful: https://www.tensorflow.org/guide/migrate#customize_the_training_step

In [None]:
import spinSimulation as ss
import rlPulse as rlp
import numpy as np
import importlib

In [None]:
importlib.reload(ss)
importlib.reload(rlp)

Define system parameters.

In [None]:
N = 4
dim = 2**N

pulse = .25e-6    # duration of pulse
delay = 3e-6      # duration of delay
f1 = 1/(4*pulse)  # for pi/2 pulses
coupling = 5e3    # coupling strength
Delta = 500       # chemical shift strength (for identical spins)

a = ss.getRandDip(N) # random dipolar coupling strengths
(x,y,z) = (ss.x, ss.y, ss.z)
(X,Y,Z) = ss.getCollectiveObservables(N, dim)

Hdip = ss.getHdip(N, dim, x, y, z, a)
Hint = ss.getHint(Hdip, coupling, Z, Delta)
HWHH0 = ss.getHWHH0(X,Y,Z,Delta)

Initialize the actor and critic, as well as target actor and target critic. The actor learns the policy function
$$
\pi_\theta: S \to A, s \mapsto a
$$
that picks the optimal action $a$ for a given state $s$, with some set of parameters $\theta$. The critic learns the Q-function
$$
Q_\phi: S \times A \to \mathbf{R}, (s,a) \mapsto q
$$
where $q$ is the total expected rewards by doing action $a$ on a state $s$, and $\phi$ is the parameter set for the Q-function model. The target actor/critic have different parameter sets $\theta_\text{target}$ and $\phi_\text{target}$.

In [None]:
actor = rlp.Actor(3,3,None)
actorTarget = rlp.Actor(3,3,None)
critic = rlp.Critic(3,3,None,.5)
criticTarget = rlp.Critic(3,3,None,.5)

replayBuffer = rlp.ReplayBuffer(100) # TODO figure out if this buffer size makes sense

Below is the implementation of the DDPG algorithm (see [this OpenAI resource for reference](https://spinningup.openai.com/en/latest/algorithms/ddpg.html#pseudocode)). 

In [None]:
# TODO write implementation



In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.plot(range(10), np.random.normal(size=(10)), label='0')
plt.title('Random numbers')
plt.xlabel('Cycle number')
plt.ylabel('Net magnetization, real')
plt.legend()