@@ -195,6 +195,44 @@ experimental systematic noise, ...) can be done all in one, using this
195
195
algorithm.
196
196
197
197
198
+ The RL Algorithm
199
+ ================
200
+ Reinforcement Learning (RL) represents a different approach compared to traditional
201
+ quantum control methods, such as GRAPE and CRAB. Instead of relying on gradients or
202
+ prior knowledge of the system, RL uses an agent that autonomously learns to optimize
203
+ control policies by interacting with the quantum environment.
204
+
205
+ The RL algorithm consists of three main components:
206
+
207
+ **Agent **: The RL agent is responsible for making decisions regarding control
208
+ parameters at each time step. The agent observes the current state of the quantum
209
+ system and chooses an action (i.e., a set of control parameters) based on the current policy.
210
+ **Environment **: The environment represents the quantum system that evolves over time.
211
+ The environment is defined by the system's dynamics, which include drift and control Hamiltonians.
212
+ Each action chosen by the agent induces a response in the environment, which manifests as an
213
+ evolution of the system's state. From this, a reward can be derived.
214
+ **Reward **: The reward is a measure of how much the action chosen by the agent brings the
215
+ quantum system closer to the desired objective. In this context, the objective could be the
216
+ preparation of a specific state, state-to-state transfer, or the synthesis of a quantum gate.
217
+
218
+ Each interaction between the agent and the environment defines a step.
219
+ A sequence of steps forms an episode. The episode ends when certain conditions, such as reaching
220
+ a specific fidelity, are met.
221
+ The reward function is a crucial component of the RL algorithm, carefully designed to
222
+ reflect the objective of the quantum control problem.
223
+ It guides the algorithm in updating its policy to maximize the reward obtained during the various
224
+ training episodes.
225
+ For example, in a state-to-state transfer problem, the reward is based on the fidelity
226
+ between the achieved final state and the desired target state.
227
+ In addition, a constant penalty term is subtracted in order to encourages the agent to reach the
228
+ objective in as few steps as possible.
229
+
230
+ In QuTiP, the RL environment is modeled as a custom class derived from the gymnasium library.
231
+ This class allows defining the quantum system's dynamics at each step, the actions the agent
232
+ can take, the observation space, and so on. The RL agent is trained using the Proximal Policy Optimization
233
+ (PPO) algorithm from the stable baselines3 library.
234
+
235
+
198
236
Optimal Quantum Control in QuTiP
199
237
================================
200
238
Defining a control problem with QuTiP is very easy.
0 commit comments