Skip to content

Commit c3c3adb

Browse files
Merge pull request #18 from LegionAtol/update-guide-control
Update guide control
2 parents 1ce4f3c + 5403809 commit c3c3adb

File tree

1 file changed

+38
-0
lines changed

1 file changed

+38
-0
lines changed

doc/guide/guide-control.rst

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -195,6 +195,44 @@ experimental systematic noise, ...) can be done all in one, using this
195195
algorithm.
196196

197197

198+
The RL Algorithm
199+
================
200+
Reinforcement Learning (RL) represents a different approach compared to traditional
201+
quantum control methods, such as GRAPE and CRAB. Instead of relying on gradients or
202+
prior knowledge of the system, RL uses an agent that autonomously learns to optimize
203+
control policies by interacting with the quantum environment.
204+
205+
The RL algorithm consists of three main components:
206+
207+
**Agent**: The RL agent is responsible for making decisions regarding control
208+
parameters at each time step. The agent observes the current state of the quantum
209+
system and chooses an action (i.e., a set of control parameters) based on the current policy.
210+
**Environment**: The environment represents the quantum system that evolves over time.
211+
The environment is defined by the system's dynamics, which include drift and control Hamiltonians.
212+
Each action chosen by the agent induces a response in the environment, which manifests as an
213+
evolution of the system's state. From this, a reward can be derived.
214+
**Reward**: The reward is a measure of how much the action chosen by the agent brings the
215+
quantum system closer to the desired objective. In this context, the objective could be the
216+
preparation of a specific state, state-to-state transfer, or the synthesis of a quantum gate.
217+
218+
Each interaction between the agent and the environment defines a step.
219+
A sequence of steps forms an episode. The episode ends when certain conditions, such as reaching
220+
a specific fidelity, are met.
221+
The reward function is a crucial component of the RL algorithm, carefully designed to
222+
reflect the objective of the quantum control problem.
223+
It guides the algorithm in updating its policy to maximize the reward obtained during the various
224+
training episodes.
225+
For example, in a state-to-state transfer problem, the reward is based on the fidelity
226+
between the achieved final state and the desired target state.
227+
In addition, a constant penalty term is subtracted in order to encourages the agent to reach the
228+
objective in as few steps as possible.
229+
230+
In QuTiP, the RL environment is modeled as a custom class derived from the gymnasium library.
231+
This class allows defining the quantum system's dynamics at each step, the actions the agent
232+
can take, the observation space, and so on. The RL agent is trained using the Proximal Policy Optimization
233+
(PPO) algorithm from the stable baselines3 library.
234+
235+
198236
Optimal Quantum Control in QuTiP
199237
================================
200238
Defining a control problem with QuTiP is very easy.

0 commit comments

Comments
 (0)