Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 45 additions & 45 deletions site/en/tutorials/reinforcement_learning/actor_critic.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@
"id": "p62G8M_viUJp"
},
"source": [
"# Playing CartPole with the Actor-Critic Method\n"
"# Playing CartPole with the Actor-Critic method\n"
]
},
{
Expand Down Expand Up @@ -74,8 +74,8 @@
"id": "kFgN7h_wiUJq"
},
"source": [
"This tutorial demonstrates how to implement the [Actor-Critic](https://papers.nips.cc/paper/1786-actor-critic-algorithms.pdf) method using TensorFlow to train an agent on the [Open AI Gym](https://gym.openai.com/) CartPole-V0 environment.\n",
"The reader is assumed to have some familiarity with [policy gradient methods](https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf) of reinforcement learning. \n"
"This tutorial demonstrates how to implement the [Actor-Critic](https://papers.nips.cc/paper/1786-actor-critic-algorithms.pdf) method using TensorFlow to train an agent on the [Open AI Gym](https://gym.openai.com/) [`CartPole-v0`](https://www.gymlibrary.dev/environments/classic_control/cart_pole/) environment.\n",
"The reader is assumed to have some familiarity with [policy gradient methods](https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf) of [(deep) reinforcement learning](https://en.wikipedia.org/wiki/Deep_reinforcement_learning). \n"
]
},
{
Expand All @@ -86,7 +86,7 @@
"source": [
"**Actor-Critic methods**\n",
"\n",
"Actor-Critic methods are [temporal difference (TD) learning](https://en.wikipedia.org/wiki/Temporal_difference_learning) methods that represent the policy function independent of the value function. \n",
"Actor-Critic methods are [temporal difference (TD) learning](https://en.wikipedia.org/wiki/Temporal_difference_learning) methods that represent the policy function independent of the value function.\n",
"\n",
"A policy function (or policy) returns a probability distribution over actions that the agent can take based on the given state.\n",
"A value function determines the expected return for an agent starting at a given state and acting according to a particular policy forever after.\n",
Expand All @@ -102,12 +102,12 @@
"id": "rBfiafKSRs2k"
},
"source": [
"**CartPole-v0**\n",
"**`CartPole-v0`**\n",
"\n",
"In the [CartPole-v0 environment](https://www.gymlibrary.ml/environments/classic_control/cart_pole/), a pole is attached to a cart moving along a frictionless track. \n",
"The pole starts upright and the goal of the agent is to prevent it from falling over by applying a force of -1 or +1 to the cart. \n",
"A reward of +1 is given for every time step the pole remains upright.\n",
"An episode ends when (1) the pole is more than 15 degrees from vertical or (2) the cart moves more than 2.4 units from the center.\n",
"In the [`CartPole-v0` environment](https://www.gymlibrary.dev/environments/classic_control/cart_pole/), a pole is attached to a cart moving along a frictionless track.\n",
"The pole starts upright and the goal of the agent is to prevent it from falling over by applying a force of `-1` or `+1` to the cart.\n",
"A reward of `+1` is given for every time step the pole remains upright.\n",
"An episode ends when: 1) the pole is more than 15 degrees from vertical; or 2) the cart moves more than 2.4 units from the center.\n",
"\n",
"<center>\n",
" <figure>\n",
Expand Down Expand Up @@ -203,15 +203,15 @@
"id": "AOUCe2D0iUJu"
},
"source": [
"## Model\n",
"## The model\n",
"\n",
"The *Actor* and *Critic* will be modeled using one neural network that generates the action probabilities and critic value respectively. This tutorial uses model subclassing to define the model. \n",
"The *Actor* and *Critic* will be modeled using one neural network that generates the action probabilities and Critic value respectively. This tutorial uses model subclassing to define the model. \n",
"\n",
"During the forward pass, the model will take in the state as the input and will output both action probabilities and critic value $V$, which models the state-dependent [value function](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#value-functions). The goal is to train a model that chooses actions based on a policy $\\pi$ that maximizes expected [return](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#reward-and-return).\n",
"\n",
"For Cartpole-v0, there are four values representing the state: cart position, cart-velocity, pole angle and pole velocity respectively. The agent can take two actions to push the cart left (0) and right (1) respectively.\n",
"For `CartPole-v0`, there are four values representing the state: cart position, cart-velocity, pole angle and pole velocity respectively. The agent can take two actions to push the cart left (`0`) and right (`1`), respectively.\n",
"\n",
"Refer to [OpenAI Gym's CartPole-v0 wiki page](http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf) for more information.\n"
"Refer to [Gym's Cart Pole documentation page](https://www.gymlibrary.dev/environments/classic_control/cart_pole/) and [_Neuronlike adaptive elements that can solve difficult learning control problems_](http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf) by Barto, Sutton and Anderson (1983) for more information.\n"
]
},
{
Expand Down Expand Up @@ -261,13 +261,13 @@
"id": "hk92njFziUJw"
},
"source": [
"## Training\n",
"## Train the agent\n",
"\n",
"To train the agent, you will follow these steps:\n",
"\n",
"1. Run the agent on the environment to collect training data per episode.\n",
"2. Compute expected return at each time step.\n",
"3. Compute the loss for the combined actor-critic model.\n",
"3. Compute the loss for the combined Actor-Critic model.\n",
"4. Compute gradients and update network parameters.\n",
"5. Repeat 1-4 until either success criterion or max episodes has been reached.\n"
]
Expand All @@ -278,7 +278,7 @@
"id": "R2nde2XDs8Gh"
},
"source": [
"### 1. Collecting training data\n",
"### 1. Collect training data\n",
"\n",
"As in supervised learning, in order to train the actor-critic model, you need\n",
"to have training data. However, in order to collect such data, the model would\n",
Expand All @@ -299,7 +299,7 @@
},
"outputs": [],
"source": [
"# Wrap OpenAI Gym's `env.step` call as an operation in a TensorFlow function.\n",
"# Wrap Gym's `env.step` call as an operation in a TensorFlow function.\n",
"# This would allow it to be included in a callable TensorFlow graph.\n",
"\n",
"def env_step(action: np.ndarray) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:\n",
Expand Down Expand Up @@ -377,7 +377,7 @@
"id": "lBnIHdz22dIx"
},
"source": [
"### 2. Computing expected returns\n",
"### 2. Compute the expected returns\n",
"\n",
"The sequence of rewards for each timestep $t$, $\\{r_{t}\\}^{T}_{t=1}$ collected during one episode is converted into a sequence of expected returns $\\{G_{t}\\}^{T}_{t=1}$ in which the sum of rewards is taken from the current timestep $t$ to $T$ and each reward is multiplied with an exponentially decaying discount factor $\\gamma$:\n",
"\n",
Expand Down Expand Up @@ -432,9 +432,9 @@
"id": "qhr50_Czxazw"
},
"source": [
"### 3. The actor-critic loss\n",
"### 3. The Actor-Critic loss\n",
"\n",
"Since a hybrid actor-critic model is used, the chosen loss function is a combination of actor and critic losses for training, as shown below:\n",
"Since you're using a hybrid Actor-Critic model, the chosen loss function is a combination of Actor and Critic losses for training, as shown below:\n",
"\n",
"$$L = L_{actor} + L_{critic}$$"
]
Expand All @@ -445,18 +445,18 @@
"id": "nOQIJuG1xdTH"
},
"source": [
"#### Actor loss\n",
"#### The Actor loss\n",
"\n",
"The actor loss is based on [policy gradients with the critic as a state dependent baseline](https://www.youtube.com/watch?v=EKqxumCuAAY&t=62m23s) and computed with single-sample (per-episode) estimates.\n",
"The Actor loss is based on [policy gradients with the Critic as a state dependent baseline](https://www.youtube.com/watch?v=EKqxumCuAAY&t=62m23s) and computed with single-sample (per-episode) estimates.\n",
"\n",
"$$L_{actor} = -\\sum^{T}_{t=1} \\log\\pi_{\\theta}(a_{t} | s_{t})[G(s_{t}, a_{t}) - V^{\\pi}_{\\theta}(s_{t})]$$\n",
"\n",
"where:\n",
"- $T$: the number of timesteps per episode, which can vary per episode\n",
"- $s_{t}$: the state at timestep $t$\n",
"- $a_{t}$: chosen action at timestep $t$ given state $s$\n",
"- $\\pi_{\\theta}$: is the policy (actor) parameterized by $\\theta$\n",
"- $V^{\\pi}_{\\theta}$: is the value function (critic) also parameterized by $\\theta$\n",
"- $\\pi_{\\theta}$: is the policy (Actor) parameterized by $\\theta$\n",
"- $V^{\\pi}_{\\theta}$: is the value function (Critic) also parameterized by $\\theta$\n",
"- $G = G_{t}$: the expected return for a given state, action pair at timestep $t$\n",
"\n",
"A negative term is added to the sum since the idea is to maximize the probabilities of actions yielding higher rewards by minimizing the combined loss.\n",
Expand All @@ -470,15 +470,15 @@
"id": "Y304O4OAxiAv"
},
"source": [
"##### Advantage\n",
"##### The Advantage\n",
"\n",
"The $G - V$ term in our $L_{actor}$ formulation is called the [advantage](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#advantage-functions), which indicates how much better an action is given a particular state over a random action selected according to the policy $\\pi$ for that state.\n",
"The $G - V$ term in our $L_{actor}$ formulation is called the [Advantage](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#advantage-functions), which indicates how much better an action is given a particular state over a random action selected according to the policy $\\pi$ for that state.\n",
"\n",
"While it's possible to exclude a baseline, this may result in high variance during training. And the nice thing about choosing the critic $V$ as a baseline is that it trained to be as close as possible to $G$, leading to a lower variance.\n",
"\n",
"In addition, without the critic, the algorithm would try to increase probabilities for actions taken on a particular state based on expected return, which may not make much of a difference if the relative probabilities between actions remain the same.\n",
"In addition, without the Critic, the algorithm would try to increase probabilities for actions taken on a particular state based on expected return, which may not make much of a difference if the relative probabilities between actions remain the same.\n",
"\n",
"For instance, suppose that two actions for a given state would yield the same expected return. Without the critic, the algorithm would try to raise the probability of these actions based on the objective $J$. With the critic, it may turn out that there's no advantage ($G - V = 0$) and thus no benefit gained in increasing the actions' probabilities and the algorithm would set the gradients to zero.\n",
"For instance, suppose that two actions for a given state would yield the same expected return. Without the Critic, the algorithm would try to raise the probability of these actions based on the objective $J$. With the Critic, it may turn out that there's no Advantage ($G - V = 0$), and thus no benefit gained in increasing the actions' probabilities and the algorithm would set the gradients to zero.\n",
"\n",
"<br>"
]
Expand All @@ -489,7 +489,7 @@
"id": "1hrPLrgGxlvb"
},
"source": [
"#### Critic loss\n",
"#### The Critic loss\n",
"\n",
"Training $V$ to be as close possible to $G$ can be set up as a regression problem with the following loss function:\n",
"\n",
Expand All @@ -512,7 +512,7 @@
" action_probs: tf.Tensor, \n",
" values: tf.Tensor, \n",
" returns: tf.Tensor) -> tf.Tensor:\n",
" \"\"\"Computes the combined actor-critic loss.\"\"\"\n",
" \"\"\"Computes the combined Actor-Critic loss.\"\"\"\n",
"\n",
" advantage = returns - values\n",
"\n",
Expand All @@ -530,7 +530,7 @@
"id": "HSYkQOmRfV75"
},
"source": [
"### 4. Defining the training step to update parameters\n",
"### 4. Define the training step to update parameters\n",
"\n",
"All of the steps above are combined into a training step that is run every episode. All steps leading up to the loss function are executed with the `tf.GradientTape` context to enable automatic differentiation.\n",
"\n",
Expand Down Expand Up @@ -567,14 +567,14 @@
" action_probs, values, rewards = run_episode(\n",
" initial_state, model, max_steps_per_episode) \n",
"\n",
" # Calculate expected returns\n",
" # Calculate the expected returns\n",
" returns = get_expected_return(rewards, gamma)\n",
"\n",
" # Convert training data to appropriate TF tensor shapes\n",
" action_probs, values, returns = [\n",
" tf.expand_dims(x, 1) for x in [action_probs, values, returns]] \n",
"\n",
" # Calculating loss values to update our network\n",
" # Calculate the loss values to update our network\n",
" loss = compute_loss(action_probs, values, returns)\n",
"\n",
" # Compute the gradients from the loss\n",
Expand All @@ -598,7 +598,7 @@
"\n",
"Training is executed by running the training step until either the success criterion or maximum number of episodes is reached. \n",
"\n",
"A running record of episode rewards is kept in a queue. Once 100 trials are reached, the oldest reward is removed at the left (tail) end of the queue and the newest one is added at the head (right). A running sum of the rewards is also maintained for computational efficiency. \n",
"A running record of episode rewards is kept in a queue. Once 100 trials are reached, the oldest reward is removed at the left (tail) end of the queue and the newest one is added at the head (right). A running sum of the rewards is also maintained for computational efficiency.\n",
"\n",
"Depending on your runtime, training can finish in less than a minute."
]
Expand All @@ -617,15 +617,15 @@
"max_episodes = 10000\n",
"max_steps_per_episode = 500\n",
"\n",
"# Cartpole-v1 is considered solved if average reward is >= 475 over 500 \n",
"# `CartPole-v1` is considered solved if average reward is >= 475 over 500 \n",
"# consecutive trials\n",
"reward_threshold = 475\n",
"running_reward = 0\n",
"\n",
"# Discount factor for future rewards\n",
"# The discount factor for future rewards\n",
"gamma = 0.99\n",
"\n",
"# Keep last episodes reward\n",
"# Keep the last episodes reward\n",
"episodes_reward: collections.deque = collections.deque(maxlen=min_episodes_criterion)\n",
"\n",
"t = tqdm.trange(max_episodes)\n",
Expand All @@ -642,7 +642,7 @@
" t.set_postfix(\n",
" episode_reward=episode_reward, running_reward=running_reward)\n",
" \n",
" # Show average episode reward every 10 episodes\n",
" # Show the average episode reward every 10 episodes\n",
" if i % 10 == 0:\n",
" pass # print(f'Episode {i}: average reward: {avg_reward}')\n",
" \n",
Expand All @@ -660,7 +660,7 @@
"source": [
"## Visualization\n",
"\n",
"After training, it would be good to visualize how the model performs in the environment. You can run the cells below to generate a GIF animation of one episode run of the model. Note that additional packages need to be installed for OpenAI Gym to render the environment's images correctly in Colab."
"After training, it would be good to visualize how the model performs in the environment. You can run the cells below to generate a GIF animation of one episode run of the model. Note that additional packages need to be installed for Gym to render the environment's images correctly in Colab."
]
},
{
Expand Down Expand Up @@ -731,15 +731,15 @@
"source": [
"## Next steps\n",
"\n",
"This tutorial demonstrated how to implement the actor-critic method using Tensorflow.\n",
"This tutorial demonstrated how to implement the Actor-Critic method using Tensorflow.\n",
"\n",
"As a next step, you could try training a model on a different environment in OpenAI Gym. \n",
"As a next step, you could try training a model on a different environment in Gym. \n",
"\n",
"For additional information regarding actor-critic methods and the Cartpole-v0 problem, you may refer to the following resources:\n",
"For additional information regarding Actor-Critic methods and the Cartpole-v0 problem, you may refer to the following resources:\n",
"\n",
"- [Actor Critic Method](https://hal.inria.fr/hal-00840470/document)\n",
"- [Actor Critic Lecture (CAL)](https://www.youtube.com/watch?v=EKqxumCuAAY&list=PLkFD6_40KJIwhWJpGazJ9VSj9CFMkb79A&index=7&t=0s)\n",
"- [Cartpole learning control problem \\[Barto, et al. 1983\\]](http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf) \n",
"- [The Actor-Critic method](https://hal.inria.fr/hal-00840470/document)\n",
"- [The Actor-Critic lecture (CAL)](https://www.youtube.com/watch?v=EKqxumCuAAY&list=PLkFD6_40KJIwhWJpGazJ9VSj9CFMkb79A&index=7&t=0s)\n",
"- [Cart Pole learning control problem \\[Barto, et al. 1983\\]](http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf) \n",
"\n",
"For more reinforcement learning examples in TensorFlow, you can check the following resources:\n",
"- [Reinforcement learning code examples (keras.io)](https://keras.io/examples/rl/)\n",
Expand Down