Skip to content

Commit

Permalink
added pendulum examples
Browse files Browse the repository at this point in the history
  • Loading branch information
hardmaru committed Nov 30, 2018
1 parent c6b448f commit cf66682
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 4 deletions.
4 changes: 2 additions & 2 deletions draft.md
Original file line number Diff line number Diff line change
Expand Up @@ -426,12 +426,12 @@ We have shown that one iteration of this training loop was enough to solve simpl

<div style="text-align: center;">
<video autoplay muted playsinline loop style="display: block; margin: auto; width: 100%;"><source src="https://storage.googleapis.com/quickdraw-models/sketchRNN/world_models/assets/mp4/pendulum01.mp4" type="video/mp4"/></video>
<figcaption>Generated rollout after the first iteration. M has difficulty predicting states of a swung up pole since the data collected from the initial random policy is near the steady state in the bottom half. Despite this, C still learns to swing the pole upwards when deployed inside of M. </figcaption>
<figcaption>Swing-up Pendulum from Pixels: Generated rollout after the first iteration. M has difficulty predicting states of a swung up pole since the data collected from the initial random policy is near the steady state in the bottom half. Despite this, C still learns to swing the pole upwards when deployed inside of M. </figcaption>
</div>

<div style="text-align: center;">
<video autoplay muted playsinline loop style="display: block; margin: auto; width: 100%;"><source src="https://storage.googleapis.com/quickdraw-models/sketchRNN/world_models/assets/mp4/pendulum20.mp4" type="video/mp4"/></video>
<figcaption>Generated rollout after 20 iterations. Deploying policies that swing the pole upwards in the actual environment gathered more data that recorded the pole being in the top half, allowing M to model the environment more accurately, and C to learn a better policy inside of M, eventually balancing an inverted pendulum.</figcaption>
<figcaption>Swing-up Pendulum from Pixels: Generated rollout after 20 iterations. Deploying policies that swing the pole upwards in the actual environment gathered more data that recorded the pole being in the top half, allowing M to model the environment more accurately, and C to learn a better policy inside of M.</figcaption>
</div>

In the present approach, since M is a MDN-RNN that models a probability distribution for the next frame, if it does a poor job, then it means the agent has encountered parts of the world that it is not familiar with. Therefore we can adapt and reuse M's training loss function to encourage curiosity. By flipping the sign of M's loss function in the actual environment, the agent will be encouraged to explore parts of the world that it is not familiar with. The new data it collects may improve the world model.
Expand Down
4 changes: 2 additions & 2 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -675,11 +675,11 @@ <h2>Iterative Training Procedure</h2>
<p>We have shown that one iteration of this training loop was enough to solve simple tasks. For more difficult tasks, we need our controller in Step 2 to actively explore parts of the environment that is beneficial to improve its world model. An exciting research direction is to look at ways to incorporate artificial curiosity and intrinsic motivation <dt-cite key="schmidhuber_creativity,s07_intrinsic,s08_curiousity,pathak2017,intrinsic_motivation"></dt-cite> and information seeking <dt-cite key="SchmidhuberStorck:94,Gottlieb2013"></dt-cite> abilities in an agent to encourage novel exploration <dt-cite key="Lehman2011"></dt-cite>. In particular, we can augment the reward function based on improvement in compression quality <dt-cite key="schmidhuber_creativity,s07_intrinsic,s08_curiousity,learning_to_think"></dt-cite>.</p>
<div style="text-align: center;">
<video autoplay muted playsinline loop style="display: block; margin: auto; width: 100%;"><source src="https://storage.googleapis.com/quickdraw-models/sketchRNN/world_models/assets/mp4/pendulum01.mp4" type="video/mp4"/></video>
<figcaption>Generated rollout after the first iteration. M has difficulty predicting states of a swung up pole since the data collected from the initial random policy is near the steady state in the bottom half. Despite this, C still learns to swing the pole upwards when deployed inside of M. </figcaption>
<figcaption>Swing-up Pendulum from Pixels: Generated rollout after the first iteration. M has difficulty predicting states of a swung up pole since the data collected from the initial random policy is near the steady state in the bottom half. Despite this, C still learns to swing the pole upwards when deployed inside of M. </figcaption>
</div>
<div style="text-align: center;">
<video autoplay muted playsinline loop style="display: block; margin: auto; width: 100%;"><source src="https://storage.googleapis.com/quickdraw-models/sketchRNN/world_models/assets/mp4/pendulum20.mp4" type="video/mp4"/></video>
<figcaption>Generated rollout after 20 iterations. Deploying policies that swing the pole upwards in the actual environment gathered more data that recorded the pole being in the top half, allowing M to model the environment more accurately, and C to learn a better policy inside of M, eventually balancing an inverted pendulum.</figcaption>
<figcaption>Swing-up Pendulum from Pixels: Generated rollout after 20 iterations. Deploying policies that swing the pole upwards in the actual environment gathered more data that recorded the pole being in the top half, allowing M to model the environment more accurately, and C to learn a better policy inside of M.</figcaption>
</div>
<p>In the present approach, since M is a MDN-RNN that models a probability distribution for the next frame, if it does a poor job, then it means the agent has encountered parts of the world that it is not familiar with. Therefore we can adapt and reuse M's training loss function to encourage curiosity. By flipping the sign of M's loss function in the actual environment, the agent will be encouraged to explore parts of the world that it is not familiar with. The new data it collects may improve the world model.</p>
<p>The iterative training procedure requires the M model to not only predict the next observation <span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="strut" style="height:0.43056em;"></span><span class="strut bottom" style="height:0.43056em;vertical-align:0em;"></span><span class="base textstyle uncramped"><span class="mord mathit">x</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>d</mi><mi>o</mi><mi>n</mi><mi>e</mi></mrow><annotation encoding="application/x-tex">done</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="strut" style="height:0.69444em;"></span><span class="strut bottom" style="height:0.69444em;vertical-align:0em;"></span><span class="base textstyle uncramped"><span class="mord mathit">d</span><span class="mord mathit">o</span><span class="mord mathit">n</span><span class="mord mathit">e</span></span></span></span>, but also predict the action and reward for the next time step. This may be required for more difficult tasks. For instance, if our agent needs to learn complex motor skills to walk around its environment, the world model will learn to imitate its own C model that has already learned to walk. After difficult motor skills, such as walking, is absorbed into a large world model with lots of capacity, the smaller C model can rely on the motor skills already absorbed by the world model and focus on learning more higher level skills to navigate itself using the motor skills it had already learned.<dt-fn>Another related connection is to muscle memory. For instance, as you learn to do something like play the piano, you no longer have to spend working memory capacity on translating individual notes to finger motions -- this all becomes encoded at a subconscious level.</dt-fn></p>
Expand Down

0 comments on commit cf66682

Please sign in to comment.