This challenge serves two purposes:
1. introduce you to the world of Policy Gradient methods,
2. act as an evaluation for the RL class.

It is meant to be started in class and finished at home. It will require that you read quite a bit, then that you work on your own understanding, before answering the questions below.

The goal is simply to complete your discovery of Reinforcement Learning and to insure that you validate the skill-goals of the class. No traps here, I'd be happy to give a perfect mark to anybody that completes this exam and even better for those who go beyond (see the bonus questions for that). 

I recommend to answer both questions of the theoretical part first, then move on to the implementation part, then get back to the second question of the theoretical part (practice allows your ideas to mature).

# Policy gradient theorem (4 points)

Use your favourite source of information to:
1. quote the policy gradient theorem (2 points)
2. explain how it's useful (2 points)

The goal is not to have you search dark places of the web for references, but to build an understanding for yourself and render it in your own words. You can, for instance, use the following sources (you may suggest other ones):
- [Policy gradient algorithms](https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html) on Lilian Weng's blog (OpenAI)
- [Policy gradient methods for reinforcement learning with function approximation](https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf), Sutton, McAllester, Sigh, Mansour. NIPS 2000.
- [Policy gradient methods](http://www.scholarpedia.org/article/Policy_gradient_methods) on Scholarpedia (written by Jan Peters).
- [Reinforcement Learning, an introduction](http://incompleteideas.net/book/the-book.html), the classic book by Sutton and Barto. Chapter 13.

**The policy gradient theorem:**
*Your answer here*

**The meaning and usefulness of the policy gradient theorem:**
*Your answer here*

# REINFORCE (3 points)

- Implement the REINFORCE algorithm (from the Machine Learning journal paper "Simple statistical gradient-following algorithms for connectionist reinforcement learning" by Williams 1992, but also explained in all links above) on OpenAI gym's inverted pendulum.
- Plot the evolution of performance vs training time
- Discuss

In [None]:
import gym
#import gym.envs.box2d.lunar_lander as ll
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
np.set_printoptions(precision=3)

In [None]:
pendulum = gym.make('Pendulum-v0')

In [None]:
# Your code here

# Actor-critic (3 points)

- Implement an Actor-Critic algorithm on OpenAI gym's inverted pendulum.
- Plot the evolution of performance vs training time
- Discuss

In [None]:
# Your code here

# Monte Carlo Tree Search (5 points)

First note that the trick below allows you to set the pendulum state as you wish.

In [None]:
pendulum.reset()
pendulum.unwrapped.state = [np.pi/3., 0.] # format: theta, thetaDot
pendulum.render();

Note also that the integration time step is known.

In [None]:
print("time step:", pendulum.unwrapped.dt)

Finally, note that you can use the trick below to control the wall clock execution time of a method.

In [None]:
import datetime

time_limit = datetime.timedelta(seconds=pendulum.unwrapped.dt)
count = 0
begin = datetime.datetime.utcnow()
while datetime.datetime.utcnow() - begin < time_limit:
    count += 1
print(count)

Now use all those to implement a Monte Carlo Tree Search method that controls the pendulum in real time.

In [None]:
# Your code here

# Bonus questions (extra points)

That part is free, I'm just providing hints.

- Take a look at the Deterministic Policy Gradient Theorem ([Deterministic policy gradient algorithms](http://proceedings.mlr.press/v32/silver14.html) by Silver et al, 2014) and the DDPG algorithm ([Continuous control with deep reinforcement learning](https://arxiv.org/abs/1509.02971), Lillicrap et al. 2015), discuss, implement, etc.
- Pick any more recent algorithm from [Lilian Weng's excellent summary](https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html) and implement it in a demonstrative manner.
- Take a more difficult environment and try to solve it (for instance acrobot, cart-pole or mountain-car).
- Take a question that seems difficult for you or in the litterature and illustrate why, try to answer it, etc.
- Get inspiration from these papers from friends [CEM-RL](https://arxiv.org/abs/1810.01222), [overview on policy search](https://arxiv.org/abs/1803.04706), [GEP-PG](https://arxiv.org/abs/1802.05054).