In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gym
import numpy as np

## Markov Decision Processes

**The key abstraction in reinforcement learning is the Markov decision process (MDP).** An MDP models sequential interactions with an external environment. It consists of the following:
- a **state space**
- a set of **actions**
- a **transition function** which describes the probability of being in a state $s'$ at time $t+1$ given that the MDP was in state $s$ at time $t$ and action $a$ was taken
- a **reward function**, which determines the reward received at time $t$
- a **discount factor** $\gamma$
More details are available [here](https://en.wikipedia.org/wiki/Markov_decision_process).

**Note:** Reinforcement learning algorithms are often applied to problems that don't strictly fit into the MDP framework. In particular, situations in which the state of the environment is not fully observed lead to violations of the MDP assumption. Nevertheless, RL algorithms can be applied anyway.

## Policies

A **policy** is a function that takes in a **state** and returns an **action**. A policy may be stochastic (i.e., it may sample from a probability distribution) or it can be deterministic.

The **goal of reinforcement learning** is to learn a **policy** for maximizing the cumulative reward in an MDP. That is, we wish to find a policy $\pi$ which solves the following optimization problem

\begin{equation}
\max_{\pi} \sum_{t=1}^T \gamma^t R_t,
\end{equation}

where $T$ is the number of steps taken in the MDP (this is a random variable and may depend on $\pi$) and $R_t$ is the reward received at time $t$ (also a random variable which depends on $\pi$).

A number of algorithms are available for solving reinforcement learning problems. Several of the most widely known are [value iteration](https://en.wikipedia.org/wiki/Markov_decision_process#Value_iteration), [policy iteration](https://en.wikipedia.org/wiki/Markov_decision_process#Policy_iteration), and [Q learning](https://en.wikipedia.org/wiki/Q-learning).

## RL in Python

The `gym` module provides MDP interfaces to a variety of simulators. For example, the CartPole environment interfaces with a simple simulator which simulates the physics of balancing a pole on a cart. This example fits into the MDP framework as follows.
- The **state** consists of the position and velocity of the cart as well as the angle and angular velocity of the pole that is balancing on the cart.
- The **actions** are to decrease or increase the cart's velocity by one unit.
- The **transition function** is deterministic and is determined by simulating physical laws.
- The **reward function** is a constant 1 as long as the pole is upright, and 0 once the pole has fallen over. Therefore, maximizing the reward means balancing the pole for as long as possible.
- The **discount factor** in this case can be taken to be 1.

The code below illustrates how to create and manipulate MDPs in Python.

In [None]:
# Create a new environment (MDP).
env = gym.make('CartPole-v0')

In [None]:
# Reset the state of the environment. This returns its initial state.
env.reset()

The `env.step` method takes an action (in the case of the CartPole environment, the appropriate actions are 0 or 1, for moving left or right). It returns a tuple of four things:
1. the new state of the environment
2. a reward
3. a boolean indicating whether the simulation has finished
4. a dictionary of miscellaneous extra information

In [None]:
# Simulate taking an action in the environment. Appropriate actions for
# the CartPole environment are 0 and 1 (for moving left and right).
action = 0
state, reward, done, info = env.step(action)

**Exercise:** Implement a function `rollout`, which takes an environment, resets the state of the environment, takes random actions until the simulation has finished, and returns the cumulative reward.

In [None]:
def random_rollout(env):
    # This function should do the following:
    # - reset the state of the environment
    # - take random actions until the simulation has finished
    # - return the cumulative reward
    raise NotImplementedError
    
reward = random_rollout(env)
print(reward)
reward = random_rollout(env)
print(reward)

**Exercise:** Modify the rollout function to take an environment *and* a policy. The *policy* is a function that takes in a *state* and returns an *action*.

In [None]:
def rollout_policy(env, policy):
    # This function should do the following:
    # - reset the state of the environment
    # - take actions according to the policy until the simulation has finished
    # - return the cumulative reward
    raise NotImplementedError

def sample_policy(state):
    return 0 if state[0] < 0 else 1

reward = rollout_policy(env, sample_policy)
print(reward)
reward = random_rollout(env)
print(reward)

**Exercise:** Modify the `rollout_policy` function to create a video showing the policy in action. This requires two steps:
1. Wrap the environment as follows `env = gym.wrappers.Monitor(env, '/tmp/cartpole-experiment-1')`.
2. Call `env.render()` whenever you want to add a frame to the video.

**Note:** Producing the video may appear to freeze your computer. This works better when run from a regular Python interpreter, but you can solve the problem in the notebook by clicking on **Kernel -> Restart**.

In [None]:
def produce_video_of_rollout(env, policy):
    env = wrappers.Monitor(env, '/tmp/cartpole-experiment-1')
    raise NotImplementedError

# The video will be saved in the directory /tmp/cartpole-experiment-1.
produce_video_of_rollout(env, sample_policy)