# Reinforcement Learning

Reinforcement learning (RL) is an area of machine learning concerned with **agents**, acting in an **environment**, in order to achieve some **goal**. The goal is expressed in the form of **rewards**, i.e. agents receive a reward for the correct behavior. An important aspect of reinforcement learning is that rewards may be *sparse* and *time-delayed*. Reinforcement learning is considered the closest machine learning paradigm to how humans and animals learn.

Typical reinforcement learning setup is a closed loop where
1. an agent observes the environment and decides an action,
2. the environment carries out the action and provides the agent feedback with reward and new observation.

<img src="images/rl.png">

# Markov Decision Process

Reinforcement learning problem is usually formalized as a Markov decision process (MDP). MDP is a tuple $<S, A, P, R, \gamma>$, where
* $S$ is a set of states,
* $A$ is a set of actions,
* $P$ is a probability distribution for transitions between states, i.e. $P(s'|s,a)$ is the probability of state $s'$ if in state $s$ the agent takes the action $a$.
* $R$ is a reward function, i.e. $R(s, a, s')$ represents the reward if in state $s$ the agent takes the action $a$ and ends up in state $s'$.

Acting in an environment results in a trace of **states**, **actions** and **rewards**:

$$
<s_0, a_0, r_1, s_1, a_1, r_2, s_2, a_2, ...>
$$

<!-- This trace can be finite in case of finite **episodes**, and infinite in case of infinite episodes. -->

<img src="images/mdp.png">

Partially observable Markov decision process (POMDP) is an extension of MDP, where the agent cannot directly access the environment state and instead observes partial view of the environment. Examples of POMDP are poker (because opponent cards are part of environment state, but are not visible to you) or first-person shooter games (what is happening behind your back is part of environment state, but not visible to you).

**Policy** is the rules or strategies the agent uses to choose actions for each state, usually represented by $\pi$. In case of deterministic policy it is a simple function $a_t = \pi(s_t)$. In case of stochastic policy it is represented as conditional probability distribution $\pi(a_t|s_t)$. The goal of reinforcement learning algorithm is to find an **optimal policy** $\pi^*$ that maximizes expected (average) sum of rewards in an episode:

$$
\pi^* = argmax_{\pi} E \left[ \sum_t r_t \right]
$$

Expectation here is over environment transitions $P(s_{t+1}|s_t, a_t)$ and actions chosen by policy $\pi(a_t|s_t)$.

# OpenAI Gym

[OpenAI Gym](http://gym.openai.com/) is an implementation of MDP interface in Python. It comes with many built-in [environments](http://gym.openai.com/envs/), ranging from classical control and robotics to Atari videogames. All environments have to implement the same basic methods:
* `reset()` restarts the environment and returns the initial state.
* `step(action)` takes an action in the environment and returns a new state, a reward, an indicator for terminal state and auxilliary information.
* `render()` visually renders the environment for debugging.

In addition you can get the environment description through `observation_space` and `action_space` properties. Possible values for those are:
* `Discrete(N)` - discrete space with $N$ distinct values, the $N$ can be accessed with `space.n`.
* `Box(shape)` - multi-dimensional space, lowest and highest values can be accessed with `space.low` and `space.high`.

Both spaces have also method `sample()` that produces random value from given range.

To install OpenAI Gym:
```
pip install gym
```
To install OpenAI Gym with Atari games support:
```
pip install gym[atari]
```

In [1]:
# import OpenAI Gym
import gym

## Example: Frozen Lake

*Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend.*

The surface is described using a grid like the following:

```
SFFF       (S: starting point, safe)
FHFH       (F: frozen surface, safe)
FFFH       (H: hole, fall to your doom)
HFFG       (G: goal, where the frisbee is located)
```

The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.

In [2]:
# create the Frozen Lake environment
env = gym.make('FrozenLake-v0', is_slippery=False)

In [3]:
# print observation and action space size
print("Observation space:", env.observation_space)
print("Action space:", env.action_space)

Observation space: Discrete(16)
Action space: Discrete(4)


In [12]:
# reset environment to initial state
state = env.reset()
# states are numbered sequentially 0-15
print("Initial state:", state)
# visualize the environment
env.render()

Initial state: 0

[41mS[0mFFF
FHFH
FFFH
HFFG


In [18]:
# actions: 0-left, 1-down, 2-right, 3-up
action = 2
state, reward, done, info = env.step(action)
env.render()
print("State:", state, "Reward:", reward, "Done:", done, "Info:", info)

  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
State: 15 Reward: 1.0 Done: True Info: {'prob': 1.0}


**Task:** Get to the goal by manually changing action and re-running the cell!

In [22]:
# play 100 random games
rewards = []
for i in range(100):
    env.reset()
    done = False
    while not done:
        action = env.action_space.sample()
        state, reward, done, info = env.step(action)
    rewards.append(reward)
print("Mean reward:", sum(rewards) / len(rewards))

Mean reward: 0.01


## Example: CartPole

*A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.*

    Observation: 
        Num	Observation                 Min         Max
        0	Cart Position             -4.8            4.8
        1	Cart Velocity             -Inf            Inf
        2	Pole Angle                 -24 deg        24 deg
        3	Pole Velocity At Tip      -Inf            Inf
        
    Actions:
        Num	Action
        0	Push cart to the left
        1	Push cart to the right

In [23]:
# create CartPole environment
env = gym.make('CartPole-v1')
print("Observation space:", env.observation_space)
print("Action space:", env.action_space)
print("Observation space low:", env.observation_space.low)
print("Observation space high:", env.observation_space.high)

Observation space: Box(4,)
Action space: Discrete(2)
Observation space low: [-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]
Observation space high: [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]


In [26]:
import time

# play one random episode
state = env.reset()
done = False
env.render()
while not done:
    env.render()
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)
    print("Action:", action, "Reward:", reward, "Done:", done)
    time.sleep(0.1)
env.close()

Action: 0 Reward: 1.0 Done: False
Action: 1 Reward: 1.0 Done: False
Action: 0 Reward: 1.0 Done: False
Action: 0 Reward: 1.0 Done: False
Action: 1 Reward: 1.0 Done: False
Action: 0 Reward: 1.0 Done: False
Action: 0 Reward: 1.0 Done: False
Action: 0 Reward: 1.0 Done: False
Action: 1 Reward: 1.0 Done: False
Action: 0 Reward: 1.0 Done: False
Action: 0 Reward: 1.0 Done: False
Action: 1 Reward: 1.0 Done: True


## Example: Pong

*Maximize your score in the Atari 2600 game Pong. In this environment, the observation is an RGB image of the screen, which is an array of shape (210, 160, 3) Each action is repeatedly performed for a duration of $k$ frames, where $k$ is uniformly sampled from $\{2, 3, 4\}$.*

In [27]:
# create Pong environment
env = gym.make('Pong-v0')
print("Observation space:", env.observation_space)
print("Action space:", env.action_space)
print("Observation space low:", env.observation_space.low)
print("Observation space high:", env.observation_space.high)

Observation space: Box(210, 160, 3)
Action space: Discrete(6)
Observation space low: [[[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 ...

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]]
Observation space high: [[[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 ...

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
 

In [28]:
import time

# play one random episode
env.reset()
done = False
env.render()
while not done:
    env.render()
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)
    print("Action:", action, "Reward:", reward, "Done:", done)
    time.sleep(0.01)
env.close()

Action: 5 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 4 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 5 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 2 Reward: 0.0 Done: False
Action: 4 Reward: 0.0 Done: False
Action: 4 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 2 Reward: 0.0 Done: False
Action: 4 Reward: 0.0 Done: False
Action: 4 Reward: 0.0 Done: False
Action: 5 Reward: 0.0 Done: False
Action: 4 Reward: 0.0 Done: False
Action: 1 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 1 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 1 Reward: 0.0 Done: False
Action: 2 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 5 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 2 Reward: 0.0 Done: False
Action: 1 Reward: 0.0 Done: False
Action: 4 Rewa

Action: 0 Reward: 0.0 Done: False
Action: 1 Reward: 0.0 Done: False
Action: 5 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 2 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 5 Reward: 0.0 Done: False
Action: 2 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 5 Reward: 0.0 Done: False
Action: 5 Reward: 0.0 Done: False
Action: 1 Reward: 0.0 Done: False
Action: 1 Reward: 0.0 Done: False
Action: 5 Reward: 0.0 Done: False
Action: 2 Reward: 0.0 Done: False
Action: 2 Reward: 0.0 Done: False
Action: 5 Reward: 0.0 Done: False
Action: 5 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 4 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 2 Reward: 0.0 Done: False
Action: 1 Reward: 0.0 Done: False
Action: 5 Reward: 0.0 Done: False
Action: 5 Rewa

Action: 3 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 1 Reward: 0.0 Done: False
Action: 2 Reward: 0.0 Done: False
Action: 1 Reward: 0.0 Done: False
Action: 1 Reward: 0.0 Done: False
Action: 4 Reward: 0.0 Done: False
Action: 1 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 4 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 5 Reward: 0.0 Done: False
Action: 2 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 2 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 1 Reward: -1.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 5 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 5 Reward: 0.0 Done: False
Action: 1 Rew

Action: 4 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 5 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 5 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 4 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 1 Reward: 0.0 Done: False
Action: 4 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 1 Reward: 0.0 Done: False
Action: 2 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 5 Reward: 0.0 Done: False
Action: 4 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 4 Reward: 0.0 Done: False
Action: 2 Reward: 0.0 Done: False
Action: 2 Reward: 0.0 Done: False
Action: 1 Reward: 0.0 Done: False
Action: 4 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 5 Reward: 0.0 Done: False
Action: 2 Rewa

Action: 0 Reward: 0.0 Done: False
Action: 1 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 2 Reward: 0.0 Done: False
Action: 1 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 4 Reward: 0.0 Done: False
Action: 4 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 2 Reward: 0.0 Done: False
Action: 1 Reward: -1.0 Done: False
Action: 1 Reward: 0.0 Done: False
Action: 2 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 5 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 2 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 5 Reward: 0.0 Done: False
Action: 4 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 1 Reward: 0.0 Done: False
Action: 2 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 5 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 1 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 4 Rew

Action: 4 Reward: 0.0 Done: False
Action: 2 Reward: 0.0 Done: False
Action: 1 Reward: 0.0 Done: False
Action: 4 Reward: 0.0 Done: False
Action: 0 Reward: 0.0 Done: False
Action: 3 Reward: 0.0 Done: False
Action: 5 Reward: 0.0 Done: False
Action: 2 Reward: 0.0 Done: False
Action: 4 Reward: -1.0 Done: True


## Example: Play

`play()` function from `gym.utils.play` package can be used to play all environments interactively.

In [29]:
from gym.utils.play import play

# make Montezuma's Revenge environment without skipping the frames
env = gym.make('MontezumaRevengeNoFrameskip-v4')

# play the game
play(env)

# keys: WASD - move, SPACE - jump, ESC - exit

pygame 1.9.6
Hello from the pygame community. https://www.pygame.org/contribute.html
