Introduction
============
This tutorial will introduce you to the field of reinforcement learning, a subfield of machine learning conserned with designing algorithms which learn from experience. Specifically this tutorial will consist of a introductory overview of Markov Decision Processes, and the Q-learning algorithm, followed by a simple Q-learning example using OpenAI's Gym.

Tutorial content
----------------

1. Review of prerequisite theoretical frameworks for the second unit.
    - [Markov Decission Process](#Markov-Decission-Process)
    - [Reinforcement Learning](#Reinforcement-Learning)
    - [Artificial Neural Networks](#Artificial-Neural-Networks)
2. Using **[tqdm](https://pypi.python.org/pypi/tqdm)**, **[opencv](http://opencv.org)**, **[tensorflow](https://www.tensorflow.org)**, and **[gym](https://gym.openai.com)** to apply reinforcement learning to *FlappyBird*.
    - [Installing the libraries](#Installing-the-libraries)
        - [Installing tqdm](#Installing-tqdm)
        - [Installing OpenCV](#Installing-OpenCV)
        - [Installing TensorFlow](#Installing-TensorFlow)
        - [Installing Gym](#Installing-Gym)
        - [Installing PLE](#Installing-PLE)
        - [Installing Gym-PLE](#Installing-Gym-PLE)
    - [Using the libraries](#Using-the-libraries)
        - [Using TensorFlow](#Using-TensorFlow)
        - [Using Gym](#Using-Gym)
    - [Tabular Q-Learning Example](#Tabular-Q-Learning-Example)
    - [Deep Q-Learning](#Deep-Q-Learning)
3. Topics for further reading.
    - [Machine Learning](#Machine-Learning)
    - [Computer Vision](#Computer-Vision)
    - [Reinforcement Learning](#Reinforcement-Learning)
    - [Deep Learning](#Deep-Learning)

---

<br><br/>
<center>
        <h1>Theory</h1>
</center>
<br><br/>

---

## Markov Decission Proccess

**Markov Decission Process**
- set of states $s \in S$
- set of actions $a \in A$
- a transition function $T(s,a,s^{\prime}) = \mathbf{P}(s^{\prime} \mid s,a)$, i.e. the probability of $S^{\prime}$ being the next state given action $a$ taken at state $s$.
- a reward function $R$
..* we can define the reward function to depend on any of:
...* the current state: $R(s)$
...* the current state-action pair: $R(s,a)$ 
...* the current state, action, and next state: $R(s,a,s^{\prime})$
- terminal states (either a goal state with positive reward, or a non-goal state with negative reward)
- discount factor $\gamma$. A discount factor of $0$ results in a memoryless reward function, where the reward of the current state (and/or action) is only considered. Conversely, a discount factor of $1$ results in a reward function that approximates the expected value of future rewards (weighted by the probability of each possible future state/action).

**Markov Property**
simply put, the outcome of any action $a$ depends only on the current state and not on any of the previous state-action pairs.
$$\mathbf{P}(s_{t+1} \mid s_{t}, a_{t}) = \mathbf{P}(s_{t+1} \mid s_{1},a_{1},s_{2},a_{2},\ldots,s_{t},a_{t})$$

**Policies**
In MDPs instead of plans we have policies.
A policy $\pi^{*}\ :\ S \rightarrow A$ (a function mapping states to actions)
specifies what action to take in each state


**State Value given policy $\pi$**
We can evaluate the state value of a policy by calculating the expectation over the reward function. (i.e. the sum of future rewards of each possible future state weighted by the probability of reaching that future state from the current state -- where future rewards are discounted by the discount factor $\gamma$),
$$
    V^{\pi}(s) = \sum_{s^{\prime} \in S} \mathbf{P}(s^{\prime} \mid s, \pi(s)) \left[ R(s,\pi(s),s^{\prime}) + \gamma V^{\pi}(s^{\prime})\right]
$$

**State-Action Value given policy $\pi$**
We can evaluate the state-action pair value of a policy with a simliar expectation. (i.e. the sum of future rewards of each possible future state-action pair weighted by the probability of reaching that future state-action pair -- with future reward discounted by the discount factor $\gamma$),
$$
    Q^{\pi}(s,a) = \sum_{s^{\prime} \in S} \mathbf{P}(s^{\prime} \mid s,a) \left[ R(s,a,s^{\prime}) + \gamma V^{\pi}(s^{\prime})\right]
$$


**Optimal State Value**
For any given state $s$ we can determine the value of the optimal action to take by simply applying the $max$ function instead of a summation,
$$
    [V^{*}(S_{i}) = \text{max}_{a}\ \left( \mathbf{P}(s_{j} \mid s_{i}, a) \left[ R(s_{i}, a, s^{\prime}) + \gamma V^{*}(s_{j}) \right] \right) 
$$

**Optimal State-Action Value**
Similiarly, for any given state-action pair, we can determine the opitmal action to take for any given state (i.e. the optimal state-action pair for some state) by iteration the $argmax$ function),
$$
\begin{align*}
    \pi^{*} &= \text{argmax}_{a}\ Q(s_{i}, a)\\
            &= \text{argmax}_{a}\ \left( \sum_{s_{j} \in S} \mathbf{P}(s_{j} \mid s_{i}, a) \left[ R(s_{i}, a, s^{\prime}) + \gamma V^{*}(s_{j}) \right] \right)
\end{align*}
$$

It is worth pointing out that in a deterministic environment (i.e. the transition function always has a probability of $1$ or $0$) and when the reward is always $1$, the $V^{\pi}$ and $Q^{\pi}$ functions take on the form of the sum of a geometric series. $$S_{n} = \lim_{C \rightarrow \infty} \sum_{k=0}^{C} \gamma^{k} = \frac{1}{1 - \gamma}$$

---

## Reinforcement Learning

In some ways Reinforcement Learning is an extension of the Markov Decision Process. While in a MDP the algorithm is given a model of the reward and transition functions, with RL the algorithm must approximate the reward and transition functions by sampling events in the environment.

[<img src="https://gym.openai.com/static/img/tutorial/aeloop.svg">](https://gym.openai.com/static/img/tutorial/aeloop.svg)

There are several different flavors of RL, in this tutorial we are going to cover a specific model-free reinforcment learning algorithm, Q-learning.

Simply put, the Q-learning algorithm, so named because it learns an optimal state-action policy for a finite MDP by approximating the optimal state-action value function $Q(s,a)$.

This approximation is accomplished using direct sampling, 
$$
\begin{align*}
Q_{\text{sample}}(s,a) &= R(s,a,s^{\prime}) + \gamma \max_{a^{\prime}} Q(s^{\prime}, a^{\prime}) \\
Q(s,a) &= (1 - \alpha) Q(s,a) + (\alpha) Q_{\text{sample}}(s,a)
\end{align*}
$$

## Artificial Neural Networks

[<img src="http://www.global-warming-and-the-climate.com/images/Neuron-input.GIF">](http://www.global-warming-and-the-climate.com/images/Neuron-input.GIF)
[<img src="http://tensorfly.cn/special/deeplearning/images/tikz35.png">](http://tensorfly.cn/special/deeplearning/images/tikz35.png)

---

<br><br/>
<center>
        <h1>Tutorial</h1>
</center>
<br><br/>

---


# Installing the libraries
Before getting started, you'll need to install tensorflow, gym.openai, PLE (pygame learning environment), and gym-ple.

---

### Installing tqdm
A simple library for displaying a progress bar for loops, useful for graphical representation of DQN training loop.

    $ pip install tqdm

### Installing OpenCV
If you haven't already install <code> cmake </code>

    $ brew install -v cmake
    
Now, we install OpenCV with homebrew.

    $ brew install homebrew/science/opencv

### Installing tensorflow
On OSX you can install the CPU only version of tensorflow for python 2.7 with 'pip':
    
    $ export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-0.11.0rc0-py2-none-any.whl

    $ sudo pip install --upgrade $TF_BINARY_URL
   
If you want a different version, check out the other [tf binaries](https://www.tensorflow.org/versions/r0.11/get_started/os_setup.html#pip-installation).

### Installing gym
We will be using PLE's 'FlappyBird' environment, so we cannot just do a minimal install of gym. Doing a full install entails installing some dependencies first, which we can do with 'brew' on OSX:

    $ brew install cmake boost boost-python sdl2 swig wget


Now, we are ready to install the gym module with 'pip':

    $ pip install 'gym[all]'

### Installing ple
Before installing Pygame Learning Environment, we need to install its dependencies. Note although PLE requires a Numpy installation, tensorflow has already installed it for us. We can use a combination of 'brew' and 'pip' to install the remaining dependencies:

       $ pip install pillow
       
       $ brew install sdl sdl_ttf sdl_image sdl_mixer portmidi
       
       $ pip install pygame
    

We are now ready to install PLE:

    $ git clone https://github.com/ntasfi/PyGame-Learning-Environment.git
    
    $ cd PyGame-Learning-Environment/
    
    $ pip install -e . 

##### Note
Currently PLE does not support its Doom environment on OSX (you can get around this, but we wont bother). However, annoyingly importing PLE will throw an error asking the user to install the doom environment.

We can get around this by commenting out a few lines of code (since we won't be using the Doom Env):

Navigate to wherever you cloned 'Pygame-Learning-Environment'.

- Open up '/Pygame-Learning-Environment/ple/ple.py', and comment out the lines on 127-128


        if isinstance(self.game, base.DoomWrapper):
            self.rng = rng

- Next, open up '/Pygame-Learning-Environment/ple/games/\__init\__.py', and comment out the import on line 1:

        from .doom import Doom

- Lastly, open up '/Pygame-Learning-Environment/ple/games/base/\__init\__.py', and comment out the import on line 2:

        from .doomwrapper import DoomWrapper

### Installing gym-ple
Now we just need to install 'gym-ple', so that we can use PLE environments in 'gym'. We'll just clone the git repo, and install with 'pip':

    $ git clone https://github.com/lusob/gym-ple.git
    
    $ cd gym-ple/
    
    $ pip install -e .

In [None]:
import tensorflow as tf
import numpy as np
import gym
import gym_ple

from tqdm import tqdm
from collections import defaultdict
import logging, os

---

# Using the libraries

---

## Using TensorFlow

Here, we will implement logistic regression, in TensorFlow, for the MNIST dataset -- installed with tensorflow.

In [None]:
# Import MNIST
from tensorflow.examples.tutorials.mnist import input_data


class LogisticRegression(object):
    """
    perform logistic regression, using gradient descent.
    """
    def __init__(self, **userconfig=None):
        mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
        self.mnist = mnist
        
        self.X_train = mnist.train.images
        self.y_train = mnist.train.labels
        self.X_test = mnist.test.images
        self.y_test = mnist.test.labels
        
        self.config = {
            "learning_rate" : 0.01,
            "n_epochs" : 20,
            "batch_size" : 100,
            "n_batches" : int(mnist.train.num_examples / 100)
        if userconfig: 
            self.config.update(userconfig)
        
        self.X = tf.placeholder(tf.float32, [None, 784])
        self.y = tf.placeholder(tf.float32, [None, 10])
        
        self.W = tf.Variable(tf.zeroes([784, 10])) # weight matrix
        self.b = tf.Variable(tf.zeroes([10]))      # bias array
    
    def predict(self):
        return tf.nn.softmax(tf.matmul(self.x, self.W) + self.b)
    
    def cost(self):
        """
        use the cross-enthropy loss function to minimize error
        """
        return tf.reduce_mean(-tf.reduce_sum(y*tf.log(self.predict()), reduce_indices=1))
    
    def gradient(self):
        """
        Gradient Descent
        """
        return tf.train.GradientDescentOptimizer(self.config["learning_rate"]).minimize(cost)
    
        
    def train(self):
        init = tf.initialize_all_variables()
        with tf.Session() as sess:
            sess.run(init)
            
            epoch_costs = []
            for epoch in tqdm(range(self.config["n_epochs"])):
                batch_costs = []
                for t in range(self.config["n_batches"]):
                    batch_xs, batch_ys = self.mnist.train.next_batch(self.config["batch_size"])
                    _, batch_cost = sess.run([self.gradient(), self.cost()], 
                                             feed_dict={x: batch_xs, y: batch_ys})
                    batch_costs.append(batch_cost)
                epoch_costs.append(sum(batch_costs) / self.config["n_batches"])
        return epoch_costs
                

In [None]:
LR = LogisticRegression()
_ = LR.train()
model_correct = tf.equal(tf.argmax(LR.predict(), 1), tf.argmax(LR.y, 1))
model_accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

## Using Gym

OpenAI's Gym is an open source toolkit for reinforcement learning which provides a standardized environment so that one can meaningfully compare different reinforcement learning algorithms.

Gym is composed of Environments, and Spaces.

Environments define the actual MDP to be learned, including Spaces, and the reward function.

Spaces define the representation of actions, and observations.

For a more indepth introduction, you can consult the library [documentation](https://gym.openai.com/docs) (about one page).


In [None]:
# define an agent
class Random_Agent(object):
    """ Simple agent example, which acts randomly """
    def __init__(self, action_space):
        self.action_space = action_space
    
    def act(self, observation, reward, done):
        return self.action_space.sample()

In [None]:
# setup logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)
outdir = os.path.join(os.path.abspath(os.path.dirname(__file__)), 'results/Random-Agent')

# setup the environment
env = gym.make('Taxi-v1')
env.monitor.start(outdir, force=True)

# assign the agent
agent = Random_Agent(env.action_space)

episodes = 100
steps = 200
reward = 0
done = False

# train our random agent on the Flappy Bird environment
for episode in tqdm(range(episodes)):
    observation = env.reset()
    
    for step in range(steps):
        action = agent.act(observation, reward, done)
        ob, reward, done, _ = env.step(action)
        if done:
            break

env.monitor.close()
logger.info("Successfully ran Random-Agent!")

---

### Tabular Q-Learning Example

In [None]:
class TQ_Agent(object):
    """
    A simple agent implementing Epsilon Greedy Q-learning, which uses a 2d-dict to store Q-values.
    (i.e. self.q = dict[observation] -> dict[action] -> Q-val for (observation-action) pair).
    """
    
    def __init__(self, observation_space, action_space, **userconfig=None):
        self.observation_space = observation_space
        self.action_space = action_space
        self.action_n = action_space.n
        self.config = {
            "mean" : 0.0,           # Initialize Q values with this mean
            "std" : 0.0,            # Initialize Q values with this standard deviation
            "alpha" : 0.1,          # Learning rate
            "epsilon": 0.05,        # Epsilon in epsilon greedy policies
            "gamma": 0.95,          # Discount factor
            "n_iter": 10000}        # Number of iterations
        if userconfig:
            self.config.update(userconfig)
        # allow for random initialization of Q-values -- following a Normal Distribution.
        self.q = defaultdict(lambda: self.config["std"] * np.random.randn(self.action_n) + self.config["mean"])

    def act(self, observation):
        """ 
        E-Greedy: 
            - with probability (1-epsilon) choose argmax{ Q(s,a) for all a}
            - with probability (epsilon) choose a random action
        """
        random_action = self.action_space.sample()
        best_action = np.argmax(self.q[observation])
        return best_action if np.random.random() > self.config["epsilon"] else random_action

    def learn(self, env):
        """
        sampleQ(s,a) = R(s,a,s') + gamma * max{ Q(s',a') for all a'} 

        updatedQ = (1-alpha) * Q(s,a)  +  (alpha) * sampleQ(s,a)
                 = ( Q(s,a) - alpha * Q(s,a) )  +  ( alpha * sampleQ(s,a) )
                 = Q(s,a) + ( - alpha * Q(s,a) )  -  ( -alpha * sampleQ(s,a) )
                 = Q(s,a) - alpha * ( Q(s,a) - sampleQ(s,a) )
            => self.q[s][a] -= alpha * ( self.q[s][a] - sampleQ(s,a) ) 
        """
        observation = env.reset()
        for t in range(self.config["n_iter"]):
            action = self.act(observation)
            next_observation, reward, done, _ = env.step(action)

            sampleQ = reward + self.config["gamma"] * (np.max(self.q[observation]) if not done else 0.0)
            self.q[observation][action] -= self.config["alpha"] * (self.q[observation][action] - sampleQ)

            observation = next_observation
            if done:
                break

In [None]:
# setup logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)
outdir = os.path.join(os.path.abspath(os.path.dirname(__file__)), 'results/TQ-Agent')
print outdir

# setup the environment
env = gym.make('Taxi-v1')
env.monitor.start(outdir, force=True)

# assign the agent
agent = TQ_Agent(env.observation_space, env.action_space)

# train the agent on the flappy bird environment
episodes = 100
for episode in tqdm(range(episodes)):
	agent.learn(env)

env.monitor.close()
logger.info("Successfully ran TQ-Agent!")

---

## Deep Q-Learning

[<img src="http://www.nature.com/nature/journal/v518/n7540/images/nature14236-f1.jpg">](http://www.nature.com/nature/journal/v518/n7540/images/nature14236-f1.jpg)

The above figure is the general layout of Google Deepmind's DQN used in its atari paper. It is outside of the scope of this tutorial, however if you are interested, feel free to check out this excellent [blog post](https://www.nervanasys.com/demystifying-deep-reinforcement-learning/) covering the deepmind paper -- specifically check out the section on the Deep Q-Network. Or just read the [paper](http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html).

Here is a similiar algorithm for applying Deep Q-Network to *FlappyBird*, covered [here](http://cs229.stanford.edu/proj2015/362_report.pdf).

Lets begin by considering the FlappyBird environment.

The action space is Discrete, which is pretty straight forward.
However, the observation space is a Box defined as <code> spaces.Box(low=0, high=255, shape=(self.screen_width, self.screen_height, 3)) </code>, so what we have here is a stack of four images of the screen -- since <code> spaces.Box.shape </code> is zero indexed.

Here is some pseudocode covering what is going on behind the scenes: 
[<img src="https://www.nervanasys.com/wp-content/uploads/2015/12/Screen-Shot-2015-12-21-at-11.23.43-AM-1.png">](https://www.nervanasys.com/wp-content/uploads/2015/12/Screen-Shot-2015-12-21-at-11.23.43-AM-1.png)

---

<br><br/>
<center>
        <h1>Additional Resources</h1>
</center>
<br><br/>

---

## Machine Learning

- introductory resources:
    - [stanford machine learning (coursera)](https://www.coursera.org/learn/machine-learning)
    - [Introductory chapter of deep learning book](http://www.deeplearningbook.org/contents/intro.html)
    
- advanced resources:
    - [10-601 S15 lecture videos (youtube)](https://www.youtube.com/playlist?list=PLAJ0alZrN8rD63LD0FkzKFiFgkOmEtltQ)
    - [10-701 F14 lecture videos (youtube)](https://www.youtube.com/playlist?list=PLAJ0alZrN8rC-QCaaZ0Z-brjoWyIO8CKd)

- textbooks:
    - [tom michell's ml book](http://www.cs.cmu.edu/~tom/mlbook.html)

---

## Computer Vision

- [Simon Prince's cv book](http://www.computervisionmodels.com)

---

## Reinforcement Learning

- [Reinforcement Learning: An Introduction](http://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html)

---

## Deep Learning

- [neural nets and deep learning](http://neuralnetworksanddeeplearning.com)
- [deep learning book](http://www.deeplearningbook.org)