Run following command to install deps:

```pip install tensorflow==1.13.1 keras keras-rl gym```

If you are interested to head into deeper details, look into - [Deep RL](https://github.com/trokas/Deep_RL), which contains more examples and intuitive lower level implementations. This [medium](https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-2-ded33892c724) series is great. Also you can look into good book - Deep Reinforcement Learning Hands-On by Maxim Laptan.

In [2]:
import numpy as np
import gym

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory

# Q-Learning

Let's start by looking at markov decision process.

<img src="img/markov_decision_process.png" alt="Markov decision process" style="width: 600px;"/>

This can be represented with transition weights as follows:

In [3]:
transition_probabilities = [ # shape=[s, a, s']
        [[0.7, 0.3, 0.0], [1.0, 0.0, 0.0], [0.8, 0.2, 0.0]],
        [[0.0, 1.0, 0.0], None, [0.0, 0.0, 1.0]],
        [None, [0.8, 0.1, 0.1], None]]
rewards = [ # shape=[s, a, s']
        [[+10, 0, 0], [0, 0, 0], [0, 0, 0]],
        [[0, 0, 0], [0, 0, 0], [0, 0, -50]],
        [[0, 0, 0], [+40, 0, 0], [0, 0, 0]]]
possible_actions = [[0, 1, 2], [0, 2], [1]]

Now we will try to run through iterativ optimization process

$$Q_{k+1} (s,a) \leftarrow \sum_{s'} T(s,a,s') [R(s,a,s') + \gamma \max_{a'} Q_k(s', a')] \; \text{for all} \; (s'a)$$

In [5]:
Q_values = np.full((3, 3), -np.inf) # -np.inf for impossible actions
for state, actions in enumerate(possible_actions):
    Q_values[state, actions] = 0.0  # for all possible actions
    
gamma = 0.90 # the discount factor

for iteration in range(50):
    Q_prev = Q_values.copy()
    for s in range(3):
        for a in possible_actions[s]:
            Q_values[s, a] = np.sum([
                    transition_probabilities[s][a][sp]
                    * (rewards[s][a][sp] + gamma * np.max(Q_prev[sp]))
                for sp in range(3)])

In [6]:
Q_values

array([[18.91891892, 17.02702702, 13.62162162],
       [ 0.        ,        -inf, -4.87971488],
       [       -inf, 50.13365013,        -inf]])

The idea of using discounted rewards in Q-states is one of the fundamental ideas in RL. For sure we don't know initial probabilities and rewards, but as we will see we can learn them.

# Cartpole and DQN

Get the environment and extract the number of actions.

We will try to balance a stick - [CartPole](https://github.com/openai/gym/wiki/CartPole-v0)

![](https://miro.medium.com/max/960/1*G_whtIrY9fGlw3It6HFfhA.gif)

To meet provide this challenge we are going to utilize the [OpenAI gym](https://gym.openai.com/docs/), a collection of reinforcement learning environments.

- Observations — The agent needs to know where pole currently is, and the angle at which it is balancing.
- Delayed reward — Keeping the pole in the air as long as possible means moving in ways that will be advantageous for both the present and the future.

In [40]:
env = gym.make('CartPole-v0')
nb_actions = env.action_space.n
print('Number of actions', nb_actions)

Number of actions 2


Let's build a simple NN model.

In [41]:
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_7 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_19 (Dense)             (None, 16)                80        
_________________________________________________________________
activation_19 (Activation)   (None, 16)                0         
_________________________________________________________________
dense_20 (Dense)             (None, 16)                272       
_________________________________________________________________
activation_20 (Activation)   (None, 16)                0         
_________________________________________________________________
dense_21 (Dense)             (None, 16)                272       
_________________________________________________________________
activation_21 (Activation)   (None, 16)               

Finally, we configure and compile our agent. We will use Epsilon Greedy:
- All actions initially are tried with non-zero probability
- With probability $1-\epsilon$ choose the greedy action
- With probability $\epsilon$ choose an action ar random

and we will estimate target Q-Value using reward and the future discounted value estimate

$$Q_{target}(s,a) = r + \gamma \cdot \max_{a'} Q_\theta (s', a').$$

In [42]:
policy = EpsGreedyQPolicy()
memory = SequentialMemory(limit=50000, window_length=1)
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10, 
               target_model_update=1e-2, policy=policy)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])

Let's see how it looks like before training. Note, that pole does not have to fall fully for gym to note it as a failed play.

In [43]:
dqn.test(env, nb_episodes=5, visualize=True)

Testing for 5 episodes ...
Episode 1: reward: 71.000, steps: 71
Episode 2: reward: 140.000, steps: 140
Episode 3: reward: 40.000, steps: 40
Episode 4: reward: 54.000, steps: 54
Episode 5: reward: 66.000, steps: 66


<keras.callbacks.callbacks.History at 0x132f61410>

Okay, now it's time to learn something! You can visualize the training by setting `visualize=True`, but this
slows down training quite a lot.

In [44]:
dqn.fit(env, nb_steps=5000, visualize=False, verbose=2)

Training for 5000 steps ...




   27/5000: episode: 1, duration: 2.233s, episode steps: 27, steps per second: 12, episode reward: 27.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.519 [0.000, 1.000], mean observation: 0.111 [-0.189, 0.928], loss: 0.447221, mae: 0.503291, mean_q: 0.109952




  113/5000: episode: 2, duration: 0.542s, episode steps: 86, steps per second: 159, episode reward: 86.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.453 [0.000, 1.000], mean observation: -0.201 [-1.454, 0.273], loss: 0.152574, mae: 0.524421, mean_q: 0.741481
  126/5000: episode: 3, duration: 0.103s, episode steps: 13, steps per second: 126, episode reward: 13.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.385 [0.000, 1.000], mean observation: 0.112 [-0.764, 1.213], loss: 0.027968, mae: 0.693243, mean_q: 1.402542
  139/5000: episode: 4, duration: 0.090s, episode steps: 13, steps per second: 144, episode reward: 13.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.385 [0.000, 1.000], mean observation: 0.105 [-0.950, 1.513], loss: 0.036448, mae: 0.729355, mean_q: 1.487590
  148/5000: episode: 5, duration: 0.099s, episode steps: 9, steps per second: 91, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.111 [0.000, 1.000], mean observation: 0.135

  437/5000: episode: 34, duration: 0.124s, episode steps: 11, steps per second: 89, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.091 [0.000, 1.000], mean observation: 0.150 [-1.746, 2.849], loss: 0.206125, mae: 1.956131, mean_q: 3.776603
  446/5000: episode: 35, duration: 0.080s, episode steps: 9, steps per second: 113, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.000 [0.000, 0.000], mean observation: 0.126 [-1.772, 2.749], loss: 0.262391, mae: 1.958308, mean_q: 3.694489
  456/5000: episode: 36, duration: 0.083s, episode steps: 10, steps per second: 120, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.100 [0.000, 1.000], mean observation: 0.121 [-1.585, 2.492], loss: 0.138729, mae: 2.035957, mean_q: 4.020289
  466/5000: episode: 37, duration: 0.083s, episode steps: 10, steps per second: 121, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.900 [0.000, 1.000], mean observation: -0

 1269/5000: episode: 65, duration: 0.285s, episode steps: 42, steps per second: 147, episode reward: 42.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.524 [0.000, 1.000], mean observation: 0.097 [-0.411, 0.810], loss: 0.479696, mae: 4.827694, mean_q: 9.430436
 1328/5000: episode: 66, duration: 0.400s, episode steps: 59, steps per second: 148, episode reward: 59.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.475 [0.000, 1.000], mean observation: -0.108 [-0.708, 0.269], loss: 0.637441, mae: 5.085500, mean_q: 9.896491
 1376/5000: episode: 67, duration: 0.305s, episode steps: 48, steps per second: 157, episode reward: 48.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.479 [0.000, 1.000], mean observation: -0.131 [-0.945, 0.272], loss: 0.586007, mae: 5.303317, mean_q: 10.316116
 1400/5000: episode: 68, duration: 0.164s, episode steps: 24, steps per second: 146, episode reward: 24.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.458 [0.000, 1.000], mean observati

<keras.callbacks.callbacks.History at 0x133576d50>

Let's test our reinforcement learning model.

In [45]:
dqn.test(env, nb_episodes=5, visualize=True)

Testing for 5 episodes ...
Episode 1: reward: 189.000, steps: 189
Episode 2: reward: 200.000, steps: 200
Episode 3: reward: 200.000, steps: 200
Episode 4: reward: 200.000, steps: 200
Episode 5: reward: 181.000, steps: 181


<keras.callbacks.callbacks.History at 0x12ea05050>

This is nearly a perfect play, since CartPole exits if 200 steps are reached. You can experiment with version which limit is 500 by changing env to `CartPole-v1`.