# Deep Q-learning
This notebook builds a neural network that can learn to play different games through reinforcement learning. The game will be simulated using [OpenAI Gym](https://gym.openai.com).

In [2]:
import gym
import tensorflow as tf
import numpy as np

>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

In [3]:
# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')

[2017-06-19 17:05:09,260] Making new env: CartPole-v0


In [3]:
# let's explore the environment
# how many actions are possible?
# will output discrete value 2 (you can just move the pole left or right)
env.action_space

Discrete(2)

In [4]:
# get a random action
env.action_space.sample()

0

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network
We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

For this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, I'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, the Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>

As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

For the neural network I will use an LSTM cell.

In [4]:
from keras.models import Sequential
from keras.layers import Dense, Activation, LSTM
from keras import backend as K

Using TensorFlow backend.


In [13]:
class CartPoleController(object):
    def __init__(self, n_input=4, n_hidden=10, n_output=1, initial_state=0.1, training_threshold=1.5):
        self.n_input = n_input
        self.n_hidden = n_hidden
        self.n_output = n_output
        self.initial_state = initial_state
        self.training_threshold = training_threshold
        self.step_threshold = 0.5
        # Action neural network
        # Dense input -> (1 x n_input)
        # LSTM -> (n_hidden)
        # Dense output -> (n_output)
        self.action_model = Sequential()
        self.action_model.add(LSTM(self.n_hidden, input_shape=(1, self.n_input)))
        self.action_model.add(Activation('tanh'))
        self.action_model.add(Dense(self.n_output))
        self.action_model.add(Activation('sigmoid'))
        self.action_model.compile(loss='mse', optimizer='adam')
        
    def action(self, obs, prev_obs=None, prev_action=None):
        x = np.ndarray(shape=(1, 1, self.n_input)).astype(K.floatx())
        if prev_obs is not None:
            prev_norm = np.linalg.norm(prev_obs)
            if prev_norm > self.training_threshold:
            # Compute a training step
                x[0, 0, :] = prev_obs
            if prev_norm < self.step_threshold:
                y = np.array([prev_action]).astype(K.floatx())
            else:
                y = np.array([np.abs(prev_action - 1)]).astype(K.floatx())
            self.action_model.train_on_batch(x, y)
            # Predict new value
            x[0, 0, :] = obs
        output = self.action_model.predict(x, batch_size=1)
        return self.step(output)
    def step(self, value):
        if value > self.step_threshold:
            return int(1)
        else:
            return int(0)

Let's run the model

In [6]:
# Number of episodes
nb_episodes = 100
# Max execution time (in seconds)
max_execution_time = 120
# Set random seed
np.random.seed(1000)

In [17]:
from gym import wrappers
import time

print('OpenAI-Gym CartPole-v0 LSTM experiment')
env = gym.make('CartPole-v0')
env = wrappers.Monitor(env, 'cartpole_lstm-1', force=True)
cart_pole_controller = CartPoleController()
total_reward = []

for episode in range(nb_episodes):
    # Reset environment
    observation = env.reset()
    previous_observation = observation
    action = cart_pole_controller.action(observation)
    previous_action = action
    done = False
    t = 0
    partial_reward = 0.0
    start_time = time.time()
    elapsed_time = 0
    
    while not done and elapsed_time < max_execution_time:
        t += 1
        elapsed_time = time.time() - start_time
        env.render()
        observation, reward, done, info = env.step(action)
        partial_reward += reward
        action = cart_pole_controller.action(observation, previous_observation, previous_action)
        previous_observation = observation
        previous_action = action

    print('Episode %d finished after %d timesteps. Total reward: %1.0f. Elapsed time: %d s' %
          (episode+1, t+1, partial_reward, elapsed_time))

env.close()
total_reward.append(partial_reward)
total_reward = np.array(total_reward)
print('Average reward: %3.2f' % np.mean(total_reward))

[2017-06-19 17:28:22,724] Making new env: CartPole-v0
[2017-06-19 17:28:22,734] Finished writing results. You can upload them to the scoreboard via gym.upload('/Users/bernhardmayr/Documents/Projekte/MLforTrading/QLearning/gym_deep_q_learning/cartpole_lstm-1')
[2017-06-19 17:28:22,741] Clearing 12 monitor files from previous run (because force=True was provided)


OpenAI-Gym CartPole-v0 LSTM experiment


[2017-06-19 17:28:23,167] Starting new video recorder writing to /Users/bernhardmayr/Documents/Projekte/MLforTrading/QLearning/gym_deep_q_learning/cartpole_lstm-1/openaigym.video.6.22823.video000000.mp4
[2017-06-19 17:28:25,695] Starting new video recorder writing to /Users/bernhardmayr/Documents/Projekte/MLforTrading/QLearning/gym_deep_q_learning/cartpole_lstm-1/openaigym.video.6.22823.video000001.mp4


Episode 1 finished after 10 timesteps. Total reward: 9. Elapsed time: 2 s
Episode 2 finished after 11 timesteps. Total reward: 10. Elapsed time: 0 s
Episode 3 finished after 11 timesteps. Total reward: 10. Elapsed time: 0 s
Episode 4 finished after 11 timesteps. Total reward: 10. Elapsed time: 0 s
Episode 5 finished after 13 timesteps. Total reward: 12. Elapsed time: 0 s
Episode 6 finished after 11 timesteps. Total reward: 10. Elapsed time: 0 s
Episode 7 finished after 12 timesteps. Total reward: 11. Elapsed time: 0 s


[2017-06-19 17:28:27,220] Starting new video recorder writing to /Users/bernhardmayr/Documents/Projekte/MLforTrading/QLearning/gym_deep_q_learning/cartpole_lstm-1/openaigym.video.6.22823.video000008.mp4


Episode 8 finished after 11 timesteps. Total reward: 10. Elapsed time: 0 s
Episode 9 finished after 27 timesteps. Total reward: 26. Elapsed time: 0 s
Episode 10 finished after 32 timesteps. Total reward: 31. Elapsed time: 0 s
Episode 11 finished after 13 timesteps. Total reward: 12. Elapsed time: 0 s
Episode 12 finished after 14 timesteps. Total reward: 13. Elapsed time: 0 s
Episode 13 finished after 14 timesteps. Total reward: 13. Elapsed time: 0 s
Episode 14 finished after 12 timesteps. Total reward: 11. Elapsed time: 0 s
Episode 15 finished after 13 timesteps. Total reward: 12. Elapsed time: 0 s
Episode 16 finished after 12 timesteps. Total reward: 11. Elapsed time: 0 s
Episode 17 finished after 13 timesteps. Total reward: 12. Elapsed time: 0 s
Episode 18 finished after 12 timesteps. Total reward: 11. Elapsed time: 0 s
Episode 19 finished after 14 timesteps. Total reward: 13. Elapsed time: 0 s
Episode 20 finished after 14 timesteps. Total reward: 13. Elapsed time: 0 s
Episode 21 fin

[2017-06-19 17:28:32,319] Starting new video recorder writing to /Users/bernhardmayr/Documents/Projekte/MLforTrading/QLearning/gym_deep_q_learning/cartpole_lstm-1/openaigym.video.6.22823.video000027.mp4


Episode 27 finished after 14 timesteps. Total reward: 13. Elapsed time: 0 s
Episode 28 finished after 12 timesteps. Total reward: 11. Elapsed time: 0 s
Episode 29 finished after 13 timesteps. Total reward: 12. Elapsed time: 0 s
Episode 30 finished after 22 timesteps. Total reward: 21. Elapsed time: 0 s
Episode 31 finished after 15 timesteps. Total reward: 14. Elapsed time: 0 s
Episode 32 finished after 14 timesteps. Total reward: 13. Elapsed time: 0 s
Episode 33 finished after 13 timesteps. Total reward: 12. Elapsed time: 0 s
Episode 34 finished after 12 timesteps. Total reward: 11. Elapsed time: 0 s
Episode 35 finished after 12 timesteps. Total reward: 11. Elapsed time: 0 s
Episode 36 finished after 13 timesteps. Total reward: 12. Elapsed time: 0 s
Episode 37 finished after 11 timesteps. Total reward: 10. Elapsed time: 0 s
Episode 38 finished after 13 timesteps. Total reward: 12. Elapsed time: 0 s
Episode 39 finished after 11 timesteps. Total reward: 10. Elapsed time: 0 s
Episode 40 f

[2017-06-19 17:28:40,337] Starting new video recorder writing to /Users/bernhardmayr/Documents/Projekte/MLforTrading/QLearning/gym_deep_q_learning/cartpole_lstm-1/openaigym.video.6.22823.video000064.mp4


Episode 64 finished after 16 timesteps. Total reward: 15. Elapsed time: 0 s
Episode 65 finished after 12 timesteps. Total reward: 11. Elapsed time: 0 s
Episode 66 finished after 13 timesteps. Total reward: 12. Elapsed time: 0 s
Episode 67 finished after 14 timesteps. Total reward: 13. Elapsed time: 0 s
Episode 68 finished after 10 timesteps. Total reward: 9. Elapsed time: 0 s
Episode 69 finished after 12 timesteps. Total reward: 11. Elapsed time: 0 s
Episode 70 finished after 17 timesteps. Total reward: 16. Elapsed time: 0 s
Episode 71 finished after 13 timesteps. Total reward: 12. Elapsed time: 0 s
Episode 72 finished after 16 timesteps. Total reward: 15. Elapsed time: 0 s
Episode 73 finished after 19 timesteps. Total reward: 18. Elapsed time: 0 s
Episode 74 finished after 17 timesteps. Total reward: 16. Elapsed time: 0 s
Episode 75 finished after 15 timesteps. Total reward: 14. Elapsed time: 0 s
Episode 76 finished after 10 timesteps. Total reward: 9. Elapsed time: 0 s
Episode 77 fin

[2017-06-19 17:28:48,971] Finished writing results. You can upload them to the scoreboard via gym.upload('/Users/bernhardmayr/Documents/Projekte/MLforTrading/QLearning/gym_deep_q_learning/cartpole_lstm-1')


Episode 100 finished after 18 timesteps. Total reward: 17. Elapsed time: 0 s
Average reward: 17.00
