# Deep Q-learning for Cart-Pole

This notebook uses OpenAI Gym and Deep Q-learning to creating a playing agent for Cart-Pole. 

### Import dependencies and create a Cart-Pole playing environment

In [2]:
import gym
import tensorflow as tf
import numpy as np

In [3]:
env = gym.make('CartPole-v0')

### Explore the OpenAI Gym environment

Get a list of the possible actions for this game

In [4]:
env.action_space

Discrete(2)

There are two possible actions, moving the cart left and right--coded as 0 or 1 in the environment

---

Let's run a random simulation to see how the game it played

In [5]:
env.reset()
rewards = []
for move in range(100):
    env.render()
    state, reward, done, info = env.step(env.action_space.sample())
    rewards.append(reward)
    if done:
        rewards = []
        env.reset()

In [6]:
env.close()

In [7]:
print(rewards)

[1.0, 1.0, 1.0, 1.0, 1.0]


The object of the game is to move the cart left or right to keep the pole from falling. The longer the pole stays up, the more reward we receive. For this game, we get a reward of 1 for each step that the pole is still standing.

### Building the Q-Network

In reinforcement learning we usually keep a matrix of all state-action pairs and update the values to help the agent learn. For some games, such as cart-pole, the number of state-action paris is simply too large for this to be feasible. Even for a simple game like cart-pole, there are four real-valued numbers that make up each possible state--position and velocity of the cart, and position and velocity of the pole. This creates a nearly infinite number of states.

In deep Q-learning, we use a neural network to approximate the Q-table. Our A-network takes a state as input and outputs q-values for each possible action. 

Our targets for training are $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$, thus we want to minimize $(\hat{Q}(s,a) - Q(s,a))^2$. This can be thought of as a measurement of how much reward can be expected in the next time step if we take a given action.

In [21]:
class QNetwork():
    def __init__(self, learning_rate=0.01, state_size=4, action_size=2, hidden_size=10, name='QNetwork'):
        with tf.variable_scope(name):
            self.inputs_ = tf.placeholder(tf.float32, [None, state_size], name='inputs')
            self.actions_ = tf.placeholder(tf.int32, [None], name='actions')
            one_hot_actions = tf.one_hot(self.actions_, action_size)

            # Target placeholder for training
            self.targetQs_ = tf.placeholder(tf.float32, [None], name='target')
            
            # Hidden layers
            self.fc1 = tf.contrib.layers.fully_connected(self.inputs_, hidden_size)
            self.fc2 = tf.contrib.layers.fully_connected(self.fc1, hidden_size)
            
            # Output layer
            self.output = tf.contrib.layers.fully_connected(self.fc2, action_size, activation_fn=None)
            
            # Trian on (targetQ - Q)^2
            self.Q = tf.reduce_sum(tf.multiply(self.output, one_hot_actions), axis=1)
            
            self.loss = tf.reduce_mean(tf.square(self.targetQs_ - self.Q))
            self.optimizer = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)