# MsPacman Q-learning

This notebook shows how to train a model to play Ms Pacman through reinforcement learning. 
For this purpose, we use OpenAI Gym library, and use the atari environment

In [56]:
import gym
from scipy.misc import imresize
import numpy as np

#Use MsPacman
env = gym.make('MsPacman-v0')

[2017-12-18 21:50:53,024] Making new env: Breakout-v0


Since this is an agent decission problem (?) we need a set of possible actions the agent can perform. The environment we created has a set of possible actions, and we can see their meaning in the context of an Atari game:

In [43]:
env.unwrapped.get_action_meanings()

['NOOP',
 'UP',
 'RIGHT',
 'LEFT',
 'DOWN',
 'UPRIGHT',
 'UPLEFT',
 'DOWNRIGHT',
 'DOWNLEFT']

At each step, we decide an action, and this brings with it a new state, a reward, if the game is over, and more information on the game (like if we have more lifes left).

Let's see what would happen if the agent always chose to go downwards. 

In [44]:
env.reset()
for i in range(1000):
    action = 4
    obs, reward, done, info = env.step(action)
    if reward > 0:
        print(reward,done,info['ale.lives'])

10.0 False 3
10.0 False 3
10.0 False 3
10.0 False 3
10.0 False 3
10.0 False 3
10.0 False 3


In [58]:
obs = env.reset()
print(obs)
print(len(obs))
print(len(obs[0]))
print(len(obs[0][0]))

[[[0 0 0]
  [0 0 0]
  [0 0 0]
  ..., 
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ..., 
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ..., 
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 ..., 
 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ..., 
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ..., 
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ..., 
  [0 0 0]
  [0 0 0]
  [0 0 0]]]
210
160
3


In [65]:
import tensorflow as tf
IM_SIZE = 80
K = env.action_space.n
X = tf.placeholder(tf.float32, shape=(None, 4, IM_SIZE, IM_SIZE), name='X')

# tensorflow convolution needs the order to be:
# (num_samples, height, width, "color")
# so we need to tranpose later
G = tf.placeholder(tf.float32, shape=(None,), name='G')
actions = tf.placeholder(tf.int32, shape=(None,), name='actions')

# calculate output and cost
# convolutional layers
# these built-in layers are faster and don't require us to
# calculate the size of the output of the final conv layer!
Z = X / 255.0
# print(Z)
Z = tf.transpose(Z, [0, 2, 3, 1])
# print(Z)




In [66]:
conv_layer_sizes = [(32, 8, 4), (64, 4, 2), (64, 3, 1)]
hidden_layer_sizes = [512]

for num_output_filters, filtersz, poolsz in conv_layer_sizes:
  Z = tf.contrib.layers.conv2d(
      Z,
      num_output_filters,
      filtersz,
      poolsz,
      activation_fn=tf.nn.relu
    )

# fully connected layers
Z = tf.contrib.layers.flatten(Z)
for M in hidden_layer_sizes:
    Z = tf.contrib.layers.fully_connected(Z, M)

# final output layer
predict_op = tf.contrib.layers.fully_connected(Z, K)

selected_action_values = tf.reduce_sum(
    predict_op * tf.one_hot(actions, K),
    reduction_indices=[1]
)

cost = tf.reduce_mean(tf.square(G - selected_action_values))
# self.train_op = tf.train.AdamOptimizer(1e-2).minimize(cost)
# self.train_op = tf.train.AdagradOptimizer(1e-2).minimize(cost)
# self.train_op = tf.train.RMSPropOptimizer(2.5e-4, decay=0.99, epsilon=10e-3).minimize(cost)
train_op = tf.train.RMSPropOptimizer(0.00025, 0.99, 0.0, 1e-6).minimize(cost)
# self.train_op = tf.train.MomentumOptimizer(1e-3, momentum=0.9).minimize(cost)
# self.train_op = tf.train.GradientDescentOptimizer(1e-4).minimize(cost)

cost = cost

In [None]:
sess= tf.Session()
sess.run(tf.global_variables_initializer())
