# MsPacman Q-learning

This notebook shows how to train a model to play Ms Pacman through reinforcement learning. 
For this purpose, we use the OpenAI Gym library.

In [None]:
import gym
from scipy.misc import imresize
import numpy as np

#Use MsPacman
env = gym.make('MsPacman-v0')

Since this is an agent decission problem (?) we need a set of possible actions the agent can perform. The environment we created has a set of possible actions, and we can see their meaning in the context of an Atari game:

In [None]:
env.unwrapped.get_action_meanings()

At each step, we decide an action, and this brings with it a new state, a reward, if the game is over, and more information on the game (like if we have more lifes left).

Let's see what would happen if the agent always chose to go downwards. 

In [None]:
env.reset()
for i in range(500):
    action = 4
    obs, reward, done, info = env.step(action)
    if reward > 0:
        print(reward,done,info['ale.lives'])

Let's now see what each state looks like in terms of size:

In [None]:
obs = env.reset()
print(len(obs))
print(len(obs[0]))
print(len(obs[0][0]))

## Image preprocessing

Before training, we can convert our board to images that actually matter for the game. We can also turn it into a square image, if no information is lost

This is how the board looks at the beginning:

In [None]:
import matplotlib.pyplot as plt
original_board = env.reset()
plt.imshow(original_board)
plt.show()

We can remove the lower rectangle that has the game statistics:

In [None]:
cropped = original_board[0:170]
plt.imshow(cropped)
plt.show()

Now, we can make it blurry using nearest neighbour image interpolation to resie the image to half the size. The ghosts are still there, so are the points, pacman and the walls. We can also convert it to gray scale to not have 3 components of RGB. 

In [None]:
from scipy.misc import imresize
scaled = cropped.mean(axis=2) # convert to grayscale
scaled = scaled/255 #normalize
scaled = imresize(scaled, size=(85,85), interp='nearest')
plt.imshow(scaled)
plt.show()

With this, we create our first constant and helping function:

In [None]:
BOARD_SIZE = 85

def rescale_board(board):
    cropped = board[0:170]
    scaled = cropped.mean(axis=2)
    return imresize(scaled/255, size=(BOARD_SIZE, BOARD_SIZE), interp='nearest')

## State Definition and State Update

To train the network, we don't use just the immediate previous screen, we need a set of screens. This way, the network can better make inference on things like speed and direction of moving objects on the game's screen.

In the paper used for this project, the authors used 4 screens to define the "state". We'll do the same.

We define a function to update the state, which will take as an input the current state and the new observation, and it will remove the oldest observation, and introduce the new observation.

In [None]:
def update_state(current_state, new_obs):
    scaled_obs = rescale_board(new_obs)
    return np.append(state[1:], np.expand_dims(scaled_obs, 0), axis=0)

## Experience Replay

Before doing anything else, we'll start creating our buffer for experience replay.

Experience replay is an improvement done to deep q-learning. Essentialy, you accumulate experiences (states) from game play. This plays can be done by an expert player, or at random. They are useful because they eliminate the time-dependencies introduced by the fact that the game frames always follow a sequence.

We generate our buffer as follows:

In [None]:
REPLAY_EXPERIENCES = 500 # for real 50000, for testing 500
K = env.action_space.n # Number of possible actions
s = 4 # number of screens in a state
experience_buffer = []

initial_obs = env.reset()
scaled_initial_obs = rescale_board(initial_obs)
state = np.stack([scaled_initial_obs] * s, axis=0)
for i in range(REPLAY_EXPERIENCES):
    action = np.random.choice(K)
    obs, reward, game_over, _ = env.step(action)
    next_state = update_state(state, obs)
    experience_buffer.append((state, action, reward, next_state, game_over))
    if game_over:
        obs = env.reset()
        scaled_obs = downsample_image(obs)
        state = np.stack([scaled_obs] * s, axis=0)
    else:
        state = next_state

We define some more constants. This time, it's the number of games we'll play to train and the number of experiences for the "experience-replay" batch. 

In [None]:
TRAIN_EXPERIENCES = 5000 #for real 500000, for testing 5000


In [None]:
import tensorflow as tf


X = tf.placeholder(tf.float32, shape=(None, 4, IM_SIZE, IM_SIZE), name='X')

# tensorflow convolution needs the order to be:
# (num_samples, height, width, "color")
# so we need to tranpose later
G = tf.placeholder(tf.float32, shape=(None,), name='G')
actions = tf.placeholder(tf.int32, shape=(None,), name='actions')

# calculate output and cost
# convolutional layers
# these built-in layers are faster and don't require us to
# calculate the size of the output of the final conv layer!
Z = X / 255.0
# print(Z)
Z = tf.transpose(Z, [0, 2, 3, 1])
# print(Z)




In [None]:
conv_layer_sizes = [(32, 8, 4), (64, 4, 2), (64, 3, 1)]
hidden_layer_sizes = [512]

for num_output_filters, filtersz, poolsz in conv_layer_sizes:
  Z = tf.contrib.layers.conv2d(
      Z,
      num_output_filters,
      filtersz,
      poolsz,
      activation_fn=tf.nn.relu
    )

# fully connected layers
Z = tf.contrib.layers.flatten(Z)
for M in hidden_layer_sizes:
    Z = tf.contrib.layers.fully_connected(Z, M)

# final output layer
predict_op = tf.contrib.layers.fully_connected(Z, K)

selected_action_values = tf.reduce_sum(
    predict_op * tf.one_hot(actions, K),
    reduction_indices=[1]
)

cost = tf.reduce_mean(tf.square(G - selected_action_values))
train_op = tf.train.AdamOptimizer(1e-2).minimize(cost)
cost = cost

In [None]:
sess= tf.Session()
sess.run(tf.global_variables_initializer())
