# MsPacman Q-learning

This notebook shows how to train a model to play Ms Pacman through reinforcement learning. 
For this purpose, we use the OpenAI Gym library.

In [None]:
import gym
from scipy.misc import imresize
import numpy as np

#Use MsPacman
env = gym.make('MsPacman-v0')

Since this is an agent decission problem (?) we need a set of possible actions the agent can perform. The environment we created has a set of possible actions, and we can see their meaning in the context of an Atari game:

In [None]:
env.unwrapped.get_action_meanings()

At each step, we decide an action, and this brings with it a new state, a reward, if the game is over, and more information on the game (like if we have more lifes left).

Let's see what would happen if the agent always chose to go downwards. 

In [None]:
env.reset()
for i in range(500):
    action = 4
    obs, reward, game_over, info = env.step(action)
    if reward > 0:
        print(reward,game_over,info['ale.lives'])

Let's now see what each state looks like in terms of size:

In [None]:
obs = env.reset()
print(len(obs))
print(len(obs[0]))
print(len(obs[0][0]))

## Image preprocessing

Before training, we can convert our board to images that actually matter for the game. We can also turn it into a square image, if no information is lost

This is how the board looks at the beginning:

In [None]:
import matplotlib.pyplot as plt
original_board = env.reset()
plt.imshow(original_board)
plt.show()

We can remove the lower rectangle that has the game statistics:

In [None]:
cropped = original_board[0:170]
plt.imshow(cropped)
plt.show()

Now, we can make it blurry using nearest neighbour image interpolation to resie the image to half the size. The ghosts are still there, so are the points, pacman and the walls. We can also convert it to gray scale to not have 3 components of RGB. 

In [None]:
from scipy.misc import imresize
scaled = cropped.mean(axis=2) # convert to grayscale
scaled = scaled/255 #normalize
scaled = imresize(scaled, size=(85,85), interp='nearest')
plt.imshow(scaled)
plt.show()

With this, we create our first constant and helping function:

In [None]:
BOARD_SIZE = 85

def rescale_board(board):
    cropped = board[0:170]
    scaled = cropped.mean(axis=2)
    return imresize(scaled/255, size=(BOARD_SIZE, BOARD_SIZE), interp='nearest')

## State Definition and State Update

To train the network, we don't use just the immediate previous screen, we need a set of screens. This way, the network can better make inference on things like speed and direction of moving objects on the game's screen.

In the paper used for this project, the authors used 4 screens to define the "state". We'll do the same.

We define a function to update the state, which will take as an input the current state and the new observation, and it will remove the oldest observation, and introduce the new observation.

In [None]:
def update_state(current_state, new_obs):
    scaled_obs = rescale_board(new_obs)
    return np.append(state[1:], np.expand_dims(scaled_obs, 0), axis=0)

## Experience Replay

Before doing anything else, we'll start creating our buffer for experience replay.

Experience replay is an improvement done to deep q-learning. Essentialy, you accumulate experiences (states) from game play. This plays can be done by an expert player, or at random. They are useful because they eliminate the time-dependencies introduced by the fact that the game frames always follow a sequence.

We generate our buffer as follows:

In [None]:
MIN_REPLAY_EXPERIENCES = 20 # for real 50000, for testing 20
MAX_REPLAY_EXPERIENCES = 100 # for real 500000, for testing 100

s = 4 # number of screens in a state
experience_buffer = []
K = env.action_space.n

initial_obs = env.reset()
scaled_initial_obs = rescale_board(initial_obs)
state = np.stack([scaled_initial_obs] * s, axis=0)
for i in range(MIN_REPLAY_EXPERIENCES):
    action = np.random.choice(K)
    obs, reward, game_over, _ = env.step(action)
    next_state = update_state(state, obs)
    experience_buffer.append((state, action, reward, next_state, game_over))
    if game_over:
        obs = env.reset()
        scaled_obs = rescale_board(obs)
        state = np.stack([scaled_obs] * s, axis=0)
    else:
        state = next_state

## Neural Network Definition

Now we define the Neural Network using tensorflow (RO. ECHALE MAS VERBO AQUI) (G son las recompensas)

In [None]:
import tensorflow as tf
import random
import sys


## Training params definition
batch_size = 10 #for real 30, for testing 10
num_episodes = 25 # for real 10000 for testing 25
gamma = 0.99
episode_rewards = np.zeros(num_episodes)

epsilon = 1.0
epsilon_min = 0.1
epsilon_delta = (epsilon - epsilon_min) / 10000 # for real divide by 500000, for testing by 10000



class QNetwork:
    def __init__(self, K):
        
        conv_layer_sizes = [(32, 8, 4), (64, 4, 2), (64, 3, 1)]
        hidden_layer_sizes = [512]
        gamma = 0.99
        
        self.K = K # Number of possible actions
        
        self.X = tf.placeholder(tf.float32, shape=(None, 4, BOARD_SIZE, BOARD_SIZE), name='X') #input
        self.G = tf.placeholder(tf.float32, shape=(None,), name='G') #rewards
        self.actions = tf.placeholder(tf.int32, shape=(None,), name='actions') #actions
        
        # normalize input
        Z = self.X / 255.0
        # permute input dimensions
        Z = tf.transpose(Z, [0, 2, 3, 1])

        for num_output_filters, num_filters, num_pools in conv_layer_sizes:
            Z = tf.contrib.layers.conv2d(Z, num_output_filters, num_filters,
                num_pools,activation_fn=tf.nn.relu)

        # fully connected layers
        Z = tf.contrib.layers.flatten(Z)
        for M in hidden_layer_sizes:
            Z = tf.contrib.layers.fully_connected(Z, M)

        # final output layer
        self.predict_op = tf.contrib.layers.fully_connected(Z, K)

        selected_action_values = tf.reduce_sum(
            self.predict_op * tf.one_hot(self.actions, self.K),
            reduction_indices=[1]
        )

        cost = tf.reduce_mean(tf.square(self.G - selected_action_values))
        self.train_op = tf.train.AdamOptimizer(1e-2).minimize(cost)
        self.cost = cost
    
    def predict(self, states):
        return self.session.run(self.predict_op, feed_dict={self.X: states})
    
    def set_session(self, session):
        self.session = session
    
    def update_model(self, states, actions, targets):
        loss, _ = self.session.run(
            [self.cost, self.train_op],
            feed_dict={
                self.X: states,
                self.G: targets,
                self.actions: actions
            }
        )
        return loss
    
    def select_action(self, obs, eps):
        if np.random.random() < eps:
            return np.random.choice(self.K)
        else:
            return np.argmax(self.predict([obs])[0])

In [None]:
def learn(model, experience_buffer, gamma, batch_size):

    experience_sample = random.sample(experience_buffer, batch_size)
    states, actions, rewards, next_states, game_overs = map(np.array, zip(*experience_sample))
    
    print(game_overs)

    possibleQs = model.predict(next_states)
    selectedQ = np.amax(possibleQs, axis=1)
    targets = rewards + np.invert(game_overs).astype(np.float32) * gamma * selectedQ

    # Update model
    loss = model.update_model(states, actions, targets)
    return loss

def play_one_episode(env, experience_buffer, model, gamma, batch_size,
                     epsilon, epsilon_delta, epsilon_min):
    
    obs = env.reset()
    scaled_obs = rescale_board(obs)
    state = np.stack([scaled_obs] * s, axis=0)
    loss = None
    episode_reward = 0
    game_over = False
    
    while not game_over:
        # Take action
        action = model.select_action(state, epsilon)
        obs, reward, game_over, _ = env.step(action)
        scaled_obs = rescale_board(obs)
        next_state = np.append(state[1:], np.expand_dims(scaled_obs, 0), axis=0)
        
        if len(experience_buffer) == MAX_REPLAY_EXPERIENCES:
            experience_buffer.pop(0)

        experience_buffer.append((state, action, reward, next_state, game_over))

        loss = learn(model, experience_buffer, gamma, batch_size)

        state = next_state
        episode_reward += reward
        epsilon = max(epsilon - epsilon_delta, epsilon_min)

    return episode_reward, epsilon



In [None]:
model = QNetwork(K=env.action_space.n)

with tf.Session() as sess:
    model.set_session(sess)
    sess.run(tf.global_variables_initializer())

    # Play a number of episodes and learn!
    for i in range(num_episodes):
        print(i)
        episode_reward, epsilon = play_one_episode(env, experience_buffer, model,
                                           gamma, batch_size, epsilon, epsilon_delta,
                                           epsilon_min)
        
        episode_rewards[i] = episode_reward

        last_100_avg = episode_rewards[max(0, i - 100):i + 1].mean()
        print("Episode:", i,
        "Reward:", episode_reward,
        "Avg Reward (Last 100):", "%.3f" % last_100_avg,
        "Epsilon:", "%.3f" % epsilon
        )
        sys.stdout.flush()