# Training an agent to play games

First of all, we need to load the libraries : keras for the deep-q-network, and gym to create the environements ( state, reward, actions..ect).
We also define the number of observe time ( from which the agent is gonna learn to play ), the epsilon ( probability of choosing random move ), and the gamma ( or the decremental factor time that defines the impact of early moves on a reward ).


In [1]:
from keras.models import Sequential
from keras.layers import Flatten, Dense
from collections import deque

import random
import numpy as np
import gym

env = gym.make('Pong-v4')

model = Sequential()
model.add(Dense(20, input_shape=(2,) + env.observation_space.shape, init='uniform', activation='relu'))
model.add(Flatten())     
model.add(Dense(18, init='uniform', activation='relu'))
model.add(Dense(10, init='uniform', activation='relu'))
model.add(Dense(env.action_space.n, init='uniform', activation='linear'))    

model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])


# Hyper-Parameters
D = deque()                                

observetime = int(input("Number of observe time ?"))                          # Number of timesteps we will be acting on the game and observing results
epsilon = float(input("Probability of choosing random move ? "))                              # Probability of doing a random move
gamma = float(input("Time factor ? (gamma)"))                                # Discounted future reward. How much we care about steps further in time
mb_size = int(input( " Mini-batches size ? "))                               # Learning minibatch size

Using TensorFlow backend.
  del sys.path[0]
  from ipykernel import kernelapp as app
  app.launch_new_instance()


[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Number of observe time ?10000
Probability of choosing random move ? 0.7
Time factor ? (gamma)0.6
 Mini-batches size ? 200


We begin the game, at each iteration the agent choose according to epsilon if he does a random move or a Q-Based one, if he choose to follow the Q-value then he calls the modal ( that hasn't learned yet ).
At each iteration of the games, we store every : state, action taken from that state, and the reward from it.

In [2]:
# Beginning of the game
observation = env.reset()
obs = np.expand_dims(observation, axis=0)  
state = np.stack((obs, obs), axis=1)
done = False
for i in range(observetime):
    if (np.random.rand()<= epsilon): # Random factor
        action = np.random.randint(0, env.action_space.n, size=1)[0] # Random move
    else:
        Q = model.predict(state) # Q-based move, which is predicted by the modal
        action = np.argmax(Q) 
    observation_new, reward, done, info = env.step(action) # Do the action and get the reward, state..ect.
    obs_new = np.expand_dims(observation_new, axis=0)
    state_new = np.append(np.expand_dims(obs_new, axis=0), state[:, :1, :], axis=1) 
    D.append((state, action, reward, state_new, done)) 
    state = state_new 
    if done:
        env.reset()           # Restart game at the end
        obs = np.expand_dims(observation, axis=0)     
        state = np.stack((obs, obs), axis=1)
print("Observe time finished")

Observe time finished


We create some random samples of moves, and format for preparing the learning phase.

In [3]:
minibatch = random.sample(D, mb_size)                              
inputs_shape = (mb_size,) + state.shape[1:]
inputs = np.zeros(inputs_shape)
targets = np.zeros((mb_size, env.action_space.n))

We loop in the history stored, give the gamma factor to the early moves,after that the modal learn on the new Q-value ( that was modified by the new rewards ).

In [4]:
print("Learning phase")
for i in range (0, mb_size):
    state = minibatch[i][0]
    action = minibatch[i][1]
    reward = minibatch[i][2]
    state_new = minibatch[i][3]
    done = minibatch[i][4]
    
    inputs[i:i+1] = np.expand_dims(state, axis=0)
    targets[i] = model.predict(state)
    Q_sa = model.predict(state_new)
    
    if done:
        targets[i, action] = reward # Normal reward 
    else:
        targets[i, action] = reward + gamma * np.max(Q_sa) # Reward with time factor
    model.train_on_batch(inputs, targets)

Learning phase


Finally, we can test the agent and see how he performs.

In [5]:
print("Play time")
observation = env.reset()
obs = np.expand_dims(observation, axis=0)
state = np.stack((obs, obs), axis=1)
done = False
tot_reward = 0.0
while not done:
    env.render()                    
    Q = model.predict(state)        
    action = np.argmax(Q)         
    observation, reward, done, info = env.step(action)
    obs = np.expand_dims(observation, axis=0)
    state = np.append(np.expand_dims(obs, axis=0), state[:, :1, :], axis=1)    
    tot_reward += reward
print("End of the game")


Play time
End of the game
