# Space Invaders

## Dependencies

In [1]:
!pip install tensorflow==2.3.1 gym keras-rl2 gym[atari]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


ROM instructions: https://github.com/openai/atari-py#roms

In [2]:
!pip install atari-py

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
!python -m atari_py.import_roms roms

copying space_invaders.bin from roms/Space Invaders.bin to /usr/local/lib/python3.7/dist-packages/atari_py/atari_roms/space_invaders.bin


## Exploration and baseline

In [21]:
import gym
import random
import numpy as np

In [22]:
env = gym.make("SpaceInvaders-v0")
print(env.observation_space.shape)

(210, 160, 3)


In [23]:
env.unwrapped.get_action_meanings()

['NOOP', 'FIRE', 'RIGHT', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE']

In [27]:
EPISODES = 100
scores = []
for episode in range(1, EPISODES + 1):
    state = env.reset()
    done = False
    score = 0 
    
    while not done:
        # env.render()
        action = random.choice(range(env.action_space.n))
        n_state, reward, done, info = env.step(action)
        score += reward
    
    scores.append(score)
    print(f"Episode {episode}: Reward == {score}")

avg = np.mean(scores)
print(f"Average reward: {avg}")
env.close()

Episode 1: Reward == 55.0
Episode 2: Reward == 640.0
Episode 3: Reward == 145.0
Episode 4: Reward == 195.0
Episode 5: Reward == 195.0
Episode 6: Reward == 55.0
Episode 7: Reward == 110.0
Episode 8: Reward == 65.0
Episode 9: Reward == 75.0
Episode 10: Reward == 35.0
Episode 11: Reward == 35.0
Episode 12: Reward == 35.0
Episode 13: Reward == 135.0
Episode 14: Reward == 135.0
Episode 15: Reward == 135.0
Episode 16: Reward == 100.0
Episode 17: Reward == 285.0
Episode 18: Reward == 150.0
Episode 19: Reward == 190.0
Episode 20: Reward == 110.0
Episode 21: Reward == 380.0
Episode 22: Reward == 175.0
Episode 23: Reward == 185.0
Episode 24: Reward == 145.0
Episode 25: Reward == 110.0
Episode 26: Reward == 210.0
Episode 27: Reward == 230.0
Episode 28: Reward == 80.0
Episode 29: Reward == 65.0
Episode 30: Reward == 210.0
Episode 31: Reward == 110.0
Episode 32: Reward == 295.0
Episode 33: Reward == 110.0
Episode 34: Reward == 110.0
Episode 35: Reward == 105.0
Episode 36: Reward == 140.0
Episode 37

So the baseline is around 150.

## Model

In [28]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Convolution2D
from tensorflow.keras.optimizers import Adam

In [29]:
def build_model(height, width, channels, actions):
    model = Sequential()
    model.add(Convolution2D(32, (8,8), strides=(4,4), activation='relu', input_shape=(3,height, width, channels)))
    model.add(Convolution2D(64, (4,4), strides=(2,2), activation='relu'))
    model.add(Convolution2D(64, (3,3), activation='relu'))
    model.add(Flatten())
    model.add(Dense(512, activation='relu'))
    model.add(Dense(256, activation='relu'))
    model.add(Dense(actions, activation='linear'))
    return model

In [30]:
height, width, channels = env.observation_space.shape
actions = env.action_space.n

6

In [40]:
model = build_model(height, width, channels, actions)

In [41]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_3 (Conv2D)            (None, 3, 51, 39, 32)     6176      
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 3, 24, 18, 64)     32832     
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 3, 22, 16, 64)     36928     
_________________________________________________________________
flatten_1 (Flatten)          (None, 67584)             0         
_________________________________________________________________
dense_5 (Dense)              (None, 512)               34603520  
_________________________________________________________________
dense_6 (Dense)              (None, 256)               131328    
_________________________________________________________________
dense_7 (Dense)              (None, 6)                

## Agent

In [42]:
from rl.agents import DQNAgent
from rl.memory import SequentialMemory
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy

In [43]:
def build_agent(model, actions):
    policy = LinearAnnealedPolicy(
        EpsGreedyQPolicy(), 
        attr='eps', 
        value_max=1.0, 
        value_min=0.1, 
        value_test=0.2, 
        nb_steps=10000
    )
    memory = SequentialMemory(
        limit=1000, 
        window_length=3
    )
    dqn = DQNAgent(
        model=model, 
        memory=memory, 
        policy=policy,
        enable_dueling_network=True, 
        dueling_type='avg', 
        nb_actions=actions, 
        nb_steps_warmup=1000
    )
    return dqn

In [44]:
dqn = build_agent(model, actions)
dqn.compile(Adam(lr=1e-4))

## Train

In [45]:
dqn.fit(env, nb_steps=10000, visualize=False, verbose=2)

Training for 10000 steps ...
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
  703/10000: episode: 1, duration: 26.704s, episode steps: 703, steps per second:  26, episode reward: 45.000, mean reward:  0.064 [ 0.000, 15.000], mean action: 2.563 [0.000, 5.000],  loss: --, mean_q: --, mean_eps: --
 1372/10000: episode: 2, duration: 666.274s, episode steps: 669, steps per second:   1, episode reward: 100.000, mean reward:  0.149 [ 0.000, 20.000], mean action: 2.508 [0.000, 5.000],  loss: 9.782588, mean_q: 6.822987, mean_eps: 0.893260
 2402/10000: episode: 3, duration: 1772.951s, episode steps: 1030, steps per second:   1, episode reward: 380.000, mean reward:  0.369 [ 0.000, 200.000], mean action: 2.488 [0.000, 5.000],  loss: 6.755573, mean_q: 7.143362, mean_eps: 0.830215
 3062/10000: episode: 4, duration: 1135.876s, episode steps: 660, steps per second:   1, episode reward: 110.000, mean reward:  0.167 [ 0.000, 30.000],

<tensorflow.python.keras.callbacks.History at 0x7fe05c107f90>

In [46]:
scores = dqn.test(env, nb_episodes=20, visualize=False)
np.mean(scores.history["episode_reward"])

Testing for 20 episodes ...
Episode 1: reward: 90.000, steps: 442
Episode 2: reward: 100.000, steps: 683
Episode 3: reward: 135.000, steps: 498
Episode 4: reward: 105.000, steps: 559
Episode 5: reward: 195.000, steps: 929
Episode 6: reward: 235.000, steps: 922
Episode 7: reward: 20.000, steps: 380
Episode 8: reward: 75.000, steps: 437
Episode 9: reward: 80.000, steps: 501
Episode 10: reward: 165.000, steps: 663
Episode 11: reward: 105.000, steps: 605
Episode 12: reward: 135.000, steps: 582
Episode 13: reward: 55.000, steps: 458
Episode 14: reward: 60.000, steps: 680
Episode 15: reward: 105.000, steps: 711
Episode 16: reward: 105.000, steps: 509
Episode 17: reward: 170.000, steps: 924
Episode 18: reward: 160.000, steps: 700
Episode 19: reward: 220.000, steps: 1154
Episode 20: reward: 50.000, steps: 380


118.25