# Space Invaders

## Dependencies

In [27]:
!pip install tensorflow==2.8 gym keras-rl2 gym[atari]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tensorflow==2.8
  Downloading https://us-python.pkg.dev/colab-wheels/public/tensorflow/tensorflow-2.8.0%2Bzzzcolab20220506162203-cp37-cp37m-linux_x86_64.whl
[K     / 668.3 MB 113.2 MB/s
Collecting tf-estimator-nightly==2.8.0.dev2021122109
  Downloading tf_estimator_nightly-2.8.0.dev2021122109-py2.py3-none-any.whl (462 kB)
[K     |████████████████████████████████| 462 kB 6.7 MB/s 
Collecting tensorboard<2.9,>=2.8
  Downloading tensorboard-2.8.0-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 68.7 MB/s 
Collecting keras<2.9,>=2.8.0rc0
  Downloading keras-2.8.0-py2.py3-none-any.whl (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 55.0 MB/s 
Installing collected packages: tf-estimator-nightly, tensorboard, keras, tensorflow
  Attempting uninstall: tensorboard
    Found existing installation: tensorboard 2.9.1
    Uninstalling tensorboard-2.9.

In [1]:
!apt install --allow-change-held-packages libcudnn8=8.1.0.77-1+cuda11.2  # fixes GPU issues

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following packages will be REMOVED:
  libcudnn8-dev
The following held packages will be changed:
  libcudnn8
The following packages will be upgraded:
  libcudnn8
1 upgraded, 0 newly installed, 1 to remove and 43 not upgraded.
Need to get 430 MB of archives.
After this operation, 3,139 MB disk space will be freed.
Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  libcudnn8 8.1.0.77-1+cuda11.2 [430 MB]
Fetched 430 MB in 9s (47.2 MB/s)
(Reading database ... 155632 files and directories currently installed.)
Removing libcudnn8-dev (8.0.5.39-1+cuda11.1) ...
(Reading database ... 155610 files and directories currently installed.)
Preparing to unpack .../libcudnn8_8.1.0.77-1+cuda11.2_amd64.deb ...
Unpacking libcudnn8 (8.1.0.77-1+c

ROM instructions: https://github.com/openai/atari-py#roms

In [2]:
!python -m atari_py.import_roms roms

copying space_invaders.bin from roms/Space Invaders.bin to /usr/local/lib/python3.7/dist-packages/atari_py/atari_roms/space_invaders.bin


## GPU?

In [3]:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
tf.debugging.set_log_device_placement(True)

Num GPUs Available:  1


## Exploration and baseline

In [4]:
import gym
import random
import numpy as np

In [5]:
env = gym.make("SpaceInvaders-v4")
print(env.observation_space.shape)

(210, 160, 3)


In [6]:
env.unwrapped.get_action_meanings()

['NOOP', 'FIRE', 'RIGHT', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE']

In [8]:
from numpy import clip

EPISODES = 100
scores = []
scores_clipped = []

for episode in range(1, EPISODES + 1):
    state = env.reset()
    done = False
    score = 0 
    score_clipped = 0
    
    while not done:
        # env.render()
        action = random.choice(range(env.action_space.n))
        n_state, reward, done, info = env.step(action)
        score += reward
        score_clipped += clip(reward, -1.0, 1.0)
    
    scores.append(score)
    scores_clipped.append(score_clipped)
    print(f"Episode {episode}: Reward == {score}; Clipped Reward == {score_clipped}")

avg = np.mean(scores)
avg_clipped = np.mean(scores_clipped)
print(f"Average reward: {avg}; clipped: {avg_clipped}")
env.close()

Episode 1: Reward == 120.0; Clipped Reward == 8.0
Episode 2: Reward == 195.0; Clipped Reward == 13.0
Episode 3: Reward == 180.0; Clipped Reward == 11.0
Episode 4: Reward == 295.0; Clipped Reward == 17.0
Episode 5: Reward == 120.0; Clipped Reward == 8.0
Episode 6: Reward == 120.0; Clipped Reward == 8.0
Episode 7: Reward == 365.0; Clipped Reward == 23.0
Episode 8: Reward == 150.0; Clipped Reward == 10.0
Episode 9: Reward == 485.0; Clipped Reward == 19.0
Episode 10: Reward == 100.0; Clipped Reward == 8.0
Episode 11: Reward == 260.0; Clipped Reward == 16.0
Episode 12: Reward == 50.0; Clipped Reward == 4.0
Episode 13: Reward == 135.0; Clipped Reward == 9.0
Episode 14: Reward == 135.0; Clipped Reward == 9.0
Episode 15: Reward == 110.0; Clipped Reward == 7.0
Episode 16: Reward == 30.0; Clipped Reward == 3.0
Episode 17: Reward == 105.0; Clipped Reward == 6.0
Episode 18: Reward == 210.0; Clipped Reward == 12.0
Episode 19: Reward == 185.0; Clipped Reward == 12.0
Episode 20: Reward == 110.0; Clip

So the baseline is around 150 unclipped / 9-10 clipped.



## Model

In [9]:
from tensorflow.keras import Input
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Convolution2D, Resizing, Rescaling, Reshape
from tensorflow.keras.optimizers import Adam
from tensorflow.image import rgb_to_grayscale
from tensorflow.keras.layers import Layer
from tensorflow.keras.utils import register_keras_serializable

In [22]:
@register_keras_serializable("atari")
class GrayscaleLayer(Layer):
  def call(self, input):
    return rgb_to_grayscale(input)


ValueError: ignored

In [23]:
def build_model(window_size, height, width, channels, actions):
    model = Sequential()
    model.add(Input(shape=(window_size, height, width, channels)))
    model.add(Reshape((window_size * height, width, channels), name="reshape_stack"))
    model.add(GrayscaleLayer(name="grayscale"))
    model.add(Resizing((window_size * height) // 2, width // 2, name="resize_half"))
    model.add(Rescaling(1./255, name="normalize")) # normalize to [0, 1]
    model.add(Reshape((window_size, height // 2, width // 2, 1), name="reshape_unstack"))
    model.add(Convolution2D(32, (8,8), strides=(4,4), activation='relu', name="conv1"))
    model.add(Convolution2D(64, (4,4), strides=(2,2), activation='relu', name="conv2"))
    model.add(Convolution2D(64, (3,3), activation='relu', name="conv3"))
    model.add(Flatten(name="flatten"))
    model.add(Dense(512, activation='relu', name="fully_connected_1"))
    # model.add(Dense(256, activation='relu'))
    model.add(Dense(actions, activation='linear', name="output"))
    return model

In [24]:
WINDOW_SIZE = 4
height, width, channels = env.observation_space.shape
actions = env.action_space.n

In [25]:
model = build_model(WINDOW_SIZE, height, width, channels, actions)  

In [26]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 reshape_stack (Reshape)     (None, 840, 160, 3)       0         
                                                                 
 grayscale (GrayscaleLayer)  (None, 840, 160, 1)       0         
                                                                 
 resize_half (Resizing)      (None, 420, 80, 1)        0         
                                                                 
 normalize (Rescaling)       (None, 420, 80, 1)        0         
                                                                 
 reshape_unstack (Reshape)   (None, 4, 105, 80, 1)     0         
                                                                 
 conv1 (Conv2D)              (None, 4, 25, 19, 32)     2080      
                                                                 
 conv2 (Conv2D)              (None, 4, 11, 8, 64)     

## Agent

In [15]:
from rl.agents import DQNAgent
from rl.memory import SequentialMemory
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy
from rl.processors import Processor
from numpy import clip

In [27]:
class AtariRewardProcessor(Processor):
  def process_reward(self, reward):
      return clip(reward, -1.0, 1.0)

In [42]:
def build_agent(model, actions, window_size):
    policy = LinearAnnealedPolicy(
        EpsGreedyQPolicy(), 
        attr='eps', 
        value_max=1.0, 
        value_min=0.1, 
        value_test=0.2, 
        nb_steps=75000
    )
    memory = SequentialMemory(
        limit=1000000, 
        window_length=window_size
    )
    dqn = DQNAgent(
        model=model, 
        memory=memory, 
        policy=policy,
        processor=AtariRewardProcessor(),
        enable_dueling_network=True, 
        dueling_type='avg', 
        nb_actions=actions, 
        nb_steps_warmup=10000,
        gamma=0.99
    )
    return dqn

In [32]:
dqn = build_agent(model, actions, WINDOW_SIZE)
dqn.compile(Adam(learning_rate=0.00025))

## Train

In [33]:
dqn.fit(env, nb_steps=10000, visualize=False, verbose=2)

Training for 10000 steps ...


  updates=self.state_updates,


  857/10000: episode: 1, duration: 16.724s, episode steps: 857, steps per second:  51, episode reward: 17.000, mean reward:  0.020 [ 0.000,  1.000], mean action: 2.505 [0.000, 5.000],  loss: --, mean_q: --, mean_eps: --


  updates=self.state_updates,


done, took 76.677 seconds


<keras.callbacks.History at 0x7f04841117d0>

In [None]:
scores = dqn.test(env, nb_episodes=20, visualize=False)
np.mean(scores.history["episode_reward"])

Testing for 20 episodes ...
Episode 1: reward: 40.000, steps: 694
Episode 2: reward: 230.000, steps: 826
Episode 3: reward: 25.000, steps: 570
Episode 4: reward: 60.000, steps: 792
Episode 5: reward: 35.000, steps: 557
Episode 6: reward: 20.000, steps: 661
Episode 7: reward: 225.000, steps: 967
Episode 8: reward: 115.000, steps: 844
Episode 9: reward: 80.000, steps: 677
Episode 10: reward: 85.000, steps: 545
Episode 11: reward: 75.000, steps: 672
Episode 12: reward: 90.000, steps: 1360
Episode 13: reward: 50.000, steps: 375
Episode 14: reward: 105.000, steps: 1188
Episode 15: reward: 80.000, steps: 683
Episode 16: reward: 15.000, steps: 497
Episode 17: reward: 340.000, steps: 988
Episode 18: reward: 10.000, steps: 734
Episode 19: reward: 65.000, steps: 435
Episode 20: reward: 20.000, steps: 407


88.25

TODO: 
* Train with 1.5M~2M steps
* Add fourth conv layer
* Don't use grayscale layer but initialize the environment with the grayscale option instead

## RAM-based approach

Atari 2600 uses a 128 byte RAM to for its internal representation of the game state.

In [34]:
envram = gym.make("SpaceInvaders-v4", obs_type="ram")

In [35]:
envram.observation_space.shape

(128,)

In [47]:
def build_ram_model(window_size, ram_size, actions):
    model = Sequential(name="ram_model")
    model.add(Input(shape=(window_size, ram_size)))
    model.add(Flatten(name="flatten"))
    model.add(Dense(512, activation="relu", name="fc1"))
    model.add(Dense(128, activation="relu", name="fc2"))
    model.add(Dense(actions, activation="linear", name="output"))
    return model


In [48]:
WINDOW_SIZE = 3
ram_model = build_ram_model(WINDOW_SIZE, envram.observation_space.shape[0], envram.action_space.n)

In [49]:
ram_model.summary()

Model: "ram_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten (Flatten)           (None, 384)               0         
                                                                 
 fc1 (Dense)                 (None, 512)               197120    
                                                                 
 fc2 (Dense)                 (None, 128)               65664     
                                                                 
 output (Dense)              (None, 6)                 774       
                                                                 
Total params: 263,558
Trainable params: 263,558
Non-trainable params: 0
_________________________________________________________________


In [50]:
ram_dqn = build_agent(ram_model, envram.action_space.n, WINDOW_SIZE)

In [51]:
ram_dqn.compile(Adam(learning_rate=0.0002))

In [52]:
history = ram_dqn.fit(envram, nb_steps=100_000, visualize=False, verbose=2)

Training for 100000 steps ...


  updates=self.state_updates,


   913/100000: episode: 1, duration: 2.158s, episode steps: 913, steps per second: 423, episode reward: 14.000, mean reward:  0.015 [ 0.000,  1.000], mean action: 2.468 [0.000, 5.000],  loss: --, mean_q: --, mean_eps: --
  2291/100000: episode: 2, duration: 2.337s, episode steps: 1378, steps per second: 590, episode reward: 25.000, mean reward:  0.018 [ 0.000,  1.000], mean action: 2.502 [0.000, 5.000],  loss: --, mean_q: --, mean_eps: --
  3081/100000: episode: 3, duration: 1.356s, episode steps: 790, steps per second: 583, episode reward: 15.000, mean reward:  0.019 [ 0.000,  1.000], mean action: 2.575 [0.000, 5.000],  loss: --, mean_q: --, mean_eps: --
  4072/100000: episode: 4, duration: 1.690s, episode steps: 991, steps per second: 586, episode reward:  7.000, mean reward:  0.007 [ 0.000,  1.000], mean action: 2.493 [0.000, 5.000],  loss: --, mean_q: --, mean_eps: --
  4844/100000: episode: 5, duration: 1.365s, episode steps: 772, steps per second: 565, episode reward: 11.000, mea

  updates=self.state_updates,


 10012/100000: episode: 14, duration: 3.567s, episode steps: 644, steps per second: 181, episode reward: 12.000, mean reward:  0.019 [ 0.000,  1.000], mean action: 2.668 [0.000, 5.000],  loss: 6355.657504, mean_q: 244.676995, mean_eps: 0.879928
 10657/100000: episode: 15, duration: 6.705s, episode steps: 645, steps per second:  96, episode reward:  8.000, mean reward:  0.012 [ 0.000,  1.000], mean action: 2.369 [0.000, 5.000],  loss: 295.729557, mean_q: 166.894596, mean_eps: 0.875992
 11386/100000: episode: 16, duration: 7.619s, episode steps: 729, steps per second:  96, episode reward: 11.000, mean reward:  0.015 [ 0.000,  1.000], mean action: 2.432 [0.000, 5.000],  loss: 175.217151, mean_q: 162.289567, mean_eps: 0.867748
 12235/100000: episode: 17, duration: 8.991s, episode steps: 849, steps per second:  94, episode reward: 10.000, mean reward:  0.012 [ 0.000,  1.000], mean action: 2.336 [0.000, 5.000],  loss: 174.615387, mean_q: 160.636613, mean_eps: 0.858280
 12861/100000: episode:

In [54]:
scores = ram_dqn.test(envram, nb_episodes=100, visualize=False)
np.mean(scores.history["episode_reward"])

Testing for 100 episodes ...
Episode 1: reward: 0.000, steps: 884
Episode 2: reward: 0.000, steps: 527
Episode 3: reward: 0.000, steps: 731
Episode 4: reward: 0.000, steps: 1055
Episode 5: reward: 1.000, steps: 1115
Episode 6: reward: 0.000, steps: 682
Episode 7: reward: 0.000, steps: 786
Episode 8: reward: 0.000, steps: 490
Episode 9: reward: 0.000, steps: 623
Episode 10: reward: 0.000, steps: 1088
Episode 11: reward: 0.000, steps: 684
Episode 12: reward: 0.000, steps: 476
Episode 13: reward: 0.000, steps: 640
Episode 14: reward: 0.000, steps: 494
Episode 15: reward: 0.000, steps: 650
Episode 16: reward: 0.000, steps: 477
Episode 17: reward: 0.000, steps: 390
Episode 18: reward: 0.000, steps: 500
Episode 19: reward: 0.000, steps: 670
Episode 20: reward: 0.000, steps: 619
Episode 21: reward: 0.000, steps: 675
Episode 22: reward: 0.000, steps: 909
Episode 23: reward: 0.000, steps: 946
Episode 24: reward: 0.000, steps: 479
Episode 25: reward: 0.000, steps: 777
Episode 26: reward: 0.000, 

0.05

In [53]:
ram_dqn.save_weights("ramdqn.h5f", overwrite=True)