<a href="https://colab.research.google.com/github/syntactic/DeepReinforcementLearning/blob/main/Group9_HW4_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 1 Homework Review
This task asks you to review two other groups’ homework. The goal includes (1) for you to get a better understanding of contents by reviewing other groups submissions, (2) helping them understand how they could improve with their code (and possibly RL), and (3) help you improve by receiving valuable feedback from other groups. Step-by-step:
1. Coordinate with two other groups for mutual feedback. You may use the forum to achieve this, but we also try to match groups spontaneously at each QnA.
1
2. Take 15-30 min each to review their respective submissions. Write bullet points on your findings (both what your group should learn from their submission, and what the other group should improve)
3. Get together and discuss this feedback with representatives of all three groups in one of either the in-person or digital QnA sessions. Have one of the attending tutors as a ’referee’ for any upcoming discussion and questions, and make sure they write down having refereed your group.
4. Denote the groups and respective tutor in the homework submission form


### 2 DQN
This homework asks you to implement DQN on the [ALE Breakout v5 Atari Game](https://gymnasium.farama.org/environments/atari/breakout/)
* Achieving a reasonable score takes four to twenty four hours on a reasonable computer system with dedicated GPU. plan accordingly and build towards efficiency and throughput (make use of vectorized environments!).
* If you do not have the necessary compute ressources available, you may instead solve the discrete version of lunar lander (also available on gym- nasium!). Notice you can not be awarded an outstanding in this case however!
* Make use of the [Implementing DQN from scratch video](https://www.youtube.com/playlist?list=PLPitqsshnVV8YOGE1r-Sm2zVtuSsGNL_G) series if necessary.
* Make use of a delayed target network, prefilling the ERP and some additional measures to tackle the overestimation bias (e.g. Double DQN)

In [None]:
pip install gymnasium[atari,accept-rom-license]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gymnasium[accept-rom-license,atari]
  Downloading gymnasium-0.28.1-py3-none-any.whl (925 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m925.5/925.5 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
Collecting jax-jumpy>=1.0.0 (from gymnasium[accept-rom-license,atari])
  Downloading jax_jumpy-1.0.0-py3-none-any.whl (20 kB)
Collecting farama-notifications>=0.0.1 (from gymnasium[accept-rom-license,atari])
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Collecting autorom[accept-rom-license]~=0.4.2 (from gymnasium[accept-rom-license,atari])
  Downloading AutoROM-0.4.2-py3-none-any.whl (16 kB)
Collecting shimmy[atari]<1.0,>=0.1.0 (from gymnasium[accept-rom-license,atari])
  Downloading Shimmy-0.2.1-py3-none-any.whl (25 kB)
Collecting AutoROM.accept-rom-license (from autorom[accept-rom-license]~=0.4.2->gymnasium[accept-rom-license,atari])
  Download

In [None]:
import gymnasium as gym
import tensorflow as tf

In [None]:
env = gym.make("ALE/Breakout-v5")
obs, _ = env.reset()

In [None]:
class ExperienceReplayBuffer:

    def __init__(self, max_size: int, environment_name: str, parallel_game_unrolls: int, observation_preprocessing_function: callable, unroll_steps:int):
        self.max_size = max_size
        self.environment_name = environment_name
        self.parallel_game_unrolls = parallel_game_unrolls
        self.unroll_steps = unroll_steps
        self.observation_preprocessing_function = observation_preprocessing_function
        self.num_possible_actions = self.env.single_action_space.n
        self.envs = gym.vector.make(environment_name, num_envs=self.parallel_game_unrolls)
        self.current_states, _ = self.envs.reset()
        self.data = []

    def fill_with_samples(self, dqn_network, epsilon: float):
        # adds new samples into the ERP

        states_list = []
        actions_list = []
        rewards_list = []
        terminateds_list = []
        next_states_list = []

        
        for i in range(self.unroll_steps):
            actions = self.sample_epsilon_greedy(dqn_network, epsilon)
            # take the action and get s' and r
            next_states, rewards, terminateds, _, _ = self.envs.step(actions)
            # store observation, action, reward, next_observation into ERP container
            #
            states_list.append(self.current_states)
            actions_list.append(actions)
            rewards_list.append(rewards)
            terminateds_list.append(terminateds)
            next_states_list.append(next_states)
            self.current_states = next_states

        def data_generator():
            for states_batch, actions_batch, rewards_batch, terminateds_batch, next_states_batch in \
                zip(states_list, actions_list, rewards_list, terminateds_list, next_states_list):
                for game_idx in range(self.parallel_game_unrolls):
                    state = states_batch[game_idx,:,:,:]
                    action = actions_batch[game_idx]
                    reward = rewards_batch[game_idx]
                    terminated = terminateds_batch[game_idx]
                    next_state = next_states_batch[game_idx,:,:,:]
            yield(state, action, reward, next_state, terminated)
        
        dataset_tensor_specs = (tf.TensorSpec(shape=(210,160,3), dtype=tf.uint8), 
                                tf.TensorSpec(shape=(), dtype=tf.int32), 
                                tf.TensorSpec(shape=(), dtype=tf.float32), 
                                tf.TensorSpec(shape=(210,160,3), dtype=tf.uint8),
                                tf.TensorSpec(shape=(), dtype=tf.bool))

        new_samples_dataset = tf.data.Dataset.from_generator(data_generator, output_signature=dataset_tensor_specs)
        
        new_samples_dataset = new_samples_dataset.map(lambda state, action, reward, next_state, terminated:(self.observation_preprocessing_function(state), action, reward, self.observation_preprocessing_function(next_state), terminated))
        new_samples_dataset = new_samples_dataset.cache().shuffle(buffer_size=self.unroll_steps * self.parallel_game_unrolls, reshuffle_each_iteration=True)

        for elem in new_samples_dataset:
            continue

        self.data.append(new_samples_dataset)

        if(len(self.data) * self.parallel_game_unrolls * self.unroll_steps > self.max_size):
            self.data.pop(0)


    def create_dataset(self):
        ERP_dataset = tf.data.Dataset.sample_from_datasets(self.data, weights=[1/float(len(self.data)) for _ in self.data], stop_on_empty_dataset = False)
        return ERP_dataset

    def sample_epsilon_greedy(self, dqn_network, epsilon):
        observations = self.observation_preprocessing_function(self.current_states)
        q_values = dqn_network(observations) # tensor float 32 shape(parallel_game_unrolls, num_actions)
        greedy_actions = tf.argmax(q_values, axis=1)
        random_actions = tf.random.uniform(shape=(self.parallel_game_unrolls,), minval=0, maxval=self.num_possible_actions, dtype=tf.int64)
        epsilon_sampling = tf.random.uniform(shape=(self.parallel_game_unrolls,), minval=0, maxval=1, dtype=tf.float32) > epsilon
        actions = tf.where(epsilon_sampling, greedy_actions, random_actions)
        return actions

def observation_preprocessing_function(observation):
    # preprocess our observation so that it has shape (84, 84) and is between -1 and 1
    observation = tf.image.reseize(observation, shape=(84,84))
    observation = tf.cast(observation, dtype=tf.float32)/128.0 - 1.0
    return observation

def create_dqn_model(num_actions: int):
    # create intput for function tf model api
    input_layer = tf.keras.Input(shape=(84,84,3), dtype=tf.float32)

    x = tf.keras.layers.Conv2D(filters=16, kernel_size=3, activation='relu')(input_layer)
    x = tf.keras.layers.Conv2D(filters=16, kernel_size=3, activation='relu')(input_layer) + x # residual connections
    x = tf.keras.layers.Conv2D(filters=16, kernel_size=3, activation='relu')(input_layer) + x

    x = tf.keras.layers.MaxPool2D(pool_size=2)(x)

    x = tf.keras.layers.Conv2D(filters=32, kernel_size=3, activation='relu')(input_layer)
    x = tf.keras.layers.Conv2D(filters=32, kernel_size=3, activation='relu')(input_layer) + x
    x = tf.keras.layers.Conv2D(filters=32, kernel_size=3, activation='relu')(input_layer) + x

    x = tf.keras.layers.MaxPool2D(pool_size=2)(x)

    x = tf.keras.layers.Conv2D(filters=64, kernel_size=3, activation='relu')(input_layer)
    x = tf.keras.layers.Conv2D(filters=64, kernel_size=3, activation='relu')(input_layer) + x
    x = tf.keras.layers.Conv2D(filters=64, kernel_size=3, activation='relu')(input_layer) + x 

    x = tf.keras.layers.MaxPool2D(pool_size=2)(x)

    x = tf.keras.layers.Conv2D(filters=64, kernel_size=3, activation='relu')(input_layer) + x
    x = tf.keras.layers.Conv2D(filters=64, kernel_size=3, activation='relu')(input_layer) + x
    x = tf.keras.layers.Conv2D(filters=64, kernel_size=3, activation='relu')(input_layer) + x

    x = tf.keras.layers.GlobalAvgPool2D()(x)

    x = tf.keras.layers.Dense(units=64, activation='relu')(x) + x
    x = tf.keras.layers.Dense(units=num_actions, activations='linear')(x)

    model = tf.keras.Model(inputs=input_layer, ouputs=x)

    return model




In [None]:
actions = env.action_space
for i in range(1000):
  action = actions.sample()
  obs, reward, terminated, truncated, info = env.step(action)
  if terminated:
    print(i)
    env.reset()

238
373
582
748
880
