## Title 
🆓 Exercise: Q-Learning using DQN

## Description
The aim of this exercise is to implement Deep Q Networks for a pre-defined reinforcement learning environment. For this, we will be using a pre-defined environment by OpenAI Gym. We will be using an environment called FrozenLake-v0. This is the same as what was used for the previous session (Refer to it to get information about the environment).

<img src="../fig/fig.png" style="width: 500px;">


## Instructions
- Initialize an environment using a pre-defined environment from OpenAI Gym.
- Get the number of possible actions in the environment. 
- Define a simple feed-forward neural network with your choice of hidden layers and nodes.
- Define the action sampling policy. We will be using the Epsilon Greedy policy.
- Initialize sequential memory to store the data, which is the input to the DQN.
- Define the DQN and compile it with Adam optimizer.
- Fit and test the DQN model.


## Hints

<a href="https://gym.openai.com/docs/#environments" target="_blank">gym.make(environment-name)</a> Access a pre-defined environment

Env.action_space.n : Returns the number of discrete actions

Env.observation_space.n : Returns the number of discrete states

<a href="https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense" target="_blank">Dense()</a> A regular densely-connected NN layer

<a href="https://www.tensorflow.org/api_docs/python/tf/keras/layers/Flatten" target="_blank">Flatten()</a> Flattens the input. 

<a href="https://keras-rl.readthedocs.io/en/latest/agents/dqn/" target="_blank">Adam()</a> Optimizer that implements the Adam algorithm

<a href="https://keras-rl.readthedocs.io/en/latest/agents/dqn/" target="_blank">DQNAgent()</a> Initializes the DQN Agent

SequentialMemory()

Keras-RL provides a class called rl.memory.SequentialMemory that provides a fast and efficient data structure that we can store the agent’s experiences in.

In [43]:
# Run this once and then comment it to ensure it does not run multiple times
# !pip install keras-rl2

In [44]:
# Import necessay libraries
import numpy as np
import gym
import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Flatten
# from tensorflow.keras.optimizers import Adam
from tensorflow.keras.optimizers.legacy import Adam

tf.keras.__version__ = tf.__version__

from rl.agents.dqn import DQNAgent
from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory

In [45]:
# Initializing an environment using a pre-defined environment from OpenAI Gym 
# The environment used here is 'FrozenLake-v0'
env = gym.make("FrozenLake-v1")

# Get the number of actions within the environment
nb_actions = env.action_space.n


In [46]:
# Define a Feed-Forward Neural Network 

# Initialize a keras sequential model
model = Sequential()

# Flatten the input to have an input shape of (1,) + 
# shape of the environment state space i.e. the observation space
model.add(Flatten(input_shape = (1,2)))

# Add Dense layers with Relu activation
# The number of hidden layers and number of nodes in each layer is your choice

model.add(Dense(128, activation = "relu"))
model.add(Dense(256, activation = "relu"))
model.add(Dense(128, activation = "relu"))

# Add an output layer with number of nodes as the number of actions
model.add(Dense(nb_actions))


In [47]:
# Define the policy to sample the actions
# We will be using the Epsilon-Greedy algorithm
policy = EpsGreedyQPolicy()
# implements an epsilon-greedy strategy. It means that with probability epsilon (ε), it explores randomly by selecting a random action, 
# and with probability (1 - ε), it exploits by selecting the action with the highest estimated Q-value (action-value) for the current state.

# To store our data initialize Sequential Memory with limit=500000 and window_length of 1
memory = SequentialMemory(limit = 500000, window_length = 1)

### **DQN AGENT**

<img src="./images/dqn.png" alt="DQN Agent" style="width:700px">

In [48]:
# Initialize the DQNAgent with the neural network model, nb_actions as the number of actions in the environment, 
# set the memory as the sequential memory defined above, nb_steps_warmup as 100, policy as the epsilon greedy policy defined above
# and set the target_model_update as 1e-2

# Deep Q-Network (DQN) agent
# rl.agents.dqn.DQNAgent(model, policy=None, test_policy=None, enable_double_dqn=True, enable_dueling_network=False, dueling_type='avg')
dqn = DQNAgent(model, policy = policy, nb_actions = nb_actions, memory = memory,nb_steps_warmup = 100, target_model_update = 1e-2 )

# Compile the DQN with Adam optimizer with learning rate of 1e-3 and metric as mse
dqn.compile(optimizer=Adam(learning_rate=1e-3), metrics=["mse"])

# Fit the DQN by passing with environment with nb_steps as 5000
# You have an option to visualize the output, which is done by implicitly calling the render function of the environment
# However, this will slow down the training process and is not recommended for EdStem
# To see the complete training details, set verbose as 2
dqn.fit(env, nb_steps=5000, verbose=2);


2023-09-18 20:50:34.336472: W tensorflow/c/c_api.cc:304] Operation '{name:'dense_27/bias/Assign' id:3030 op device:{requested: '', assigned: ''} def:{{{node dense_27/bias/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_FLOAT, validate_shape=false](dense_27/bias, dense_27/bias/Initializer/zeros)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.


Training for 5000 steps ...


2023-09-18 20:50:34.570657: W tensorflow/c/c_api.cc:304] Operation '{name:'dense_27/BiasAdd' id:3035 op device:{requested: '', assigned: ''} def:{{{node dense_27/BiasAdd}} = BiasAdd[T=DT_FLOAT, _has_manual_control_dependencies=true, data_format="NHWC"](dense_27/MatMul, dense_27/BiasAdd/ReadVariableOp)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.
2023-09-18 20:50:34.622187: W tensorflow/c/c_api.cc:304] Operation '{name:'count_25/Assign' id:3301 op device:{requested: '', assigned: ''} def:{{{node count_25/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_FLOAT, validate_shape=false](count_25, count_25/Initializer/zeros)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after runni

TypeError: float() argument must be a string or a real number, not 'dict'

In [None]:
# Test your model by passing the environment and running for 10 episodes
dqn.test(env, nb_episodes=10)
