# Applied Deep Learning Tutorial 
contact: Mark.schutera@kit.edu


# Deep Reinforcement Learning with Deep-Q-Network (DQN)

## Introduction
In this tutorial, you will attempt to implement a Deep-Q-Network that is able to do a classic control. The approaches are build upon the paper by DeepMind: Playing Atari with Deep Reinforcement Learning [paper](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf), which first introduces the notion of a Deep Q-Network.
<img src="graphics/atari_play.png" width="700"><br>
<center> Fig. 1: Breakout environment of the Atari game </center>

## Core idea
As you probably remember from the lecture, during trial and error we can learn a policy for our Atari game, and model it within our Q-matrix. This is done with a deep neural network. After training, this Q-matrix gives us an estimate of the expected reward when taking action a in state s: Q(s, a).
Playing the action with the maximum Q-value in any given state is the same as playing optimal, or following a full exploitation strategy.

## OpenAI Gym
[OpenAI Gym](https://gym.openai.com/docs/) is a library that can simulate a large number of reinforcement learning environments, including Atari games (these need to be installed additionaly). You will need Python 3.5+

>pip install gym


## Taking our cart pole on a first ride
Now that you have gym installed you can load the 'Pendulum-v0' environment of Atari.


In [None]:
# Import the gym module
import gym


In [None]:
# Load the environment
env = gym.make('Pendulum-v1', render_mode='human')

# Reset, it returns the starting frame
frame = env.reset()

for _ in range(100000):
    # Perform a random action, returns the new frame, reward and whether the game is over
    
    '''
    Implement to sample a random action from the action space within the loaded environment
    action = environment.action_space.sample()
    '''
    
    observation, reward, is_done, info = env.step(action)
    print('observation: ', observation, 'reward: ', reward)
    if is_done: break
    env.render()

env.close()


This already looks nice, yet the actions are random and thus it is time to better understand our environment. And to implement our Deep-Q-Network


In [None]:
# import the necessary libraries
import random
import pickle
from collections import deque

import gym
import gym.spaces
import gym.wrappers
import numpy as np
from keras.layers import Flatten, Dense
from keras import backend as K
from keras.models import Sequential, Model, load_model
from keras import optimizers

## Observation
The observation is made up of cos(theta), sin(theta) and theta dot. 
Theta is normalized between -pi and pi.

## Action
Joint effort -2.0 to +2.0
Write a function to discretize the continuous action space of the joint effort.


In [None]:
# define the action space
def create_action_bins(num_action_bins):
    '''
    Using linspace of numpy implement the action bins for the pendulum, when given the number of the action bins as argument
    actionbins = 
    '''
    
    return actionbins

# depending on the action, find the according actionbin 
# discretization of the continuous action space
def find_actionbin(action, actionbins):
    idx = (np.abs(actionbins - action)).argmin()

    return idx

## Reward
The reward is defined as
> -(theta^2 + 0.1 x theta_dt^2 + 0.001 x action^2)

What is the lowest expected cost? And what is the highest cost?

-(pi^2 + 0.1 x 8^2 + 0.001 x 2^2) = -16.2736044

-(0^2 + 0.1 x 0^2 + 0.001 x 0^2) = 0

From this reward function, what is the goal of the agent?
In essence, the goal is to remain at zero angle (vertical), with the least rotational velocity, and the least effort.

For a hint have a look at the [wiki](https://github.com/openai/gym/wiki).

In [None]:
def train_model(memory, gamma=0.9):
    for state, action, reward, state_new in memory:
        
        # flatten state to make it compatible to our neural network
        flat_state_new = np.reshape(state_new, [1, 3])
        flat_state = np.reshape(state, [1, 3])

        # determine estimated reward given state s' after action a, 
        # combination of observed and predicted exploited reward.
        '''
        Implement the Q-function for the flat_state_new
        target = 
        '''
        
        # determine current expected agent rewards
        targetfull = model.predict(flat_state)
        
        # update current expected rewards with the emulated prediced reward
        targetfull[0][action] = target
        
        # Fit model based on emulation and prediction
        model.fit(flat_state, targetfull, epochs=1, verbose=0)

## Deep Q Model

As a reminder, this is our Q function.
> Q(s, a) = r + gamma max_a'(Q(s, a'))

The input of our neural network, our generalizable Q-matrix, will be the observation or the state of the pendulum. 
and the output will be the estimate of the reward taking the action a'. Gamma is the discount factor of the predicted reward in our next state. r is the reward 

For our first network we will implement a DQN with keras:

- Layer with 128 ReLU units
- Layer with 64 ReLU units
- 3 inputs and one output per action bin with linear activation function
- Adam optimizer with learning rate 0.0002, beta_1 0.9 and beta_2 0.999
- Loss mean squared error

In [None]:
# Define the Deep-Q-Network in keras

def build_model(num_output_nodes):
    model = Sequential()

    '''
    Design your model here (find description above)
    model. 
    model.
    model.
    '''
    
    '''
    Define the optimizer (find description above)
    adam = optimizers.
    '''
    
    '''
    define the loss (hint: search the web for keras loss and see description above)
    model.compile(loss=, optimizer=adam)
    '''

    return model

In [None]:
def run_episodes(epsilon, gamma, training_iterations, sequence_iterations):
    
    # These are hyperparameters to play around with, after your first run.
    epsilon_decay = 0.9999
    epsilon_min = 0.02
    steps_per_sequence = 250

    for epoch in range(0, training_iterations // sequence_iterations):
        for sequence_id in range(0, sequence_iterations):
            state = env.reset()
            memory = deque()
            
            total_reward = 0
            
            # Easy implementation of decaying exploration
            '''
            Given epsilon, epsilon_decay and epsilon_min implement a simple method for a decaying epsilon.
            
            Decay epsilon here, while bigger than epsilon_min
            if 
                            
            '''
            
            for i in range(0, steps_per_sequence):
                    
                '''
                Given epsilon implement a simple method for trading off exploration and exploitation
            
                Hint: For random values (use numpy) smaller than epsilon we want to explore
                
                if 
                    action = 
                
                Hint: For random values larger than epsilon we want to exploit
                
                else:
                    flat_state = np.reshape(state, [1, 3])
                    action = 
                '''

                # determine action
                actionbin = find_actionbin(action, actionbinslist)
                action = actionbinslist[actionbin]
                action = np.array([action])

                # emulate the action in the simulation and observe the transition 
                # as well as the reward
                observation, reward, done, _ = env.step(action)
                total_reward += reward

                state_new = observation

                '''
                save transitions into memory
                Hint: The memory is used as an argument for the train_model function.
                
                memory.append((_, _, _, _))
                
                '''
                
                state = state_new
                
            # train model on the samples memories
            train_model(memory, gamma)
            
            print(epoch , ' epoch', sequence_id, ' sequence. Average reward = ', total_reward / steps_per_sequence, '. epsilon = ', epsilon)

           

## Function for running the policy of our DQN after loading or training


In [None]:
def play_game(rounds):
    state = env.reset()
    totalreward = 0

    for _ in range(0, rounds):
        # Rendering for visualization
        env.render()

        flat_state = np.reshape(state, [1, 3])
        actionbin = np.argmax(model.predict(flat_state))

        action = actionbinslist[actionbin]
        action = np.array([action])

        observation, reward, done, _ = env.step(action)

        totalreward += reward

        state_new = observation
        state = state_new
        
    return totalreward

## Train the DQN


In [None]:
env = gym.make('Pendulum-v1')

# These are hyperparameters to play around with

# iterations
training_iterations = 1000
sequence_iterations = 25

# epsilon (setting exploitation vs exploration)
epsilon = 1

# gamma (importance of predicted estimated reward)
gamma = 0.9

# Discretization settings for the action space
num_action_bins = 20
actionbinslist = create_action_bins(num_action_bins)



# Build model
model = build_model(num_action_bins)

run_episodes(epsilon, gamma, training_iterations, sequence_iterations)

'''
training takes super long, this is not efficient at all, how can we bypass this?
Hint: See cells in run pretrained model.

'''
   

In [None]:

# Save model weights

print('saving model')
model.save('pendulum_model_juno_' + str(training_iterations) + '.h5')
print('model saved')


In [None]:
# Evaluate performance on 10 test runs with 100 steps each
trarray = []
rounds = 100
for i in range(10):
    trarray.append(play_game(rounds))
    print(i, ' sequence. Average test reward = ', np.average(trarray)/rounds, 'Average test reward = ', trarray[-1]/rounds)
    

## Run pretrained model

In case you already trained a model or want to load the pretrained model for sanity checking use the following script (make sure you executed the necessary cells starting with the imports).

- How does the performance change with the amount of trained iterations?
- How can we measure performance to begin with?
- Is it sufficient to start the play_game function a single time? 
- How can we make sure, that the evaluation is meaningful?



In [None]:
'''

-
-
-
-

'''


In [None]:
env = gym.make('Pendulum-v1', render_mode='human')

actionbinslist = create_action_bins(20)

# 'pendulum_model_[iterationstrained].h5' 
# iterationstrained: 100, 1000, 10000
model = load_model('pendulum_model_1000.h5')

'''
Is the next line meaningful for evaluation, if not, what can we do?

play_game(rounds=250)
'''    
    
env.close()

## Next steps to take it from here

- Implement a skip frame approach
- Experiment with the discretization of the action bins (e.g. advantages and disadvantages of triadisation)
- Experiment with exploration vs exploitation

Send extended ipynb file to mark.schutera@kit.edu for the chance to get bonus points for the final project.