## DQN (Deep Q-Network)
### Function approximation Q-learning
This tutorial walks through the implementation of deep Q networks (DQNs), 
an RL method which applies the function approximation capabilities of deep neural networks
to problems in reinforcement learning.
The model in this tutorial closely follows the work described in the paper 
[Human-level control through deep reinforcement learning](http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html?foxtrotcallback=true), written by Volodomyr Mnih. 

To keep these chapters runnable 
by as many people as possible, 
on as many machines as possible,
and with as few headaches as possible, 
we have so far avoided dependencies on external libraries 
(besides mxnet, numpy and matplotlib). 
However, in this case, we'll need to import the [OpenAI Gym](https://gym.openai.com/docs).
That's because in reinforcement learning, 
instead of drawing examples from a data structure, 
our data comes from interactions with an environment. 
In this chapter, our environemnts will be classic Atari video games.

## Preliminaries
The following code clones and installs the OpenAI gym.
`git clone https://github.com/openai/gym ; cd gym ; pip install -e .[all]` 
Full documentation for the gym can be found on [at this website](https://gym.openai.com/).
If you want to see reasonable results before the sun sets on your AI career,
we suggest running these experiments on a server equipped with GPUs.

In [1]:
import mxnet as mx
from mxnet import nd, autograd
from mxnet import gluon
from __future__ import print_function
import os
import random
import numpy as np
import matplotlib.pyplot as plt
from IPython import display
import gym
import math
from collections import namedtuple
from ale_python_interface import ALEInterface
import logging

import time
f = open('results_diff0.txt','w')



### Summary of the algorithm
#### Collect samples
At the beginning of each episode (one round of the game), 
reset the environment to its initial state using `env.reset()`. 
At each time step ``t``, the environment is at `current_state`.
With probability $\epsilon$, apply a random action.
Otherwise, apply $argmax_a~ Q(\phi($ `current_state` $),a,\theta)$,
where $Q$ is parameterized by paramters $\theta$ and $\phi(\cdot)$ is preprocessor.
Pass the action through `env.step(action)` to receive next frame, reward and whether the game terminates.
Append this frame to the end of the `current_state` and construct `next_state` while removeing $frame(t-12)$.
Store the tuple $(\phi($ `current_state` $), action, reward, \phi($ `next_ state` $))$ in the replay buffer.

#### Update Network
* Draw batches of tuples from the replay buffer: $(\phi,r,a,\phi')$.
* Define the following loss
$$\Large(\small Q(\phi,a,\theta)-r-max_{a'}Q(\phi',a',\theta^-)\Large)^2$$
* Where $\theta^-$ is the parameter of the target network.( Set $Q(\phi',a',\theta^-)$ to zero if $\phi$ is the preprocessed termination state). 
* Update the $\theta$
* Update the $\theta^-$ once in a while








## Set the hyper-parameters

In [2]:
class Options:
    def __init__(self):
        #Articheture
        self.batch_size = 32 # The size of the batch to learn the Q-function
        self.image_size = 84 # Resize the raw input frame to square frame of size 80 by 80 
        #Trickes
        self.replay_buffer_size = 1000000 # The size of replay buffer; set it to size of your memory (.5M for 50G available memory)
        self.learning_frequency = 4 # With Freq of 1/4 step update the Q-network
        self.skip_frame = 4 # Skip 4-1 raw frames between steps
        self.internal_skip_frame = 4 # Skip 4-1 raw frames between skipped frames
        self.frame_len = 4 # Each state is formed as a concatination 4 step frames [f(t-12),f(t-8),f(t-4),f(t)]
        self.Target_update = 10000 # Update the target network each 10000 steps
        self.epsilon_min = 0.1 # Minimum level of stochasticity of policy (epsilon)-greedy
        self.annealing_end = 1000000. # The number of step it take to linearly anneal the epsilon to it min value
        self.gamma = 0.99 # The discount factor
        self.replay_start_size = 50000 # Start to backpropagated through the network, learning starts
        self.no_op_max = 30 / self.skip_frame # Run uniform policy for first 30 times step of the beginning of the game
        
        #otimization
        self.num_episode = 20000 # Number episode to run the algorithm
        self.lr = 0.00025 # RMSprop learning rate
        self.gamma1 = 0.95 # RMSprop gamma1
        self.gamma2 = 0.95 # RMSprop gamma2
        self.rms_eps = 0.01 # RMSprop epsilon bias
        self.ctx = mx.gpu() # Enables gpu if available, if not, set it to mx.cpu()
opt = Options()

env_name = 'Assault_diff0' # Set the desired environment
#env = gym.make(env_name)
#num_action = env.action_space.n # Extract the number of available action from the environment setting

manualSeed = 1 # random.randint(1, 10000) # Set the desired seed to reproduce the results
mx.random.seed(manualSeed)
attrs = vars(opt)
print (', '.join("%s: %s" % item for item in attrs.items()))

replay_buffer_size: 1000000, annealing_end: 1000000.0, Target_update: 10000, gamma1: 0.95, frame_len: 4, internal_skip_frame: 4, ctx: gpu(0), skip_frame: 4, batch_size: 32, learning_frequency: 4, lr: 0.00025, num_episode: 20000, image_size: 84, epsilon_min: 0.1, replay_start_size: 50000, rms_eps: 0.01, no_op_max: 7, gamma: 0.99, gamma2: 0.95


In [3]:
ale = ALEInterface()
ale.setInt('random_seed', 1)
ale.loadROM('ROMS/space_invaders.bin')
num_action = len(ale.getLegalActionSet())

In [4]:
print(ale.getAvailableDifficulties())
ale.setDifficulty(0)

[0 1]


### Define the DQN model
The network is constructed as three CNN layers and a fully connected added on the top. Furthermore, the optimizer is assigned to the parameters.

In [5]:
DQN = gluon.nn.Sequential()
with DQN.name_scope():
    #first layer
    DQN.add(gluon.nn.Conv2D(channels=32, kernel_size=8,strides = 4,padding = 0))
    DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))
    DQN.add(gluon.nn.Activation('relu'))
    #second layer
    DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=4,strides = 2))
    DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))
    DQN.add(gluon.nn.Activation('relu'))
    #tird layer
    DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=3,strides = 1))
    DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))
    DQN.add(gluon.nn.Activation('relu'))
    DQN.add(gluon.nn.Flatten())
    #fourth layer
    DQN.add(gluon.nn.Dense(512,activation ='relu'))
    #fifth layer
    DQN.add(gluon.nn.Dense(num_action,activation ='relu'))

dqn = DQN
dqn.collect_params().initialize(mx.init.Normal(0.02), ctx=opt.ctx)
DQN_trainer = gluon.Trainer(dqn.collect_params(),'RMSProp', \
                          {'learning_rate': opt.lr ,'gamma1':opt.gamma1,'gamma2': opt.gamma2,'epsilon': opt.rms_eps,'centered' : True})
dqn.collect_params().zero_grad()


In [6]:
Target_DQN = gluon.nn.Sequential()
with Target_DQN.name_scope():
    #first layer
    Target_DQN.add(gluon.nn.Conv2D(channels=32, kernel_size=8,strides = 4,padding = 0))
    Target_DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))
    Target_DQN.add(gluon.nn.Activation('relu'))
    #second layer
    Target_DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=4,strides = 2))
    Target_DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))
    Target_DQN.add(gluon.nn.Activation('relu'))
    #tird layer
    Target_DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=3,strides = 1))
    Target_DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))
    Target_DQN.add(gluon.nn.Activation('relu'))
    Target_DQN.add(gluon.nn.Flatten())
    #fourth layer
    Target_DQN.add(gluon.nn.Dense(512,activation ='relu'))
    #fifth layer
    Target_DQN.add(gluon.nn.Dense(num_action,activation ='relu'))
target_dqn = Target_DQN
target_dqn.collect_params().initialize(mx.init.Normal(0.02), ctx=opt.ctx)


### Replay buffer
Replay buffer store the tuple of : `state`, action , `next_state`, reward , done.

In [7]:
Transition = namedtuple('Transition',('state', 'action', 'reward','done','initial_state'))
class Replay_Buffer():
    def __init__(self, replay_buffer_size):
        self.replay_buffer_size = replay_buffer_size
        self.memory = []
        self.position = 0
    def push(self, *args):
        if len(self.memory) < self.replay_buffer_size:
            self.memory.append(None)
        self.memory[self.position] = Transition(*args)
        self.position = (self.position + 1) % self.replay_buffer_size
    def sample(self, batch_size,batch_state,batch_state_next,batch_reward,batch_action,batch_done):
        for i in range(batch_size):
            j = random.randint(opt.frame_len-1,len(self.memory)-2)
            for jj in range(opt.frame_len):
                batch_state[i,opt.frame_len-1-jj] =  self.memory[j-jj].state[0].as_in_context(opt.ctx).astype('float32')/255.
                batch_state_next[i,opt.frame_len-1-jj] = self.memory[j-jj+1].state.as_in_context(opt.ctx)[0].astype('float32')/255.
                if self.memory[j-jj].initial_state:
                    for kk in range(opt.frame_len-jj-1):
                        batch_state[i,opt.frame_len-2-kk] =  self.memory[j-jj].state[0].as_in_context(opt.ctx).astype('float32')/255.
                        batch_state_next[i,opt.frame_len-2-kk] = self.memory[j-jj].state.as_in_context(opt.ctx)[0].astype('float32')/255.
                    break
            if self.memory[j].done:
                batch_state_next[i,opt.frame_len-1] = batch_state[i,opt.frame_len-1]
            batch_reward[i] = self.memory[j].reward
            batch_action[i] = self.memory[j].action
            batch_done[i] = self.memory[j].done

### Preprocess frames
* Take a frame, average over the `RGB` filter and append it to the `state` to construct `next_state`
* Clip the reward
* Render the frames

In [8]:
def preprocess(raw_frame, currentState = None, initial_state = False):
    raw_frame = nd.array(raw_frame,mx.cpu())
    raw_frame = nd.reshape(nd.mean(raw_frame, axis = 2),shape = (raw_frame.shape[0],raw_frame.shape[1],1))
    raw_frame = mx.image.imresize(raw_frame,  opt.image_size, opt.image_size)
    raw_frame = nd.transpose(raw_frame, (2,0,1))
    raw_frame = raw_frame.astype('float32')/255.
    if initial_state == True:
        state = raw_frame
        for _ in range(opt.frame_len-1):
            state = nd.concat(state , raw_frame, dim = 0)
    else:
        state = mx.nd.concat(currentState[1:,:,:], raw_frame, dim = 0)
    return state, raw_frame

def rew_clipper(rew):
    if rew>0.:
        return 1.
    elif rew<0.:
        return -1.
    else:
        return 0

def renderimage(next_frame):
    if render_image:
        plt.imshow(next_frame);
        plt.show()
        display.clear_output(wait=True)
        time.sleep(.1)
        
l2loss = gluon.loss.L2Loss(batch_axis=0)


### Initialize arrays

In [None]:
frame_counter = 0. # Counts the number of steps so far
annealing_count = 0. # Counts the number of annealing steps
epis_count = 0. # Counts the number episodes so far
replay_memory = Replay_Buffer(opt.replay_buffer_size) # Initialize the replay buffer
tot_clipped_reward = np.zeros(opt.num_episode) 
tot_reward = np.zeros(opt.num_episode)
moving_average_clipped = 0.
moving_average = 0.

### Train the model

In [None]:
render_image = False # Whether to render Frames and show the game
batch_state = nd.empty((opt.batch_size,opt.frame_len,opt.image_size,opt.image_size), opt.ctx)
batch_state_next = nd.empty((opt.batch_size,opt.frame_len,opt.image_size,opt.image_size), opt.ctx)
batch_reward = nd.empty((opt.batch_size),opt.ctx)
batch_action = nd.empty((opt.batch_size),opt.ctx)
batch_done = nd.empty((opt.batch_size),opt.ctx)
initial_state = True
while epis_count < opt.num_episode:
    cum_clipped_reward = 0
    cum_reward = 0
    ale.reset_game()
    next_frame = ale.getScreenRGB()
    state, current_frame = preprocess(next_frame, initial_state = True)
    t = 0.
    done = False
    initial_state = True


    while not done:
        mx.nd.waitall()
        previous_state = state
        # show the frame
        renderimage(next_frame)
        sample = random.random()
        if frame_counter > opt.replay_start_size:
            annealing_count += 1
        if frame_counter == opt.replay_start_size:
            logging.error('annealing and laerning are started ')
            
            
        
        eps = np.maximum(1.-annealing_count/opt.annealing_end,opt.epsilon_min)
        effective_eps = eps
        if t < opt.no_op_max:
            effective_eps = 1.
        
        # epsilon greedy policy
        if sample < effective_eps:
            action = random.randint(0, num_action - 1)
        else:
            data = state.reshape([1,opt.frame_len,opt.image_size,opt.image_size]).as_in_context(opt.ctx)
            action = int(nd.argmax(dqn(data),axis=1).as_in_context(mx.cpu()).asscalar())
        
        # Skip frame
        rew = 0
        for skip in range(opt.skip_frame-1):
            reward = ale.act(action)
            next_frame = ale.getScreenRGB()
            done = ale.game_over()
            renderimage(next_frame)
            cum_clipped_reward += rew_clipper(reward)
            rew += reward
            for internal_skip in range(opt.internal_skip_frame-1):
                reward = ale.act(action)
                done = ale.game_over()
                cum_clipped_reward += rew_clipper(reward)
                rew += reward
                
        reward = ale.act(action)
        next_frame_new = ale.getScreenRGB()
        done = ale.game_over()        
        renderimage(next_frame)
        cum_clipped_reward += rew_clipper(reward)
        rew += reward
        cum_reward += rew
        
        # Reward clipping
        reward = rew_clipper(rew)
        next_frame = np.maximum(next_frame_new,next_frame)
        replay_memory.push((current_frame*255.).astype(np.uint8)\
                           ,action,reward,done, initial_state)
        state, current_frame = preprocess(next_frame, state)
        initial_state = False
        # Train
        if frame_counter > opt.replay_start_size:        
            if frame_counter % opt.learning_frequency == 0:
                replay_memory.sample(opt.batch_size,batch_state,batch_state_next,batch_reward,batch_action,batch_done)
                with autograd.record():
                    Q_sp = nd.max(target_dqn(batch_state_next),axis = 1)
                    Q_sp = Q_sp*(nd.ones(opt.batch_size,ctx = opt.ctx)-batch_done)
                    Q_s_array = dqn(batch_state)
                    Q_s = nd.pick(Q_s_array,batch_action,1)
                    loss = nd.mean(l2loss(Q_s ,  (batch_reward + opt.gamma *Q_sp)))
                loss.backward()
                DQN_trainer.step(opt.batch_size)
                
        

        
        t += 1
        frame_counter += 1
        
        # Save the model and update Target model
        if frame_counter > opt.replay_start_size:
            if frame_counter % opt.Target_update == 0 :
                check_point = frame_counter / (opt.Target_update *100)
                fdqn = './target_%s_%d' % (env_name,int(check_point))
                dqn.save_params(fdqn)
                target_dqn.load_params(fdqn, opt.ctx)
        if done:
            if epis_count % 10. == 0. :
                results = 'epis[%d],eps[%f],durat[%d],fnum=%d, cum_cl_rew = %d, cum_rew = %d,tot_cl = %d , tot = %d'\
                  %(epis_count,eps,t+1,frame_counter,cum_clipped_reward,cum_reward,moving_average_clipped,moving_average)
                print(results)
                f.write('\n' + results)
    epis_count += 1
    tot_clipped_reward[int(epis_count)-1] = cum_clipped_reward
    tot_reward[int(epis_count)-1] = cum_reward
    if epis_count > 50.:
        moving_average_clipped = np.mean(tot_clipped_reward[int(epis_count)-1-50:int(epis_count)-1])
        moving_average = np.mean(tot_reward[int(epis_count)-1-50:int(epis_count)-1])
f.close()
from tempfile import TemporaryFile
outfile = TemporaryFile()
outfile_clip = TemporaryFile()
np.save(outfile, moving_average)
np.save(outfile_clip, moving_average_clipped)

epis[0],eps[1.000000],durat[360],fnum=359, cum_cl_rew = 29, cum_rew = 640,tot_cl = 0 , tot = 0
epis[10],eps[1.000000],durat[114],fnum=2130, cum_cl_rew = 6, cum_rew = 80,tot_cl = 0 , tot = 0
epis[20],eps[1.000000],durat[165],fnum=3676, cum_cl_rew = 8, cum_rew = 120,tot_cl = 0 , tot = 0
epis[30],eps[1.000000],durat[141],fnum=5580, cum_cl_rew = 9, cum_rew = 100,tot_cl = 0 , tot = 0
epis[40],eps[1.000000],durat[156],fnum=7106, cum_cl_rew = 10, cum_rew = 150,tot_cl = 0 , tot = 0
epis[50],eps[1.000000],durat[196],fnum=8429, cum_cl_rew = 16, cum_rew = 215,tot_cl = 0 , tot = 0
epis[60],eps[1.000000],durat[147],fnum=9753, cum_cl_rew = 9, cum_rew = 135,tot_cl = 9 , tot = 137
epis[70],eps[1.000000],durat[149],fnum=11357, cum_cl_rew = 11, cum_rew = 140,tot_cl = 9 , tot = 135
epis[80],eps[1.000000],durat[108],fnum=13075, cum_cl_rew = 6, cum_rew = 50,tot_cl = 8 , tot = 130
epis[90],eps[1.000000],durat[239],fnum=14605, cum_cl_rew = 24, cum_rew = 560,tot_cl = 8 , tot = 122
epis[100],eps[1.000000],dura

ERROR:root:annealing and laerning are started 


epis[310],eps[0.999965],durat[91],fnum=50036, cum_cl_rew = 8, cum_rew = 75,tot_cl = 9 , tot = 128
epis[320],eps[0.998369],durat[206],fnum=51632, cum_cl_rew = 11, cum_rew = 155,tot_cl = 9 , tot = 134
epis[330],eps[0.996534],durat[205],fnum=53467, cum_cl_rew = 12, cum_rew = 145,tot_cl = 9 , tot = 144
epis[340],eps[0.995072],durat[232],fnum=54929, cum_cl_rew = 18, cum_rew = 245,tot_cl = 9 , tot = 143
epis[350],eps[0.993691],durat[163],fnum=56310, cum_cl_rew = 12, cum_rew = 150,tot_cl = 9 , tot = 132
epis[360],eps[0.992414],durat[262],fnum=57587, cum_cl_rew = 21, cum_rew = 310,tot_cl = 8 , tot = 113
epis[370],eps[0.990488],durat[182],fnum=59513, cum_cl_rew = 13, cum_rew = 195,tot_cl = 9 , tot = 123
epis[380],eps[0.988971],durat[139],fnum=61030, cum_cl_rew = 7, cum_rew = 70,tot_cl = 8 , tot = 121
epis[390],eps[0.987009],durat[223],fnum=62992, cum_cl_rew = 19, cum_rew = 300,tot_cl = 9 , tot = 141
epis[400],eps[0.985614],durat[84],fnum=64387, cum_cl_rew = 8, cum_rew = 90,tot_cl = 9 , tot = 14

epis[1130],eps[0.869064],durat[282],fnum=180937, cum_cl_rew = 25, cum_rew = 545,tot_cl = 11 , tot = 162
epis[1140],eps[0.867309],durat[226],fnum=182692, cum_cl_rew = 13, cum_rew = 190,tot_cl = 11 , tot = 165
epis[1150],eps[0.865840],durat[122],fnum=184161, cum_cl_rew = 8, cum_rew = 120,tot_cl = 10 , tot = 157
epis[1160],eps[0.863967],durat[280],fnum=186034, cum_cl_rew = 22, cum_rew = 290,tot_cl = 10 , tot = 168
epis[1170],eps[0.862289],durat[88],fnum=187712, cum_cl_rew = 3, cum_rew = 45,tot_cl = 10 , tot = 157
epis[1180],eps[0.860477],durat[155],fnum=189524, cum_cl_rew = 5, cum_rew = 40,tot_cl = 11 , tot = 159
epis[1190],eps[0.858392],durat[294],fnum=191609, cum_cl_rew = 27, cum_rew = 620,tot_cl = 11 , tot = 162
epis[1200],eps[0.856944],durat[235],fnum=193057, cum_cl_rew = 18, cum_rew = 215,tot_cl = 10 , tot = 165
epis[1210],eps[0.855699],durat[147],fnum=194302, cum_cl_rew = 9, cum_rew = 85,tot_cl = 10 , tot = 147
epis[1220],eps[0.854193],durat[179],fnum=195808, cum_cl_rew = 15, cum_re

epis[1930],eps[0.742959],durat[89],fnum=307042, cum_cl_rew = 5, cum_rew = 60,tot_cl = 10 , tot = 142
epis[1940],eps[0.741487],durat[247],fnum=308514, cum_cl_rew = 22, cum_rew = 500,tot_cl = 9 , tot = 122
epis[1950],eps[0.740088],durat[125],fnum=309913, cum_cl_rew = 8, cum_rew = 140,tot_cl = 9 , tot = 131
epis[1960],eps[0.738568],durat[217],fnum=311433, cum_cl_rew = 11, cum_rew = 150,tot_cl = 9 , tot = 127
epis[1970],eps[0.736997],durat[147],fnum=313004, cum_cl_rew = 13, cum_rew = 135,tot_cl = 10 , tot = 133
epis[1980],eps[0.735366],durat[140],fnum=314635, cum_cl_rew = 6, cum_rew = 70,tot_cl = 10 , tot = 132
epis[1990],eps[0.733600],durat[148],fnum=316401, cum_cl_rew = 8, cum_rew = 55,tot_cl = 11 , tot = 147
epis[2000],eps[0.731976],durat[210],fnum=318025, cum_cl_rew = 22, cum_rew = 300,tot_cl = 10 , tot = 131
epis[2010],eps[0.730362],durat[120],fnum=319639, cum_cl_rew = 9, cum_rew = 110,tot_cl = 11 , tot = 140
epis[2020],eps[0.728784],durat[205],fnum=321217, cum_cl_rew = 14, cum_rew = 

epis[2730],eps[0.619662],durat[126],fnum=430339, cum_cl_rew = 4, cum_rew = 50,tot_cl = 9 , tot = 122
epis[2740],eps[0.618049],durat[144],fnum=431952, cum_cl_rew = 5, cum_rew = 35,tot_cl = 9 , tot = 128
epis[2750],eps[0.616640],durat[102],fnum=433361, cum_cl_rew = 5, cum_rew = 55,tot_cl = 10 , tot = 137
epis[2760],eps[0.614973],durat[85],fnum=435028, cum_cl_rew = 5, cum_rew = 35,tot_cl = 10 , tot = 141
epis[2770],eps[0.613239],durat[121],fnum=436762, cum_cl_rew = 4, cum_rew = 40,tot_cl = 10 , tot = 143
epis[2780],eps[0.611540],durat[126],fnum=438461, cum_cl_rew = 6, cum_rew = 80,tot_cl = 9 , tot = 135
epis[2790],eps[0.609972],durat[150],fnum=440029, cum_cl_rew = 10, cum_rew = 95,tot_cl = 9 , tot = 135
epis[2800],eps[0.608590],durat[93],fnum=441411, cum_cl_rew = 4, cum_rew = 45,tot_cl = 10 , tot = 142
epis[2810],eps[0.606993],durat[117],fnum=443008, cum_cl_rew = 3, cum_rew = 20,tot_cl = 9 , tot = 134
epis[2820],eps[0.605517],durat[101],fnum=444484, cum_cl_rew = 4, cum_rew = 35,tot_cl = 9

epis[3530],eps[0.487917],durat[268],fnum=562084, cum_cl_rew = 16, cum_rew = 185,tot_cl = 10 , tot = 124
epis[3540],eps[0.486231],durat[141],fnum=563770, cum_cl_rew = 9, cum_rew = 105,tot_cl = 11 , tot = 133
epis[3550],eps[0.484750],durat[117],fnum=565251, cum_cl_rew = 7, cum_rew = 110,tot_cl = 10 , tot = 131
epis[3560],eps[0.483086],durat[121],fnum=566915, cum_cl_rew = 5, cum_rew = 40,tot_cl = 10 , tot = 134
epis[3570],eps[0.481535],durat[77],fnum=568466, cum_cl_rew = 5, cum_rew = 40,tot_cl = 10 , tot = 134
epis[3580],eps[0.480155],durat[85],fnum=569846, cum_cl_rew = 6, cum_rew = 70,tot_cl = 10 , tot = 132
epis[3590],eps[0.478563],durat[215],fnum=571438, cum_cl_rew = 15, cum_rew = 235,tot_cl = 9 , tot = 120
epis[3600],eps[0.476898],durat[124],fnum=573103, cum_cl_rew = 7, cum_rew = 60,tot_cl = 10 , tot = 124
epis[3610],eps[0.474860],durat[257],fnum=575141, cum_cl_rew = 17, cum_rew = 425,tot_cl = 10 , tot = 144
epis[3620],eps[0.473368],durat[106],fnum=576633, cum_cl_rew = 1, cum_rew = 5,

epis[4330],eps[0.354848],durat[152],fnum=695153, cum_cl_rew = 14, cum_rew = 140,tot_cl = 10 , tot = 133
epis[4340],eps[0.353324],durat[162],fnum=696677, cum_cl_rew = 12, cum_rew = 125,tot_cl = 10 , tot = 137
epis[4350],eps[0.351457],durat[207],fnum=698544, cum_cl_rew = 17, cum_rew = 280,tot_cl = 11 , tot = 147
epis[4360],eps[0.349727],durat[172],fnum=700274, cum_cl_rew = 11, cum_rew = 110,tot_cl = 12 , tot = 154
epis[4370],eps[0.347854],durat[199],fnum=702147, cum_cl_rew = 10, cum_rew = 130,tot_cl = 12 , tot = 166
epis[4380],eps[0.346199],durat[118],fnum=703802, cum_cl_rew = 15, cum_rew = 175,tot_cl = 13 , tot = 167
epis[4390],eps[0.344535],durat[188],fnum=705466, cum_cl_rew = 14, cum_rew = 200,tot_cl = 12 , tot = 164
epis[4400],eps[0.342998],durat[83],fnum=707003, cum_cl_rew = 4, cum_rew = 20,tot_cl = 12 , tot = 159
epis[4410],eps[0.341525],durat[247],fnum=708476, cum_cl_rew = 18, cum_rew = 265,tot_cl = 11 , tot = 145
epis[4420],eps[0.339682],durat[264],fnum=710319, cum_cl_rew = 14, c

epis[5130],eps[0.217537],durat[388],fnum=832464, cum_cl_rew = 25, cum_rew = 380,tot_cl = 12 , tot = 175
epis[5140],eps[0.215712],durat[103],fnum=834289, cum_cl_rew = 1, cum_rew = 5,tot_cl = 12 , tot = 176
epis[5150],eps[0.213923],durat[266],fnum=836078, cum_cl_rew = 22, cum_rew = 320,tot_cl = 11 , tot = 146
epis[5160],eps[0.212320],durat[236],fnum=837681, cum_cl_rew = 12, cum_rew = 165,tot_cl = 11 , tot = 153
epis[5170],eps[0.210437],durat[141],fnum=839564, cum_cl_rew = 10, cum_rew = 125,tot_cl = 10 , tot = 156
epis[5180],eps[0.208621],durat[145],fnum=841380, cum_cl_rew = 8, cum_rew = 80,tot_cl = 11 , tot = 168
epis[5190],eps[0.206544],durat[272],fnum=843457, cum_cl_rew = 17, cum_rew = 225,tot_cl = 11 , tot = 162
epis[5200],eps[0.204711],durat[118],fnum=845290, cum_cl_rew = 8, cum_rew = 115,tot_cl = 12 , tot = 176
epis[5210],eps[0.202429],durat[226],fnum=847572, cum_cl_rew = 14, cum_rew = 175,tot_cl = 12 , tot = 189
epis[5220],eps[0.200534],durat[170],fnum=849467, cum_cl_rew = 3, cum_r

epis[5930],eps[0.100000],durat[225],fnum=991597, cum_cl_rew = 12, cum_rew = 160,tot_cl = 12 , tot = 179
epis[5940],eps[0.100000],durat[216],fnum=993723, cum_cl_rew = 11, cum_rew = 125,tot_cl = 13 , tot = 200
epis[5950],eps[0.100000],durat[205],fnum=995516, cum_cl_rew = 7, cum_rew = 85,tot_cl = 12 , tot = 185
epis[5960],eps[0.100000],durat[150],fnum=997216, cum_cl_rew = 4, cum_rew = 30,tot_cl = 12 , tot = 182
epis[5970],eps[0.100000],durat[129],fnum=999275, cum_cl_rew = 14, cum_rew = 200,tot_cl = 12 , tot = 178
epis[5980],eps[0.100000],durat[249],fnum=1001139, cum_cl_rew = 23, cum_rew = 325,tot_cl = 12 , tot = 179
epis[5990],eps[0.100000],durat[159],fnum=1003305, cum_cl_rew = 12, cum_rew = 165,tot_cl = 12 , tot = 181
epis[6000],eps[0.100000],durat[213],fnum=1005189, cum_cl_rew = 8, cum_rew = 75,tot_cl = 12 , tot = 187
epis[6010],eps[0.100000],durat[208],fnum=1007452, cum_cl_rew = 7, cum_rew = 90,tot_cl = 14 , tot = 217
epis[6020],eps[0.100000],durat[304],fnum=1009707, cum_cl_rew = 14, c

epis[6720],eps[0.100000],durat[274],fnum=1160274, cum_cl_rew = 18, cum_rew = 455,tot_cl = 14 , tot = 215
epis[6730],eps[0.100000],durat[216],fnum=1162801, cum_cl_rew = 14, cum_rew = 160,tot_cl = 14 , tot = 209
epis[6740],eps[0.100000],durat[261],fnum=1165092, cum_cl_rew = 16, cum_rew = 620,tot_cl = 14 , tot = 206
epis[6750],eps[0.100000],durat[150],fnum=1167308, cum_cl_rew = 10, cum_rew = 85,tot_cl = 14 , tot = 214
epis[6760],eps[0.100000],durat[215],fnum=1169592, cum_cl_rew = 13, cum_rew = 350,tot_cl = 14 , tot = 207
epis[6770],eps[0.100000],durat[299],fnum=1171996, cum_cl_rew = 15, cum_rew = 380,tot_cl = 14 , tot = 221
epis[6780],eps[0.100000],durat[149],fnum=1174421, cum_cl_rew = 11, cum_rew = 125,tot_cl = 13 , tot = 217
epis[6790],eps[0.100000],durat[304],fnum=1176577, cum_cl_rew = 19, cum_rew = 285,tot_cl = 13 , tot = 214
epis[6800],eps[0.100000],durat[251],fnum=1178733, cum_cl_rew = 14, cum_rew = 200,tot_cl = 13 , tot = 207
epis[6810],eps[0.100000],durat[205],fnum=1181208, cum_cl

epis[7510],eps[0.100000],durat[203],fnum=1348457, cum_cl_rew = 19, cum_rew = 235,tot_cl = 17 , tot = 258
epis[7520],eps[0.100000],durat[186],fnum=1350827, cum_cl_rew = 16, cum_rew = 160,tot_cl = 17 , tot = 268
epis[7530],eps[0.100000],durat[238],fnum=1353430, cum_cl_rew = 21, cum_rew = 300,tot_cl = 17 , tot = 283
epis[7540],eps[0.100000],durat[310],fnum=1355941, cum_cl_rew = 22, cum_rew = 300,tot_cl = 17 , tot = 284
epis[7550],eps[0.100000],durat[310],fnum=1357982, cum_cl_rew = 23, cum_rew = 330,tot_cl = 17 , tot = 277
epis[7560],eps[0.100000],durat[367],fnum=1360445, cum_cl_rew = 27, cum_rew = 380,tot_cl = 17 , tot = 283
epis[7570],eps[0.100000],durat[328],fnum=1363024, cum_cl_rew = 23, cum_rew = 320,tot_cl = 17 , tot = 281
epis[7580],eps[0.100000],durat[193],fnum=1365177, cum_cl_rew = 20, cum_rew = 310,tot_cl = 17 , tot = 274
epis[7590],eps[0.100000],durat[177],fnum=1367825, cum_cl_rew = 19, cum_rew = 275,tot_cl = 18 , tot = 289
epis[7600],eps[0.100000],durat[278],fnum=1370138, cum_c

### Plot the overall performace

In [None]:
bandwidth = 1000 # Moving average bandwidth
total_clipped = np.zeros(int(epis_count)-bandwidth)
total_rew = np.zeros(int(epis_count)-bandwidth)
for i in range(int(epis_count)-bandwidth):
    total_clipped[i] = np.sum(tot_clipped_reward[i:i+bandwidth])/bandwidth
    total_rew[i] = np.sum(tot_reward[i:i+bandwidth])/bandwidth
t = np.arange(int(epis_count)-bandwidth)
belplt = plt.plot(t,total_rew[0:int(epis_count)-bandwidth],"r", label = "Return")
plt.legend()#handles[likplt,belplt])
print('Running after %d number of episodes' %epis_count)
plt.xlabel("Number of episode")
plt.ylabel("Average Reward per episode")
plt.show()
likplt = plt.plot(t,total_clipped[0:opt.num_episode-bandwidth],"b", label = "Clipped Return")
plt.legend()#handles[likplt,belplt])
plt.xlabel("Number of episode")
plt.ylabel("Average clipped Reward per episode")
plt.show()


### Accumulated average reward after 1000 episodes of game Assault
|![](./Assault.png)|![](./Assault-clipped.png)|
|:---------------:|:---------------:|
|Average reward|Average clipped reward|