# Continuous Control with Deep Reinforcement Learning

This is inspired by the DDPG paper from https://arxiv.org/abs/1509.02971. 



## OpenAI Bipedal Walker 

  This is simple 4-joints walker robot environment.
 
  There are two versions:
 
  - **Normal**, with slightly uneven terrain.
 
  - **Hardcore** with ladders, stumps, pitfalls.
  


<table><tr>
<td> <img src="images/normal_env.png"  style="width: 550px;"/> </td>
<td> <img src="images/hardcore_env.png"  style="width: 550px;"/> </td>
</tr></table>
  
  
  
We are using the Normal environment to prototype the Deep Deterministic Policy Gradient (DDPG).


#### Source

  The BipedalEnvironment was created by Oleg Klimov and is licensed on the same terms as the rest of OpenAI Gym.  
  Raw environment code: https://github.com/openai/gym/blob/master/gym/envs/box2d/bipedal_walker.py  


### Rewards Given to the Agent

  - Moving forward, total 300+ points up to the far end. 
  - If the robot falls, it gets -100. 
  - Applying motor torque costs a small amount of points, more optimal agent will get better score.

### State Space: 24 Dimensions

  - **4 hull measurements**: angle speed, angular velocity, horizontal speed, vertical speed
  - **8 joint measurements**, 2 for each of the 4 joints: position of joints and joints angular speed 
  - **2 leg measurements**, one for each leg: legs contact with ground
  - **10 lidar rangefinder measurements** to help to navigate the hardcore environment. 
  
### What quantifies a solution, or a sucessful RL agent?
  
  To solve the game you need to get **300 points in 1600 time steps**.
 
  To solve the hardcore version you need **300 points in 2000 time steps**.



# Psuedocode / Planning

Replay Buffer / Memory:
    
    state, action state_p, reward, d (terminal flag)


2 Actor Networks; for Actor:

    Chooses an action given a state
    randomly sample states from memory
    use actor to determine actions for those states
    plug actions into critic and get the value
    take gradient w.r.t the actor network params
    backpropagate through the critic AND actor network for a given action
    
2 critic networks; for Critic:

    evaluates value, by state and actions from actor network


Polyak Averaging for Target Network Update where $\theta^Q$ is the critic network weights

$\theta^{Q'} = \rho\theta^Q + (1-\rho)\theta^{Q'}$



In [6]:
import tensorflow as tf
import os
import gym
import numpy as np


class Actor(tf.keras.Model):
    
    """
    Chooses an action given a state
    
    """
    def __init__(self, num_actions, d1_dims=256, d2_dims=256, action_range = (-1,1),
                 save_dir='model_weights/', name = 'actor'):
        
        super().__init__()
        self.actions = num_actions
        self.d1_dims = d1_dims
        self.d2_dims = d2_dims
        self.action_range = action_range
        
        ## Create Model Checkpoint Destination
        self.model_name = name
        self.save_dir = save_dir
        self.file_checkpoint = os.path.join(self.save_dir, self.model_name +'.h5')
        
        
        ## Build the Fully Connected Model Layers
        
        initializer = tf.keras.initializers.RandomNormal(mean=0., stddev=0.1)
        self.d1 = tf.keras.layers.Dense(self.d1_dims, kernel_initializer=initializer,
                                                  activation='relu')
        self.d2 = tf.keras.layers.Dense(self.d2_dims, kernel_initializer=initializer,
                                                  activation='relu')
        self.action_vector = tf.keras.layers.Dense(self.actions, activation = 'tanh') #[-1,1]
    
    def call(self, state):
        
        
        ## Forward propagation
        mu = self.d1(state)
        mu = self.d2(mu)
        mu = self.action_vector(mu)
        
        ## Multiplay the tanh output by 
        mu = mu * max(self.action_range)
        
        return mu


class Critic(tf.keras.Model):
    
    """
    Evaluates the value by the state and actions from the Actor Network
    
    """
    def __init__(self, d1_dims=256, d2_dims=256, save_dir='model_weights/', name = 'critic'):
        
        super().__init__()
        self.d1_dims = d1_dims
        self.d2_dims = d2_dims
        
        
        ## Create Model Checkpoint Destination
        self.model_name = name
        self.save_dir = save_dir
        self.file_checkpoint = os.path.join(self.save_dir, self.model_name +'.h5')
        
        
        ## Build the Fully Connected Model Layers
        initializer = tf.keras.initializers.RandomNormal(mean=0., stddev=0.1)
        self.d1 = tf.keras.layers.Dense(self.d1_dims, kernel_initializer=initializer,
                                                  activation='relu')
        self.d2 = tf.keras.layers.Dense(self.d2_dims, kernel_initializer=initializer,
                                                  activation='relu')
        self.q = tf.keras.layers.Dense(1, activation = None)
    
    def call(self, state, action):
        
        ## Forward propagation
        q = self.d1(tf.concat([state,action], axis=1))
        q = self.d2(q)
        q = self.q(q)
        
        return q
        

In [19]:
import tensorflow as tf
import os
import gym
import numpy as np


class Actor(tf.keras.Model):
    
    """
    Chooses an action given a state
    
    """
    def __init__(self, num_actions, d1_dims=256, d2_dims=256, action_range = (-1,1),
                 save_dir='model_weights/', name = 'actor'):
        
        super().__init__()
        self.actions = num_actions
        self.d1_dims = d1_dims
        self.d2_dims = d2_dims
        self.action_range = action_range
        
        ## Create Model Checkpoint Destination
        self.model_name = name
        self.save_dir = save_dir
        self.file_checkpoint = os.path.join(self.save_dir, self.model_name +'.h5')
        
        
        ## Build the Fully Connected Model Layers
        
        initializer = tf.keras.initializers.RandomNormal(mean=0., stddev=0.1)
        self.d1 = tf.keras.layers.Dense(self.d1_dims, kernel_initializer=initializer,
                                                  activation='relu')
        self.d2 = tf.keras.layers.Dense(self.d2_dims, kernel_initializer=initializer,
                                                  activation='relu')
        self.d3 = tf.keras.layers.Dense(self.d1_dims, kernel_initializer=initializer,
                                                  activation='relu')
        self.action_vector = tf.keras.layers.Dense(self.actions, activation = 'tanh') #[-1,1]
    
    def call(self, state):
        
        
        ## Forward propagation
        mu = self.d1(state)
        mu = self.d2(mu)
        mu = self.d3(mu)
        mu = self.action_vector(mu)
        
        ## Multiplay the tanh output by 
        mu = mu * max(self.action_range)
        
        return mu


class Critic(tf.keras.Model):
    
    """
    Evaluates the value by the state and actions from the Actor Network
    
    """
    def __init__(self, d1_dims=256, d2_dims=256, save_dir='model_weights/', name = 'critic'):
        
        super().__init__()
        self.d1_dims = d1_dims
        self.d2_dims = d2_dims
        
        
        ## Create Model Checkpoint Destination
        self.model_name = name
        self.save_dir = save_dir
        self.file_checkpoint = os.path.join(self.save_dir, self.model_name +'.h5')
        
        
        ## Build the Fully Connected Model Layers
        initializer = tf.keras.initializers.RandomNormal(mean=0., stddev=0.1)
        self.d1 = tf.keras.layers.Dense(self.d1_dims, kernel_initializer=initializer,
                                                  activation='relu')
        self.d2 = tf.keras.layers.Dense(self.d2_dims, kernel_initializer=initializer,
                                                  activation='relu')
        self.d3 = tf.keras.layers.Dense(self.d1_dims, kernel_initializer=initializer,
                                                  activation='relu')
        self.q = tf.keras.layers.Dense(1, activation = None)
    
    def call(self, state, action):
        
        ## Forward propagation
        q = self.d1(tf.concat([state,action], axis=1))
        q = self.d2(q)
        q = self.d3(q)
        q = self.q(q)
        
        return q
        

In [20]:
class OUActionNoise():
    
    def __init__(self, mu, sigma=0.15, theta=.2, dt=1e-2, x0=None):
        self.theta = theta
        self.mu = mu
        self.sigma = sigma
        self.dt = dt
        self.x0 = x0
        self.reset()

    def __call__(self):
        x = self.x_prev + self.theta * (self.mu - self.x_prev) * self.dt + \
            self.sigma * np.sqrt(self.dt) * np.random.normal(size=self.mu.shape)
        self.x_prev = x
        return x

    def reset(self):
        self.x_prev = self.x0 if self.x0 is not None else np.zeros_like(self.mu)

    def __repr__(self):
        return 'OrnsteinUhlenbeckActionNoise(mu={}, sigma={})'.format(
                                                            self.mu, self.sigma)

class ReplayBuffer():
    
    """
    Experience replay buffers stablize learning. This buffer
    should be large enough to capture a wide range of experiences so that
    it may generalize well.
    
    The buffer saves the (s, a, s', r, d) for each step in an environment.
    The models will randomly sample experiences using a uniform distribution
    later, to update the deep neural networks.
    
    Hyper-parameters:
        memory_capacity - Too large and training is slow. Too small and
            training will overfit to most recent experience
    """
    
    def __init__(self, size, state_dims, num_actions):

        self.memory_capacity = size
        self.memory_index = 0
        
        ## Initialize a memory array for (s, a, s', r, d)
        self.state_memory = np.zeros((self.memory_capacity, state_dims))
        self.action_memory = np.zeros((self.memory_capacity, num_actions))
        self.reward_memory = np.zeros(self.memory_capacity)
        self.state_p_memory = np.zeros((self.memory_capacity, state_dims))
        self.terminal_memory = np.zeros(self.memory_capacity, dtype=np.bool_)
        
            
    def memorize(self, state, action, reward, state_p, terminal):
        
        ## Overwrite buffer when full
        index = self.memory_index % self.memory_capacity
        
        ## Store Experience
        self.state_memory[index] = state
        self.action_memory[index] = action
        self.reward_memory[index] = reward
        self.state_p_memory[index] = state_p
        self.terminal_memory[index] = terminal
        
        self.memory_index += 1
    
    def sample_memory(self, batch_size):
         
        sample_size = min(self.memory_index, self.memory_capacity)
        
        ## Randomly sample a batch of memories without replacement
        batch = np.random.choice(sample_size, batch_size, replace=False)
        
        ## Generate Batch by Indices
        states = self.state_memory[batch]
        actions = self.action_memory[batch]
        rewards = self.reward_memory[batch]
        states_p = self.state_p_memory[batch]
        terminals = self.terminal_memory[batch]
        
        ## Numpy to Tensor
        states = tf.convert_to_tensor(states, dtype=tf.float32)
        actions = tf.convert_to_tensor(actions, dtype=tf.float32)
        rewards = tf.convert_to_tensor(rewards, dtype=tf.float32)
        states_p = tf.convert_to_tensor(states_p, dtype=tf.float32)
        #terminals = tf.convert_to_tensor(terminals, dtype=tf.float32)
        
        return states, actions, rewards, states_p, terminals

In [21]:
class Agent():
    
    def __init__(self, env_name, lr_actor=0.001, lr_critic=0.002, env=None, gamma=0.95,
                buffer_size = 20000, rho = 0.005, layer_dims =(400,300), batch_size=32,
                epsilon=0.99, e_decay = 0.001):
        self.save_dir = env_name + '_models'
        self.num_actions = env.action_space.shape[0]
        self.state_size = env.observation_space.shape[0]
        self.lr_actor = lr_actor
        self.lr_critic = lr_critic
        self.env = env
        self.gamma = gamma
        self.buffer_size = buffer_size
        self.rho = rho
        self.d1_dims = layer_dims[0]
        self.d2_dims = layer_dims[1]
        self.batch_size = batch_size
        self.epsilon = epsilon
        self.e_decay = e_decay
        self.noise = OUActionNoise(mu=np.zeros(self.num_actions))
        self.action_range = (env.action_space.high[0], env.action_space.low[0])
        self.action_counter = 0
        
        ## Define Replay Buffer
        self.buffer_size = buffer_size
        self.memory = ReplayBuffer(self.buffer_size, self.state_size, self.num_actions)
        
        ## Define Neural Networks
        self.actor = Actor(save_dir = self.save_dir, num_actions = self.num_actions, 
                           d1_dims=self.d1_dims, d2_dims=self.d2_dims, 
                           action_range = self.action_range)
        self.target_actor = Actor(save_dir = self.save_dir, num_actions = self.num_actions, 
                                  d1_dims=self.d1_dims, d2_dims=self.d2_dims, 
                                  action_range = self.action_range, 
                                  name = 'target_actor')
        
        self.critic = Critic(save_dir = self.save_dir, d1_dims=self.d1_dims, d2_dims=self.d2_dims)
        self.target_critic = Critic(d1_dims=self.d1_dims, d2_dims=self.d2_dims, 
                                    name = 'target_critic', save_dir = self.save_dir,)
        
        ## Compile the Networks
        self.actor.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=lr_actor))
        self.target_actor.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=lr_actor))
        self.critic.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=lr_critic))
        self.target_critic.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=lr_critic))
        
        ## Hard Copy Weights to Target Network
        self.soft_update_weights(rho=1)
        
    def soft_update_weights(self, rho = None):
        """
        Use polyak averaging as a soft weight update to the on-policy network
        
        https://spinningup.openai.com/en/latest/algorithms/ddpg.html
        """
        
        if rho == None:
            rho = self.rho
        
        ## Update Actor
        weights = []
        target_weights = self.target_actor.weights
        for i, weight in enumerate(self.actor.weights):
            weights.append(weight*rho + target_weights[i]*(1-rho))
        self.target_actor.set_weights(weights)
        
        ## Update Critic
        weights = []
        target_weights = self.target_critic.weights
        for i, weight in enumerate(self.critic.weights):
            weights.append(weight*rho + target_weights[i]*(1-rho))
        self.target_critic.set_weights(weights)
        
    def remember_experience(self, state, action, reward, state_p, terminal):
        self.memory.memorize(state, action, reward, state_p, terminal)
    
    def act(self, state, greedy = False):
        
        ## E-Greedy Policy!
        
        self.action_counter+=1
        
        if greedy:
            
            return self.actor(tf.convert_to_tensor([state]), dtype='float32')[0]
        
        else:
            
            epsilon = min(np.exp(-self.e_decay * self.action_counter), self.epsilon)
            
            if np.random.random() > epsilon:
                return tf.convert_to_tensor(self.act_noisy(state))[0], 0
            else:
            
                actions = tf.convert_to_tensor([self.env.action_space.sample()])
                return actions[0], 1
    
    def choose_action(self, observation, evaluate=False):
        state = tf.convert_to_tensor([observation], dtype=tf.float32)
        actions = self.actor(state)
        if not evaluate:
            actions += tf.random.normal(shape=[self.num_actions],
                    mean=0.0, stddev=0.1)
        # note that if the environment has an action > 1, we have to multiply by
        # max action at some point
        actions = tf.clip_by_value(actions, min(self.action_range), max(self.action_range))

        return actions[0]
    
    def act_noisy(self, state):
        
        action = self.actor(tf.convert_to_tensor([state], dtype='float32'))
        action = action + self.noise()
        action = np.clip(action, a_min = min(self.action_range), 
                         a_max = max(self.action_range))
        
        return action
        
    def save_weights(self, iteration = None):
        
        if iteration == None:
            iteration = self.actor.save_dir
        
        print(f"\n........Initializing save at Episode {iteration}........")
        print("Saving Actor.....................")
        
        if os.path.isdir(self.actor.save_dir) == False:
            os.mkdir(self.actor.save_dir)
        self.actor.save_weights(self.actor.file_checkpoint)
        print(f"Save Complete at {self.actor.file_checkpoint}")
        
        print("Saving Critic.....................")
        if os.path.isdir(self.critic.save_dir) == False:
            os.mkdir(self.critic.save_dir)
        self.critic.save_weights(self.critic.file_checkpoint)
        print(f"Save Complete at {self.critic.file_checkpoint}")
        
        print("Saving Target Networks.............")
        if os.path.isdir(self.target_actor.save_dir) == False:
            os.mkdir(self.target_actor.save_dir)
        if os.path.isdir(self.target_critic.save_dir) == False:
            os.mkdir(self.target_critic.save_dir)
        self.target_actor.save_weights(self.target_actor.file_checkpoint)
        self.target_critic.save_weights(self.target_critic.file_checkpoint)
        
        print("Save Complete.\n")
    
    def load_weights(self):
        
        print(f"........Loading Weights........")
        print("Loading Actor.....................")
        self.actor.load_weights(self.actor.file_checkpoint)
        
        print("Loading Critic.....................")
        self.critic.load_weights(self.critic.file_checkpoint)
    
        
        print("Loading Target Networks.............")
        self.target_actor.load_weights(self.target_actor.file_checkpoint)
        self.target_critic.load_weights(self.target_critic.file_checkpoint)
        print("Load Complete.")
        
    def learn(self):
        
        if self.memory.memory_index < self.batch_size:
            return
        
        ## Sample a Batch from Memory
        states, actions, rewards, states_p, terminals = self.memory.sample_memory(self.batch_size)
        
        with tf.GradientTape() as tape:
            
            ## Feed s' to target_actor
            mu_p = self.target_actor(states_p)
            
            # Feed s' and target_actor output to target_critic
            q_p = tf.squeeze(self.target_critic(states_p, mu_p),1)
            
            ## Target is generated by target_critic, reward if terminal
            target = rewards + self.gamma*q_p*(1-terminals)
            
            # Feed states and actions to crtic
            q = tf.squeeze(self.critic(states,actions), 1)
            
            ## Critic 
            critic_loss = tf.keras.losses.MSE(q, target)
            
        ## Calculate the gradient of the loss w.r.t. the critic parameters
        critic_gradient = tape.gradient(critic_loss, self.critic.trainable_variables)
        self.critic.optimizer.apply_gradients(zip(critic_gradient, self.critic.trainable_variables))
        
        with tf.GradientTape() as tape:
            
            ## Actions of actor based on current weights
            actor_actions = self.actor(states)
            
            ## Gradient ASCENT to MAXIMIZE Expected Value over time
            actor_loss = tf.math.reduce_mean((-self.critic(states, actor_actions)))
        
        #
        actor_gradient = tape.gradient(actor_loss, self.actor.trainable_variables)
        self.actor.optimizer.apply_gradients(zip(actor_gradient,self.actor.trainable_variables))
        
        self.soft_update_weights()

            

In [None]:
if __name__ == '__main__':
    #env = gym.make('BipedalWalker-v2')
    env_name = 'LunarLanderContinuous-v2'
    env = gym.make(env_name)
    agent = Agent(env=env, env_name = env_name, layer_dims = (400,300), batch_size = 64, 
                  rho = 0.01, gamma = 0.95, lr_critic= 0.005, lr_actor= 0.003, 
                  epsilon = 0, e_decay = 0.000001)
    
    n_games = 2000


    best_score = env.reward_range[0]
    score_history = []
    load_checkpoint = False

    if load_checkpoint:
        n_steps = 0
        while n_steps <= agent.batch_size:
            observation = env.reset()
            action = env.action_space.sample()
            observation_, reward, done, info = env.step(action)
            agent.remember_experience(observation, action, reward, observation_, done)
            n_steps += 1
        agent.learn()
        agent.load_weights()
        evaluate = False    ## Set true if you dont wnat to improve model
    else:
        evaluate = False

        
        
    ## Game Loop
    action_history = []
    for i in range(n_games):
        observation = env.reset()
        done = False
        score = 0
        action_sequence = []
        
        while not done:
            #action, explore = agent.act(observation)
            #action_hist.append(explore)
            action = agent.choose_action(observation)
            action_sequence.append(action)
            
            observation_, reward, done, info = env.step(action)
            score += reward
            agent.remember_experience(observation, action, reward, observation_, done)
           
            if not load_checkpoint:
                agent.learn()
            observation = observation_


        score_history.append(score)
        avg_score = np.mean(score_history[-100:])
        
        if avg_score > best_score:
            best_score = avg_score
            if not load_checkpoint:
                agent.save_weights(iteration = i)
                
        if i % 20 == 0:
            print(action)
            print(agent.action_counter)
        action_history.append(action_sequence)
        print('episode ', i, 'score %.1f' % score, 'avg score %.1f' % avg_score, 'avg action ',
              np.mean(action_sequence),'over ', len(action_sequence), ' actions')


........Initializing save at Episode 0........
Saving Actor.....................
Save Complete at LunarLanderContinuous-v2_models\actor.h5
Saving Critic.....................
Save Complete at LunarLanderContinuous-v2_models\critic.h5
Saving Target Networks.............
Save Complete.

tf.Tensor([-0.5125351  -0.62160116], shape=(2,), dtype=float32)
0
episode  0 score -156.3 avg score -156.3 avg action  -0.4464166 over  59  actions
episode  1 score -1428.3 avg score -792.3 avg action  0.9050902 over  114  actions
episode  2 score -1644.7 avg score -1076.4 avg action  0.9597903 over  121  actions
episode  3 score -1547.3 avg score -1194.2 avg action  0.9643386 over  122  actions
episode  4 score -1737.2 avg score -1302.8 avg action  0.9568754 over  129  actions
episode  5 score -748.7 avg score -1210.4 avg action  0.96457523 over  56  actions
episode  6 score -783.1 avg score -1149.4 avg action  0.9506825 over  87  actions
episode  7 score -805.1 avg score -1106.3 avg action  0.9628663 ov

episode  91 score -1156.1 avg score -1085.9 avg action  0.9598974 over  92  actions
episode  92 score -739.0 avg score -1082.2 avg action  0.9565745 over  58  actions
episode  93 score -1144.9 avg score -1082.8 avg action  0.95390505 over  96  actions
episode  94 score -1130.5 avg score -1083.3 avg action  0.96446955 over  90  actions
episode  95 score -764.2 avg score -1080.0 avg action  0.9586285 over  61  actions
episode  96 score -1247.0 avg score -1081.7 avg action  0.9643491 over  105  actions
episode  97 score -1246.7 avg score -1083.4 avg action  0.9674902 over  96  actions
episode  98 score -1044.2 avg score -1083.0 avg action  0.969082 over  91  actions
episode  99 score -2017.9 avg score -1092.4 avg action  0.95895165 over  137  actions
tf.Tensor([0.939968 1.      ], shape=(2,), dtype=float32)
0
episode  100 score -1211.0 avg score -1102.9 avg action  0.95583266 over  95  actions
episode  101 score -1176.5 avg score -1100.4 avg action  0.9587616 over  103  actions
episode  1

In [18]:
 for i in range(n_games):
        observation = env.reset()
        done = False
        score = 0
        while not done:
            env.render()
            action = agent.choose_action(observation, evaluate)
            observation_, reward, done, info = env.step(action)
            score += reward
            agent.remember_experience(observation, action, reward, observation_, done)
            if not load_checkpoint:
                agent.learn()
            observation = observation_

        score_history.append(score)
        avg_score = np.mean(score_history[-100:])

        if avg_score > best_score:
            best_score = avg_score
            #if not load_checkpoint:
             #   agent.save_models()

        print('episode ', i, 'score %.1f' % score, 'avg score %.1f' % avg_score)
env.close()

episode  0 score -297.8 avg score -125.6
episode  1 score 3.9 avg score -124.7
episode  2 score -189.4 avg score -125.6
episode  3 score -202.9 avg score -126.4
episode  4 score -204.0 avg score -127.4
episode  5 score -105.0 avg score -126.9
episode  6 score -96.8 avg score -126.8
episode  7 score -203.9 avg score -127.5
episode  8 score -162.8 avg score -127.6
episode  9 score -139.2 avg score -127.2
episode  10 score -149.5 avg score -127.9
episode  11 score -57.3 avg score -128.1
episode  12 score -84.3 avg score -128.4
episode  13 score -124.9 avg score -125.2
episode  14 score -60.6 avg score -124.3
episode  15 score -95.9 avg score -124.4
episode  16 score -110.9 avg score -124.9
episode  17 score -89.9 avg score -125.1
episode  18 score -177.8 avg score -124.7
episode  19 score -97.7 avg score -120.6
episode  20 score -116.4 avg score -117.5
episode  21 score -100.5 avg score -117.3
episode  22 score -112.5 avg score -117.7
episode  23 score -335.8 avg score -120.5
episode  24 

KeyboardInterrupt: 

## TODO
https://towardsdatascience.com/deep-deterministic-policy-gradients-explained-2d94655a9b7b
Should be able to change the update rule to be the mse between q and $G_t$ or other expected return method. Worth a shot, it'll give more weight to the rewards and contain less bootstrapping

## Epsilon Decay

In [None]:
import matplotlib.pyplot as plt

plt.plot(np.linspace(0, 50000, 500), np.exp(-1/10000*np.linspace(0, 50000, 500)))
plt.show()

In [None]:
np.exp(-0.0001 * 10000)

In [None]:
min(agent.action_range)

In [None]:
actions = []
for episode in action_history:
    for item in episode:
        actions.append(item[0])

In [None]:

#plt.plot(score_history)
plt.plot(actions[:200])

In [None]:
smaller_hist = score_history