# Continuous Control with Deep Reinforcement Learning


## OpenAI Bipedal Walker 

  This is simple 4-joints walker robot environment.
 
  There are two versions:
 
  - **Normal**, with slightly uneven terrain.
 
  - **Hardcore** with ladders, stumps, pitfalls.
  


<table><tr>
<td> <img src="images/normal_env.png"  style="width: 550px;"/> </td>
<td> <img src="images/hardcore_env.png"  style="width: 550px;"/> </td>
</tr></table>
  
  
  
We are using the Normal environment to prototype the Deep Deterministic Policy Gradient (DDPG).


#### Source

  The BipedalEnvironment was created by Oleg Klimov and is licensed on the same terms as the rest of OpenAI Gym.  
  Raw environment code: https://github.com/openai/gym/blob/master/gym/envs/box2d/bipedal_walker.py  


### Rewards Given to the Agent

  - Moving forward, total 300+ points up to the far end. 
  - If the robot falls, it gets -100. 
  - Applying motor torque costs a small amount of points, more optimal agent will get better score.

### State Space: 24 Dimensions

  - **4 hull measurements**: angle speed, angular velocity, horizontal speed, vertical speed
  - **8 joint measurements**, 2 for each of the 4 joints: position of joints and joints angular speed 
  - **2 leg measurements**, one for each leg: legs contact with ground
  - **10 lidar rangefinder measurements** to help to navigate the hardcore environment. 
  
### What quantifies a solution, or a sucessful RL agent?
  
  To solve the game you need to get **300 points in 1600 time steps**.
 
  To solve the hardcore version you need **300 points in 2000 time steps**.



In [1]:
from agents import *
import tensorflow as tf
import os
import gym
import numpy as np
import matplotlib.pyplot as plt



## Critic Loss Function

Update the critic network by minimizing the loss:

$L = \frac{1}{N}\sum(y_i - Q(s_i, a_i | \theta^Q))^2$

where the target $y_i$ is equal to:
$y_i = r_i + \gamma Q'(s_{i+1}, \mu'(s_{i+1}|\theta^{\mu'})|\theta^{Q'})$

This target is reminiscent of the update target in Q-learning, except we are substituting the known Q' values with output values from our Target Critic Network.

### Intuition
Any time we see the term $\mu(s|\theta^{\mu'})$, we should recognize that its similar to an action, $a$, except it is the output action of feeding a state $s$ into the Target Actor Network, given the target network's current weights $\theta^{\mu'}$.

To compute the Q value $Q'(s_{i+1}, \mu'(s_{i+1}|\theta^{\mu'})|\theta^{Q'})$ for this $\mu'(s_{i+1}|\theta^{\mu'})$, we feed $s'$ and $\mu'(s_{i+1}|\theta^{\mu'})$ into the Target Critic Network, since critics are responsible for generating values for a given state and action. Thus, we get $Q'(s'_{i+1},\mu'(s_{i+1}))$, and update the model using the MSE between this and the Critic Network's Q output for state $s$ and action $a$.


## Weight Updates

DDPG uses conservative policy iteration (soft updates) on the actor and critic network weights. Another term for this is Polyak Averaging.  

For example, if $\theta^Q$ is the critic network weights then:  

$\theta^{Q'} = \rho\theta^Q + (1-\rho)\theta^{Q'}$
where $\rho$ << 1


## Exploration Strategy
Add noise to the action, since the action space is continuous:

$\mu'(s) = \mu_\theta(s) + \mathcal{N}$

By adding noise, we can separate the exploration strategy from our policy. The noise employed here is generated based upon an Ornstein-Uhlenbeck process which allows for temporally correlated exploration for physical control problems involving inertia.

# Pendulum Environment: Proof-of-Concept

In [None]:
env_name = 'Pendulum-v0'

env = gym.make(env_name)
agent = Agent(env=env, env_name = env_name, layer_dims = (400,300), batch_size = 64, 
              rho = 0.01, gamma = 0.95, lr_critic= 0.005, lr_actor= 0.003, b_normalize = False)
score_history = train(agent, env, episodes = 100, debug = False, 
                           save = True)

In [None]:
plt.title("DDPG on Pendulum")
plt.plot(score_history, c = 'orange', label = 'DDPG')
plt.grid()

In [None]:
## Render 5 episodes to view policy!
evaluate=True
_ =train(agent,env,episodes=5, debug=False, save=False, evaluate=evaluate)

## Bipedal Environment

In [2]:
def train_walker(agent, env, episodes = 200, epsilon = 0, debug = True, save = True, 
          load_checkpoint = False, evaluate = False):
    
    
    score_history = []
    raw_reward_history = []
    best_score = env.reward_range[0]
    
    if load_checkpoint:
        n_steps = 0
        while n_steps <= agent.batch_size:
            state = env.reset()[:14]
            action = env.action_space.sample()
            state_p, reward, terminal, info = env.step(action)
            state_p = state_p[:14]
            agent.remember_experience(state, action, reward, state_p, terminal)
            n_steps += 1
        agent.learn()
        agent.load_weights()
      


    ## Game Loop
 
    for i in range(episodes):
        state = env.reset()[:14]
        terminal = False
        score = 0
        raw_reward =0 
        action_sequence = []

        while not terminal:
            
            if evaluate:
                env.render() 
                
            action = agent.compute_action(state, epsilon, evaluate = evaluate)
            action_sequence.append(action)
            
            state_p, reward, terminal, info = env.step(action)
            state_p = state_p[:14]
            
            raw_reward += reward
            
            # Modifications to avert agent from learning to sit still
            if (state_p[3]) < 0.001:
                reward += -100
            else:
                reward+= max(state_p[3] * 100, 10)
            score += reward
  
            agent.remember_experience(state, action, reward, state_p, terminal)

            if not evaluate:
                agent.learn()
            else:
                print(state)

            state = state_p


        score_history.append(score)
        raw_reward_history.append(raw_reward)
        avg_score = np.mean(raw_reward_history[-75:])

        if avg_score > best_score:
            best_score = avg_score
            if save:
                agent.save_weights(debug=debug, iteration=i)

        if debug:
            print('Ep: ', i, 'Raw Score %.1f'% raw_reward, 'Score %.1f' % score, 'Avg %.1f' % avg_score, 
                  'Actions:', len(action_sequence))
        elif i % 10 == 0:
            print('Ep: ', i, 'Raw Score %.1f'% raw_reward, 'Score %.1f' % score, 'Avg %.1f' % avg_score, 
                      'Actions:', len(action_sequence))
    if evaluate:
        env.close()
        
    return score_history, raw_reward_history

## Train an Agent from the Beginning

In [3]:
env_name = 'BipedalWalker-v2'
env = gym.make(env_name)
agent = Agent(env=env, env_name = env_name, layer_dims = (500,400), batch_size = 64, state_size = 14,
            rho = 0.001, gamma = 0.99, lr_critic= 0.015, lr_actor= 0.003, b_normalize = False)
score_history, raw_history = train_walker(agent, env, episodes = 2000, debug = False, save = True, epsilon = 0.005,
                                         load_checkpoint=False)

Ep:  0 Raw Score -117.8 Score -7657.8 Avg -117.8 Actions: 137
Ep:  10 Raw Score -122.4 Score -3101.5 Avg -147.2 Actions: 43
Ep:  20 Raw Score -114.4 Score -10414.4 Avg -131.5 Actions: 114
Ep:  30 Raw Score -115.9 Score -15265.9 Avg -126.3 Actions: 157
Ep:  40 Raw Score -120.5 Score -3220.5 Avg -124.1 Actions: 42
Ep:  50 Raw Score -122.7 Score -3222.7 Avg -123.2 Actions: 42
Ep:  60 Raw Score -114.9 Score -8364.9 Avg -122.6 Actions: 99
Ep:  70 Raw Score -112.0 Score -9232.0 Avg -121.7 Actions: 100
Ep:  80 Raw Score -114.4 Score -11214.4 Avg -118.5 Actions: 122
Ep:  90 Raw Score -123.4 Score -3222.1 Avg -116.7 Actions: 42
Ep:  100 Raw Score -121.7 Score -3211.7 Avg -117.4 Actions: 43
Ep:  110 Raw Score -120.4 Score -3550.4 Avg -117.7 Actions: 42
Ep:  120 Raw Score -111.9 Score -7341.9 Avg -117.9 Actions: 80
Ep:  130 Raw Score -117.8 Score -2937.8 Avg -117.4 Actions: 37
Ep:  140 Raw Score -118.3 Score -3038.3 Avg -117.1 Actions: 38
Ep:  150 Raw Score -113.3 Score -9353.3 Avg -117.6 Actions

Ep:  1290 Raw Score -119.2 Score -3129.2 Avg -118.2 Actions: 40
Ep:  1300 Raw Score -121.5 Score -3221.5 Avg -118.5 Actions: 42
Ep:  1310 Raw Score -119.4 Score -3219.4 Avg -117.2 Actions: 42
Ep:  1320 Raw Score -118.7 Score -3128.7 Avg -117.4 Actions: 40
Ep:  1330 Raw Score -112.3 Score -9202.3 Avg -117.6 Actions: 103
Ep:  1340 Raw Score -112.0 Score -7432.0 Avg -117.4 Actions: 82
Ep:  1350 Raw Score -121.6 Score -3221.6 Avg -117.7 Actions: 42
Ep:  1360 Raw Score -121.5 Score -3241.5 Avg -117.0 Actions: 40
Ep:  1370 Raw Score -113.6 Score -9033.6 Avg -116.5 Actions: 98
Ep:  1380 Raw Score -113.2 Score -8733.2 Avg -116.3 Actions: 95
Ep:  1390 Raw Score -119.3 Score -3239.3 Avg -116.7 Actions: 40
Ep:  1400 Raw Score -121.6 Score -3241.6 Avg -116.8 Actions: 40
Ep:  1410 Raw Score -113.5 Score -8733.5 Avg -116.5 Actions: 95
Ep:  1420 Raw Score -118.5 Score -17748.5 Avg -116.7 Actions: 184
Ep:  1430 Raw Score -113.5 Score -9033.5 Avg -116.3 Actions: 98
Ep:  1440 Raw Score -113.2 Score -893

In [None]:
agent.save_weights()

In [None]:
n_score = norm_score_history
r_score = raw_history

In [None]:
2.85397625e-01

In [4]:
## Render 5 episodes to view policy!
#agent.batch_size=2000
env.close()
_ =train_walker(agent,env,episodes=20, debug=False, save=False, evaluate=True)#, load_checkpoint=True)

[ 2.74744513e-03 -1.05991354e-05  8.24505687e-04 -1.59999275e-02
  9.20036435e-02 -1.08806707e-03  8.60245302e-01  2.20709651e-03
  1.00000000e+00  3.24101932e-02 -1.08799001e-03  8.53793263e-01
  7.69489755e-04  1.00000000e+00]
[-0.01822847 -0.065975   -0.01458597  0.02806884 -0.23439743  0.19835772
  1.45398879  0.83326459  1.         -0.33928889 -0.27481112  1.70878768
  0.99151937  1.        ]
[-0.08624539 -0.12213281 -0.04749353  0.02317769  0.17412107  0.98711103
  1.02098375 -0.03091737  1.          0.03321493  1.00251937  1.42315671
  0.00218658  1.        ]
[-1.51124910e-01 -1.23203964e-01 -4.77868116e-02  3.46814930e-03
  3.09077084e-01  1.00017297e+00  8.84279288e-01  0.00000000e+00
  1.00000000e+00  1.77706912e-01  9.99932766e-01  1.27633065e+00
 -7.83602397e-05  1.00000000e+00]
[-2.28627428e-01 -1.46005363e-01 -4.45964742e-02  3.63030529e-02
  3.61151278e-01  4.97735739e-01  8.76529798e-01  7.39796956e-01
  1.00000000e+00  2.12232992e-01  1.00001204e+00  1.19923408e+00
  1

[-1.07196319e+00 -4.58193153e-03  3.32764041e-02 -1.45786211e-03
  1.13365328e+00  1.20345503e-05  9.31289144e-01  4.11272049e-06
  1.00000000e+00  1.12649226e+00 -2.32830644e-07  9.34887722e-01
 -2.02159087e-06  0.00000000e+00]
[-1.07435179e+00 -4.77776378e-03  3.46934116e-02 -1.57295659e-03
  1.13365173e+00  1.24946237e-05  9.31292400e-01  4.31761146e-06
  1.00000000e+00  1.12649047e+00 -2.42143869e-07  9.34889585e-01
 -2.10106373e-06  0.00000000e+00]
[-1.07684207e+00 -4.98123497e-03  3.61650109e-02 -1.69733360e-03
  1.13365126e+00  1.29807740e-05  9.31294680e-01  4.43930427e-06
  1.00000000e+00  1.12648892e+00 -2.51457095e-07  9.34891701e-01
 -2.18053659e-06  0.00000000e+00]
[-1.07943809e+00 -5.19266784e-03  3.76934373e-02 -1.83172718e-03
  1.13365006e+00  1.34669244e-05  9.31298122e-01  4.63922819e-06
  1.00000000e+00  1.12648678e+00 -2.60770321e-07  9.34893727e-01
 -2.26746003e-06  0.00000000e+00]
[-1.08214390e+00 -5.41239798e-03  3.92810011e-02 -1.97698012e-03
  1.13364887e+00  1

[-1.24225712e+00 -1.40046489e-02  1.00167253e-01 -9.52822566e-03
  1.13287306e+00  4.89503145e-06  9.29398462e-01  0.00000000e+00
  0.00000000e+00  1.12663448e+00 -1.56462193e-07  9.34916131e-01
 -2.37921874e-05  1.00000000e+00]
[-1.24947000e+00 -1.44348943e-02  1.03140113e-01 -1.03010082e-02
  1.13285697e+00  5.16325235e-06  9.29423332e-01  0.00000000e+00
  0.00000000e+00  1.12662876e+00 -1.63912773e-07  9.34906937e-01
 -2.50836213e-05  1.00000000e+00]
[-1.25691247e+00 -1.48880076e-02  1.06262176e-01 -1.11342156e-02
  1.13284588e+00  5.38676977e-06  9.29440230e-01  0.00000000e+00
  0.00000000e+00  1.12661707e+00 -1.71363354e-07  9.34916012e-01
 -2.62608131e-05  1.00000000e+00]
[-1.26459002e+00 -1.53649354e-02  1.09536366e-01 -1.20329630e-02
  1.13282907e+00  5.69224358e-06  9.29465711e-01  0.00000000e+00
  0.00000000e+00  1.12661016e+00 -1.78813934e-07  9.34907682e-01
 -2.78155009e-05  1.00000000e+00]
[-1.27252126e+00 -1.58661330e-02  1.12965567e-01 -1.30024576e-02
  1.13281691e+00  5

[-1.86849177e+00 -5.91799164e-02  3.15199528e-01 -1.93177872e-01
  1.13012934e+00  1.67697668e-04  9.33507502e-01  0.00000000e+00
  0.00000000e+00  1.12536144e+00 -5.09619713e-06  9.35147941e-01
 -1.01109346e-03  1.00000000e+00]
[-1.89913237e+00 -6.14652729e-02  3.17707644e-01 -2.06792202e-01
  1.12987232e+00  6.63101673e-05  9.33913529e-01  0.00000000e+00
  0.00000000e+00  1.12529516e+00 -2.02655792e-06  9.34905946e-01
 -4.02530034e-04  1.00000000e+00]
[-1.92983437 -0.06145836  0.31675077 -0.22323433  1.12968779  0.
  0.93419063  0.          0.          1.12511349  0.          0.93511307
  0.          0.        ]
[-1.96279907e+00 -6.61133432e-02  3.20082693e-01 -2.36141491e-01
  1.12940943e+00  2.43097544e-04  9.34650242e-01  0.00000000e+00
  0.00000000e+00  1.12501836e+00 -8.64267349e-06  9.34906602e-01
 -3.80794207e-04  1.00000000e+00]
[-1.99578691e+00 -6.61039686e-02  3.18937883e-01 -2.52600517e-01
  1.12919235e+00 -2.98023224e-08  9.34906602e-01  0.00000000e+00
  0.00000000e+00  1

[-1.20032084e+00 -6.08109891e-03  4.36598861e-02 -3.02804917e-03
  1.13491714e+00  1.88872218e-06  9.34907660e-01  0.00000000e+00
  0.00000000e+00  1.12700069e+00 -5.96046448e-08  9.34900753e-01
 -8.92579556e-06  0.00000000e+00]
[-1.20352459e+00 -6.37709200e-03  4.57724190e-02 -3.26802701e-03
  1.13491583e+00  2.27987766e-06  9.34907421e-01  0.00000000e+00
  0.00000000e+00  1.12700999e+00 -7.45058060e-08  9.34902363e-01
 -1.08977159e-05  0.00000000e+00]
[-1.20687139e+00 -6.68366075e-03  4.79558015e-02 -3.52702886e-03
  1.13490999e+00  2.22027302e-06  9.34906907e-01  0.00000000e+00
  0.00000000e+00  1.12701297e+00 -7.07805157e-08  9.34904404e-01
 -1.05698903e-05  0.00000000e+00]
[-1.21037817e+00 -7.00088680e-03  5.02132809e-02 -3.80603582e-03
  1.13490999e+00  2.27242708e-06  9.34906900e-01  0.00000000e+00
  0.00000000e+00  1.12701821e+00 -6.70552254e-08  9.34904173e-01
 -1.08554959e-05  0.00000000e+00]
[-1.21404219e+00 -7.32933223e-03  5.25480974e-02 -4.10689324e-03
  1.13490784e+00  2

[-1.49832034e+00 -3.02475882e-02  2.02171440e-01 -5.52124977e-02
  1.13454401e+00  1.79857016e-05  9.34906662e-01  0.00000000e+00
  0.00000000e+00  1.12657785e+00 -5.66244125e-07  9.34907496e-01
 -9.64800517e-05  0.00000000e+00]
[-1.51404071e+00 -3.14694238e-02  2.09025421e-01 -5.95514297e-02
  1.13451076e+00  1.92523003e-05  9.34906155e-01  0.00000000e+00
  0.00000000e+00  1.12654042e+00 -5.96046448e-07  9.34906811e-01
 -1.03980303e-04  0.00000000e+00]
[-1.53039384e+00 -3.27374482e-02  2.15974803e-01 -6.42183304e-02
  1.13447607e+00  2.05934048e-05  9.34906602e-01  0.00000000e+00
  0.00000000e+00  1.12649906e+00 -6.40749931e-07  9.34906870e-01
 -1.11897786e-04  0.00000000e+00]
[-1.54740357e+00 -3.40531182e-02  2.23002176e-01 -6.92353725e-02
  1.13443911e+00  2.20090151e-05  9.34906632e-01  0.00000000e+00
  0.00000000e+00  1.12645483e+00 -6.85453415e-07  9.34908062e-01
 -1.20441119e-04  0.00000000e+00]
[-1.56509423e+00 -3.54179549e-02  2.30086985e-01 -7.46255589e-02
  1.13439929e+00  2

KeyboardInterrupt: 

In [None]:
env.close()