### Summary

This repo proposes a solution to Continuous Control project which adopts DDPG Agent. DDPG uses an actor body and a critic body. The actor act to environment based on a local policy network. The critic gives the actor/policy feedbacks by evaluating the state-action value, and give guidance to the actor on how it should improve the policy. The agent has local critic and actor, and also uses target critic and actor, in order to stablize the training.

### Hyper-parameters

Hyper parameters used in this solution are shown below:

In [1]:
from config import *

In [2]:
print('Memory buffer, ', MEMORY_BUFFER)
print('Batch size, ', BATCH_SIZE)
print('%d updates per step' % UPDATE_FREQUENCY_PER_STEP)
print('Discount rate, ', GAMMA)
print('Actor learning rate, ', LR_A)
print('Critic learning rate, ', LR_C)
print('Soft update TAU, ', TAU)

Memory buffer,  1000000
Batch size,  256
5 updates per step
Discount rate,  0.99
Actor learning rate,  0.0005
Critic learning rate,  0.0005
Soft update TAU,  0.001


- The replay buffer has a buffer size of 1000000, and randomly sample 256 experiences at each step.
- The agent collect 20 data points at each step, and add these experiences to memory.
- The agent runs 5 udpates per step.
- At each step, the agent randomly samples 256 past experiences from memory, and updates local actor and critic based on the sampled experience.
- The actor and the critic use same learning rate, which is 0.0005.
- The agent soft-update target actor and target critic everytime after local actor and local critic get updated, using a $\tau$ of 0.001.

### Training Scores

The following plot shows how policy improves during training. The agent hits a score of 30 within 50 episodes, and stablized around 38 after 75 episodes.

![DDPG Training Scores](DDPG_scores.png)

### Test Agent

In [1]:
import torch

from ddpg_agent import DDPGAgent
from env_wrapper import EnvWrapper
from utils import test_agent, plot_scores

In [2]:
agent = DDPGAgent(train_mode=False)
env = EnvWrapper(file_name='Reacher_Windows_x86_64\Reacher.exe', train_mode=False)

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


In [3]:
agent.actor.load_state_dict(torch.load('checkpoints/reacher_ddpg_actor_checkpoint.pth'))
agent.critic.load_state_dict(torch.load('checkpoints/reacher_ddpg_critic_checkpoint.pth'))

In [4]:
test_scores = test_agent(env, agent, 5)



KeyboardInterrupt: 

In [5]:
test_episodes = 1

In [6]:
scores = []
for e in range(1, test_episodes+1):

    agent.reset()
    env.reset()
    agent.states = env.reset()
    done = False
    
    while not done:
        agent.act(add_noise=False)
        agent.rewards, agent.next_states, agent.dones = env.step(agent.actions)
        agent.scores += agent.rewards
        agent.step_count += 1
        agent.states = agent.next_states
        done = any(agent.dones)

    print('Episode %d, avg score %.2f' % (e, agent.scores.mean()))

Episode 1, avg score 0.00


KeyboardInterrupt: 

In [8]:
agent.step_count

702

In [None]:
plot_scores(test_scores, 'DDPG_Test')