# Continuous Control

---

Congratulations for completing the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program!  In this notebook, you will learn how to control an agent in a more challenging environment, where the goal is to train a creature with four arms to walk forward.  **Note that this exercise is optional!**

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Crawler.app"`
- **Windows** (x86): `"path/to/Crawler_Windows_x86/Crawler.exe"`
- **Windows** (x86_64): `"path/to/Crawler_Windows_x86_64/Crawler.exe"`
- **Linux** (x86): `"path/to/Crawler_Linux/Crawler.x86"`
- **Linux** (x86_64): `"path/to/Crawler_Linux/Crawler.x86_64"`
- **Linux** (x86, headless): `"path/to/Crawler_Linux_NoVis/Crawler.x86"`
- **Linux** (x86_64, headless): `"path/to/Crawler_Linux_NoVis/Crawler.x86_64"`

For instance, if you are using a Mac, then you downloaded `Crawler.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Crawler.app")
```

In [2]:
import os
# os.environ['PATH'] = f"{os.environ['PATH']}:/home/student/.local/bin"
# os.environ['PATH'] = f"{os.environ['PATH']}:/opt/conda/lib/python3.10/site-packages"

os.environ['PATH'] = f"{os.environ['PATH']}:/home/vidy/.local/bin"
os.environ['PATH'] = f"{os.environ['PATH']}:/home/vidy/mambaforge/envs/py310/lib/python3.10/site-packages"


os.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION'] = 'python'

In [3]:
# !python -m pip freeze | grep numpy
# !pip -q install . 

## SETUP ENVIRONMENT 
(The environment embedded in this repository is only for Linux 20 agents)

In [4]:
## Setting up Environment  

from unityagents import UnityEnvironment
import numpy as np

# Path to the Unity environment binary 
# (THE File Env PROVIDED IS ONLY FOR LINUX), feel free to replace with other env
env_path = "Reacher.x86_64"

env = UnityEnvironment(file_name=env_path, no_graphics=True)
brain_name = env.brain_names[0]
brain = env.brains[brain_name]
env_info = env.reset(train_mode=True, )[brain_name]

num_agents = len(env_info.agents)
action_size = brain.vector_action_space_size

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_size -> 5.0
		goal_speed -> 1.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Found path: /home/vidy/RL_Reacher/Reacher.x86_64
Mono path[0] = '/home/vidy/RL_Reacher/Reacher_Data/Managed'
Mono config path = '/home/vidy/RL_Reacher/Reacher_Data/MonoBleedingEdge/etc'
Preloaded 'ScreenSelector.so'
Preloaded 'libgrpc_csharp_ext.x64.so'
Unable to preload the following plugins:
	ScreenSelector.so
	libgrpc_csharp_ext.x86.so
Logging to /home/vidy/.config/unity3d/Unity Technologies/Unity Environment/Player.log
There are 20 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726624e+00 

Report Functions


In [5]:

# Save Report
import torch 
import matplotlib.pyplot as plt
import os

def save_report(scores, end_times, log_file="training_logs.txt", folder="Report"):
    os.makedirs(folder, exist_ok=True)

    # 1. Plot and save scores
    plt.figure(figsize=(10, 6))
    plt.plot(scores, label="Scores")
    plt.axhline(y=np.mean(scores[-100:]), color="r", linestyle="--", label="Last 100 Average")
    plt.xlabel("Episode")
    plt.ylabel("Score")
    plt.title("Training Progress")
    plt.legend()
    plot_path = os.path.join(folder, "training_progress.png")
    plt.savefig(plot_path)
    plt.close()
    print(f"Score plot saved: {plot_path}")

    log_path = os.path.join(folder, log_file)
    with open(log_path, "w") as f:
        for i, score in enumerate(scores):
            rounded_values = [round(v, 2) for v in [score, end_times[i]]]
            score, end_episode = rounded_values
            f.write(f"Episode: {i+1} average last 100 Score: {score}, done in {end_episode} seconds\n")
    print(f"Logs saved: {log_path}")
    
def save_model(agent, folder="Report"):
    # 1. Save the model
    actor_path = os.path.join(folder, "actor.pth")
    critic_path = os.path.join(folder, "critic.pth")
    torch.save(agent.actor_local.state_dict(), actor_path)
    torch.save(agent.critic_local.state_dict(), critic_path)
    print(f"Models saved: {actor_path}, {critic_path}")

### Declare Global variable

In [6]:

from Agent import DDPGAgent

BUFFER_SIZE = int(1e8)  # replay buffer size
BATCH_SIZE = 256        # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR_ACTOR = 1e-4         # learning rate of the actor
LR_CRITIC = 1e-3        # learning rate of the critic
WEIGHT_DECAY = 0        # L2 weight decay
DEVICE = 'cuda'
USE_PER = False
use_mixed_precision= False # if your GPU support mixed precision in NVIDIA

## if using PER, uncomment below otherwise commentout when using REplayBuffer
## PER takes longer time to complete 1 epidsode and the learning is not consistent
# USE_PER = True
# BUFFER_SIZE = int(1e4)
# BATCH_SIZE = 128

agent = DDPGAgent(
    DEVICE=DEVICE,
    BUFFER_SIZE=BUFFER_SIZE,
    BATCH_SIZE=BATCH_SIZE,
    GAMMA=GAMMA,
    TAU=TAU,
    LR_ACTOR=LR_ACTOR,
    LR_CRITIC=LR_CRITIC,
    WEIGHT_DECAY=WEIGHT_DECAY,
    state_size=state_size, action_size=action_size, random_seed=0,
    use_mixed_precision=use_mixed_precision,
    USE_PER=USE_PER)




### Training code

In [None]:
import numpy as np
from collections import deque
import time

max_steps_per_episode = 1000       
scores = []  
end_times = []
scores_window = deque(maxlen=100)  
i_episode = 0
max_episode = 2000 # allow for x episode running instead of running forever

while True:
    if(i_episode > max_episode): break
    
    i_episode += 1
    env_info = env.reset(train_mode=True)[brain_name]  
    states = env_info.vector_observations            
    agent.reset()                                     
    episode_scores = np.zeros(20)
    start_episode = time.perf_counter()
    # Initialize episode score
    
    start_step = time.perf_counter()
    for t in range(max_steps_per_episode):
        actions = agent.act(states)                     
        env_info = env.step(actions)[brain_name]       
        next_states = env_info.vector_observations  
        rewards = env_info.rewards                   
        dones = env_info.local_done                 

        # Save experience and learn
        if(t % 100 == 0):        
            end_step = time.perf_counter() - start_step
            start_step = time.perf_counter()
            print(f"Scores for episode {i_episode} step {t}: {np.mean(episode_scores):.2f}, calculated in {end_step:.2f}s")
        for i in range(20):
            agent.step(states[i], actions[i], rewards[i], next_states[i], dones[i], t)

        # Transition to next state
        states = next_states                           
        episode_scores += rewards                             

        if np.any(dones): 
            break

    avg_score = np.mean(episode_scores)
    scores.append(avg_score)
    scores_window.append(avg_score)
    end_episode = time.perf_counter() - start_episode
    end_times.append(end_episode)
    
    #save model and report
    print(f"\nEpisode {i_episode}\tAverage Score: {avg_score:.2f}, finished in {end_episode}s")
    save_report(scores, end_times)
    print(f"\nEnvironment solved in {i_episode} episodes!, finished in {end_episode} seconds ")
    save_model(agent)
    
    
    # Gradually reduce noise
    if hasattr(agent.noise, 'sigma'):
        agent.noise.sigma = max(0.1, agent.noise.sigma * 0.995)   


Scores for episode 1 step 0: 0.00, calculated in 0.04s
Scores for episode 1 step 100: 0.07, calculated in 1.49s
Scores for episode 1 step 200: 0.25, calculated in 1.72s
Scores for episode 1 step 300: 0.38, calculated in 1.81s
Scores for episode 1 step 400: 0.49, calculated in 1.81s
Scores for episode 1 step 500: 0.58, calculated in 1.89s
Scores for episode 1 step 600: 0.67, calculated in 1.95s
Scores for episode 1 step 700: 0.78, calculated in 1.94s
Scores for episode 1 step 800: 0.88, calculated in 1.97s
Scores for episode 1 step 900: 0.97, calculated in 2.15s

Episode 1	Average Score: 1.13, finished in 18.730684647001908s
Score plot saved: Report/training_progress.png
Logs saved: Report/training_logs.txt

Environment solved in 1 episodes!, finished in 18.730684647001908 seconds 
Models saved: Report/actor.pth, Report/critic.pth
Scores for episode 2 step 0: 0.00, calculated in 0.00s
Scores for episode 2 step 100: 0.10, calculated in 2.00s
Scores for episode 2 step 200: 0.22, calculate