# Navigation

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the first project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893).

### 1. Start the Environment

We begin by importing some necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
import os
# os.environ['PATH'] = f"{os.environ['PATH']}:/home/student/.local/bin"
# os.environ['PATH'] = f"{os.environ['PATH']}:/opt/conda/lib/python3.10/site-packages"

os.environ['PATH'] = f"{os.environ['PATH']}:/home/vidy/.local/bin"
os.environ['PATH'] = f"{os.environ['PATH']}:/home/vidy/mambaforge/envs/py310/lib/python3.10/site-packages"


os.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION'] = 'python'

In [2]:
!python -m pip freeze | grep numpy

numpy==1.26.0


In [3]:
!pip -q install .

  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[571 lines of output][0m
  [31m   [0m   import pkg_resources
  [31m   [0m Found cython-generated files...
  [31m   [0m running bdist_wheel
  [31m   [0m running build
  [31m   [0m running build_py
  [31m   [0m running build_project_metadata
  [31m   [0m creating python_build
  [31m   [0m creating python_build/lib.linux-x86_64-cpython-310
  [31m   [0m creating python_build/lib.linux-x86_64-cpython-310/grpc
  [31m   [0m copying src/python/grpcio/grpc/__init__.py -> python_build/lib.linux-x86_64-cpython-310/grpc
  [31m   [0m copying src/python/grpcio/grpc/_auth.py -> python_build/lib.linux-x86_64-cpython-310/grpc
  [31m   [0m copying src/python/grpcio/grpc/_channel.py -> python_build/lib.linux-x86_64-cpython-310/grpc
  [31m   [0m copying src/python/grpcio/grpc/_plugin_wrapp

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Banana.app"`
- **Windows** (x86): `"path/to/Banana_Windows_x86/Banana.exe"`
- **Windows** (x86_64): `"path/to/Banana_Windows_x86_64/Banana.exe"`
- **Linux** (x86): `"path/to/Banana_Linux/Banana.x86"`
- **Linux** (x86_64): `"path/to/Banana_Linux/Banana.x86_64"`
- **Linux** (x86, headless): `"path/to/Banana_Linux_NoVis/Banana.x86"`
- **Linux** (x86_64, headless): `"path/to/Banana_Linux_NoVis/Banana.x86_64"`

For instance, if you are using a Mac, then you downloaded `Banana.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Banana.app")
```

In [4]:
from unityagents import UnityEnvironment
import numpy as np

# Path to the Unity environment binary
env_path = "Banana_Linux/Banana.x86_64"

# Initialize the UnityEnvironment
env = UnityEnvironment(file_name=env_path, no_graphics=True)


# Reset the environment
env.reset()


INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Found path: /home/vidy/RL_banana/Value-based-methods/p1_navigation/Banana_Linux/Banana.x86_64
Mono path[0] = '/home/vidy/RL_banana/Value-based-methods/p1_navigation/Banana_Linux/Banana_Data/Managed'
Mono config path = '/home/vidy/RL_banana/Value-based-methods/p1_navigation/Banana_Linux/Banana_Data/MonoBleedingEdge/etc'
Preloaded 'ScreenSelector.so'
Preloaded 'libgrpc_csharp_ext.x64.so'
Unable to preload the following plugins:
	ScreenSelector.so
	libgrpc_csharp_ext.x86.so
Logging to /home/vidy/.config/unity3d/Unity Technologies/Unity Environment/Player.log


{'BananaBrain': <unityagents.brain.BrainInfo at 0x7f6251779de0>}

Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [5]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

The simulation contains a single agent that navigates a large environment.  At each time step, it has four actions at its disposal:
- `0` - walk forward 
- `1` - walk backward
- `2` - turn left
- `3` - turn right

The state space has `37` dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.  A reward of `+1` is provided for collecting a yellow banana, and a reward of `-1` is provided for collecting a blue banana. 

Run the code cell below to print some information about the environment.

In [6]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents in the environment
print('Number of agents:', len(env_info.agents))

# number of actions
brain = env.brains[brain_name]
action_size = brain.vector_action_space_size
print('Number of actions:', action_size)

# examine the state space 
state = env_info.vector_observations[0]
print('States look like:', state)
state_size = len(state)
print('States have length:', state_size)

Number of agents: 1
Number of actions: 4
States look like: [0.         1.         0.         0.         0.16895212 0.
 1.         0.         0.         0.20073597 1.         0.
 0.         0.         0.12865657 0.         1.         0.
 0.         0.14938059 1.         0.         0.         0.
 0.58185619 0.         1.         0.         0.         0.16089135
 0.         1.         0.         0.         0.31775284 0.
 0.        ]
States have length: 37


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action (uniformly) at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [7]:
# env_info = env.reset(train_mode=True)[brain_name] # reset the environment
# state = env_info.vector_observations[0]            # get the current state
# score = 0                                          # initialize the score
# while True:
#     action = np.random.randint(action_size)        # select an action
#     env_info = env.step(action)[brain_name]        # send the action to the environment
#     next_state = env_info.vector_observations[0]   # get the next state
#     reward = env_info.rewards[0]                   # get the reward
#     done = env_info.local_done[0]                  # see if episode has finished
#     score += reward                                # update the score
#     state = next_state                             # roll over the state to next time step
#     if done:                                       # exit loop if episode finished
#         break
    
# print("Score: {}".format(score))

When finished, you can close the environment.

In [8]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [9]:


from dqn_dueling import DQNDueling as DQN
import torch
from experience_replay import ReplayMemory
from prioritized_replay import PrioritizedReplayMemory
import itertools
import random
from torch import nn
import os
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime, timedelta
import pickle



# 'Agg': used to generate plots as images and save them to a file instead of
# rendering to screen
matplotlib.use('Agg')
env_info = env.reset(train_mode=True)[brain_name]


class DqnTrainer:
    def __init__(self, id, brain_name, RUNS_DIR, DATE_FORMAT ):
        self.env_id = id
        self.DATE_FORMAT = DATE_FORMAT
        self.brain_name = brain_name

        # Loss function
        self.loss_fn = nn.MSELoss()
        self.optimizer = None
        
        # Path to Run info
        self.LOG_FILE = os.path.join(
            RUNS_DIR, f'{self.env_id}.log')
        self.MODEL_FILE = os.path.join(
            RUNS_DIR, f'{self.env_id}.pt')
        self.GRAPH_FILE = os.path.join(
            RUNS_DIR, f'{self.env_id}.png')

    def next_step(self, env, action):
        env_info = env.step(action.item())[self.brain_name]
        new_state = env_info.vector_observations[0]
        reward = env_info.rewards[0]
        terminated = env_info.local_done[0]
        return new_state, reward, terminated

    def train(self, env, env_info, num_states, num_actions, device, continue_from_checkpoint=False):
        brain = env.brains[self.brain_name]
        self.device = device
        self.action_size = brain.vector_action_space_size

        print('fc1_nodes = ',self.fc1_nodes)
        print('mini_batch_size = ',  self.mini_batch_size )
        policy_dqn = DQN(num_states, num_actions, self.fc1_nodes).to(device)
        target_dqn = DQN(num_states, num_actions, self.fc1_nodes).to(device)
        epsilon_history = []
        rewards_per_episode = []
        start_episode = 0 
        if continue_from_checkpoint:
            checkpoint = torch.load(self.MODEL_FILE)
            start_episode = checkpoint.get('step_count', 0) 

            with open(f"{self.env_id}_replay_memory.pkl", 'rb') as f:
                PERmemory = pickle.load(f)
            PERmemory.alpha = 0.6 
            print(f"Loaded replay memory with {len(PERmemory)} transitions.")

            # Restore models
            policy_dqn.load_state_dict(checkpoint['policy_model_state_dict'])
            target_dqn.load_state_dict(checkpoint['target_model_state_dict'])

            # Restore optimizer
            self.optimizer = torch.optim.Adam(
                policy_dqn.parameters(), lr=self.learning_rate_a)
            self.optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

            # Restore other variables
            epsilon = checkpoint.get('epsilon', self.epsilon_init)
            step_count = checkpoint.get('step_count', 0)
            best_rewards = checkpoint.get('best_rewards', -9999999)
            print(f"Resumed from step count: {step_count}, best rewards: {best_rewards}")
        
        else:
            print(f"Initial trainings")

            target_dqn.load_state_dict(policy_dqn.state_dict())

            # Policy Network optimizer, Adam Optimizer
            self.optimizer = torch.optim.Adam(
                policy_dqn.parameters(), lr=self.learning_rate_a)

            PERmemory = PrioritizedReplayMemory(maxlen=self.replay_memory_size, alpha=0.6)

            # memory = ReplayMemory(self.replay_memory_size)

            epsilon = self.epsilon_init

            step_count = 0
            best_rewards = -9999999

        start_time = datetime.now()
        last_graph_update_time = start_time

        log_message = f"{start_time.strftime(self.DATE_FORMAT)}: Training starting..."
        print(log_message)
        with open(self.LOG_FILE, 'w') as file:
            file.write(log_message + '\n')
            


        for episode in itertools.count(start=start_episode):
            env_info = env.reset(train_mode=True)[self.brain_name]
            state = env_info.vector_observations[0]
            state = torch.tensor(state, dtype=torch.float, device=device)
            terminated = False
            transition = None
            episode_reward = 0.0
            td_error = 1.0
            while not terminated:
                if random.random() < epsilon:
                    action = np.random.randint(self.action_size)
                else:
                    with torch.no_grad():
                        action = policy_dqn(
                            state.unsqueeze(dim=0)).squeeze().argmax()
                
                action = torch.tensor(action, dtype=torch.int64, device=device)

                new_state, reward, terminated = self.next_step(env, action)
                episode_reward += reward
                new_state = torch.tensor(
                    new_state, dtype=torch.float, device=device)
                reward = torch.tensor(reward, dtype=torch.float, device=device)
                
                # Calculate TD error for PER
                with torch.no_grad():
                    q_val = policy_dqn(state.unsqueeze(0)).squeeze()[action].item()
                    next_q_val = target_dqn(new_state.unsqueeze(0)).max().item()
                    td_error = abs(reward.item() + (1 - terminated) * self.discount_factor_g * next_q_val - q_val)


                # memory.append((state, action, new_state, reward, terminated))
                transition = (state, action, new_state, reward, terminated)
                PERmemory.append(transition, priority=td_error)


                step_count += 1
                state = new_state

            rewards_per_episode.append(episode_reward)
            

            if episode_reward > best_rewards:
                torch.save({
                    'policy_model_state_dict': policy_dqn.state_dict(),
                    'target_model_state_dict': target_dqn.state_dict(),
                    'optimizer_state_dict': self.optimizer.state_dict(),
                    'epsilon': epsilon,
                    'step_count': step_count,
                    'best_rewards': best_rewards
                }, self.MODEL_FILE)
                best_rewards = episode_reward

            if episode % 50 == 0:
                log_message = f"Episode {episode}: Total Mean reward = {np.mean(rewards_per_episode)}, Mean Reward last 50 = {np.mean(rewards_per_episode[-50:])}, Epsilon = {epsilon}, best_rewards = {best_rewards}"
                print(log_message)
                with open(self.LOG_FILE, 'a') as file:
                    file.write(log_message + '\n')
                    
            current_time = datetime.now()
            if current_time - last_graph_update_time > timedelta(seconds=10):
                self.save_graph(rewards_per_episode, epsilon_history)
                last_graph_update_time = current_time

            if len(PERmemory) > self.mini_batch_size:
                transitions, weights, indices = PERmemory.sample(self.mini_batch_size, beta=0.4)
                td_errors = self.optimize(transitions, weights, policy_dqn, target_dqn)
                PERmemory.update_priorities(indices, abs(td_errors.detach().cpu().numpy()))

                
            epsilon = max(epsilon * self.epsilon_decay, self.epsilon_min)
            epsilon_history.append(epsilon)

            if step_count % self.network_sync_rate == 0:
                target_dqn.load_state_dict(policy_dqn.state_dict())

    # Optimize with pytorch
    def optimize(self, transitions, weights, policy_dqn, target_dqn):
        # Transpose the list of experiences and separate each element
        states, actions, new_states, rewards, terminations = zip(*transitions)

        # Convert data to tensors for PyTorch to process with GPU
        states = torch.stack(states)
        actions = torch.stack(actions)
        new_states = torch.stack(new_states)
        rewards = torch.stack(rewards)
        terminations = torch.tensor(terminations, dtype=torch.float).to(self.device)
        weights = torch.tensor(weights, dtype=torch.float).to(self.device)  # Convert weights to Tensor

        with torch.no_grad():
            # Always use double DQN
            best_action_from_policy = policy_dqn(new_states).argmax(dim=1)
            target_q = rewards + \
                (1 - terminations) * self.discount_factor_g * \
                target_dqn(new_states).gather(
                    dim=1, index=best_action_from_policy.unsqueeze(dim=1)
                ).squeeze()

        # Calculate Q values from current policy
        current_q = policy_dqn(states).gather(dim=1, index=actions.unsqueeze(dim=1)).squeeze()

        td_errors = target_q - current_q
        loss = (weights * td_errors.pow(2)).mean()
        

        # Optimize the model
        self.optimizer.zero_grad()  # Clear gradients
        loss.backward()  # Compute gradients (backpropagation)
        self.optimizer.step()
        return td_errors

    def save_graph(self, rewards_per_episode, epsilon_history):
        fig = plt.figure(1)

        # Plot average rewards (Y-Axis) vs episodes (X-axis)
        mean_rewards = np.zeros(len(rewards_per_episode))
        for x in range(len(mean_rewards)):
            mean_rewards[x] = np.mean(rewards_per_episode[max(0, x-99):(x+1)])
        plt.subplot(121)  # Plot in a 1 row x 2 col grid, at cell 1

        plt.ylabel('Mean Rewards')
        plt.plot(mean_rewards)

        plt.subplot(122)
        plt.ylabel('Epsilon Decay')
        plt.plot(epsilon_history)

        plt.subplots_adjust(wspace=1.0, hspace=1.0)

        # save plots
        fig.savefig(self.GRAPH_FILE)
        plt.close(fig)



In [None]:

device = 'cuda'

# for printing date and time
DATE_FORMAT = "%m-%d %H: %M: %S"

# Directory for saving run info
RUNS_DIR = "runs"
os.makedirs(RUNS_DIR, exist_ok=True)

# 'Agg': used to generate plots as images and save them to a file instead of
# rendering to screen
matplotlib.use('Agg')

env.reset()
trainer = DqnTrainer('Banana_Linux', brain_name, RUNS_DIR, DATE_FORMAT)
trainer.learning_rate_a =  0.0005 
trainer.discount_factor_g = 0.99
trainer.network_sync_rate = 500
trainer.replay_memory_size = 1_000_000
trainer.mini_batch_size = 128
trainer.epsilon_init = 1
trainer.epsilon_decay = 0.99
trainer.epsilon_min = 0.01
trainer.stop_on_reward = 15
trainer.fc1_nodes = 512  

trainer.train(env, env_info, state_size, action_size, device, False)

fc1_nodes =  512
mini_batch_size =  128
Initial trainings
12-08 16: 46: 52: Training starting...
Episode 0: Total Mean reward = -2.0, Mean Reward last 50 = -2.0, Epsilon = 1, best_rewards = -2.0


  action = torch.tensor(action, dtype=torch.int64, device=device)


Episode 50: Total Mean reward = -0.3137254901960784, Mean Reward last 50 = -0.28, Epsilon = 0.6050060671375365, best_rewards = 2.0
Episode 100: Total Mean reward = 0.039603960396039604, Mean Reward last 50 = 0.4, Epsilon = 0.36603234127322926, best_rewards = 4.0
Episode 150: Total Mean reward = 0.4105960264900662, Mean Reward last 50 = 1.16, Epsilon = 0.22145178723886094, best_rewards = 6.0
Episode 200: Total Mean reward = 0.6865671641791045, Mean Reward last 50 = 1.52, Epsilon = 0.13397967485796175, best_rewards = 9.0
Episode 250: Total Mean reward = 0.8645418326693227, Mean Reward last 50 = 1.58, Epsilon = 0.08105851616218133, best_rewards = 12.0
Episode 300: Total Mean reward = 1.06312292358804, Mean Reward last 50 = 2.06, Epsilon = 0.04904089407128576, best_rewards = 12.0
Episode 350: Total Mean reward = 1.2792022792022792, Mean Reward last 50 = 2.58, Epsilon = 0.029670038450977095, best_rewards = 12.0
Episode 400: Total Mean reward = 1.6084788029925188, Mean Reward last 50 = 3.92,