<div align="center">
    <img src="https://www.sharif.ir/documents/20124/0/logo-fa-IR.png/4d9b72bc-494b-ed5a-d3bb-e7dfd319aec8?t=1609608338755" alt="Logo" width="200">
    <p><b> Reinforcement Learning Course, Dr. Rohban</b></p>
</div>


*Full Name:*

*Student Number:*

# Random Network Distillation (RND) with PPO - Homework Project

  

---

  


## 1. Introduction: Random Network Distillation (RND)

A common way of doing exploration is to visit states with a large prediction error of some quantity, for instance, the TD error or even random functions.  
The RND algorithm ([Exploration by Random Network Distillation](https://arxiv.org/abs/1810.12894)) aims at encouraging exploration by asking the exploration policy to more frequently undertake transitions where the prediction error of a random neural network function is high.

Formally, let $f^*_\theta(s')$ be a randomly chosen vector-valued function represented by a neural network.  
RND trains another neural network, $\hat{f}_\phi(s')$, to match the predictions of $f^*_\theta(s')$ under the distribution of datapoints in the buffer, as shown below:

$$
\phi^* = \arg\min_\phi \mathbb{E}_{s,a,s'\sim\mathcal{D}} \left[ \left\| \hat{f}_\phi(s') - f^*_\theta(s') \right\| \right]
$$

If a transition $(s, a, s')$ is in the distribution of the data buffer, the prediction error $\mathcal{E}_\phi(s')$ is expected to be small.  
On the other hand, for all unseen state-action tuples, it is expected to be large.

In practice, RND uses two critics:
- an exploitation critic $Q_R(s,a)$, which estimates returns based on the true rewards,
- and an exploration critic $Q_E(s,a)$, which estimates returns based on the exploration bonuses.

To stabilize training, prediction errors are normalized before being used.

---

## 2. What You Will Implement

  

You will implement the missing core components of Random Network Distillation (RND) combined with a Proximal Policy Optimization (PPO) agent inside the MiniGrid environment.

  

Specifically, you will:

  

- Complete the architecture of TargetModel and PredictorModel.

  

- Complete the initialization of weights for these models.

  

- Implement the intrinsic reward calculation (prediction error).

  

- Implement the RND loss calculation.

  

You will complete TODO sections inside two main files:

  

    Core/ppo_rnd_agent.py
    Core/model.py

  

---

  

## 3. Project Structure


```
RND_PPO_Project/
 ├── main.py               # Main training loop and evaluation
 ├──requirements.txt       # Python dependencies               
 ├── Core/
 │    └── ppo_rnd_agent.py         # Agent logic (policy + RND + training)
 │    └── model.py         # Model architectures (policy, predictor, target)
 ├── Common/
 │    ├── config.py        # Hyperparameters and argument parsing
 │    ├── utils.py         # Utilities (normalization, helper functions)
 │    ├── logger.py        # Tensorboard logger
 │    └── play.py          # Evaluation / Play script
```



---

## 4. Modules Explanation

| Module        | Description |
|---------------|-------------|
| `ppo_rnd_agent.py`    | **Core agent logic.** This file contains the PPO algorithm implementation and also handles the RND intrinsic reward mechanism. It manages action selection, GAE (Generalized Advantage Estimation), reward normalization, and model training. <br>➡️ You will modify this file to implement the intrinsic reward and RND loss functions. |
| `model.py`    | **Neural network architectures.** This defines the structure of the policy network (used for action selection) and the two RND networks — Target and Predictor. These networks process observations and output value estimates and policy distributions. <br>➡️ You will define the structure of the `TargetModel` and `PredictorModel` classes here and implement proper initialization. |
| `utils.py`    | **Support utilities.** This includes helper functions like setting random seeds for reproducibility, maintaining running mean and variance for normalization, and a few decorators. It helps the rest of the codebase stay clean and modular. |
| `config.py`   | **Experiment settings.** It defines all training hyperparameters (learning rate, batch size, gamma, etc.) and parses command-line flags such as `--train_from_scratch` or `--do_test`. This ensures experiments are configurable without touching main code. |
| `logger.py`   | **Logging training metrics.** Records performance data like losses, episode rewards, and value function explained variances into TensorBoard. This helps you visually inspect whether the agent is learning or not. |
| `play.py`     | **Evaluation module.** This file runs a trained agent in the environment without further learning. It resets the environment, feeds observations through the trained policy, and executes actions until the episode terminates. |
| `runner.py`     | **Parallel environment interaction.** Runs a Gym environment in a separate process using torch.multiprocessing. It communicates with the main process to exchange observations and actions, enabling parallel experience collection. Supports episode reset and optional rendering. |
| `main.py`     | **Project entry point.** Orchestrates the full experiment — sets up environment, models, logger, and executes training or testing depending on the flag. This is where everything comes together. |

---

## 5. TODO Parts (Your Tasks)

You must complete the following parts:

| File | TODO Description |
| :--- | :--- |
| `Core/model.py` | Implement the architecture of `TargetModel` and `PredictorModel`. |
| `Core/model.py` | Implement `_init_weights()` method for proper initialization. |
| `Core/ppo_rnd_agent.py` | Implement `calculate_int_rewards()` to compute intrinsic rewards. |
| `Core/ppo_rnd_agent.py` | Implement `calculate_rnd_loss()` to compute predictor training loss. |


---

# Setup Code
Before getting started we need to run some boilerplate code to set up our environment. You'll need to rerun this setup code each time you start the notebook.

First, run this cell load the [autoreload](https://ipython.readthedocs.io/en/stable/config/extensions/autoreload.html?highlight=autoreload) extension. This allows us to edit `.py` source files, and re-import them into the notebook for a seamless editing and debugging experience.

In [None]:
# Import all necessary libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import os
import sys
from abc import ABC

# Add the project directory to Python path
project_dir = os.getcwd()
if project_dir not in sys.path:
    sys.path.append(project_dir)

# Import project modules
from Core.model import PolicyModel, TargetModel, PredictorModel
from Core.ppo_rnd_agent import Brain
from Common.config import get_config
from Common.utils import RunningMeanStd
from Common.logger import Logger
from Common.runner import Runner
from Common.play import play

print("All imports successful!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name()}")


In [None]:
%load_ext autoreload
%autoreload 2

#### In the following cell you are going to direct to your gooledrive if you are using GooleColab which is preferable

In [None]:
# ----------------------------
# . Moount Google Drive
# ----------------------------
from google.colab import drive
drive.mount('/content/drive')

# ----------------------------
# 2. Go the Project directory
# ----------------------------
import os

# TODO: Fill in the Google Drive path where you uploaded the assignment
# Example: If you create a 2020FA folder and put all the files under A1 folder, then '2020FA/A1'
# GOOGLE_DRIVE_PATH_AFTER_MYDRIVE = '2020FA/A1'
GOOGLE_DRIVE_PATH_AFTER_MYDRIVE =
GOOGLE_DRIVE_PATH = os.path.join('drive', 'My Drive', GOOGLE_DRIVE_PATH_AFTER_MYDRIVE)
print(os.listdir(GOOGLE_DRIVE_PATH))


## 1. Install dependencies


## 6. Complete Implementation

The following sections show the complete implementation of all TODO parts that were required for the RND algorithm.


### 6.1 TargetModel Implementation

The TargetModel is a fixed random neural network that serves as the target for the PredictorModel to learn from. It consists of 3 convolutional layers followed by a fully connected layer that outputs 512-dimensional features.


In [None]:
# TargetModel Implementation
class TargetModel(nn.Module, ABC):
    def __init__(self, state_shape):
        super(TargetModel, self).__init__()
        c, w, h = state_shape
        
        # Convolutional layers
        self.conv1 = nn.Conv2d(c, 32, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1)
        
        # Calculate flattened size after conv layers
        # For MiniGrid 7x7 input, after 3 conv layers with padding=1, size remains 7x7
        flatten_size = 128 * 7 * 7
        
        # Fully connected layer to produce 512-dimensional features
        self.encoded_features = nn.Linear(flatten_size, 512)
        
        self._init_weights()  # Call this after defining layers

    def _init_weights(self):
        # Initialize all layers with orthogonal weights
        for layer in self.modules():
            if isinstance(layer, (nn.Conv2d, nn.Linear)):
                nn.init.orthogonal_(layer.weight, gain=np.sqrt(2))
                layer.bias.data.zero_()

    def forward(self, inputs):
        # Normalize input to [0, 1] range
        x = inputs / 255.0
        
        # Pass through convolutional layers with ReLU activations
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        
        # Flatten and pass through fully connected layer
        x = x.view(x.size(0), -1)  # Flatten
        encoded_features = self.encoded_features(x)
        
        return encoded_features


### 6.2 PredictorModel Implementation

The PredictorModel is a trainable neural network that learns to predict the output of the TargetModel. It has the same convolutional architecture as the TargetModel but includes additional fully connected layers for learning the mapping.


In [None]:
# PredictorModel Implementation
class PredictorModel(nn.Module, ABC):
    def __init__(self, state_shape):
        super(PredictorModel, self).__init__()
        c, w, h = state_shape
        
        # Convolutional layers (same as TargetModel)
        self.conv1 = nn.Conv2d(c, 32, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1)
        
        # Calculate flattened size after conv layers
        flatten_size = 128 * 7 * 7
        
        # Additional fully connected layers for prediction
        self.fc1 = nn.Linear(flatten_size, 512)
        self.fc2 = nn.Linear(512, 512)
        
        # Final output layer to match TargetModel output dimension
        self.encoded_features = nn.Linear(512, 512)
        
        self._init_weights()  # Call this after defining layers

    def _init_weights(self):
        # Initialize all layers with orthogonal weights
        for layer in self.modules():
            if isinstance(layer, (nn.Conv2d, nn.Linear)):
                if layer == self.encoded_features:
                    # Use smaller gain for final output layer to slow learning
                    nn.init.orthogonal_(layer.weight, gain=np.sqrt(0.01))
                else:
                    nn.init.orthogonal_(layer.weight, gain=np.sqrt(2))
                layer.bias.data.zero_()

    def forward(self, inputs):
        # Normalize input to [0, 1] range
        x = inputs / 255.0
        
        # Pass through convolutional layers with ReLU activations
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        
        # Flatten and pass through fully connected layers
        x = x.view(x.size(0), -1)  # Flatten
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        
        # Final encoded features
        encoded_features = self.encoded_features(x)
        
        return encoded_features


### 6.3 Intrinsic Reward Calculation

The intrinsic reward is computed as the prediction error between the TargetModel and PredictorModel. States with high prediction error (unseen states) receive higher intrinsic rewards, encouraging exploration.


### 6.4 RND Loss Calculation

The RND loss is computed during training to update the PredictorModel. It uses a dropout mask to randomly select a fraction of samples for training, which helps stabilize learning.


In [None]:
# RND Loss Calculation Implementation
def calculate_rnd_loss(self, obs):
    # Normalize observations
    norm_obs = np.clip(
        (obs.cpu().numpy() - self.state_rms.mean) / (self.state_rms.var**0.5), -5, 5
    ).astype(np.float32)
    norm_obs = torch.tensor(norm_obs).to(self.device)

    # Get target features (fixed random network)
    with torch.no_grad():
        target = self.target_model(norm_obs)

    # Get predicted features (trainable network)
    pred = self.predictor_model(norm_obs)

    # Compute squared error between predicted and target features
    loss = torch.mean((pred - target) ** 2, dim=1)

    # Apply dropout mask using predictor_proportion
    # This randomly selects a fraction of samples for training the predictor
    mask = torch.rand_like(loss) < self.config["predictor_proportion"]
    masked_loss = loss * mask.float()

    # Compute final loss as mean of masked losses
    final_loss = torch.mean(masked_loss)

    return final_loss


### 6.5 Key Implementation Details

**Architecture Design:**
- **TargetModel**: Fixed random network with 3 conv layers (32→64→128 channels) + FC layer (512 features)
- **PredictorModel**: Same conv architecture + 2 additional FC layers (512→512→512) for learning
- **Weight Initialization**: Orthogonal initialization with gain=√2 for most layers, gain=√0.01 for final layer

**Intrinsic Reward Mechanism:**
- Computes MSE between TargetModel and PredictorModel outputs
- Higher prediction error = higher intrinsic reward = more exploration
- Uses state normalization for stable training

**Training Strategy:**
- Dropout mask randomly selects 25% of samples for predictor training
- Combines extrinsic and intrinsic rewards in advantage calculation
- PPO updates both policy and predictor networks simultaneously


### 6.4 RND Loss Calculation

The RND loss is computed during training to update the PredictorModel. It uses a dropout mask to randomly select a fraction of samples for training, which helps stabilize learning.


In [None]:
!pip install -r requirements.txt


## 7. Student Instructions (Updated)

> **All TODO sections have been completed!** The following files now contain the full implementation:
- `Core/ppo_rnd_agent.py` - Complete intrinsic reward and RND loss implementation
- `Core/model.py` - Complete TargetModel and PredictorModel architectures

> **What was implemented:**
1. **TargetModel**: Fixed random network with 3 conv layers + FC layer (512 features)
2. **PredictorModel**: Same conv architecture + 2 additional FC layers for learning
3. **Intrinsic Rewards**: MSE between target and predictor outputs
4. **RND Loss**: Training loss with dropout mask for predictor updates

You can now proceed to train the agent with the complete implementation!




## 8. Train the Agent

Now that all implementations are complete, let's train the RND agent from scratch!

In [None]:
!python main.py --train_from_scratch


## 9. Visualize Training Logs

Launch TensorBoard to monitor your training progress and analyze the RND agent's performance.



In [None]:
# Start Tensorboard
%load_ext tensorboard
%tensorboard --logdir Logs


## 10. Summary

**Congratulations!** You have successfully implemented the complete Random Network Distillation (RND) algorithm with PPO. 

**Key Components Implemented:**
1. ✅ **TargetModel**: Fixed random neural network for generating target features
2. ✅ **PredictorModel**: Trainable network that learns to predict target features  
3. ✅ **Intrinsic Rewards**: Prediction error-based exploration bonuses
4. ✅ **RND Loss**: Training objective with dropout masking for stability

**How RND Works:**
- The TargetModel generates random features for each state
- The PredictorModel learns to predict these features for seen states
- Unseen states have high prediction error → high intrinsic reward → more exploration
- This encourages the agent to visit novel states and improve exploration

**Expected Results:**
- The agent should show improved exploration in the MiniGrid environment
- Intrinsic rewards should decrease over time as the agent explores more states
- The agent should learn to solve the environment more efficiently than standard PPO

Good luck with your training! 🚀