<div align="center">
    <img src="https://www.sharif.ir/documents/20124/0/logo-fa-IR.png/4d9b72bc-494b-ed5a-d3bb-e7dfd319aec8?t=1609608338755" alt="Logo" width="200">
    <p><b> Reinforcement Learning Course, Dr. Rohban</b></p>
</div>


*Full Name:*

*Student Number:*

# Random Network Distillation (RND) with PPO - Homework Project

  

---

  


## 1. Introduction: Random Network Distillation (RND)

A common way of doing exploration is to visit states with a large prediction error of some quantity, for instance, the TD error or even random functions.  
The RND algorithm ([Exploration by Random Network Distillation](https://arxiv.org/abs/1810.12894)) aims at encouraging exploration by asking the exploration policy to more frequently undertake transitions where the prediction error of a random neural network function is high.

Formally, let $f^*_\theta(s')$ be a randomly chosen vector-valued function represented by a neural network.  
RND trains another neural network, $\hat{f}_\phi(s')$, to match the predictions of $f^*_\theta(s')$ under the distribution of datapoints in the buffer, as shown below:

$$
\phi^* = \arg\min_\phi \mathbb{E}_{s,a,s'\sim\mathcal{D}} \left[ \left\| \hat{f}_\phi(s') - f^*_\theta(s') \right\| \right]
$$

If a transition $(s, a, s')$ is in the distribution of the data buffer, the prediction error $\mathcal{E}_\phi(s')$ is expected to be small.  
On the other hand, for all unseen state-action tuples, it is expected to be large.

In practice, RND uses two critics:
- an exploitation critic $Q_R(s,a)$, which estimates returns based on the true rewards,
- and an exploration critic $Q_E(s,a)$, which estimates returns based on the exploration bonuses.

To stabilize training, prediction errors are normalized before being used.

---

## 2. What You Will Implement

  

You will implement the missing core components of Random Network Distillation (RND) combined with a Proximal Policy Optimization (PPO) agent inside the MiniGrid environment.

  

Specifically, you will:

  

- Complete the architecture of TargetModel and PredictorModel.

  

- Complete the initialization of weights for these models.

  

- Implement the intrinsic reward calculation (prediction error).

  

- Implement the RND loss calculation.

  

You will complete TODO sections inside two main files:

  

    Core/ppo_rnd_agent.py
    Core/model.py

  

---

  

## 3. Project Structure


```
RND_PPO_Project/
 ├── main.py               # Main training loop and evaluation
 ├──requirements.txt       # Python dependencies               
 ├── Core/
 │    └── ppo_rnd_agent.py         # Agent logic (policy + RND + training)
 │    └── model.py         # Model architectures (policy, predictor, target)
 ├── Common/
 │    ├── config.py        # Hyperparameters and argument parsing
 │    ├── utils.py         # Utilities (normalization, helper functions)
 │    ├── logger.py        # Tensorboard logger
 │    └── play.py          # Evaluation / Play script
```



---

## 4. Modules Explanation

| Module        | Description |
|---------------|-------------|
| `ppo_rnd_agent.py`    | **Core agent logic.** This file contains the PPO algorithm implementation and also handles the RND intrinsic reward mechanism. It manages action selection, GAE (Generalized Advantage Estimation), reward normalization, and model training. <br>➡️ You will modify this file to implement the intrinsic reward and RND loss functions. |
| `model.py`    | **Neural network architectures.** This defines the structure of the policy network (used for action selection) and the two RND networks — Target and Predictor. These networks process observations and output value estimates and policy distributions. <br>➡️ You will define the structure of the `TargetModel` and `PredictorModel` classes here and implement proper initialization. |
| `utils.py`    | **Support utilities.** This includes helper functions like setting random seeds for reproducibility, maintaining running mean and variance for normalization, and a few decorators. It helps the rest of the codebase stay clean and modular. |
| `config.py`   | **Experiment settings.** It defines all training hyperparameters (learning rate, batch size, gamma, etc.) and parses command-line flags such as `--train_from_scratch` or `--do_test`. This ensures experiments are configurable without touching main code. |
| `logger.py`   | **Logging training metrics.** Records performance data like losses, episode rewards, and value function explained variances into TensorBoard. This helps you visually inspect whether the agent is learning or not. |
| `play.py`     | **Evaluation module.** This file runs a trained agent in the environment without further learning. It resets the environment, feeds observations through the trained policy, and executes actions until the episode terminates. |
| `runner.py`     | **Parallel environment interaction.** Runs a Gym environment in a separate process using torch.multiprocessing. It communicates with the main process to exchange observations and actions, enabling parallel experience collection. Supports episode reset and optional rendering. |
| `main.py`     | **Project entry point.** Orchestrates the full experiment — sets up environment, models, logger, and executes training or testing depending on the flag. This is where everything comes together. |

---

## 5. TODO Parts (Your Tasks)

You must complete the following parts:

| File | TODO Description |
| :--- | :--- |
| `Core/model.py` | Implement the architecture of `TargetModel` and `PredictorModel`. |
| `Core/model.py` | Implement `_init_weights()` method for proper initialization. |
| `Core/ppo_rnd_agent.py` | Implement `calculate_int_rewards()` to compute intrinsic rewards. |
| `Core/ppo_rnd_agent.py` | Implement `calculate_rnd_loss()` to compute predictor training loss. |


---

# Setup Code
Before getting started we need to run some boilerplate code to set up our environment. You'll need to rerun this setup code each time you start the notebook.

First, run this cell load the [autoreload](https://ipython.readthedocs.io/en/stable/config/extensions/autoreload.html?highlight=autoreload) extension. This allows us to edit `.py` source files, and re-import them into the notebook for a seamless editing and debugging experience.

In [None]:
%load_ext autoreload
%autoreload 2

#### In the following cell you are going to direct to your gooledrive if you are using GooleColab which is preferable

In [None]:
# ----------------------------
# . Moount Google Drive
# ----------------------------
from google.colab import drive
drive.mount('/content/drive')

# ----------------------------
# 2. Go the Project directory
# ----------------------------
import os

# TODO: Fill in the Google Drive path where you uploaded the assignment
# Example: If you create a 2020FA folder and put all the files under A1 folder, then '2020FA/A1'
# GOOGLE_DRIVE_PATH_AFTER_MYDRIVE = '2020FA/A1'
GOOGLE_DRIVE_PATH_AFTER_MYDRIVE =
GOOGLE_DRIVE_PATH = os.path.join('drive', 'My Drive', GOOGLE_DRIVE_PATH_AFTER_MYDRIVE)
print(os.listdir(GOOGLE_DRIVE_PATH))


## 1. Install dependencies


In [None]:
!pip install -r requirements.txt


## 2. Student Instructions (Reminder)

> Please open and edit the following files:
- `Core/ppo_rnd_agent.py`
- `Core/model.py`

> Specifically, look for `TODO` markers in the code and complete the necessary parts.

After you have filled in the missing parts, you can proceed to train the agent.




## 3. Train the agent from scratch:

Now that you've completed the TODOs, let's train your agent!
This will launch the main script with training from scratch.

In [None]:
!python main.py --train_from_scratch


## 4. Visualize Logs
launch TensorBoard to monitor your training logs.



In [None]:
# Start Tensorboard
%load_ext tensorboard
%tensorboard --logdir Logs


# End of Notebook
# Good Luck :)