# Deep Reinforcement Learning Assignment (Graded): The Mountain Car Problem

Welcome to your programming assignment on Deep Reinforcement Learning! You will build an RL-Model to solve the famous Mountain Car Problem.

## Problem Description

- The Mountain Car is a classic control problem that elegantly demonstrates the core concepts of reinforcement learning. 

- In this environment, an underpowered car must drive up a steep hill. 

- The challenge lies in the fact that the car's engine is not powerful enough to climb the hill directly from a standing start.

- To solve, this problem, you have yo build a Reinforcement Learning Model.

**Key Characteristics:**

- The car must learn to build momentum by swinging back and forth in the valley

- The environment provides continuous state variables (position and velocity)

- The agent has three discrete actions available: accelerate left, right, or neutral

## Dataset/Environment Specifications

**State Space:**
- Two continuous variables:
  - Position: Range (-1.2, 0.6)
  - Velocity: Range (-0.07, 0.07)[2]

**Action Space:**
- Three discrete actions:
  - -1: Accelerate left
  - 0: No acceleration
  - 1: Accelerate right[1]

**Reward Structure:**
- -1 reward for each time step
- Episode terminates when either:
  - The car reaches the goal position (≥ 0.6)
  - Maximum steps are reached[2]

## Assignment Tasks

1. **Neural Network Architecture Implementation:**
   - Build a sequential model with three layers: two hidden layers with 128 neurons and ReLU activation, and an output layer with linear activation.
   - Compile the model using the Adam optimizer and mean squared error (MSE) loss.

2. **Experience Replay Implementation:**
   - Store experience tuples in a deque, and during replay, sample a minibatch, compute target Q-values using Double DQN, and update the main network via gradient descent.

3. **Action Selection (Epsilon-Greedy):**
   - Implement epsilon-greedy strategy: return a random action with probability epsilon or the action with the highest Q-value from the model otherwise.

4. **Target Network Update:**
   - Perform a soft update of the target network by blending weights from the main network and the target network based on a parameter τ.

5. **Custom Reward Function:**
   - Compute reward based on the car’s position and velocity, reward goal achievement (+100), penalize failure (-10), and clip rewards if necessary.

6. **Training Loop:**
   - Create an episode-based loop, logging progress, updating epsilon, and saving/loading models, along with tracking performance metrics like rewards and success rates.


## Instructions

- Only write code when you see any of the below prompts,

    ```
    # YOUR CODE GOES HERE
    # YOUR CODE ENDS HERE
    # TODO
    ```

- Do not modify any other section of the code unless tated otherwise in the comments.

- Use a venv of Python 3.9.6 to solve this assignment

- Install all the packages from requirements.txt so that you don't face any compatibility issues.

# Code Section

In [None]:
import gym
import numpy as np
import tensorflow as tf
from collections import deque
import random
import matplotlib.pyplot as plt
import imageio
from helpers.methods import detect_and_set_device, plot_training_metrics, save_frames_as_gif
from tests.test_methods import test_dqn_agent, test_compute_reward, test_run_episode, test_update_best_episode, test_train_agent

## Task: Initializing the OpenAI Gym Environment

**Task Hints:**

Complete the initialize_environment method.

* Create and configure a Gym environment using the specified environment name and render mode.

* Extract the state space dimensions from the environment's observation space.

* Determine the number of possible actions from the environment's action space.

* Return the initialized environment along with state and action dimensions.


In [None]:
def initialize_environment(env_name='MountainCar-v0', render_mode='rgb_array'):
    # YOUR CODE GOES HERE
    
    # Create the environment with the specified name and render mode
    env = 
    
    # Get the state size from the environment observation space by taking the shape of the observation space
    state_size = 
    
    # Get the action size from the environment action space by taking the number of actions in the action space
    action_size = 
    return env, state_size, action_size

## Task: Implementing a DQN Agent Class

**Task Hints:**

Complete the DQNAgent class implementation.

* Initialize the agent with state and action dimensions, setting up:
  * Experience replay memory using deque with max length 5000
  * Hyperparameters (gamma, epsilon, epsilon decay)
  * Neural network model
  * History tracking for various metrics

* Build a neural network model that:
  * Takes state size as input
  * Has two hidden layers of 64 units with ReLU activation
  * Outputs Q-values for each action
  * Uses Adam optimizer and MSE loss

* Implement core DQN methods:
  * Memory storage for experience replay
  * Action selection with epsilon-greedy policy
  * Training through replay with batch sampling


In [None]:
# Define the DQNAgent class
class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=5000)
        
        # Define the hyperparameters
        # YOUR CODE GOES HERE
        # Define the discount factor to be used in the Bellman equation for updating the Q-values of the agent to be 0.99
        self.gamma =
        # Define the exploration rate of the agent to be 1.0
        self.epsilon =
        # Define the minimum exploration rate of the agent to be 0.01
        self.epsilon_min =
        # Define the decay rate of the exploration rate of the agent to be 0.995
        self.epsilon_decay =
        self.model = self._build_model()
        
        # Initialize histories for plotting
        self.rewards_history = []
        self.epsilon_history = []
        self.loss_history = []
        self.position_history = []
        self.velocity_history = []
        self.action_history = []
        
    # Define the neural network model
    def _build_model(self):
        # YOUR CODE GOES HERE
        # Define the model to be a Sequential model
        # Add a Input layer with the shape of the state size for the input
        # Add 2 Dense layers with 64 units and relu activation and a Dense layer with the action size and linear activation
        model = 
        
        # Compile the model with the Adam optimizer with learning rate 0.001 and mean squared error loss
        
        
        return model
    
    # Define a method to remember the state, action, reward, next state, and done flag
    def remember(self, state, action, reward, next_state, done):
        # YOUR CODE GOES HERE
        # Append the tuple (state, action, reward, next state, done) to the memory
        
    # Define a method to act based on the state provided
    def act(self, state):
        # YOUR CODE GOES HERE
        # If a random number is less than the epsilon value, return a random action
        # Otherwise, return the action with the highest Q-value predicted by the model

    # Define a method to train the agent based on a batch size of experiences
    def replay(self, batch_size):
        # YOUR CODE GOES HERE
        # If the memory is less than the batch size, return nothing
        if len(self.memory) < batch_size:
            return
        
        # Sample a minibatch of experiences from the memory with the specified batch size using random.sample
        minibatch = 
        
        # Initialize the states, actions, rewards, next states, and dones arrays from the minibatch of experiences
        states = 
        
        # Predict the Q-values of the states using the model and store them in the targets variable
        actions = 
        
        # Get the rewards from the minibatch of experiences and store them in the rewards variable
        rewards = 
        
        # Get the next states from the minibatch of experiences and store them in the next_states variable
        next_states = 
        
        # Get the dones from the minibatch of experiences and store them in the dones variable
        dones = 

        # Compute the targets for the Q-values of the states using the rewards, next states, and dones by applying the Bellman equation with the discount factor gamma and store them in the targets variable 
        targets = 
        
        # Compute the Q-values of the states using the model and store them in the target_f variable by predicting the Q-values of the states using the model
        target_f = 
        
        # Update the Q-values of the actions in the minibatch of experiences using the computed targets and the Q-values of the states using the model by setting the target Q-values of the actions in the minibatch of experiences to the computed targets in the target_f variable 
        target_f[np.arange(len(actions)), actions] = targets

        # Fit the model with the states and target Q-values for one epoch and store the loss in the history variable by calling the fit method of the model with the states and target Q-values for one epoch and verbose=0
        history = 
        
        # Append the loss to the loss history of the agent using the history variable and the history attribute of the history variable
        

        # If the epsilon value is greater than the minimum epsilon value, decay the epsilon value by the decay rate value
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay
            
dummy_state_size = 2
dummy_action_size = 3
agent = DQNAgent(dummy_state_size, dummy_action_size)
test_dqn_agent(agent, dummy_state_size, dummy_action_size)

## Task: Implementing a Custom Reward Function

**Task Hints:**

Complete the compute_reward method for the Mountain Car environment.

* Design a reward function that encourages:
  * Reaching higher positions (position component)
  * Maintaining momentum (velocity component)
  * Successfully reaching the goal position (≥ 0.5)
  * Avoiding episode termination before reaching goal

In [None]:
# Define the compute_reward function
def compute_reward(state, next_state, done):
    # YOUR CODE GOES HERE
    # Get the position from the next state by taking the first element of the next state
    position = 
    
    # Get the velocity from the next state by taking the second element of the next state
    velocity = 
    
    # Compute the reward based on the position by adding the position multiplied by 0.5 and 10 to the reward
    reward =   # Height reward
    
    # Compute the reward based on the velocity by adding the absolute value of the velocity multiplied by 5 to the reward
    reward +=    # Velocity reward
    
    # If the position is greater than or equal to 0.5, add 100 to the reward
    if position >= 0.5:
        
    # If the episode is done and the position is less than 0.5, subtract 10 from the reward
    if done and position < 0.5:
        
    return reward

test_compute_reward(compute_reward)

## Task: Implementing Episode Runner for DQN Training

**Task Hints:**

Complete the run_episode method to handle a single training episode.

* Initialize episode:
  * Reset environment and extract initial state
  * Set up tracking for rewards and frames
  * Handle both old and new Gym API formats

* Execute episode loop:
  * Capture environment renders for visualization
  * Get agent's action using epsilon-greedy policy
  * Execute action and handle environment step
  * Compute custom reward
  * Store experience in agent's memory
  * Perform training if enough samples available

In [None]:
def run_episode(env, agent, batch_size, compute_reward):
    # YOUR CODE GOES HERE
    # Reset the environment and get the initial state
    reset_result = 
    
    # Get the state from the reset result by taking the first element of the reset result if the reset result is a tuple, otherwise take the reset result
    state = 
    total_reward = 0
    episode_frames = []
    
    # Loop until the episode is done
    while True:
        # Render the environment and get the frame
        frame = 
        
        # If the frame is not None and the frame has 3 dimensions, append the frame to the episode frames
        if frame is not None and frame.ndim == 3:  # Append only valid RGB frames
            
        
        # Get the action from the agent by calling the act method of the agent with the state as input
        action = 
        
        # Take a step in the environment with the action and get the step result
        step_result = 
        
        # If the step result has 5 elements, get the next state, reward, terminated, truncated, and _ from the step result
        # Otherwise, get the next state, reward, done, and _ from the step result
        if len(step_result) == 5:
            
        else:
            
        # Compute the reward using the state, next state, and done flag by calling the compute_reward function with the state, next state, and done flag as inputs
        reward = 
        
        # Remember the state, action, reward, next state, and done flag by calling the remember method of the agent with the state, action, reward, next state, and done flag as inputs
        
        
        # Set the state to the next state
        
        
        # Add the reward to the total reward
        

        # Train the agent with the batch size by calling the replay method of the agent with the batch size as input if the memory of the agent is greater than the batch size
        
            
        # If the episode is done, break the loop
        if done:
            break

    return total_reward, episode_frames


test_run_episode(run_episode)

## Task: Updating Best Episode Tracker

**Task Hints:**

Complete the update_best_episode method to track the best performing episode.

* Compare the current episode's total reward with the best reward seen so far
* If current reward is better:
  * Update the best reward value
  * Store a copy of the current episode's frames
* Otherwise:
  * Maintain existing best reward and frames

In [None]:
# update_best_episode function that takes the total reward, best reward, and episode frames as input and returns the best reward and episode frames
def update_best_episode(total_reward, best_reward, episode_frames):
    
    # YOUR CODE GOES HERE
    # If the total reward is greater than the best reward, return the total reward and the episode frame's copy
    # Otherwise, return the best reward and the episode frames



test_update_best_episode(update_best_episode)

In [None]:
# DO NOT MODIFY THIS FUNCTION: log_progress
def log_progress(episode, total_reward, epsilon):
    if episode % 10 == 0:
        print(f"Episode: {episode}, Total Reward: {total_reward}, Epsilon: {epsilon:.3f}")

## Task: Implementing the DQN Training Loop


**Task Hints:**

Complete the train_agent method to manage the full training process.

* Initialize tracking variables:
  * List for storing episode rewards
  * List for tracking epsilon values
  * Storage for best episode frames
  * Variable for best reward achieved

* Execute training loop:
  * Run episodes using run_episode function
  * Record episode rewards and epsilon values
  * Update best episode information
  * Log training progress

In [None]:
# Define the train_agent function
def train_agent(env, agent, episodes, batch_size, compute_reward):
    rewards_history = []
    epsilon_history = []
    best_frames = []
    best_reward = float('-inf')
    
    # Loop through the episodes
    for episode in range(episodes):
        # Run an episode with the environment, agent, batch size, and compute reward function and get the total reward and episode frames
        total_reward, episode_frames = 
        
        # Append the total reward to the rewards history of the agent
        
        
        # Append the epsilon value to the epsilon history of the agent
        
        # Update the best reward and best frames using the total reward, best reward, and episode frames by calling the update_best_episode function
        best_reward, best_frames = 
        
        # Log the progress of the episode using the log_progress function
        log_progress(episode, total_reward, agent.epsilon)

    return rewards_history, epsilon_history, best_frames


test_train_agent(train_agent)

## Saving Results and Main Function

In [None]:
# DO NOT MODIFY THIS FUNCTION
import os

def save_results(agent, rewards_history, epsilon_history, best_frames, gif_name='best_episode.gif'):
    # Check if the output directory exists, if not create it
    output_dir = './output'
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
        print(f"Output directory '{output_dir}' created.")
    
    # Save the training metrics plot
    plot_training_metrics(agent, rewards_history, epsilon_history)
    # Save the plot in the output directory
    plot_file_path = os.path.join(output_dir, 'training_metrics.png')
    plt.savefig(plot_file_path)
    print(f"Training metrics plot saved at {plot_file_path}")

    try:
        # Save the frames as a GIF in the specified output directory
        gif_path = os.path.join(output_dir, gif_name)
        save_frames_as_gif(best_frames, gif_path)
        print(f"Best episode GIF saved at {gif_path}")
    except Exception as e:
        print(f"Error saving GIF: {e}")
        if best_frames:
            print(f"Frame shape: {np.array(best_frames[0]).shape}")

In [None]:
# DO NOT MODIFY THIS FUNCTION
def main():
    env, state_size, action_size = initialize_environment()
    agent = DQNAgent(state_size, action_size)
    episodes = 10
    batch_size = 32

    rewards_history, epsilon_history, best_frames = train_agent(
        env, agent, episodes, batch_size, compute_reward
    )
    save_results(agent, rewards_history, epsilon_history, best_frames)
    env.close()


In [None]:
main()

## Visualize the results

In [None]:
# DO NOT MODIFY THIS FUNCTION - Helper function to display the saved training metrics plot and best episode GIF
import matplotlib.pyplot as plt
from IPython.display import Image, display

def display_saved_results(training_plot_path='./output/training_metrics.png', gif_path='./output/best_episode.gif'):
    """Display the saved training metrics plot and best episode GIF."""
    try:
        # Display the training metrics plot
        print(f"Displaying training metrics plot from: {training_plot_path}")
        img = plt.imread(training_plot_path)
        plt.imshow(img)
        plt.axis('off')  # Hide axes for image
        plt.show()

        # Display the best episode GIF
        print(f"Displaying best episode GIF from: {gif_path}")
        display(Image(filename=gif_path))
        
    except Exception as e:
        print(f"Error displaying results: {e}")
        
display_saved_results(training_plot_path='./output/training_metrics.png', gif_path='./output/best_episode.gif')
