# Task
Build a Deep Q-Learning model using PyTorch to solve the Lunar Landing environment and generate a video of the trained agent.

## Install necessary libraries

### Subtask:
Install PyTorch, Gymnasium (for the LunarLander-v2 environment), and other required libraries.


**Reasoning**:
Install the necessary libraries using pip.



In [1]:
!pip install torch torchvision gymnasium gymnasium[box2d] moviepy tqdm

Collecting box2d-py==2.3.5 (from gymnasium[box2d])
  Downloading box2d-py-2.3.5.tar.gz (374 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.4/374.4 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting swig==4.* (from gymnasium[box2d])
  Downloading swig-4.3.1.post0-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (3.5 kB)
Downloading swig-4.3.1.post0-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m61.4 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: box2d-py
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pi

**Reasoning**:
The installation of box2d-py failed. Attempt to install it separately with a specific version and potentially resolve build issues.



In [2]:
!pip install box2d-py==2.3.5 --no-build-isolation

Collecting box2d-py==2.3.5
  Using cached box2d-py-2.3.5.tar.gz (374 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: box2d-py
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Building wheel for box2d-py (setup.py) ... [?25lerror
[31m  ERROR: Failed building wheel for box2d-py[0m[31m
[0m[?25h  Running setup.py clean for box2d-py
Failed to build box2d-py
[31mERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (box2d-py)[0m[31m
[0m

## Install necessary libraries

### Subtask:
Install PyTorch, Gymnasium (for the LunarLander-v2 environment), and other required libraries.


**Reasoning**:
Attempt to install swig and then box2d-py again as suggested by the instructions.



In [3]:
!sudo apt-get update
!sudo apt-get install swig -y
!pip install box2d-py==2.3.5 --no-build-isolation

0% [Working]            Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
0% [Connecting to archive.ubuntu.com (91.189.91.81)] [1 InRelease 12.7 kB/129 k0% [Connecting to archive.ubuntu.com (91.189.91.81)] [Connected to cloud.r-proj                                                                               Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
0% [Waiting for headers] [2 InRelease 3,632 B/3,632 B 100%] [Connected to r2u.s0% [Waiting for headers] [Connected to r2u.stat.illinois.edu (192.17.190.167)]                                                                                Hit:3 https://cli.github.com/packages stable InRelease
0% [Waiting for headers] [Connected to r2u.stat.illinois.edu (192.17.190.167)]                                                                                Get:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
0% [Waiting f

## Set up the environment

### Subtask:
Create the LunarLander-v2 environment using Gymnasium.


**Reasoning**:
Import the gymnasium library and create the LunarLander-v2 environment.



**Reasoning**:
The LunarLander-v2 environment is deprecated. Create the LunarLander-v3 environment instead.



## How to Create a README File

A README file is typically a plain text file named `README.md` (using Markdown format) located in the root directory of your project. It serves as the first point of contact for anyone looking at your code.

Here's a suggested structure for your README:

1.  **Title:** The name of your project.
2.  **Description:** A brief explanation of what your project does and its purpose. For this project, you can describe that it's a Deep Q-Learning model for the Lunar Lander environment.
3.  **Installation:** Instructions on how to install the necessary dependencies. You can list the libraries you installed (`torch`, `gymnasium`, `gymnasium[box2d]`, `moviepy`, `tqdm`) and mention the `pip` commands used.
4.  **Usage:** How to run your code. You can explain the different sections of your notebook: setting up the environment, defining the network and agent, training, evaluating, and generating the video. You can also mention how to load the saved model checkpoint.
5.  **Results:** Briefly summarize the training results (e.g., average score achieved, number of episodes to solve the environment) and mention the generated video.
6.  **Contributing (Optional):** How others can contribute to your project.
7.  **License (Optional):** The license under which your project is released.

You can create this file in your project directory using a text editor and write the content using Markdown syntax. Markdown is a lightweight markup language that allows you to format plain text (like using `#` for headings, `*` or `-` for lists, etc.).

In [5]:
import gymnasium as gym

env = gym.make('LunarLander-v3')

## Define the q-network

### Subtask:
Implement a neural network using PyTorch that will approximate the Q-values.


**Reasoning**:
Implement the neural network for the Q-Network using PyTorch as described in the instructions.



In [6]:
import torch
import torch.nn as nn

class QNetwork(nn.Module):
    """Deep neural network for Q-value approximation."""
    def __init__(self, state_size, action_size):
        """Initialize parameters and build network.

        Args:
            state_size (int): Dimension of each state.
            action_size (int): Dimension of each action.
        """
        super().__init__()
        self.fc1 = nn.Linear(state_size, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, action_size)

    def forward(self, state):
        """Build a network that maps state -> action values."""
        x = torch.relu(self.fc1(state))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

# Example usage (optional, for verification)
# state_size = env.observation_space.shape[0]
# action_size = env.action_space.n
# q_network = QNetwork(state_size, action_size)
# print(q_network)

## Implement the deep q-learning agent

### Subtask:
Create an agent class that includes methods for selecting actions (epsilon-greedy), storing experiences in a replay buffer, and training the Q-Network.


**Reasoning**:
Implement the DQNAgent class with the specified methods for action selection, experience storage, and learning. This involves initializing the Q-networks, replay buffer, and optimizer, as well as defining the `act`, `step`, `learn`, and `soft_update` methods.



In [7]:
import random
import numpy as np
import torch.optim as optim
from collections import deque, namedtuple

BUFFER_SIZE = int(1e5)  # replay buffer size
BATCH_SIZE = 64         # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR = 5e-4               # learning rate
UPDATE_EVERY = 4        # how often to update the network

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

class DQNAgent:
    """Interacts with and learns from the environment."""

    def __init__(self, state_size, action_size, seed, buffer_size=BUFFER_SIZE, batch_size=BATCH_SIZE, gamma=GAMMA, tau=TAU, lr=LR, update_every=UPDATE_EVERY):
        """Initialize an Agent object.

        Args:
            state_size (int): dimension of each state
            action_size (int): dimension of each action
            seed (int): random seed
            buffer_size (int): replay buffer size
            batch_size (int): minibatch size
            gamma (float): discount factor
            tau (float): for soft update of target parameters
            lr (float): learning rate
            update_every (int): how often to update the network
        """
        self.state_size = state_size
        self.action_size = action_size
        self.seed = random.seed(seed)
        self.buffer_size = buffer_size
        self.batch_size = batch_size
        self.gamma = gamma
        self.tau = tau
        self.lr = lr
        self.update_every = update_every

        # Q-Network
        self.qnetwork_local = QNetwork(state_size, action_size).to(device)
        self.qnetwork_target = QNetwork(state_size, action_size).to(device)
        self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=self.lr)

        # Replay memory
        self.memory = ReplayBuffer(buffer_size, batch_size, seed)
        # Initialize time step (for updating every 'update_every' steps)
        self.t_step = 0

    def step(self, state, action, reward, next_state, done):
        """Save experience in replay memory, and use random sample from buffer to learn."""
        # Save experience in replay memory
        self.memory.add(state, action, reward, next_state, done)

        # Learn every UPDATE_EVERY time steps
        self.t_step = (self.t_step + 1) % self.update_every
        if self.t_step == 0:
            # If enough samples are available in memory, get random subset and learn
            if len(self.memory) > self.batch_size:
                experiences = self.memory.sample()
                self.learn(experiences, self.gamma)

    def act(self, state, eps=0.):
        """Returns actions for given state as per epsilon-greedy policy.

        Args:
            state (np.array): current state
            eps (float): epsilon, for epsilon-greedy action selection
        """
        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
        self.qnetwork_local.eval()
        with torch.no_grad():
            action_values = self.qnetwork_local(state)
        self.qnetwork_local.train()

        # Epsilon-greedy action selection
        if random.random() > eps:
            return np.argmax(action_values.cpu().data.numpy())
        else:
            return random.choice(np.arange(self.action_size))

    def learn(self, experiences, gamma):
        """Update value parameters using given batch of experience tuples.

        Args:
            experiences (Tuple[torch.Tensor]): tuple of (s, a, r, s', done) tuples
            gamma (float): discount factor
        """
        states, actions, rewards, next_states, dones = experiences

        # Get max predicted Q values (for next states) from target model
        Q_targets_next = self.qnetwork_target(next_states).detach().max(1)[0].unsqueeze(1)
        # Compute Q targets for current states
        Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))

        # Get expected Q values from local model
        Q_expecteds = self.qnetwork_local(states).gather(1, actions)

        # Compute loss
        loss = torch.nn.functional.mse_loss(Q_expecteds, Q_targets)
        # Minimize the loss
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        # Soft update target network
        self.soft_update(self.qnetwork_local, self.qnetwork_target, self.tau)

    def soft_update(self, local_model, target_model, tau):
        """Soft update model parameters.
        θ_target = τ*θ_local + (1 - τ)*θ_target

        Args:
            local_model (torch.nn.Module): weights will be copied from
            target_model (torch.nn.Module): weights will be copied to
            tau (float): interpolation parameter
        """
        for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
            target_param.data.copy_(tau * local_param.data + (1.0 - tau) * target_param.data)

class ReplayBuffer:
    """Fixed-size buffer to store experience tuples."""

    def __init__(self, buffer_size, batch_size, seed):
        """Initialize a ReplayBuffer object.

        Args:
            buffer_size (int): maximum size of buffer
            batch_size (int): size of each training batch
            seed (int): random seed
        """
        self.memory = deque(maxlen=buffer_size)
        self.batch_size = batch_size
        self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
        self.seed = random.seed(seed)

    def add(self, state, action, reward, next_state, done):
        """Add a new experience to memory."""
        e = self.experience(state, action, reward, next_state, done)
        self.memory.append(e)

    def sample(self):
        """Randomly sample a batch of experiences from memory."""
        experiences = random.sample(self.memory, k=self.batch_size)

        states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)
        actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).long().to(device)
        rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
        next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)
        dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)

        return (states, actions, rewards, next_states, dones)

    def __len__(self):
        """Return the current size of internal memory."""
        return len(self.memory)

# Example usage (optional, for verification)
# state_size = env.observation_space.shape[0]
# action_size = env.action_space.n
# agent = DQNAgent(state_size=state_size, action_size=action_size, seed=0)
# print(agent)

## Train the agent

### Subtask:
Implement the training loop, including steps for interacting with the environment, storing experiences, and updating the Q-Network.


**Reasoning**:
Implement the `dqn` function to handle the training loop as described in the instructions, including iterating through episodes and timesteps, interacting with the environment, storing experiences, updating the agent, and tracking scores. After defining the function, call it with the specified parameters and store the returned scores.



In [8]:
from collections import deque
import numpy as np
import torch
import time
import matplotlib.pyplot as plt

def dqn(n_episodes=2000, max_t=1000, eps_start=1.0, eps_end=0.01, eps_decay=0.995, scores_window=100):
    """Deep Q-Learning.

    Args:
        n_episodes (int): maximum number of training episodes
        max_t (int): maximum number of timesteps per episode
        eps_start (float): starting value of epsilon, for epsilon-greedy action selection
        eps_end (float): minimum value of epsilon
        eps_decay (float): multiplicative factor (per episode) for decreasing epsilon
        scores_window (int): window size for calculating the average score
    """
    scores = []                         # list containing scores from each episode
    scores_window_deque = deque(maxlen=scores_window)  # last 100 scores
    eps = eps_start                     # initialize epsilon

    state_size = env.observation_space.shape[0]
    action_size = env.action_space.n
    agent = DQNAgent(state_size=state_size, action_size=action_size, seed=0)

    for i_episode in range(1, n_episodes + 1):
        state, _ = env.reset()
        score = 0
        for t in range(max_t):
            action = agent.act(state, eps)
            next_state, reward, done, _, _ = env.step(action)
            agent.step(state, action, reward, next_state, done)
            state = next_state
            score += reward
            if done:
                break
        scores_window_deque.append(score)       # save most recent score
        scores.append(score)                  # save most recent score
        avg_score = np.mean(scores_window_deque)

        print(f'\rEpisode {i_episode}\tAverage Score: {avg_score:.2f}\tEpsilon: {eps:.2f}', end="")
        if i_episode % 100 == 0:
            print(f'\rEpisode {i_episode}\tAverage Score: {avg_score:.2f}\tEpsilon: {eps:.2f}')

        eps = max(eps_end, eps_decay * eps) # decrease epsilon

        if avg_score >= 200.0:
            print(f'\nEnvironment solved in {i_episode:d} episodes!\tAverage Score: {avg_score:.2f}')
            torch.save(agent.qnetwork_local.state_dict(), 'checkpoint.pth')
            break

    return scores

scores = dqn(n_episodes=2000, max_t=1000, eps_end=0.01, eps_decay=0.995)

Episode 100	Average Score: -154.62	Epsilon: 0.61
Episode 200	Average Score: -140.39	Epsilon: 0.37
Episode 300	Average Score: -65.22	Epsilon: 0.22
Episode 400	Average Score: -9.69	Epsilon: 0.14
Episode 500	Average Score: 34.86	Epsilon: 0.08
Episode 600	Average Score: 107.42	Epsilon: 0.05
Episode 700	Average Score: 175.52	Epsilon: 0.03
Episode 770	Average Score: 200.86	Epsilon: 0.02
Environment solved in 770 episodes!	Average Score: 200.86


## Evaluate the trained agent

### Subtask:
Run the trained agent in the environment for a few episodes to see how it performs.


**Reasoning**:
Load the trained model weights, set the agent to evaluation mode, and run the environment for a few episodes using the trained agent with no exploration.



In [9]:
# Load the trained weights from the checkpoint
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
agent = DQNAgent(state_size=state_size, action_size=action_size, seed=0)
agent.qnetwork_local.load_state_dict(torch.load('checkpoint.pth'))

# Set the agent to evaluation mode
agent.qnetwork_local.eval()

# Run the environment for a few episodes
num_episodes_to_run = 5  # You can adjust this number
for i_episode in range(1, num_episodes_to_run + 1):
    state, _ = env.reset()
    score = 0
    while True:
        # Select an action using the agent's act method with eps=0 (no exploration)
        action = agent.act(state, eps=0)
        # Step the environment with the selected action
        next_state, reward, done, _, _ = env.step(action)
        # Accumulate the reward
        score += reward
        # Update the current state
        state = next_state
        # If the episode is done, break the inner loop
        if done:
            break
    # Print the total score for the episode
    print(f'Episode {i_episode}\tScore: {score:.2f}')

env.close()

Episode 1	Score: 232.87
Episode 2	Score: 272.17
Episode 3	Score: 252.44
Episode 4	Score: 247.71
Episode 5	Score: 232.86


## Generate video of the trained agent

### Subtask:
Record a video of the trained agent interacting with the environment.


**Reasoning**:
The subtask requires recording a video of the trained agent. This involves creating a new environment instance with a specific render mode, wrapping it with a video recorder, loading the trained agent, running the simulation, and closing the environment. These steps can be combined into a single code block.



In [10]:
import gymnasium as gym
import torch
import os

# 1. Create a new LunarLander-v3 environment instance with render_mode='rgb_array'.
# 2. Wrap the environment with the gymnasium.wrappers.RecordVideo wrapper.
video_dir = 'videos'
os.makedirs(video_dir, exist_ok=True)
eval_env = gym.make('LunarLander-v3', render_mode='rgb_array')
eval_env = gym.wrappers.RecordVideo(eval_env, video_folder=video_dir, episode_trigger=lambda x: x == 0) # Record the first episode

# 3. Load the trained agent's state dictionary from 'checkpoint.pth'.
state_size = eval_env.observation_space.shape[0]
action_size = eval_env.action_space.n
agent = DQNAgent(state_size=state_size, action_size=action_size, seed=0)
agent.qnetwork_local.load_state_dict(torch.load('checkpoint.pth'))

# 4. Set the agent to evaluation mode.
agent.qnetwork_local.eval()

# 5. Run the wrapped environment for one episode.
state, _ = eval_env.reset()
score = 0
while True:
    action = agent.act(state, eps=0) # Use eps=0 for evaluation
    next_state, reward, done, _, _ = eval_env.step(action)
    state = next_state
    score += reward
    if done:
        break

print(f'Episode finished with score: {score:.2f}')

# 6. Close the environment after the episode is finished to ensure the video is saved.
eval_env.close()

  logger.warn(


Episode finished with score: 295.51


  IMAGEMAGICK_BINARY = r"C:\Program Files\ImageMagick-6.8.8-Q16\magick.exe"


## Summary:

### Data Analysis Key Findings

*   The initial installation of `box2d-py`, a dependency for the Lunar Lander environment, failed due to build errors, which were resolved by installing `swig` beforehand.
*   The `LunarLander-v2` environment is deprecated, and the analysis successfully used the recommended `LunarLander-v3` environment.
*   A `QNetwork` class was successfully implemented in PyTorch for approximating Q-values.
*   A `DQNAgent` class and a `ReplayBuffer` class were implemented to handle action selection (epsilon-greedy), experience storage, and Q-network training.
*   The training loop successfully trained the agent to achieve an average score of 200.86 over 100 episodes within 770 episodes, solving the Lunar Lander environment.
*   The trained agent, when evaluated for 5 episodes, consistently achieved positive scores ranging from approximately 232 to 272, demonstrating effective learning.
*   A video of the trained agent's performance on one episode was successfully recorded, showing a score of 295.51.

### Insights or Next Steps

*   The training process was effective in solving the Lunar Lander environment, indicated by the high evaluation scores and the successful recording of a high-scoring episode video.
*   The saved `checkpoint.pth` file can be used to deploy the trained agent or for further analysis and experimentation, such as testing different evaluation scenarios or using the model as a baseline for more advanced reinforcement learning techniques.
