# 🧠 Deep Q-Learning using PyTorch on CartPole

Install compatible versions of gym and numpy

In [None]:
!pip install gym==0.26.2 numpy==1.23.5



##🔹 Step 1: Import Required Libraries
We begin by importing the essential libraries like numpy, pandas, PyTorch and more.

In [None]:
import gym
import random
import numpy as np
from collections import deque
import torch
import torch.nn as nn
import torch.optim as optim

---

##🔹 Step 2: Define the Q-Network
This is a simple neural network with three layers.

It takes the state as input and outputs Q-values for each action.

In [None]:
class DQN(nn.Module):
    def __init__(self, state_size, action_size):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_size, 24)
        self.fc2 = nn.Linear(24, 24)
        self.fc3 = nn.Linear(24, action_size)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)


🧠 **Explanation:**

**state_size:** Number of features in the input state.

**action_size:** Number of possible actions.

Fully connected layers with **ReLU** activation.

Final layer outputs Q-values (no activation).

---

##🔹 Step 3: Defining Hyperparameters

These parameters control the learning process: exploration vs exploitation, learning rate, memory, etc.

- **gamma:** Future reward discount closer to 1 means long-term focus.
- **epsilon:** Controls exploration vs exploitation.
- **batch_size:** How many experiences we train on at once.
- **Memory_size:** Maximum size of the experience replay buffer.

In [None]:
# Hyperparameters
gamma = 0.99             # Discount factor
epsilon = 1.0            # Initial exploration rate
epsilon_min = 0.01       # Minimum epsilon
epsilon_decay = 0.995    # Decay rate
learning_rate = 0.001
batch_size = 64
memory_size = 10000

---

##🔹 Step 4: Initialize Replay Buffer

The agent stores past experiences in memory and samples from it during training to break correlation in data.

- Stores past experiences: (state, action, reward, next_state, done).
- deque automatically removes old experiences when it exceeds the limit.

In [None]:
memory = deque(maxlen=memory_size)

🧠 **Each memory entry is:**

(state, action, reward, next_state, done)

---

##🔹 Step 5: Initialize Environment

We use the CartPole environment.


In [None]:
env = gym.make("CartPole-v1")
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

---

##🔹 Step 6: Initialize Networks and Optimizer

We use two networks: policy network (for selecting actions) and target network (for stable learning).

- **policy_net:** Train actively and chooses actions.
- **target_net:** Provides stable target Q-values (updated less frequently).
- **Adam:** Adaptive optimizer to update weights.
- **MSELoss:** Compares predicted Q-values to target Q-values.


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

policy_net = DQN(state_size, action_size).to(device)
target_net = DQN(state_size, action_size).to(device)
target_net.load_state_dict(policy_net.state_dict())  # Synchronize initially
target_net.eval()  # Target network is not trained directly

optimizer = optim.Adam(policy_net.parameters(), lr=learning_rate)
loss_fn = nn.MSELoss()


🧠 **Target Network** provides stable Q-value targets and is updated less frequently.

---

## 🔹 Step 7: Define Action Selection Function (ε-greedy)
- The agent chooses actions using an exploration-exploitation strategy.
- It explores randomly or exploits its learned policy based on epsilon.
- Otherwise we select the action with the maximum predicted Q-value.



In [None]:
def get_action(state, epsilon):
    # With probability epsilon, explore (random action)
    if random.random() < epsilon:
        return random.choice(range(action_size))
    else:
        # Otherwise, exploit learned policy (choose best action)
        state = torch.FloatTensor(state).unsqueeze(0).to(device)
        with torch.no_grad():
            q_values = policy_net(state)
        return q_values.argmax().item()


🧠 **Key Points:**

- Encourages exploration early in training.

- As epsilon decays, the agent exploits more often.

- unsqueeze(0) adds a batch dimension to the input state.

---

##🔹 Step 8: Define the Replay Function (Mini-Batch Training)

This function randomly samples a batch of experiences from memory and trains the network using the Bellman equation.

- Use the Bellman equation to compute target Q-values.
- Minimize the MSE loss between predicted and target Q-values.

In [None]:
def replay():
    if len(memory) < batch_size:
        return  # Not enough experiences to sample

    minibatch = random.sample(memory, batch_size)

    states, actions, rewards, next_states, dones = zip(*minibatch)

    states = torch.FloatTensor(states).to(device)
    actions = torch.LongTensor(actions).unsqueeze(1).to(device)
    rewards = torch.FloatTensor(rewards).unsqueeze(1).to(device)
    next_states = torch.FloatTensor(next_states).to(device)
    dones = torch.FloatTensor(dones).unsqueeze(1).to(device)

    # Current Q-values (for the selected actions)
    current_q = policy_net(states).gather(1, actions)

    # Compute the target Q-values using the target network
    next_q = target_net(next_states).max(1)[0].detach().unsqueeze(1)
    target_q = rewards + (gamma * next_q * (1 - dones))

    # Compute loss between current Q and target Q
    loss = loss_fn(current_q, target_q)

    # Optimize the network
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()


🧠 **Explanation:**

- **gather(1, actions)** extracts Q-values for taken actions.

- **next_q** uses the max Q-value from the target network.

- **dones** ensures no future Q-values are added after terminal states.

- **Gradient** descent minimizes the MSE loss.


---

##🔹 Step 9: Training the Agent

This loop trains the agent over multiple episodes, let the agent collect experience, update the network using replay, periodically sync the target network and decay the exploration rate.

- Calls **get_action()** to pick actions and replay() to train.
- Updates target network periodically for stability.
- Decay epsilon to reduce exploration over time.

In [None]:
episodes = 500
target_update_freq = 10  # how often to update the target network

for episode in range(episodes):
    # Reset environment and get initial state
    reset_result = env.reset()
    state = reset_result[0] if isinstance(reset_result, tuple) else reset_result
    total_reward = 0

    for t in range(500):  # Max steps per episode
        # Choose an action using epsilon-greedy
        action = get_action(state, epsilon)

        # Perform the action
        step_result = env.step(action)

        # Handle gym versions returning 4 or 5 values
        if len(step_result) == 5:
            next_state, reward, terminated, truncated, _ = step_result
            done = terminated or truncated
        else:
            next_state, reward, done, _ = step_result

        # Store experience
        memory.append((state, action, reward, next_state, done))

        # Move to the next state
        state = next_state
        total_reward += reward

        # Train on sampled mini-batch
        replay()

        if done:
            break

    # Decay epsilon (exploration rate)
    if epsilon > epsilon_min:
        epsilon *= epsilon_decay

    # Update target network
    if episode % target_update_freq == 0:
        target_net.load_state_dict(policy_net.state_dict())

    print(f"Episode {episode}, Total Reward: {total_reward}, Epsilon: {epsilon:.3f}")


Episode 0, Total Reward: 13.0, Epsilon: 0.995
Episode 1, Total Reward: 10.0, Epsilon: 0.990
Episode 2, Total Reward: 29.0, Epsilon: 0.985
Episode 3, Total Reward: 48.0, Epsilon: 0.980


  states = torch.FloatTensor(states).to(device)


Episode 4, Total Reward: 20.0, Epsilon: 0.975
Episode 5, Total Reward: 16.0, Epsilon: 0.970
Episode 6, Total Reward: 12.0, Epsilon: 0.966
Episode 7, Total Reward: 50.0, Epsilon: 0.961
Episode 8, Total Reward: 40.0, Epsilon: 0.956
Episode 9, Total Reward: 26.0, Epsilon: 0.951
Episode 10, Total Reward: 14.0, Epsilon: 0.946
Episode 11, Total Reward: 24.0, Epsilon: 0.942
Episode 12, Total Reward: 26.0, Epsilon: 0.937
Episode 13, Total Reward: 16.0, Epsilon: 0.932
Episode 14, Total Reward: 27.0, Epsilon: 0.928
Episode 15, Total Reward: 12.0, Epsilon: 0.923
Episode 16, Total Reward: 14.0, Epsilon: 0.918
Episode 17, Total Reward: 26.0, Epsilon: 0.914
Episode 18, Total Reward: 18.0, Epsilon: 0.909
Episode 19, Total Reward: 15.0, Epsilon: 0.905
Episode 20, Total Reward: 19.0, Epsilon: 0.900
Episode 21, Total Reward: 32.0, Epsilon: 0.896
Episode 22, Total Reward: 34.0, Epsilon: 0.891
Episode 23, Total Reward: 18.0, Epsilon: 0.887
Episode 24, Total Reward: 27.0, Epsilon: 0.882
Episode 25, Total R

🧠 **Key Ideas:**

- **Exploration–Exploitation tradeoff:** early episodes explore more, later episodes exploit.

- **Target network sync** helps stabilize training.

- **Replay buffer** breaks correlation in sequential data.

---

##🔹 Step 10: Save and Load the Trained DQN Model

### ✅ Saving the Policy Network After Training
You can add this after the training loop:

In [None]:
# Save the trained model weights
torch.save(policy_net.state_dict(), "dqn_cartpole.pth")
print("✅ Model saved to dqn_cartpole.pth")

### ✅ Loading the Trained Model Later
To use the saved model again (e.g., for evaluation or testing), reload it like this:

In [None]:
# Reinitialize the model architecture (must match)
loaded_model = DQN(state_size, action_size).to(device)
loaded_model.load_state_dict(torch.load("dqn_cartpole.pth"))
loaded_model.eval()  # Set the model to evaluation mode

print("✅ Model loaded from dqn_cartpole.pth")


FileNotFoundError: [Errno 2] No such file or directory: 'dqn_cartpole.pth'

🧠 **Tips:**

- Always match architecture exactly when loading.

- Use .eval() to turn off dropout/batchnorm (not needed here, but good habit).

- You can wrap this in a condition to load if the file exists.