# Launch 3 - Deep Q-Network (DQN)

This phase generalizes Q-Learning by replacing the table with a neural network.

## Objectives
- Introduce functional approximation with Q_theta(s,a)
- Understand the role of replay buffer and target network
- Train a DQN agent on CartPole-v1 with Stable-Baselines3

## Chronological positioning
Launch 1: random baseline -> Launch 2: tabular RL -> Launch 3: deep RL.


## Scientific Justification (Phase 3)

### Why DQN in Launch 3, and why move to PPO afterward?
DQN is used here to introduce **deep function approximation** with replay and target networks. It is a key transitional step from tabular methods to deep RL.

For the mission phase, we move to PPO because:
- PPO is typically more stable in policy optimization
- On-policy updates reduce stale-target effects from replay
- Clipped objective limits destructive policy jumps
- It is widely adopted for robust control-like benchmark tasks

### Why LunarLander for the mission trajectory?
LunarLander is selected as a structured RL benchmark with:
- Nonlinear dynamics
- Reward shaping already aligned with landing behavior
- Interpretable state/action interfaces
- Good reproducibility for controlled comparisons

It is not a real spacecraft simulator, but a useful **guidance-and-landing proxy**.

### Reward, variance, and measurement discipline
In this phase we explicitly track:
- Mean reward (central performance)
- Reward standard deviation (stability)
- Repeated evaluation episodes for robustness

This prepares mission-level reporting with smoothed trends and confidence-aware interpretation.

### Known limitations and Sim2Real gap
Current setup still omits:
- Sensor noise models
- Actuator saturation nonlinearities
- Delays, faults, and uncertainty envelopes
- Realistic environment disturbances

A credible Sim2Real path would include domain randomization, robustness stress testing, safety constraints, and transfer validation before any operational claim.


### A - Value network and replay memory

Two components stabilize DQN training:
1. **Q-network** to estimate action values
2. **Replay buffer** to reduce temporal correlation between transitions

Mini-batch learning improves numerical stability compared to purely sequential updates.


### B - DQN training loop (conceptual view)

The loop alternates:
- Action selection
- Transition collection
- Mini-batch updates
- Periodic target network synchronization

This architecture mitigates classical instability in deep Q-Learning.


### C - Training with Stable-Baselines3

We rely on SB3 for robust experimentation:
- Algorithm configuration
- Optimization process
- Model persistence and reload

The scientific workflow remains the same: define, train, evaluate, interpret.


In [None]:
import gymnasium as gym
from stable_baselines3 import DQN
from stable_baselines3.common.evaluation import evaluate_policy

# Use a non-rendering env for training stability
train_env = gym.make("CartPole-v1")

# Create the DQN model
model = DQN(
    "MlpPolicy",
    train_env,
    learning_rate=1e-3,
    buffer_size=50_000,
    batch_size=32,
    learning_starts=1000,
    target_update_interval=500,
    verbose=0
)

# Train the agent
model.learn(total_timesteps=50_000)

# Save the model
model.save("dqn_cartpole")

print("Training complete")

# Close training resources once learning is done
train_env.close()

### D - Quantitative evaluation

We report mean reward and standard deviation over independent episodes.

Interpretation:
- High mean = strong central performance
- Low standard deviation = stable behavior


In [None]:
from stable_baselines3.common.monitor import Monitor

eval_env = Monitor(gym.make("CartPole-v1"))
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=20, render=False)
print(f"Average reward : {mean_reward:.2f} +/- {std_reward:.2f}")
eval_env.close()

### E - Qualitative behavior inspection

Visual trajectory analysis complements numerical metrics:
- Verify dynamic consistency
- Detect edge-case behavior
- Prepare transfer to more demanding control environments


In [None]:
import matplotlib.pyplot as plt

render_env = gym.make("CartPole-v1", render_mode="rgb_array")
obs, _ = render_env.reset()
frames = []
done = False

while not done:
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, _ = render_env.step(action)
    frame = render_env.render()   # returns a NumPy image
    frames.append(frame)
    done = terminated or truncated

render_env.close()

# Show the first frame
plt.imshow(frames[0])
plt.axis("off")
plt.show()

## Launch 3 Conclusion

We validated the transition to deep RL.

### Key outcomes
- Successful replacement of Q-Table
- Reproducible training and evaluation pipeline
- Improved suitability for continuous/high-dimensional states

### Mission transition
The Mission phase applies these principles to LunarLander-v3 with a PPO policy, better suited to this spacecraft control context.
