# Mission - PPO Baseline for Descent Control (LunarLander-v3)

This mission is the main space-oriented demonstration: an agent learns to stabilize and land a module on a target pad.

## Operational objective
Build a robust **PPO baseline** that serves as a reference for later improvements (hyperparameters, training budgets, multi-run comparisons).

## Consistency with the launch sequence
- Launch 1: RL loop and random baseline
- Launch 2: tabular learning
- Launch 3: deep value approximation
- Mission: deep RL for practical descent-control behavior


## Scientific Justification (Mission Level)

### Why PPO over DQN / SAC in this context?
PPO is chosen for this mission baseline because it provides a strong stability/performance trade-off for iterative policy improvement:
- **Clipped surrogate objective** to prevent destructive updates
- Robust optimization behavior across seeds
- Practical reproducibility with standard SB3 tooling
- Strong empirical reliability on benchmark control tasks

DQN is valuable in discrete-action deep RL but is more sensitive to replay/target tuning in this workflow. SAC is powerful, but PPO gives a simpler and more controlled first mission baseline for this portfolio scope.

### Why LunarLander-v3?
LunarLander-v3 is used as a **structured proxy** for descent guidance because it provides:
- Nonlinear coupled dynamics
- Explicit state variables for control reasoning
- Reward components aligned with soft-landing objectives
- Benchmark comparability and reproducibility

### Reward design and shaping interpretation
We interpret reward as a multi-objective signal balancing:
- Touchdown success
- Attitude stability
- Velocity moderation
- Control-effort efficiency
- Crash penalties

The objective is not merely to maximize return numerically, but to encourage physically plausible and repeatable landing behavior.

### Variance management and statistical credibility
Variance is handled through:
- Controlled seeding
- Repeated evaluation episodes
- Moving-average smoothing on dashboard plots
- Dispersion-aware interpretation (std, confidence intuition)
- Cross-run comparison instead of single-score claims

### Simulation limitations (explicit)
This environment does not model real flight stack complexity:
- No high-fidelity aerodynamics/turbulence
- No realistic sensor noise pipeline (IMU/GNSS fusion)
- No actuator latency/saturation/failure logic
- No embedded real-time constraints

Therefore results are **algorithmic evidence**, not flight qualification evidence.

### Sim2Real gap and forward path
A credible transfer strategy would require:
1. Domain randomization (mass, thrust, disturbances, delays)
2. Robustness envelopes and worst-case testing
3. Safety constraints and guarded policy execution
4. Staged transfer to higher-fidelity simulators
5. Hardware-in-the-loop and incremental validation

This mission notebook should be read as a rigorous RL baseline and methodology demonstration, not as direct deployability proof.


### A - LunarLander-v3 environment characterization

We inspect observation/action interfaces to formalize the control problem:
- 8D state: position, velocities, angle, angular velocity, leg contacts
- 4 Discrete actions: no thrust, main engine, left engine, right engine
- Reward shaped toward stable landing and penalizing unsafe trajectories

This establishes the physical context and mission constraints.


In [None]:
import gymnasium as gym
import gymnasium.envs.box2d.lunar_lander

env = gym.make("LunarLander-v3")  # Discrete (4 actions)
print("Observation space:", env.observation_space)
print("Action space:", env.action_space)

obs, info = env.reset(seed=42)
print("Sample observation:", obs)
print("Sample action:", env.action_space.sample())
env.close()

### B - Baseline PPO training

We train a **PPO (MlpPolicy)** agent with logging, periodic evaluation, and best-model checkpointing.

Methodological choices:
- Separated train/eval environments to reduce bias
- Continuous monitoring during learning
- Persistent model artifacts for reproducibility

Initial budget (`total_timesteps=300_000`) is a time/cost compromise. It can be increased (e.g., 1M) and compared rigorously in Dashboard.


In [None]:
import os
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common.evaluation import evaluate_policy

# Output directories
log_dir = "runs/lander_baseline"
os.makedirs(log_dir, exist_ok=True)

# Training environment (no render for speed)
train_env = Monitor(gym.make("LunarLander-v3"), filename=os.path.join(log_dir, "monitor.csv"))

# Separate evaluation environment
eval_env = Monitor(gym.make("LunarLander-v3"))

# Evaluation callback: periodic eval + best model save
eval_callback = EvalCallback(
    eval_env,
    best_model_save_path=log_dir,
    log_path=log_dir,
    eval_freq=10_000,                    # evaluate every 10k steps
    n_eval_episodes=10,                  # average over 10 episodes
    deterministic=True,
    render=False
)

# Baseline PPO model
model = PPO(
    "MlpPolicy",
    train_env,
    verbose=0,
    tensorboard_log=log_dir,
    learning_rate=3e-4,  # SB3 PPO default
    n_steps=2048,
    batch_size=64,
    gae_lambda=0.95,
    gamma=0.99,
    ent_coef=0.005,
    clip_range=0.2
)

# Training budget (baseline): ~3e5 steps; can scale to 1e6 depending on runtime budget
model.learn(total_timesteps=300_000, callback=eval_callback)

# Save model
model_path = os.path.join(log_dir, "ppo_lander_baseline")
model.save(model_path)
print("Model saved ->", model_path)


### C - Statistical policy evaluation

The model is evaluated on multiple independent episodes to estimate:
- Mean reward (performance level)
- Standard deviation (robustness)

This defines a quantitative baseline for version-to-version model comparison.


In [None]:
# Reload best model if available
from stable_baselines3 import PPO

best_path = os.path.join(log_dir, "best_model")
if os.path.exists(best_path + ".zip"):
    model = PPO.load(best_path)
    print("Loaded best model:", best_path)
else:
    model = PPO.load(model_path)
    print("Loaded latest model:", model_path)

mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=20, deterministic=True, render=False)
print(f"Baseline mean reward (20 episodes): {mean_reward:.1f} +/- {std_reward:.1f}")


### D - Visual trajectory validation

Beyond aggregate metrics, we inspect rendered trajectories for:
- Approach-phase stability
- Attitude correction
- Terminal behavior near the landing zone

Start/middle/end frame inspection helps identify failure modes quickly.


In [None]:
import matplotlib.pyplot as plt
import numpy as np

vis_env = gym.make("LunarLander-v3", render_mode="rgb_array")

obs, info = vis_env.reset(seed=123)
frames = []
done = False

while not done:
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, _ = vis_env.step(action)
    frames.append(vis_env.render())
    done = terminated or truncated

vis_env.close()

# Show start / middle / end frames
idxs = [0, len(frames)//2, len(frames)-1]
plt.figure(figsize=(12,4))
for i, idx in enumerate(idxs, 1):
    plt.subplot(1,3,i)
    plt.imshow(frames[idx])
    plt.title(["Start","Middle","End"][i-1])
    plt.axis("off")
plt.tight_layout()
plt.show()


## Overall mission conclusion

The resulting PPO baseline is technically coherent and suitable for an applied RL demonstration.

### Key results
- Complete pipeline: exploration -> training -> evaluation -> visualization
- Model artifact saved and reusable through API/frontend
- Solid base for iterative improvement (timesteps, seeds, hyperparameters)

### Recommended next steps
1. Train multiple seeds to quantify variance
2. Compare 300k vs 1M timesteps on identical metrics
3. Select best run by reward + stability, not by training length alone
