# Launch 1 - Reinforcement Learning Foundations

This notebook is the scientific initialization phase of the project. The goal is not high performance yet, but a rigorous validation of the RL loop in a simple, interpretable setting.

## Technical objectives
- Instantiate a Gymnasium environment and verify its behavior
- Interpret the observation and action spaces
- Run a random policy as a reference baseline
- Observe the interaction dynamics: state -> action -> reward -> next state

## Working hypothesis
A random policy should produce low and unstable performance. This baseline will be used as a comparison point for the learning phases.


## Scientific Justification (Phase 1)

### Why not PPO yet in Launch 1?
Launch 1 is intentionally a **foundational phase**: we first validate the RL loop, state/action semantics, and instrumentation on a simple benchmark before introducing higher-complexity policy optimization.

At this stage, the objective is **experimental reliability**, not final mission performance.

### Reward definition and interpretation in this phase
The episode return is used as the primary outcome variable. Even with a random policy, we explicitly track reward trajectories to establish:
- A lower-bound baseline
- Variance patterns across episodes
- Whether metrics/logging are trustworthy for later launches

### Variance handling (initial level)
We already control variance through:
- Repeated episodes
- Aggregated statistics over runs
- Deterministic plotting and reproducible code paths

This prepares the statistical discipline used in later launches (seed control, smoothing, confidence bands).

### Limits of the simulation at this stage
This phase uses a simplified benchmark and therefore does **not** capture:
- High-fidelity flight dynamics
- Actuator/sensor delays
- Realistic disturbances (wind, vibration, turbulence)
- Hardware constraints and fault modes

### Sim2Real perspective
No direct transfer claim is made from Launch 1. The phase is a controlled methodological step that reduces implementation risk before moving to deeper RL and mission-like dynamics.


### A - Imports and minimal instrumentation

We import:
- `Gymnasium` for simulation environments
- `Numpy` for numerical handling
- `Matplotlib` for result visualization

This cell sets up a reproducible baseline for experimentation.


In [None]:
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt

### B - Environment creation and exploration

We use **CartPole-v1**, a classic control benchmark.

Why this environment:
- Continuous state space with low dimensionality (4 variables)
- Discrete action space (left/right)
- Reward signal that is easy to interpret

This choice introduces RL concepts without excessive algorithmic complexity.


In [None]:
# Create the CartPole environment
env = gym.make("CartPole-v1")

print("Observation space:", env.observation_space)
print("Action space:", env.action_space)

# Example observation and random action
obs = env.observation_space.sample()
act = env.action_space.sample()

print("Sample observation:", obs)
print("Sample random action:", act)


> Observation space (physical interpretation)
>
> _Box([-4.8, -inf, -0.418..., -inf], [4.8, inf, 0.418..., inf], (4,), float32)_
>
> The state vector describes, at each time step:
> 1. cart horizontal position,
> 2. cart horizontal velocity,
> 3. pole angle (in radians),
> 4. pole angular velocity.
>
> This representation is sufficient for Markovian control: the optimal action depends on the current state.


### C - Random policy (experimental baseline)

We run multiple episodes with `env.action_space.sample()`.

Methodological purpose:
- Establish a lower-bound performance reference
- Measure inter-episode variability
- Validate the simulation loop before training

A random baseline is standard RL practice to avoid false conclusions when improving the agent.


In [None]:
n_episodes = 10
rewards = []

for episode in range(n_episodes):
    obs, info = env.reset()
    done, total_reward = False, 0

    while not done:
        # Choose a random action
        action = env.action_space.sample()
        
        # Apply the action to the environment
        obs, reward, terminated, truncated, info = env.step(action)
        
        total_reward += reward
        done = terminated or truncated

    rewards.append(total_reward)
    print(f"Episode {episode+1} finished with total reward = {total_reward}")

env.close()

### D - Visualization and result interpretation

The cumulative reward per episode chart helps assess:
- Central performance level
- Trajectory dispersion
- Absence of structured learning (typically noisy behavior)

This confirms the system dynamics are active but not controlled by policy intelligence.


In [None]:
plt.figure(figsize=(6,4))
plt.bar(range(1, n_episodes+1), rewards)
plt.xlabel("Episode")
plt.ylabel("Total reward")
plt.title("Random policy performance")
plt.show()

## Launch 1 Conclusion

We validated the fundamentals: environment, states, actions, and the RL interaction loop.

### Scientific outcome
- Experimental protocol is operational
- Random baseline established
- Comparison metric available

### Phase transition
The next phase (Launch 2) introduces explicit learning through tabular Q-Learning on a discrete environment.
