# Launch 2 - Tabular Q-Learning

After the random baseline, this phase introduces a learning agent based on a **Q-Table**.

## Objectives
- Formulate the problem in a discrete state-action setting
- Implement the Bellman update in Q-Learning
- Quantify performance gain versus a non-trained policy

## Experimental setting
We use **FrozenLake-v1**, which is discrete and suitable for direct analysis of Q(s,a) values.


## Scientific Justification (Phase 2)

### Why Tabular Q-Learning here?
Launch 2 introduces a transparent, interpretable learning baseline:
- Explicit value table updates
- Clear exploration/exploitation dynamics
- Direct convergence diagnostics

This creates a traceable bridge between random behavior (Launch 1) and function approximation (Launch 3).

### PPO status in this phase
PPO is intentionally postponed here. The goal is to verify that policy improvement and Bellman updates are correctly understood and measured before switching to neural policies.

### Reward shaping and objective clarity
The reward signal is analyzed as a design object, not only as a score. We examine whether reward structure actually incentivizes task completion and stable behavior.

### Variance and robustness controls
Variance is managed through:
- Multiple evaluation episodes
- Success-rate reporting
- Stable protocol settings

These controls are prerequisites for meaningful comparison with deep RL phases.

### Simulation limitations and transfer caution
As a tabular benchmark phase, it does not model mission-level nonlinear guidance or real-world disturbances. Conclusions remain algorithmic and methodological, not operational-flight claims.


### A - Environment and Q-Table preparation

We initialize:
- Number of states S
- Number of actions A
- A Q table of shape (|S| x |A|), initialized to zeros

This table is a tabular approximation of expected action value.


In [None]:
import gymnasium as gym
import numpy as np
import random

# Create FrozenLake environment
env = gym.make("FrozenLake-v1", is_slippery=True)

# Q-Table dimensions
n_states = env.observation_space.n
n_actions = env.action_space.n

# Initialize Q-Table
Q = np.zeros((n_states, n_actions))

print("Number of states:", n_states)
print("Number of actions:", n_actions)
print("Initial Q-Table:", Q)


### B - Q-Learning loop (exploration vs exploitation)

At each transition, we apply an epsilon-greedy policy and the update:

Q(s,a) <- Q(s,a) + alpha * [r + gamma * max_a' Q(s',a') - Q(s,a)]

Interpretation:
- `Alpha` controls adaptation speed
- `Gamma` weights long-term return
- `Epsilon` enforces sufficient exploration

This mechanism progressively converges to a stronger policy in discrete environments.


In [None]:
# Hyperparameters
alpha = 0.8       # learning rate
gamma = 0.95      # discount factor
epsilon = 0.2     # exploration rate
n_episodes = 5000
max_steps = 100

# Training loop
for episode in range(n_episodes):
    state, _ = env.reset()
    done = False
    
    for step in range(max_steps):
        # Epsilon-greedy action selection
        if np.random.rand() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[state, :])
        
        # Environment step
        new_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        # Q-Table update
        Q[state, action] = Q[state, action] + alpha * (
            reward + gamma * np.max(Q[new_state, :]) - Q[state, action]
        )
        
        state = new_state
        if done:
            break

print("Q-Table after training:", Q)


### C - Trained policy evaluation

The agent is evaluated in greedy mode (argmax over Q) across many episodes.

Primary metric: **success rate**.

This metric is robust in FrozenLake because terminal reward is binary (success/failure).


In [None]:
# Evaluate trained agent
n_eval_episodes = 1000
successes = 0

for episode in range(n_eval_episodes):
    state, _ = env.reset()
    done = False

    while not done:
        # Greedy action from Q-Table
        action = np.argmax(Q[state, :])
        state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        # Count success when terminal reward is 1
        if reward == 1.0:
            successes += 1

# Compute success rate
success_rate = successes / n_eval_episodes
print(f"Success rate over {n_eval_episodes} episodes: {success_rate*100:.2f}%")


## Launch 2 Conclusion

This phase demonstrates the first major step forward:
- From random policy to learned policy
- Measurable improvement through success rate
- Validation of dynamic programming principles in RL

### Identified limitation
Tabular Q-Learning does not scale well to continuous or high-dimensional state spaces.

### Phase transition
Launch 3 replaces the table with a neural approximation model (**DQN**), required for more realistic systems.
