# Practical Implementation and Experimentation

This section focuses on **hands-on RL**: coding minimal examples, running them in common environments, and evaluating agent performance.

---

## 1. Minimal Working Code

### Tabular RL (Discrete states/actions)
- Example: Q-learning for GridWorld.
```python
import numpy as np

# Q-table initialization
Q = np.zeros((num_states, num_actions))
alpha = 0.1  # learning rate
gamma = 0.99 # discount factor
epsilon = 0.1

for episode in range(num_episodes):
    state = env.reset()
    done = False
    while not done:
        # ε-greedy action selection
        if np.random.rand() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[state])
        next_state, reward, done, _ = env.step(action)
        Q[state, action] += alpha * (reward + gamma * np.max(Q[next_state]) - Q[state, action])
        state = next_state


Neural Network RL (Deep RL)

Example: simple DQN skeleton using PyTorch.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

class QNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim)
        )
    def forward(self, x):
        return self.fc(x)

q_net = QNetwork(state_dim, action_dim)
optimizer = optim.Adam(q_net.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()


Core loop: sample transitions → compute target → update network.

## 2. Common Environments

| Environment                | Type                                | Use Case                         |
| -------------------------- | ----------------------------------- | -------------------------------- |
| **GridWorld**              | Tabular                             | Learning basic RL mechanics      |
| **CartPole / MountainCar** | Low-dimensional continuous/episodic | Testing TD, policy gradient, DQN |
| **Atari (OpenAI Gym)**     | High-dimensional / image input      | Benchmarking deep RL methods     |
| **MuJoCo / PyBullet**      | Continuous control                  | Robotics and physics-based RL    |


- Use OpenAI Gym / Gymnasium for standard environments.

## 3. Evaluation Metrics and Debugging

- Cumulative reward per episode: basic measure of performance.
- Moving average of reward: smooth out variance.
- Policy inspection / visualization: check if agent moves reasonably.
- Learning curves: detect divergence, stagnation, or instability.
- Ablation studies: test effect of learning rate, discount factor, network size.

Common pitfalls:

- Agent stuck in local optimum → increase exploration.
- Divergent Q-values → reduce learning rate, use target networks.
- High variance in returns → increase episode count or batch size.

## 4. Reproducibility and Randomness Control

- Set random seeds consistently (Python, NumPy, PyTorch, Gym).

In [None]:
import random, numpy as np, torch
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)

- Control environment randomness when benchmarking.
- Log hyperparameters and versions of libraries.
- Use deterministic GPU settings if strict reproducibility is needed.

## 5. Key Takeaways

- Start with simple tabular examples before deep RL.
- Use standard environments to test algorithms.
- Track reward curves and inspect behavior for debugging.
- Control randomness and seeds to make experiments reproducible.
- Proper evaluation and logging is as important as coding the RL algorithm itself.