```{contents}
```

## Reinforcement Learning

**Reinforcement Learning (RL)** is a machine learning paradigm where an *agent* learns to make decisions through **trial and error**, guided by **rewards** or **penalties**.

The agent interacts with an **environment**, takes actions, receives feedback, and uses that to improve its behavior.

---

### Core Intuition of Reinforcement Learning

The intuition is:

> “Do an action → see what happens → repeat the actions that lead to good outcomes.”

RL does not need labeled data.
It learns from **experience**, **feedback**, and **rewards**.

---

### Key Components (The RL System)

#### **1. Agent**

The learner or decision-maker (e.g., robot, LLM, software program).

#### **2. Environment**

The world the agent interacts with (game, robot room, user conversation).

#### **3. State (s)**

What the environment looks like at a given moment.

#### **4. Action (a)**

The choice the agent makes.

#### **5. Reward (r)**

A number indicating how good the action was.

#### **6. Policy (π)**

The strategy the agent uses to choose actions.

---

###  The RL Loop

```
State ----> Agent ----> Action ----> Environment ----> Reward + New State
                 ↑-----------------------------------------------↓
```

This loop repeats thousands or millions of times until the agent learns a good policy.

---

###  How RL Learns

The agent tries to **maximize cumulative rewards** over time.

It has two tasks:

#### **Exploration**

Try new actions to discover if they lead to better rewards.

#### **Exploitation**

Use actions that already gave good rewards.

The agent learns the best trade-off between the two.

---

### 5. Example to Understand RL Quickly

#### Example: Teaching a dog a trick

* Dog = agent
* You = environment
* Action = sit
* Reward = treat

Every time the dog performs the correct action ("sit"), it receives a reward.
Over many trials, the dog learns the actions that maximize treats.

This is exactly how RL works.

---

### 6. Classical RL Methods

#### **Value-Based Methods**

* Q-Learning
* Deep Q-Networks (DQN)

These learn a value for each state/action pair.

#### **Policy-Based Methods**

* REINFORCE
* PPO (used in RLHF for ChatGPT)

These learn the policy directly.

#### **Actor–Critic Methods**

* A3C
* PPO (again)

These combine value and policy learning.

---

### 7. Where RL Is Used in the Real World

* Robotics
* Game AI (AlphaGo, AlphaZero)
* Autonomous driving
* Recommendation systems
* Finance trading
* Control systems
* Conversational AI (LLMs using RLHF)

---

### 8. Reinforcement Learning in LLMs (RLHF)

Modern LLMs like ChatGPT use RL in a special way called **RLHF (Reinforcement Learning from Human Feedback)**.

#### RLHF Pipeline:

1. **Supervised Fine-Tuning (SFT)**
   Teach the model how to answer.

2. **Preference Modeling (Reward Model)**
   Train a reward model that scores human-preferred answers higher.

3. **RL Optimization (usually PPO)**
   The LLM is optimized to maximize the Reward Model’s score.
   This teaches the LLM to:

   * be safer
   * be more helpful
   * avoid harmful content
   * respond in a human-preferred style

In RLHF, the LLM becomes the *agent*, the Reward Model becomes the *reward function*, and conversation becomes the *environment*.

---

### 9. Intuition of RLHF

SFT teaches:

> “Here is an example of a good answer.”

Preference modeling teaches:

> “Between these two answers, A is better.”

RLHF teaches:

> “Every time you answer like A, you get points. Answer like B, you lose points.”

The LLM learns to maximize the reward—just like an RL agent.

---

### 10. Why RL Is Powerful

RL is uniquely suited for:

* long-term decision making
* learning from sparse feedback
* optimizing behavior rather than predicting labels
* adjusting dynamically to different environments

Unlike supervised learning, RL does not need explicit targets for every action.

---

**One-Sentence Summary**

**Reinforcement Learning is a learning paradigm where an agent learns through trial-and-error by interacting with an environment and receiving rewards, and it forms the core of how LLMs like ChatGPT become aligned using RLHF.**


### Reinforcement Learning Demonstration (Q-Learning):

The agent starts at the top-left of a 4×4 grid and must reach the goal at the bottom-right.

```
S . . .
. . . .
. . . .
. . . G
```

* S = Start
* G = Goal (reward = +10)
* Each move costs −1 (to force the agent to find shortest path)

The agent learns by **trial and error**.

---

### 1. Create a Simple Environment

```python
import numpy as np
import random

GRID_SIZE = 4
GOAL_STATE = (3, 3)

# Actions: up, down, left, right
ACTIONS = {
    0: (-1, 0),  # up
    1: (1, 0),   # down
    2: (0, -1),  # left
    3: (0, 1)    # right
}

def step(state, action):
    """Takes an action and returns (next_state, reward, done)."""
    x, y = state
    dx, dy = ACTIONS[action]
    nx, ny = x + dx, y + dy

    # Stay inside grid boundaries
    nx = max(0, min(GRID_SIZE - 1, nx))
    ny = max(0, min(GRID_SIZE - 1, ny))

    next_state = (nx, ny)

    if next_state == GOAL_STATE:
        return next_state, +10, True

    return next_state, -1, False
```

---

### 2. Initialize Q-Table

```python
Q = np.zeros((GRID_SIZE, GRID_SIZE, 4))  # state (x,y), action
```

---

###  3. Q-Learning Hyperparameters

```python
alpha = 0.1        # learning rate
gamma = 0.9        # discount factor
epsilon = 0.2      # exploration probability
episodes = 2000
```

---

### 4. Training Loop

```python
for ep in range(episodes):
    state = (0, 0)
    done = False

    while not done:
        x, y = state

        # ε-greedy action
        if random.random() < epsilon:
            action = random.choice(list(ACTIONS.keys()))
        else:
            action = np.argmax(Q[x, y])

        next_state, reward, done = step(state, action)
        nx, ny = next_state

        # Q-learning update rule
        Q[x, y, action] += alpha * (
            reward + gamma * np.max(Q[nx, ny]) - Q[x, y, action]
        )

        state = next_state
```

---

### 5. Visualize the Learned Policy

```python
directions = {0: "↑", 1: "↓", 2: "←", 3: "→"}

policy_grid = []
for i in range(GRID_SIZE):
    row = []
    for j in range(GRID_SIZE):
        if (i, j) == GOAL_STATE:
            row.append("G")
        else:
            best_action = np.argmax(Q[i, j])
            row.append(directions[best_action])
    policy_grid.append(row)

for row in policy_grid:
    print(row)
```

**Expected output (your arrows may vary slightly):**

```
['→', '→', '↓', '↓']
['↓', '→', '↓', '↓']
['↓', '↓', '→', '↓']
['↓', '↓', '→', 'G']
```

This shows the path the agent learned to reach the goal efficiently.

---

### How This Demonstration Connects to the RL Explanation

| Concept         | How It Appears in Code       |
| --------------- | ---------------------------- |
| **Agent**       | Learns using Q-table         |
| **Environment** | The grid + `step()` function |
| **State**       | (x, y) grid coordinates      |
| **Action**      | up/down/left/right           |
| **Reward**      | −1 per move, +10 at goal     |
| **Policy**      | best action = `argmax(Q)`    |
| **Exploration** | `random.random() < epsilon`  |
| **Learning**    | Q-update rule                |

This is **exactly** the trial-and-error learning process used in RL.

---