### Objective

This notebook demonstrates Reinforcement Learning fundamentals using a simple
LLM-based adaptive agent.

We will explicitly see:
- State
- Action
- Reward
- Policy
- Learning
- Next Action

⚠️ No model retraining is involved.

## One Liner

In [1]:
from policy import ResponsePolicy
from rl_state import RLState
from prompt import build_prompt
from llm import llm, evaluator_llm
from feedback_interpreter import interpret_feedback
from policy_adapter import apply_policy_update

In [2]:
# Initialize RL Components (State & Policy)
policy = ResponsePolicy()
state = RLState()

print("Initial Policy:", policy.as_dict())
print("Initial State:", state.as_dict())

Initial Policy: {'verbosity': 'medium', 'tone': 'neutral'}
Initial State: {'step': 0, 'last_action': None, 'last_reward': None}


This is our agent before it has taken any action.

In [3]:
user_query = "Explain Reinforcement Learning"
print("Environment Input (Question):", user_query)

Environment Input (Question): Explain Reinforcement Learning


#### ACTION: Agent Generates a Response

In [4]:
prompt = build_prompt(policy)
chain = prompt | llm

response = chain.invoke({"input": user_query}).content
print("Agent Response:\n")
print(response)

Agent Response:

Reinforcement Learning (RL) is a branch of machine learning focused on how agents can take actions in an environment to maximize cumulative rewards. It is inspired by behavioral psychology and involves learning through interactions with the environment. Here are the key components of reinforcement learning:

1. **Agent**: The learner or decision-maker that interacts with the environment.

2. **Environment**: The world with which the agent interacts. It provides feedback based on the actions taken by the agent.

3. **Actions**: The set of all possible moves the agent can make in the environment.

4. **States**: The different situations or configurations of the environment at any given time.

5. **Rewards**: Feedback from the environment in the form of numerical values that indicate the desirability of the agent's actions. The goal of the agent is to maximize the total reward over time.

6. **Policy**: A strategy or mapping from states to actions that defines the agent's

#### This is Action aₜ — chosen using the current policy.

In [5]:
action = policy.as_dict()
print("Action Taken (Behavior Choice):", action)

Action Taken (Behavior Choice): {'verbosity': 'medium', 'tone': 'neutral'}


In RL, the action is not the text — it’s the behavior choice

Record the Action Taken (Policy → Action)

### ENVIRONMENT FEEDBACK (Human Input)

In [6]:
feedback = "too short"
print("Environment Feedback:", feedback)

Environment Feedback: too short


### REWARD: Interpret Feedback

In [7]:
interpretation = interpret_feedback(feedback, evaluator_llm)
reward = interpretation["reward"]

print("Feedback Interpretation:")
print(interpretation)
print("\nReward Signal:", reward)

Feedback Interpretation:
{'reward': -1, 'dimensions': {'verbosity': 'increase', 'tone': 'no_change'}}

Reward Signal: -1


### LEARNING: Update Policy Using Reward

In [8]:
print("Policy BEFORE learning:", policy.as_dict())

learning_result = apply_policy_update(policy, interpretation)

print("Policy AFTER learning:", policy.as_dict())


Policy BEFORE learning: {'verbosity': 'medium', 'tone': 'neutral'}
Policy AFTER learning: {'verbosity': 'detailed', 'tone': 'neutral'}


### STATE TRANSITION (sₜ → sₜ₊₁)

In [9]:
state.update(
    action=action,
    reward=reward
)

print("Updated RL State:")
print(state.as_dict())

Updated RL State:
{'step': 1, 'last_action': {'verbosity': 'medium', 'tone': 'neutral'}, 'last_reward': -1}


### Re-run Agent with Updated Policy

In [10]:
prompt = build_prompt(policy)
chain = prompt | llm

new_response = chain.invoke({"input": user_query}).content

print("New Agent Response After Learning:\n")
print(new_response)

New Agent Response After Learning:

Reinforcement Learning (RL) is a subfield of machine learning concerned with how agents ought to take actions in an environment in order to maximize some notion of cumulative reward. It is inspired by behavioral psychology, where learning is thought to occur through interactions with the environment, leading to changes in behavior based on feedback from those interactions.

Here are the fundamental concepts and components of reinforcement learning:

1. **Agent**: The learner or decision maker that interacts with the environment. The agent takes actions based on its observations and aims to achieve a goal.

2. **Environment**: Everything that the agent interacts with. The environment can be defined as a set of states, actions, rewards, and transitions that the agent navigates through.

3. **State (s)**: A representation of the current situation of the agent in the environment. The state provides all the necessary information that the agent needs to de

### Full RL Loop Summary (Visualization)

In [11]:
summary = {
    "state": state.as_dict(),
    "policy": policy.as_dict(),
    "last_reward": reward
}

summary

{'state': {'step': 1,
  'last_action': {'verbosity': 'medium', 'tone': 'neutral'},
  'last_reward': -1},
 'policy': {'verbosity': 'detailed', 'tone': 'neutral'},
 'last_reward': -1}

### RL Mapping

| RL Concept | Where it appears |
|-----------|------------------|
| State     | RLState |
| Action    | Policy-driven behavior |
| Reward    | Interpreted feedback |
| Policy    | ResponsePolicy |
| Learning  | apply_policy_update |
| Environment | User + feedback |


## When DOES your system become RLHF?

#### 1. You introduce a learned reward model
Example:
* Train a small model to score responses
* That model replaces direct interpretation logic

#### 2. You collect preference comparisons

Example:
* “Which response do you prefer? A or B?”
* Reward is derived from preference ranking

#### 3. You optimize policy using PPO-like updates

Example:
* Policy is updated to maximize expected reward over time
* Not just rule-based updates



### RLHF is a special case of RL — not a different paradigm NOT Otherwise