### Objective

This notebook demonstrates Reinforcement Learning fundamentals using a simple
LLM-based adaptive agent.

We will explicitly see:
- State
- Action
- Reward
- Policy
- Learning
- Next Action

⚠️ No model retraining is involved.

## One Liner

In [1]:
from policy import ResponsePolicy
from rl_state import RLState
from prompt import build_prompt
from llm import llm, evaluator_llm
from feedback_interpreter import interpret_feedback
from policy_adapter import apply_policy_update

In [2]:
# Initialize RL Components (State & Policy)
policy = ResponsePolicy()
state = RLState()

print("Initial Policy:", policy.as_dict())
print("Initial State:", state.as_dict())

Initial Policy: {'verbosity': 'medium', 'tone': 'neutral'}
Initial State: {'step': 0, 'last_action': None, 'last_reward': None}


This is our agent before it has taken any action.

In [4]:
user_query = "Explain Reinforcement Learning"
print("Environment Input (Question):", user_query)

Environment Input (Question): Explain Reinforcement Learning


#### ACTION: Agent Generates a Response

In [5]:
prompt = build_prompt(policy)
chain = prompt | llm

response = chain.invoke({"input": user_query}).content
print("Agent Response:\n")
print(response)

Agent Response:

Reinforcement Learning (RL) is a branch of machine learning where an agent learns to make decisions by interacting with an environment. The core idea is to enable the agent to learn optimal behaviors through trial and error, receiving feedback in the form of rewards or penalties based on its actions.

Here are some key components of Reinforcement Learning:

1. **Agent**: The learner or decision maker that interacts with the environment.

2. **Environment**: Everything the agent interacts with. The environment responds to the agent's actions and provides feedback.

3. **State**: A representation of the current situation of the agent within the environment. The state can change based on the actions taken by the agent.

4. **Action**: The choices available to the agent that can affect the state of the environment.

5. **Reward**: A scalar feedback signal received after taking an action in a particular state. The reward indicates how good or bad the action was in achieving

#### This is Action aₜ — chosen using the current policy.

In [6]:
action = policy.as_dict()
print("Action Taken (Behavior Choice):", action)

Action Taken (Behavior Choice): {'verbosity': 'medium', 'tone': 'neutral'}


In RL, the action is not the text — it’s the behavior choice

Record the Action Taken (Policy → Action)

### ENVIRONMENT FEEDBACK (Human Input)

In [7]:
feedback = "too short"
print("Environment Feedback:", feedback)

Environment Feedback: too short


### REWARD: Interpret Feedback

In [9]:
interpretation = interpret_feedback(feedback, evaluator_llm)
reward = interpretation["reward"]

print("Feedback Interpretation:")
print(interpretation)
print("\nReward Signal:", reward)

Feedback Interpretation:
{'reward': -1, 'dimensions': {'verbosity': 'increase', 'tone': 'no_change'}}

Reward Signal: -1


### LEARNING: Update Policy Using Reward

In [10]:
print("Policy BEFORE learning:", policy.as_dict())

learning_result = apply_policy_update(policy, interpretation)

print("Policy AFTER learning:", policy.as_dict())


Policy BEFORE learning: {'verbosity': 'medium', 'tone': 'neutral'}
Policy AFTER learning: {'verbosity': 'detailed', 'tone': 'neutral'}


### STATE TRANSITION (sₜ → sₜ₊₁)

In [11]:
state.update(
    action=action,
    reward=reward
)

print("Updated RL State:")
print(state.as_dict())

Updated RL State:
{'step': 1, 'last_action': {'verbosity': 'medium', 'tone': 'neutral'}, 'last_reward': -1}


### Re-run Agent with Updated Policy

In [12]:
prompt = build_prompt(policy)
chain = prompt | llm

new_response = chain.invoke({"input": user_query}).content

print("New Agent Response After Learning:\n")
print(new_response)

New Agent Response After Learning:

Reinforcement Learning (RL) is a subfield of machine learning focused on how agents should take actions in an environment to maximize a cumulative reward. It is inspired by behavioral psychology and the concept of learning from interaction. The fundamental idea is that an agent learns to achieve a goal in an uncertain, potentially complex environment by interacting with it and receiving feedback in the form of rewards or penalties.

### Key Concepts in Reinforcement Learning:

1. **Agent**: The learner or decision maker that interacts with the environment. The agent takes actions based on its current knowledge and the state of the environment.

2. **Environment**: Everything the agent interacts with. The environment provides feedback based on the actions taken by the agent.

3. **State (s)**: A representation of the current situation of the agent within the environment. The state can contain all necessary information for the agent to make a decision.

### Full RL Loop Summary (Visualization)

In [11]:
summary = {
    "state": state.as_dict(),
    "policy": policy.as_dict(),
    "last_reward": reward
}

summary

{'state': {'step': 1,
  'last_action': {'verbosity': 'medium', 'tone': 'neutral'},
  'last_reward': -1},
 'policy': {'verbosity': 'detailed', 'tone': 'neutral'},
 'last_reward': -1}

### RL Mapping

| RL Concept | Where it appears |
|-----------|------------------|
| State     | RLState |
| Action    | Policy-driven behavior |
| Reward    | Interpreted feedback |
| Policy    | ResponsePolicy |
| Learning  | apply_policy_update |
| Environment | User + feedback |


## When DOES your system become RLHF?

#### 1. You introduce a learned reward model
Example:
* Train a small model to score responses
* That model replaces direct interpretation logic

#### 2. You collect preference comparisons

Example:
* “Which response do you prefer? A or B?”
* Reward is derived from preference ranking

#### 3. You optimize policy using PPO-like updates

Example:
* Policy is updated to maximize expected reward over time
* Not just rule-based updates



### RLHF is a special case of RL — not a different paradigm NOT Otherwise