__*Reinforcement Loop*__
(Optional in case needs to explain reinforcement learning with Python and LLM without using any agentic framework)

LLM-Driven Reinforcement Learning in Pure Python (Framework-Free Approach)

This Python-only Reinforcement Learning loop demonstrates a lightweight, interpretable, and practical method for making LLM agents improve over time. It blends:

✔ RL concepts

✔ LLM reasoning

✔ Pure Python

✔ Dynamic policy refinement

…all without needing CrewAI, or any Agentic framework.

In [1]:
pip list | grep -i langchain

langchain                                0.1.13
langchain-cohere                         0.1.5
langchain-community                      0.0.29
langchain-core                           0.1.53
langchain-openai                         0.1.7
langchain-text-splitters                 0.0.2
Note: you may need to restart the kernel to use updated packages.


In [1]:
# Setup

import os
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv
# Load environment variables
load_dotenv()

import warnings
warnings.filterwarnings('ignore')
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)


In [2]:
# 1: Basic Reinforcement Loop Framework
from langchain.schema import HumanMessage, SystemMessage
import json
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

class ReinforcementAgent:
    def __init__(self, initial_strategy):
        self.strategy = initial_strategy
        self.learning_history = []
        self.success_rate = 0.5  # Start with neutral assumption
        
    def act(self, situation):
        """Take action based on current strategy"""
        response = llm.invoke([
            SystemMessage(content=f"Current strategy: {self.strategy}"),
            HumanMessage(content=f"Handle this situation: {situation}")
        ])
        return response.content
    
    def observe(self, feedback_score, details):
        """Receive feedback from environment"""
        observation = {
            'feedback_score': feedback_score,  # 1-5 scale
            'details': details,
            'strategy_used': self.strategy
        }
        self.learning_history.append(observation)
        return observation
    
    def adapt(self):
        """Update strategy based on learning history"""
        if not self.learning_history:
            return "No learning data yet"
            
        # Calculate recent success rate (last 5 interactions)
        recent = self.learning_history[-5:]
        if recent:
            success_count = sum(1 for obs in recent if obs['feedback_score'] >= 4)
            self.success_rate = success_count / len(recent)
        
        # Adapt strategy if performing poorly
        if self.success_rate < 0.6:
            new_strategy = self._improve_strategy()
            improvement = f"Strategy updated due to low success rate ({self.success_rate:.1%})"
            self.strategy = new_strategy
            return improvement
        else:
            return f"Strategy working well ({self.success_rate:.1%} success), maintaining approach"
    
    def _improve_strategy(self):
        """Get improved strategy from LLM based on failures"""
        recent_failures = [obs for obs in self.learning_history[-3:] if obs['feedback_score'] <= 2]
        
        if not recent_failures:
            return self.strategy
            
        failure_analysis = "\n".join([f"Failure: {obs['details']}" for obs in recent_failures])
        
        response = llm.invoke([
            SystemMessage(content="Analyze failures and suggest better strategy"),
            HumanMessage(content=f"Current strategy: {self.strategy}\nRecent failures:\n{failure_analysis}\nSuggest improved approach:")
        ])
        return response.content

# Create a customer service agent with initial strategy
cs_agent = ReinforcementAgent(
    initial_strategy="Be polite and provide standard shipping information"
)

print("Reinforcement Loop Agent Created!")
print(f"Initial Strategy: {cs_agent.strategy}")

Reinforcement Loop Agent Created!
Initial Strategy: Be polite and provide standard shipping information


In [3]:
print(cs_agent.strategy)

Be polite and provide standard shipping information


In [4]:
# 2 Demo - Learning from Customer Interactions
def demonstrate_learning_loop():
    print("DEMONSTRATING REINFORCEMENT LEARNING LOOP")
    print("=" * 50)
    
    # Scenario 1: First interaction (poor outcome)
    situation1 = "Customer: My order #12345 is 3 days late! I need it for an important event."
    
    print("\nINTERACTION 1: Late Order Complaint")
    print(f"Situation: {situation1}")
    
    # ACT
    response1 = cs_agent.act(situation1)
    print(f"AI Response: {response1}")
    
    # OBSERVE (Customer gives 1/5 stars)
    observation1 = cs_agent.observe(
        feedback_score=1, 
        details="Customer angry - AI was dismissive about late order"
    )
    print(f"Feedback: {observation1['feedback_score']}/5 - {observation1['details']}")
    
    # ADAPT
    adaptation1 = cs_agent.adapt()
    print(f"Learning: {adaptation1}")
    print(f"Updated Strategy: {cs_agent.strategy}")

demonstrate_learning_loop()

DEMONSTRATING REINFORCEMENT LEARNING LOOP

INTERACTION 1: Late Order Complaint
Situation: Customer: My order #12345 is 3 days late! I need it for an important event.
AI Response: I'm sorry to hear that your order is delayed. I understand how important it is to receive it on time, especially for an event. I recommend checking the tracking information for the latest updates. If you need further assistance, please let me know, and I can help you with any options available. Thank you for your patience!
Feedback: 1/5 - Customer angry - AI was dismissive about late order
Learning: Strategy updated due to low success rate (0.0%)
Updated Strategy: To address the recent failure of a customer feeling dismissed regarding a late order, it's essential to adopt a more empathetic and proactive approach. Here’s an improved strategy:

### Improved Approach:

1. **Acknowledge the Customer's Feelings:**
   - Start by recognizing the customer's frustration. Use empathetic language to show that you underst

In [5]:
print(cs_agent.strategy)

To address the recent failure of a customer feeling dismissed regarding a late order, it's essential to adopt a more empathetic and proactive approach. Here’s an improved strategy:

### Improved Approach:

1. **Acknowledge the Customer's Feelings:**
   - Start by recognizing the customer's frustration. Use empathetic language to show that you understand their concerns.
   - Example: "I completely understand how frustrating it can be to experience a delay with your order. I’m here to help you."

2. **Provide Specific Information:**
   - Instead of just giving standard shipping information, provide specific details about the order status. If possible, include reasons for the delay and an estimated resolution time.
   - Example: "Your order was scheduled to arrive on [original date], but due to [specific reason], it has been delayed. We expect it to be shipped by [new date]."

3. **Offer Solutions:**
   - Suggest alternatives or solutions to mitigate the inconvenience. This could include 

In [5]:
# 3: Multiple Learning Iterations
def run_learning_simulation():
    """Simulate multiple customer interactions with learning"""
    
    scenarios = [
        {
            "situation": "Customer: I received the wrong size. This is very frustrating!",
            "good_response": "Apologize sincerely, offer immediate replacement and discount",
            "poor_response": "Explain return policy without empathy"
        },
        {
            "situation": "Customer: The product quality is much lower than advertised",
            "good_response": "Acknowledge concern, offer refund or exchange, ask for details",
            "poor_response": "Defend product quality and refer to manufacturer"
        },
        {
            "situation": "Customer: Your website charged me twice for the same order!",
            "good_response": "Apologize for error, verify duplicate charge, process immediate refund",
            "poor_response": "Ask customer to check their bank statement first"
        }
    ]
    
    print("\n" + "FULL LEARNING SIMULATION" + "=" * 50)
    
    for i, scenario in enumerate(scenarios, 1):
        print(f"\n ITERATION {i}:")
        print(f"Situation: {scenario['situation']}")
        
        # ACT
        response = cs_agent.act(scenario['situation'])
        print(f"AI Response: {response[:100]}...")
        
        # Simulate feedback based on response quality
        if any(keyword in response.lower() for keyword in ['apologize', 'sorry', 'immediately', 'refund']):
            feedback_score = 5  # Good response
            feedback_details = "Customer satisfied - empathetic and solution-oriented"
        else:
            feedback_score = 2  # Poor response  
            feedback_details = "Customer frustrated - lacking empathy and immediate solutions"
        
        # OBSERVE
        cs_agent.observe(feedback_score, feedback_details)
        print(f"Feedback: {feedback_score}/5 - {feedback_details}")
        
        # ADAPT
        learning = cs_agent.adapt()
        print(f"Learning: {learning}")
        
        print(f"Success Rate: {cs_agent.success_rate:.1%}")

run_learning_simulation()



 ITERATION 1:
Situation: Customer: I received the wrong size. This is very frustrating!
AI Response: ### Response to Customer:

1. **Acknowledge the Customer's Feelings:**
   - "I’m really sorry to hea...
Feedback: 5/5 - Customer satisfied - empathetic and solution-oriented
Learning: Strategy updated due to low success rate (50.0%)
Success Rate: 50.0%

 ITERATION 2:
Situation: Customer: The product quality is much lower than advertised
AI Response: ### Response to Customer Concern About Product Quality

1. **Immediate Acknowledgment and Apology:**...
Feedback: 5/5 - Customer satisfied - empathetic and solution-oriented
Learning: Strategy working well (66.7% success), maintaining approach
Success Rate: 66.7%

 ITERATION 3:
Situation: Customer: Your website charged me twice for the same order!
AI Response: 1. **Immediate Acknowledgment and Apology:**
   - "I’m really sorry to hear that you were charged tw...
Feedback: 5/5 - Customer satisfied - empathetic and solution-oriented
Learning

In [6]:
print(cs_agent.strategy)

To address the recent failure of a customer feeling dismissed regarding a late order, it's essential to adopt a more empathetic and proactive approach. Here’s an improved strategy:

### Improved Approach:

1. **Acknowledge the Customer's Feelings:**
   - Start by recognizing the customer's frustration. Use empathetic language to show that you understand their concerns.
   - Example: "I completely understand how frustrating it can be to experience a delay with your order. I’m here to help you."

2. **Provide Specific Information:**
   - Instead of just giving standard shipping information, provide specific details about the order status. If possible, include reasons for the delay and an estimated resolution time.
   - Example: "Your order was scheduled to arrive on [original date], but due to [specific reason], it has been delayed. We expect it to be shipped by [new date]."

3. **Offer Solutions:**
   - Suggest alternatives or solutions to mitigate the inconvenience. This could include 

In [7]:
# 4: Advanced Planning with Reinforcement
class AdvancedPlanningAgent:
    def __init__(self, domain):
        self.domain = domain
        self.plan_templates = self._load_plan_templates()
        self.execution_history = []
    
    def _load_plan_templates(self):
        """Initial plan templates for different scenarios"""
        return {
            "complaint": ["1. Acknowledge issue", "2. Investigate details", "3. Offer solution", "4. Follow up"],
            "inquiry": ["1. Understand question", "2. Provide information", "3. Check understanding", "4. Offer further help"],
            "emergency": ["1. Assess urgency", "2. Take immediate action", "3. Escalate if needed", "4. Document resolution"]
        }
    
    def create_plan(self, situation):
        """Create adaptive plan based on situation type"""
        situation_type = self._classify_situation(situation)
        base_plan = self.plan_templates.get(situation_type, self.plan_templates["inquiry"])
        
        # Adapt plan based on context
        adapted_plan = self._adapt_plan(base_plan, situation)
        return adapted_plan
    
    def _classify_situation(self, situation):
        """Classify what type of situation this is"""
        response = llm.invoke([
            SystemMessage(content="Classify customer situation type: complaint, inquiry, or emergency"),
            HumanMessage(content=f"Situation: {situation}\nRespond with just one word: complaint, inquiry, or emergency")
        ])
        return response.content.strip().lower()
    
    def _adapt_plan(self, base_plan, situation):
        """Adapt plan based on specific situation details"""
        plan_str = "\n".join(base_plan)
        response = llm.invoke([
            SystemMessage(content="Adapt this plan to fit the specific situation better"),
            HumanMessage(content=f"Base plan:\n{plan_str}\n\nSituation: {situation}\nSuggest 1-2 specific adaptations:")
        ])
        adapted_steps = base_plan + [f"Adapted: {response.content}"]
        return adapted_steps
    
    def execute_with_feedback(self, situation):
        """Full execution with reinforcement learning"""
        print(f"\nExecuting plan for: {situation}")
        
        # Create plan
        plan = self.create_plan(situation)
        print("Plan created:")
        for step in plan:
            print(f"  - {step}")
        
        # Execute plan
        result = self.act(situation)
        print(f"Execution result: {result[:150]}...")
        
        # Simulate feedback
        feedback = self.simulate_feedback(result)
        print(f"Feedback: {feedback}/5")
        
        # Learn from execution
        self.learn_from_execution(situation, plan, feedback)
        
        return result
    
    def act(self, situation):
        """Execute the plan"""
        response = llm.invoke([
            SystemMessage(content="Execute this customer service interaction professionally"),
            HumanMessage(content=situation)
        ])
        return response.content
    
    def simulate_feedback(self, response):
        """Simulate customer feedback based on response quality"""
        positive_indicators = ['apologize', 'understand', 'solution', 'immediately', 'help', 'resolve']
        score = 3  # Neutral start
        
        for indicator in positive_indicators:
            if indicator in response.lower():
                score += 0.5
                
        return min(5, max(1, score))
    
    def learn_from_execution(self, situation, plan, feedback):
        """Reinforcement learning from execution results"""
        learning_entry = {
            'situation': situation,
            'plan_used': plan,
            'feedback': feedback,
            'learned_at': len(self.execution_history) + 1
        }
        self.execution_history.append(learning_entry)
        
        # Simple learning: if feedback poor, note for future adaptation
        if feedback <= 2:
            print("Learning: This approach didn't work well for similar situations")

# Demo advanced planning agent
print("\n" + "ADVANCED PLANNING AGENT DEMO" + "=" * 50)
planning_agent = AdvancedPlanningAgent("customer_service")

test_situations = [
    "Customer: Your delivery service lost my package! It had important documents inside.",
    "Customer: Can you tell me about your return policy for electronics?",
    "Customer: There's a security issue with my account - someone unauthorized made purchases!"
]

for situation in test_situations:
    planning_agent.execute_with_feedback(situation)



Executing plan for: Customer: Your delivery service lost my package! It had important documents inside.
Plan created:
  - 1. Acknowledge issue
  - 2. Investigate details
  - 3. Offer solution
  - 4. Follow up
  - Adapted: 1. Acknowledge the urgency: Begin by expressing understanding of the situation's seriousness, emphasizing the importance of the documents and the inconvenience caused by the lost package.

2. Provide immediate assistance: Instead of just investigating the details, offer to initiate a trace on the package right away and provide the customer with a timeline for when they can expect updates. Additionally, discuss potential options for replacing the important documents if they cannot be recovered quickly.
Execution result: Customer Service Representative: I’m very sorry to hear that your package has been lost, and I understand how important those documents are to you. Le...
Feedback: 4.5/5

Executing plan for: Customer: Can you tell me about your return policy for elect

In [7]:
# 5: Real-time Strategy Adaptation Demo
def show_strategy_evolution():
    """Show how strategy evolves through reinforcement"""
    
    print("\n STRATEGY EVOLUTION THROUGH REINFORCEMENT")
    print("=" * 50)
    
    # Track strategy changes
    strategies = []
    
    # Create new agent
    learning_agent = ReinforcementAgent(
        initial_strategy="Provide factual information to customer queries"
    )
    
    strategies.append(learning_agent.strategy)
    
    # Learning iterations
    learning_scenarios = [
        ("Customer is angry about service outage", 1, "Too factual, lacked empathy"),
        ("Customer confused about billing", 3, "Clear but could be more helpful"), 
        ("Customer needs urgent help", 2, "Too slow, didn't recognize urgency"),
        ("Customer happy but has question", 5, "Good balance of info and friendliness")
    ]
    
    for i, (situation, feedback, reason) in enumerate(learning_scenarios, 1):
        print(f"\nLearning Cycle {i}:")
        print(f"Situation: {situation}")
        print(f"Current Strategy: {learning_agent.strategy}")
        
        # ACT and OBSERVE
        learning_agent.act(situation)
        learning_agent.observe(feedback, reason)
        
        # ADAPT 
        change = learning_agent.adapt()
        strategies.append(learning_agent.strategy)
        
        print(f"Strategy Change: {change}")
        print(f"New Strategy: {learning_agent.strategy}")
    
    print(f"\nFINAL STRATEGY: {learning_agent.strategy}")
    print(f"Final Success Rate: {learning_agent.success_rate:.1%}")

show_strategy_evolution()


 STRATEGY EVOLUTION THROUGH REINFORCEMENT

Learning Cycle 1:
Situation: Customer is angry about service outage
Current Strategy: Provide factual information to customer queries
Strategy Change: Strategy updated due to low success rate (0.0%)
New Strategy: To enhance your current strategy of providing factual information while addressing the recent failures of lacking empathy, consider the following improved approach:

### 1. **Empathetic Communication:**
   - **Acknowledge Emotions:** Start by recognizing the customer's feelings or concerns. Use phrases like "I understand how frustrating this must be for you" or "I can see why you would feel that way."
   - **Personalize Responses:** Use the customer's name and reference their specific situation to make the interaction feel more personal and less robotic.

### 2. **Active Listening:**
   - **Encourage Dialogue:** Ask open-ended questions to better understand the customer's needs and feelings. For example, "Can you tell me more about w

Key Learning Points from this Code:

Reinforcement Loop: ACT → OBSERVE → ADAPT cycle

Strategy Evolution: Agents improve based on feedback

Adaptive Planning: Plans change based on situation type

Continuous Learning: Each interaction informs future behavior

## When to Use This Approach:

This Python-only RL loop is ideal when:

You want learning over time but don’t need complex RL algorithms.

You're building conversational or reasoning agents that rely on strategy, not action probabilities.

You want full control and transparency in code, without agent frameworks.

Production Considerations: To make this RL-style loop production-ready, apply the following upgrades:

1. Persistent Memory Storage

2. Reward Automation

3. Strategy Versioning

4. Guardrails + Safety Policies

5. Rapid Adaptation with Weighted Feedback

6. Caching for Cost Optimization
