# Lesson 5: Implementing LLM Feedback Loops

## Code Generation and Debugging Assistant

In this hands-on exercise, you will implement iterative feedback loops where an AI generates, tests, and revises Python code snippets based on test results and feedback.

We will use an LLM Feedback loop to create and iteratively improve a Python function called `process_data` that is best described using the following examples:

```python
process_data([1, 2, 3, 4, 5], mode='average')  # Should return 3.0
process_data([1, 2, 'a', 3], mode='sum')  # Should return 6
```


### Outline:

- Setup
- Define Task and Test Cases
- Initial Generation
- Expand the Test Cases
- First Iteration with Feedback
- Create Feedback Loop
- Reflection

## 1. Setup

Import necessary libraries and define helper functions, including a mock LLM client, code execution environment, and test runner.

In [1]:
# Import necessary libraries
# No changes needed in this cell
from openai import OpenAI
from IPython.display import Markdown, display
import traceback
import io
import os
from contextlib import redirect_stdout, redirect_stderr
from enum import Enum

In [2]:
# Set up LLM credentials

client = OpenAI(
    base_url="https://openai.vocareum.com/v1",
    # Uncomment one of the following
    # api_key="**********",  # <--- TODO: Fill in your Vocareum API key here
    # api_key=os.getenv(
    #     "OPENAI_API_KEY"
    # ),  # <-- Alternately, set as an environment variable
)

# If using OpenAI's API endpoint
# client = OpenAI()

In [3]:
# Define helper functions
# No changes needed in this cell


class OpenAIModels(str, Enum):
    GPT_4O_MINI = "gpt-4o-mini"
    GPT_41_MINI = "gpt-4.1-mini"
    GPT_41_NANO = "gpt-4.1-nano"


MODEL = OpenAIModels.GPT_41_NANO


def get_completion(messages=None, system_prompt=None, user_prompt=None, model=MODEL):
    """
    Function to get a completion from the OpenAI API.
    Args:
        system_prompt: The system prompt
        user_prompt: The user prompt
        model: The model to use (default is gpt-4.1-mini)
    Returns:
        The completion text
    """

    messages = list(messages)
    if system_prompt:
        messages.insert(0, {"role": "system", "content": system_prompt})
    if user_prompt:
        messages.append({"role": "user", "content": user_prompt})
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.7,
    )
    return response.choices[0].message.content


def execute_code(code, test_cases):
    """
    Executes Python code and returns the results of test cases.
    Args:
        code: String containing Python code
        test_cases: List of dictionaries with inputs and expected outputs
    Returns:
        Dictionary containing execution results and test outcomes
    """
    results = {"execution_error": None, "test_results": [], "passed": 0, "failed": 0}

    # Create a namespace for execution
    namespace = {}

    # Capture stdout and stderr
    output_buffer = io.StringIO()

    try:
        with redirect_stdout(output_buffer), redirect_stderr(output_buffer):
            exec(code, namespace)

        # Run test cases
        for i, test in enumerate(test_cases):
            inputs = test["inputs"]
            expected = test["expected"]

            # Execute the function with test inputs
            try:
                if isinstance(inputs, dict):
                    actual = namespace["process_data"](**inputs)
                else:
                    actual = namespace["process_data"](*inputs)

                passed = actual == expected

                if passed:
                    results["passed"] += 1
                else:
                    results["failed"] += 1

                results["test_results"].append(
                    {
                        "test_id": i + 1,
                        "inputs": inputs,
                        "expected": expected,
                        "actual": actual,
                        "passed": passed,
                    }
                )
            except Exception as e:
                # If the error is the expected type, mark as passed
                passed = isinstance(expected, type) and isinstance(e, expected)
                results["test_results"].append(
                    {
                        "test_id": i + 1,
                        "inputs": inputs,
                        "expected": expected,
                        "error": str(e),
                        "passed": passed,
                    }
                )
                if passed:
                    results["passed"] += 1
                else:
                    results["failed"] += 1

    except Exception as e:
        results["execution_error"] = {
            "error_type": type(e).__name__,
            "error_message": str(e),
            "traceback": traceback.format_exc(),
        }

    results["stdout"] = output_buffer.getvalue()
    return results


# Function to format test results as feedback for the model
def format_feedback(results):
    """
    Formats test results into a clear feedback string for the model.
    Args:
        results: Dictionary containing execution results
    Returns:
        Formatted feedback string
    """
    feedback = []

    if results["execution_error"]:
        feedback.append(
            f"ERROR: Code execution failed with {results['execution_error']['error_type']}"
        )
        feedback.append(f"Message: {results['execution_error']['error_message']}")
        feedback.append("Traceback:")
        feedback.append(results["execution_error"]["traceback"])
        feedback.append("\nPlease fix the syntax or runtime errors in the code.")
        return "\n".join(feedback)

    feedback.append(
        f"Test Results: {results['passed']} passed, {results['failed']} failed"
    )

    if results["stdout"]:
        feedback.append(f"\nStandard output:\n{results['stdout']}")

    if results["failed"] > 0:
        feedback.append("\nFailed Test Cases:")
        for test in results["test_results"]:
            if not test.get("passed"):
                feedback.append(f"\nTest #{test['test_id']}:")
                feedback.append(f"  Inputs: {test['inputs']}")
                feedback.append(f"  Expected: {test['expected']}")
                if "actual" in test:
                    feedback.append(f"  Actual: {test['actual']}")
                if "error" in test:
                    feedback.append(f"  Error: {test['error']}")

    return "\n".join(feedback)

## 2. Define Task and Test Cases

We will create a Python function called `process_data` that analyzes numerical data with the following (possibly incomplete) set of requirements:

1. The function should accept a list of numbers and an optional parameter 'mode' that can be 'sum' or 'average' (default should be 'average').
2. If mode is 'sum', return the sum of all numbers.
3. If mode is 'average', return the average (mean) of all numbers.

Example:
```python
process_data([1, 2, 3, 4, 5], mode='average')  # Should return 3.0
process_data([1, 2, 'a', 3], mode='sum')  # Should return 6
```

In [4]:
# Write out a task description for the LLM
# TODO: Complete this cell by replacing the **********
task_description = """
We will create a Python function called `process_data` that analyzes numerical data with the following (possibly incomplete) set of requirements:

1. The function should accept a list of numbers and an optional parameter 'mode' that can be 'sum' or 'average' (default should be 'average').
2. If mode is 'sum', return the sum of all numbers.
3. If mode is 'average', return the average (mean) of all numbers.

Example:
```python
process_data([1, 2, 3, 4, 5], mode='average')  # Should return 3.0
process_data([1, 2, 'a', 3], mode='sum')  # Should return 6
```
"""

In [5]:
# Write out test cases to test the LLM's output
# TODO: Add more test cases following the examples provided replacing the **********
test_cases = [
    {"inputs": ([1, 2, 3, 4, 5], "sum"), "expected": 15},
    {"inputs": ([1, 2, 3, 4, 5], "average"), "expected": 3.0},
    {"inputs": ([11, 12, 13, 14, 15], "sum"), "expected": 65},
    {"inputs": ([11, 12, 13, 14, 15], "average"), "expected": 13.0},
    {"inputs": ([1.1, 2.2, 3.3, 4.4, 5.5], "sum"), "expected": 16.5},
    {"inputs": ([1.1, 2.2, 3.3, 4.4, 5.5], "average"), "expected": 3.3},
    {"inputs": ([-1, -2, -3, -4, -5], "sum"), "expected": -15},
    {"inputs": ([-1, -2, -3, -4, -5], "average"), "expected": -3.0},
]

## 3. Initial Generation

Let's start with a basic prompt to generate an initial solution to our problem.

In [None]:
# Basic prompt for initial code generation
# TODO: Complete this cell by replacing the **********
initial_prompt = f"""
You are an expert Python developer. Please write a Python function based on the following requirements:

{task_description}

Write only the function surrounded by ```python and ``` without any additional explanations or examples.

Example:

```python
def function_name(arguments):
    # function body
```
"""

# Get initial completion
messages = [{"role": "user", "content": initial_prompt}]
initial_response = get_completion(messages)


def extract_code(code):
    lines = code.split("\n")
    start = lines.index("```python") + 1
    end = lines.index("```", start)
    return "\n".join(lines[start:end])


initial_code = extract_code(initial_response)

print("Initial Generated Code:")
print(initial_code)

# Execute and test the initial code
initial_results = execute_code(initial_code, test_cases)
initial_feedback = format_feedback(initial_results)

print("\nTest Results:")
print(initial_feedback)

Initial Generated Code:
def process_data(numbers, mode='average'):
    # Filter out non-numeric values
    numeric_values = [num for num in numbers if isinstance(num, (int, float))]

    if not numeric_values:
        return None  # or raise an error if preferred

    if mode == 'sum':
        return sum(numeric_values)
    elif mode == 'average':
        return sum(numeric_values) / len(numeric_values)
    else:
        # Unsupported mode
        return None

Test Results:
Test Results: 8 passed, 0 failed


## 4. Expand the Test Cases

Now, pretend that you've used this code in a production setting and have received feedback. The first version of your generated code worked marvelously, and now you are seeking to expand the capabilities of your function.

Unfortunately, your product manager is on vacation, but you have know your function needs to:
1) support a new mode, "median"
2) ignore non-numeric values
3) handle empty lists, returning None

So, following test-driven development practices, you update your tests:


In [10]:
# These are the new test cases. No updates needed.
test_cases = [
    {"inputs": ([1, 2, 3, 4, 5], "sum"), "expected": 15},
    {"inputs": ([1, 2, 3, 4, 5], "average"), "expected": 3.0},
    {"inputs": ([11, 12, 13, 14, 15], "sum"), "expected": 65},
    {"inputs": ([11, 12, 13, 14, 15], "average"), "expected": 13.0},
    {"inputs": ([], "sum"), "expected": None},
    {"inputs": ([1, 3, 4], "median"), "expected": 3},
    {"inputs": ([1, 2, 3, 5], "median"), "expected": 2.5},
    {"inputs": ([1, 2, "a", 3], "sum"), "expected": 6},
    {"inputs": ([1, 2, None, 3, "b", 4], "average"), "expected": 2.5},
    {"inputs": ([10], "median"), "expected": 10},
    {"inputs": ([], "median"), "expected": None},
    {"inputs": ([1, 2, 3, 4, 5], "invalid_mode"), "expected": ValueError},
]

In [11]:
# Re-test the code
# No updates are needed in this cell
print("Initial Generated Code:")
print(initial_code)

# Execute and test the initial code
initial_results = execute_code(initial_code, test_cases)
initial_feedback = format_feedback(initial_results)

print("\nTest Results:")
print(initial_feedback)

Initial Generated Code:
def process_data(numbers, mode='average'):
    # Filter out non-numeric values
    numeric_values = [num for num in numbers if isinstance(num, (int, float))]

    if not numeric_values:
        return None  # or raise an error if preferred

    if mode == 'sum':
        return sum(numeric_values)
    elif mode == 'average':
        return sum(numeric_values) / len(numeric_values)
    else:
        # Unsupported mode
        return None

Test Results:
Test Results: 8 passed, 4 failed

Failed Test Cases:

Test #6:
  Inputs: ([1, 3, 4], 'median')
  Expected: 3
  Actual: None

Test #7:
  Inputs: ([1, 2, 3, 5], 'median')
  Expected: 2.5
  Actual: None

Test #10:
  Inputs: ([10], 'median')
  Expected: 10
  Actual: None

Test #12:
  Inputs: ([1, 2, 3, 4, 5], 'invalid_mode')
  Expected: <class 'ValueError'>
  Actual: None


## 5. First Iteration with Feedback
Now, let's feed the test results back to the model and ask for an improved version.

In [12]:
# Prompt with feedback for first iteration
# TODO: Complete this cell by replacing the **********
feedback_prompt = f"""
You are an expert Python developer. You wrote a function based on these requirements:

{task_description}

Here is your current implementation:
```python
{initial_code}
```
I've tested your code and here are the results:
{initial_feedback}
Please improve your code to fix any issues and make sure it passes all test cases.
Write only the improved function without any explanation.
"""

messages = [{"role": "user", "content": feedback_prompt}]

# Get improved code
improved_response = get_completion(messages)

# Extract the improved code
improved_code = extract_code(improved_response)

print("\nImproved Code:")
print(improved_code)

# Execute and test the improved code
improved_results = execute_code(improved_code, test_cases)
improved_feedback = format_feedback(improved_results)
print("\nTest Results for Improved Code:")
print(improved_feedback)



Improved Code:
def process_data(numbers, mode='average'):
    # Filter out non-numeric values
    numeric_values = [num for num in numbers if isinstance(num, (int, float))]

    if not numeric_values:
        return None  # or raise an error if preferred

    if mode == 'sum':
        return sum(numeric_values)
    elif mode == 'average':
        return sum(numeric_values) / len(numeric_values)
    elif mode == 'median':
        sorted_nums = sorted(numeric_values)
        n = len(sorted_nums)
        mid = n // 2
        if n % 2 == 0:
            return (sorted_nums[mid - 1] + sorted_nums[mid]) / 2
        else:
            return sorted_nums[mid]
    else:
        raise ValueError(f"Invalid mode: {mode}")

Test Results for Improved Code:
Test Results: 12 passed, 0 failed


## 6. Create Feedback Loop

We may want to give the LLM more than one chance to generate the correct code. We may even want to introduce test cases gradually, so that it has the opportunity to fix errors one at a time.

Let's develop a loop that will start from scratch and run the loop a maximum number of times or until the code is correct.

In [16]:
# Write a code-creation LLM loop
# TODO: Replace the parts marked with *****

from pprint import pprint

iterations = []


# Get initial completion and extract code
messages = [{"role": "user", "content": initial_prompt}]
initial_response = get_completion(messages)
initial_code = extract_code(initial_response)

# Execute and test the initial code
initial_results = execute_code(initial_code, test_cases)
initial_feedback = format_feedback(initial_results)


# Store the initial iteration
iterations.append(
    {
        "iteration": 0,
        "code": initial_code,
        "test_results": {
            "passed": initial_results["passed"],
            "failed": initial_results["failed"],
        },
    }
)

pprint(iterations[-1]["test_results"])

current_code = initial_code
current_feedback = initial_feedback

# Loop to improve the code based on feedback
for i in range(3):
    if iterations[-1]["test_results"]["failed"] == 0:
        print("\nSuccess! All tests passed.")
        break
    feedback_prompt = f"""
    You are an expert Python developer. You wrote a function based on these requirements:

    {task_description}

    Here is your current implementation:
    ```python
    {current_code}
    ```
    I've tested your code and here are the results:
    {current_feedback}
    Please improve your code to fix any issues and make sure it passes all test cases.
    Write only the improved function without any explanation.
    """    

    # Feedback prompt with retry
    # feedback_prompt = ...
    messages = [{"role": "user", "content": feedback_prompt}]
    improved_response = get_completion(messages)
    improved_code = extract_code(improved_response)

    # Execute and test the improved code
    improved_results = execute_code(improved_code, test_cases)
    improved_feedback = format_feedback(improved_results)
    iterations.append(
        {
            "iteration": i + 1,
            "code": improved_code,
            "test_results": {
                "passed": improved_results["passed"],
                "failed": improved_results["failed"],
            },
        }
    )
    pprint(iterations[-1]["test_results"])

    current_code = improved_code
    current_feedback = improved_feedback


{'failed': 5, 'passed': 7}
{'failed': 4, 'passed': 8}
{'failed': 0, 'passed': 12}

Success! All tests passed.


In [17]:
# View a summary of the different iterations
from pprint import pprint
pprint(iterations, width=200)

[{'code': "def process_data(numbers, mode='average'):\n"
          '    # Filter out non-numeric values\n'
          '    numeric_values = [num for num in numbers if isinstance(num, (int, float))]\n'
          '    if not numeric_values:\n'
          '        return 0\n'
          "    if mode == 'sum':\n"
          '        return sum(numeric_values)\n'
          "    elif mode == 'average':\n"
          '        return sum(numeric_values) / len(numeric_values)\n'
          '    else:\n'
          '        raise ValueError("Invalid mode. Choose \'sum\' or \'average\'.")',
  'iteration': 0,
  'test_results': {'failed': 5, 'passed': 7}},
 {'code': "def process_data(numbers, mode='average'):\n"
          '    # Filter out non-numeric values\n'
          '    numeric_values = [num for num in numbers if isinstance(num, (int, float))]\n'
          '    if not numeric_values:\n'
          '        return None\n'
          "    if mode == 'sum':\n"
          '        return sum(numeric_values

In [None]:
# Print the final code
print(iterations[-1]["code"])


def process_data(numbers, mode='average'):
    # Filter out non-numeric values
    numeric_values = [num for num in numbers if isinstance(num, (int, float))]
    if not numeric_values:
        return None

    mode_lower = mode.lower()

    if mode_lower == 'sum':
        return sum(numeric_values)
    elif mode_lower == 'average':
        return sum(numeric_values) / len(numeric_values)
    elif mode_lower == 'median':
        sorted_values = sorted(numeric_values)
        n = len(sorted_values)
        mid = n // 2
        if n % 2 == 1:
            return sorted_values[mid]
        else:
            return (sorted_values[mid - 1] + sorted_values[mid]) / 2
    else:
        raise ValueError("Invalid mode. Choose 'sum', 'average', or 'median'.")


## 7. Reflection & Transfer

What improvements did you observe across iterations?
Analyze how the code improved with each feedback loop iteration:

* Correctness: Did the number of passed tests increase?
* Error handling: How did error handling evolve?
* Edge cases: How did handling of edge cases improve?
* Readability: Did the code become more readable or maintainable?

How effective was the feedback loop approach?
Reflect on the effectiveness of using a feedback loop for code generation:

* What types of issues were fixed in each iteration?
* Were there any problems that persisted across iterations?
* How could the feedback mechanism be improved?
* How does this compare to traditional approaches to debugging and code refinement?

## Summary

In this exercise, we explored how LLM feedback loops can be used to iteratively improve code generation.

By providing structured feedback about test failures, we enabled the model to focus on specific issues and incrementally improve its solution.

The key insight is that well-structured feedback loops can significantly enhance the quality and correctness of AI-generated code, especially for complex tasks with multiple (possibly incomplete) requirements and edge cases.


Congratulations on completing this exercise! Give yourself a hand! ðŸ¤—ðŸ¤—