# Stage 2: Data-Engineered RAG

In Stage 1, our baseline RAG agent played it safe and sent everything: full-course objects, verbose fields, and lengthy descriptions. That worked, but it came at a cost ‚Äî simple questions like *"What computer science courses are available?"* could trigger 5,000+ extra tokens of context.

Before we move on to a richer agent architecture (like a ReAct agent), we‚Äôre going to pause and ask a simpler question:

> ‚ÄúWhat if we just got smarter about the data we send?‚Äù

In this stage, we‚Äôll keep the same basic RAG pattern, but engineer the context itself. Since context is just data, we can apply lightweight data engineering techniques to make it more efficient and cleaner for the LLM to consume.

More specifically, we‚Äôll:

1. Trim the data: We'll remove noise fields and unnecessary detail so the model only sees what it needs. In most cases, this will be the core course content, including descriptions, learning objectives, and other relevant details.   
2. Transform the format into LLM-friendly text: We'll take the raw JSON/course objects and convert them into natural, human-readable text that LLMs handle more easily.
3. Context depth control: We'll use compact course summaries when we need an overview, and keep full details for when they‚Äôre truly needed.
4. Experiment:  We'll see how the token count changes compared to our original baseline RAG call.

That last point will serve as a gentle introduction to a powerful concept known as progressive disclosure: Start with a brief summary view and reveal full details only upon demand. 

In this stage, you‚Äôll see progressive disclosure in action through ‚Äúsummary vs. full‚Äù course representations, and in the next stage, we‚Äôll push this further inside a full agent loop.

With just these small changes, you‚Äôll see how we can reduce the context size (tokens) by around 50% while maintaining answer quality essentially unchanged.

Let‚Äôs dive in.

## Setup

Let's set up our environment and import the stage 2 agent. Run the code blocks below.

In [None]:
#This code sets up the notebook to be able to access the provided OpenAI API Key and access to the agent code

import sys
import os
from pathlib import Path

if "OPENAI_API_BASE" in os.environ:
    os.environ["OPENAI_BASE_URL"] = os.environ["OPENAI_API_BASE"]

project_root = Path("..").resolve()

stage1_path = project_root / "progressive_agents" / "stage2_data_engineered"
src_path = project_root / "src"

sys.path.insert(0, str(src_path))
sys.path.insert(0, str(stage1_path))

print('OpenAI API key and agent access setup!')

In [None]:
from agent import setup_agent

print("Initializing Stage 2 Agent...")
# This reuses the same Redis data from Stage 1 (or generates it if missing)
workflow, course_manager = setup_agent(auto_load_courses=True, verbose=True)
print("Agent is ready!")

## Optimization: Human-Readable, LLM-Friendly Course Text
Our course data currently resides in a structured `Course` object that mirrors how it's stored in the Redis database: numerous fields, some student-facing and some purely system-facing. In Stage 1, we essentially dumped that structure into the prompt, which meant the LLM had to parse a lot of syntax and noise it did not really need.

There are two things to note about this approach:
1. JSON-style representations are full of extra characters: quotes, braces, commas, and field names.
2. The shape is designed for machines and storage, not for a model trying to answer a question in natural language.

For example, think about the difference between these two versions of the same idea:

```json
{
  "course_code": "CS101",
  "title": "Introduction to Computer Science",
  "department": "Computer Science"
}
```
versus
```txt
CS101: Introduction to Computer Science
Department: Computer Science
```

The first version spends tokens on punctuation and keys; the second goes straight to the meaning. Across dozens or hundreds of courses, that extra syntax turns into a lot of wasted tokens.

In this section, you will build `transform_course_to_text`, a function that reshapes a `Course` object into a compact, human-readable description. More specifically, we will accomplish the following:

1. Get rid of  system-oriented fields like IDs and timestamps
2. Remove null and empty fields (e.g., when a course has no prerequesites)
3. Organize the important fields (title, level, format, instructor, prerequisites, learning objectives) into a simple text block that an LLM can scan easily

The result will be a prompt-ready course description that is both easier for the model to work with and significantly cheaper in tokens than sending the raw object or its JSON representation.

Start by running the cell below to inspect a sample `Course` instance in its original form.


In [None]:
from redis_context_course.models import Course, DifficultyLevel, CourseFormat

# A dummy course for testing purposes
sample_course = Course(
    id="course_cs009",
    course_code="CS009",
    title="Computer Vision",
    department="Computer Science",
    major="Computer Science",
    credits=4,
    difficulty_level=DifficultyLevel.ADVANCED,
    format=CourseFormat.HYBRID,
    semester="fall",
    year=2025,
    max_enrollment=100,
    current_enrollment=0,
    instructor="Kayla Perez",
    description=(
        "This course teaches students how to build systems that can see and understand visual data."
    ),
    prerequisites=[
        {
            "course_code": "CS002",
            "course_title": "Prerequisite Course 2",
            "minimum_grade": "C",
            "can_be_concurrent": False,
        }
    ],
    learning_objectives=[
        "Understand core concepts in computer vision",
        "Implement computer vision algorithms and techniques",
        "Apply computer vision to real-world problems",
        "Analyze and evaluate computer vision solutions",
        "Design and build complete computer science systems",
    ],
    created_at="2023-01-01T00:00:00Z",
    updated_at="2023-06-01T00:00:00Z",
)

print("--- Raw Course Data (JSON) ---")
print(sample_course.model_dump_json(indent=2))

Now that we have the baseline, it's time to define the function that will transform the data.

### üìå Task 1: Transform the course data

Your goal is to convert a `Course` object from its structured form into clean, human-readable text that's optimized for LLM consumption. This transformation removes database artifacts (IDs, timestamps), eliminates empty fields, and formats the essential information using natural language instead of JSON syntax.

Your function should produce clean, formatted text like this:

```
CS009: Computer Vision
Department: Computer Science
Credits: 4
Level: advanced
Format: hybrid
Instructor: Kayla Perez
Prerequisites: CS002
Description: This course teaches students how to build systems that can see and understand visual data.
Learning Objectives:
  - Understand core concepts in computer vision
  - Implement computer vision algorithms and techniques
  - Apply computer vision to real-world problems
  - Analyze and evaluate computer vision solutions
  - Design and build complete computer science systems
```

<details>
<summary>üõ†Ô∏è Show Implementation Details</summary>
<br>

Requirements:

1. Input: A `Course` object
2. Output: A formatted string
3. Fields: Code, Title, Department, Credits, Level, Format, Instructor, Description
4. Conditional fields: Include prerequisites and learning objectives only if they exist
5. Excluded Fields: ID, timestamps, enrollment numbers
6. Format: Use clean `Key: Value` format with proper line breaks

**Implementation Steps:**

**Step 1: Build prerequisites text**

Check if `course.prerequisites` exists. If it does:
- Extract just the `course_code` from each prerequisite object
- Join them with commas: `", ".join([p.course_code for p in course.prerequisites])`
- Store as `prereq_text`

**Step 2: Build learning objectives text**

Check if `course.learning_objectives` exists. If it does:
- Format each objective as a bulleted item with `"  - "` prefix
- Join them with newlines
- Add a leading newline before the list
- Store as `objectives_text`

**Step 3: Build the formatted course text**

Create an f-string that includes:
- Course code and title on the first line
- Department, Credits, Level, Format, Instructor on separate lines
- Prerequisites line (using the prereq_text you built)
- Description
- Learning Objectives header followed by objectives_text

Use `course.difficulty_level.value` and `course.format.value` to access enum values.

</details>

Feel free to reference the solution code to compare your own code or if you feel stuck. The code can also be run at any point to test the output based on the sample data.

In [None]:
def transform_course_to_text(course: Course) -> str:
    """
    Transform course object to LLM-optimized text format.
    
    Args:
        course: Course object to transform
    
    Returns:
        LLM-friendly text representation
    """
    
    # TODO: Build prerequisites text
    prereq_text = ""
    
    # TODO: Build learning objectives text  
    objectives_text = ""
    
    # TODO: Build and return the formatted course text
    course_text = ""
    
    return course_text

print("‚úÖ Transform function defined!")

<details>
<summary>üóùÔ∏è Solution code</summary>
<br>
    
```python

def transform_course_to_text(course: Course) -> str:
    """
    Transform course object to LLM-optimized text format.
    """
    
    # Build prerequisites text (no leading newline)
    prereq_text = ""
    if course.prerequisites:
        prereq_codes = [p.course_code for p in course.prerequisites]
        prereq_text = f"{', '.join(prereq_codes)}"

    # Build learning objectives text (no leading newline)
    objectives_text = ""
    if course.learning_objectives:
        objectives_text = "\n" + "\n".join(
            f"  - {obj}" for obj in course.learning_objectives
        )

    # Build course text
    course_text = f"""{course.course_code}: {course.title}
Department: {course.department}
Credits: {course.credits}
Level: {course.difficulty_level.value}
Format: {course.format.value}
Instructor: {course.instructor}
Prerequisites: {prereq_text}
Description: {course.description}
Learning Objectives:{objectives_text}
"""

    return course_text

print("‚úÖ Transform function defined!")
```
</details>

Lastly, we'll need to attach the function to our agent for later. Run the code block below to accomplish this goal.

In [None]:
import agent.data_engineering

# Inject our function into the agent's module
agent.data_engineering.transform_course_to_text = transform_course_to_text

print("‚úÖ Successfully injected the custom function into the agent!")

## Optimization: Ultra-Compact Course Summaries

We've successfully reshaped a Course object into a clean, human-readable description. That format is much better for the LLM than raw JSON and is significantly cheaper in terms of tokens, but it is still fairly detailed, including a full description, learning objectives, prerequisites, and more.

Before we start testing that format in our RAG agent, we will build one more representation: an ultra-compact version that retains only the bare essentials. This gives us a third option to compare. So we'll have:

1. A raw JSON / full object (what we started with in Stage 1)
2. A human-readable full-text description (`transform_course_to_text`)
3. An ultra-compact one-line summary (`compact_course_text`)

Why do we need something this small? Well, in realistic systems, there are many situations where you don't want full details for every item. For example, when you are listing dozens of courses as a quick menu, or scanning over search results, you mostly need a short sketch: course code, title, a hint of what it is about, and maybe whether it has prerequisites.

This is where the idea of progressive disclosure starts to become practical. You can imagine a flow where the ultra-compact text is used for broad overviews or lists, and the full text representation is used only for the few items the user seems to care about.

### üìå Task 2: Create a compact course text representation

Your goal is to create an ultra-compact, single-line course representation that preserves only the bare essentials. This format is ideal for scenarios where you need to display multiple courses simultaneously (such as search result lists) or when you want to minimize token usage while still providing enough information for users to identify relevant courses.

Your function should produce a single-line format like this:

```
CS009: Computer Vision - This course teaches students how to build systems that can see and understand visual data.... (Prereq: CS002)
```

<details>
<summary>üõ†Ô∏è Show Implementation Details</summary>
<br>

**Requirements:**

1. Input: A `Course` object
2. Output: A single-line string
3. Format: `Code: Title - Description (truncated)... (Prereq: codes)`
4. Truncate: Description to 100 characters
5. Conditional Fields: Include prerequisites only if they exist

**Implementation Steps:**

**Step 1: Build prerequisites text**

Check if `course.prerequisites` exists. If it does:
- Extract just the `course_code` from each prerequisite object
- Join them with commas
- Format as: `" (Prereq: CODE1, CODE2)"`
- If no prerequisites exist, use an empty string

Store this as `prereqs`.

**Step 2: Build the compact format**

Create a single-line string with:
- Course code and title: `f"{course.course_code}: {course.title}"`
- Truncated description: Use string slicing `course.description[:100]` followed by `"..."`
- Prerequisites: Append the `prereqs` string you built

Use an f-string to combine all parts: `f"{code}: {title} - {truncated_description}...{prereqs}"`

</details>

Feel free to reference the solution code to compare your own code or if you feel stuck. The code can also be run at any point to test the output based on the sample data.

In [None]:
def compact_course_text(course: Course) -> str:
    """
    Create an ultra-compact course description.

    Reduces token count while preserving essential information.

    Use this when you need maximum token efficiency (e.g., many courses).
    Use transform_course_to_text() when you need full details.

    Args:
        course: Course object to optimize

    Returns:
        Compact text representation
    """
    
    # TODO: Build prerequisites text
    prereqs = ""
    
    # TODO: Build and return the compact format
    
    return ""

<details>
<summary>üóùÔ∏è Solution code</summary>
<br>
    
```python

def compact_course_text(course: Course) -> str:
    """
    Create an ultra-compact course description.

    Reduces token count while preserving essential information.

    Use this when you need maximum token efficiency (e.g., many courses).
    Use transform_course_to_text() when you need full details.

    Args:
        course: Course object to optimize

    Returns:
        Compact text representation
    """
    prereqs = (
        f" (Prereq: {', '.join([p.course_code for p in course.prerequisites])})"
        if course.prerequisites
        else ""
    )
    return (
        f"{course.course_code}: {course.title} - {course.description[:100]}...{prereqs}"
    )

print("‚úÖ Compact summary function defined!")
```
</details>

As before, we'll need to attach the function to our agent for later. Run the code block below to accomplish this goal.

In [None]:
import agent.data_engineering

# Inject our function into the agent's module
agent.data_engineering.compact_course_text = compact_course_text

print("Successfully injected the custom function into the agent!")

## Analysis

Let's now compare the results. In Stage 1, this query cost us ~10,000 tokens. We'll run the exact same query as in Stage 1 to make a direct comparison.

Run the code block below to initiate the query.

In [None]:
# Define the user's query
query = "What computer science courses are available?"

print(f"User asks: '{query}'")
print("Running workflow...")

# Run the graph
result = await workflow.ainvoke({"query": query})

# Display the Answer
print("="*60)
print(f"Agent Answer:\n\n{result['final_answer']}")
print("="*60)

# Display the Metrics
courses_found = result.get('courses_found', 0)
total_tokens = result.get('total_tokens', 0)

print(f"\nStatistics:")
print(f"   Courses Retrieved: {courses_found}")
print(f"   Total Tokens Used: {total_tokens:,}")

### The Comparison

| Metric | Stage 1 (Baseline) | Stage 2 (Engineered) | Improvement |
| :--- | :--- | :--- | :--- |
| **Total Tokens** | 6000+ | **~576** | **~91% Reduction** |
| **Format** | Raw JSON | Clean Text | Better Readability |
| **Noise** | High (Syllabi, IDs) | Low (Relevant info only) | Focused Context |

We achieved a ~91% reduction in token usage by simply cleaning and formatting our data.

### Inspecting the Engineered Context
Let's see what the LLM actually saw this time.

In [None]:
# Inspect the engineered context
engineered_context = result.get('engineered_context', '')

print(f"Total Context Size: {len(engineered_context):,} characters")
print("-" * 40)
print("PREVIEW OF ENGINEERED CONTEXT")
print("-" * 40)
print(engineered_context)

## Wrap Up üèÅ

You've completed Stage 2 and transformed your RAG agent from a token-wasting baseline into an efficient, data-engineered system.

In this stage, you:

- Cleaned course data by removing database artifacts, noise fields, and unnecessary details
- Transformed structured data from JSON into LLM-friendly natural text format
- Optimized context with ultra-compact summaries for maximum token efficiency
- Compared different representation strategies (raw JSON vs. clean text vs. compact summaries)

The key transformation is understanding that context is just data, and data can be engineered. By applying lightweight data engineering techniques‚Äîcleaning, transforming, and optimizing‚Äîyou achieved a 91% reduction in token usage while maintaining answer quality. This demonstrates that more context isn't always better; curated, well-formatted context is.

In Stage 3, you'll take the next leap: hierarchical retrieval with progressive disclosure. Instead of deciding upfront whether to send summaries or full details, you'll build an intelligent agent that retrieves summaries first, evaluates relevance, and fetches full syllabi only for the most relevant courses. This adaptive approach gives you both low token usage and high detail exactly when needed.