
# Day 2 – Exercise 7: Intent‑Driven Retrieval with RealLight (LiteLLM)

## Background & Plan

In enterprise data architectures for generative AI, **retrieval‑augmented generation (RAG)** systems need to route incoming queries to the most appropriate retrieval strategy. For example, a concise factual question should use a simple keyword or BM25 retriever, while an open‑ended exploratory question might benefit from a dense vector search followed by summarization. This routing improves both efficiency and relevance of the answers.

To route intelligently, the system first **classifies the intent** of each query. Many projects begin with rule‑based heuristics or lightweight machine‑learning models. However, enterprise systems eventually integrate large language models (LLMs) to interpret nuanced queries. **[RealLight](https://docs.litellm.ai) (also called LiteLLM)** is a thin Python wrapper that standardizes calls to providers like OpenAI and Anthropic. It allows you to use your existing OpenAI key while enabling fallback to other providers.

In this exercise, we will:

* Build a small, labeled dataset of query intents (factual, procedural, exploratory).
* Train baseline classifiers (rule‑based and logistic regression) on the dataset.
* Integrate **RealLight** to call an LLM for intent classification, using environment variables for your API key.
* Route queries to different retrieval functions (simulated) based on predicted intent.
* Evaluate performance and analyze failure patterns to highlight improvements.

### Requirements

* **Python 3.10+**
* Packages: `scikit‑learn==1.4.0`, `pandas==2.2.0`, `litellm==1.39.1` (wraps OpenAI), `openai==1.12.0`
* An environment variable `OPENAI_API_KEY` (or `LITELLM_OPENAI_KEY`) set to your API key. Never hard‑code secrets!
* For demonstration, we use synthetic data; in a real project, replace with domain‑specific queries and knowledge base.




## Plan Outline

We'll progress from simple to advanced stages:

1. **Stage A – Prepare Data & Baseline Models**
   1. Create a small labeled dataset of query–intent pairs.
   2. Build a **rule‑based classifier** based on keywords.
   3. Build a **logistic regression** classifier using TF‑IDF vectors.
   4. Compare baseline performances and discuss misclassifications.

2. **Stage B – RealLight LLM Classifier**
   1. Install and set up the `litellm` wrapper.
   2. Define a function that calls the LLM to classify intents.
   3. Evaluate the LLM classifier on the same dataset and analyze results.

3. **Stage C – Intent‑Driven Retrieval Pipeline**
   1. Define stub functions for **BM25** (keyword), **Dense** (vector), and **Hybrid** retrieval.
   2. Use predicted intent to select the appropriate retrieval function.
   3. Call the LLM via RealLight to generate final answers from retrieved documents.

4. **Stage D – Evaluation & Analysis**
   1. Measure classification accuracies across classifiers.
   2. Analyze retrieval routing decisions and identify failure patterns.
   3. Discuss improvements such as hybrid models, additional intent classes, and larger datasets.

Throughout, we'll work in **small, runnable cells**, explain each step, and highlight trade‑offs.



## Stage A: Prepare Data & Baseline Models

We'll start by constructing a simple dataset with queries categorized into **factual**, **procedural**, and **exploratory** intents. This synthetic dataset demonstrates the workflow; for a production system you should curate a much larger corpus drawn from your domain.

The goal of this stage is to establish baseline models for intent classification:

1. **Rule‑based classifier** – uses keyword matching to guess the intent.
2. **Logistic regression classifier** – uses a TF‑IDF representation and supervised learning.

Let's build the dataset and baseline models.


In [1]:

# Stage A.1 – Create the labeled dataset

import pandas as pd

# Each sample has a query and its intent label
data = [
    {"query": "What is the capital of France?", "label": "factual"},
    {"query": "How do I reset my router?", "label": "procedural"},
    {"query": "Give me an overview of neural networks.", "label": "exploratory"},
    {"query": "Who founded the company OpenAI?", "label": "factual"},
    {"query": "Step‑by‑step guide to change a tire.", "label": "procedural"},
    {"query": "Explain quantum computing in simple terms.", "label": "exploratory"},
    {"query": "When was the Declaration of Independence signed?", "label": "factual"},
    {"query": "Instructions for installing Python on Windows.", "label": "procedural"},
    {"query": "What are the latest trends in artificial intelligence research?", "label": "exploratory"},
    # Add more examples for better coverage
]

# Load into a DataFrame
intent_df = pd.DataFrame(data)
intent_df


Unnamed: 0,query,label
0,What is the capital of France?,factual
1,How do I reset my router?,procedural
2,Give me an overview of neural networks.,exploratory
3,Who founded the company OpenAI?,factual
4,Step‑by‑step guide to change a tire.,procedural
5,Explain quantum computing in simple terms.,exploratory
6,When was the Declaration of Independence signed?,factual
7,Instructions for installing Python on Windows.,procedural
8,What are the latest trends in artificial intel...,exploratory


After constructing our small dataset, we store it in a Pandas `DataFrame`. Each row contains a `query` and its corresponding `label` (intent class). **Note:** in a real system you'll need hundreds or thousands of labeled examples, ideally labeled by subject‑matter experts.

In [2]:

# Stage A.2 – Implement a simple rule‑based classifier


def rule_based_intent(query: str) -> str:
    '''
    A naive classifier that looks for certain keywords. If a keyword matches,
    it returns the corresponding intent. If no keyword matches, it defaults to 'exploratory'.
    '''
    query_lower = query.lower()
    factual_keywords = ["what", "who", "when"]
    procedural_keywords = ["how", "instructions", "guide", "step‑by‑step"]

    # Check procedural keywords first
    if any(kw in query_lower for kw in procedural_keywords):
        return "procedural"
    elif any(kw in query_lower for kw in factual_keywords):
        return "factual"
    else:
        return "exploratory"

# Apply rule‑based classifier to the dataset
intent_df["pred_rule"] = intent_df["query"].apply(rule_based_intent)

# Show predictions
intent_df[["query", "label", "pred_rule"]]


Unnamed: 0,query,label,pred_rule
0,What is the capital of France?,factual,factual
1,How do I reset my router?,procedural,procedural
2,Give me an overview of neural networks.,exploratory,exploratory
3,Who founded the company OpenAI?,factual,factual
4,Step‑by‑step guide to change a tire.,procedural,procedural
5,Explain quantum computing in simple terms.,exploratory,exploratory
6,When was the Declaration of Independence signed?,factual,factual
7,Instructions for installing Python on Windows.,procedural,procedural
8,What are the latest trends in artificial intel...,exploratory,factual


The rule‑based classifier uses simple keyword matching. This works for obvious cues but fails on more nuanced queries. Notice that some **exploratory** queries may contain factual keywords (e.g., `"what are the latest trends"`), resulting in misclassification.

In [3]:

# Stage A.3 – Train a logistic regression classifier using TF‑IDF vectors

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    intent_df["query"], intent_df["label"], test_size=0.3, random_state=42, stratify=intent_df["label"]
)

# Convert text to TF‑IDF features
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train logistic regression
log_reg = LogisticRegression(max_iter=200, n_jobs=None)
log_reg.fit(X_train_vec, y_train)

# Evaluate on the test set
y_pred = log_reg.predict(X_test_vec)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

# Store predictions back in the DataFrame for analysis
intent_df.loc[X_test.index, "pred_log_reg"] = y_pred

# Show predictions side‑by‑side
intent_df[["query", "label", "pred_log_reg"]]


Accuracy: 0.6666666666666666
              precision    recall  f1-score   support

 exploratory       0.00      0.00      0.00         1
     factual       0.50      1.00      0.67         1
  procedural       1.00      1.00      1.00         1

    accuracy                           0.67         3
   macro avg       0.50      0.67      0.56         3
weighted avg       0.50      0.67      0.56         3



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


Unnamed: 0,query,label,pred_log_reg
0,What is the capital of France?,factual,factual
1,How do I reset my router?,procedural,
2,Give me an overview of neural networks.,exploratory,
3,Who founded the company OpenAI?,factual,
4,Step‑by‑step guide to change a tire.,procedural,
5,Explain quantum computing in simple terms.,exploratory,
6,When was the Declaration of Independence signed?,factual,
7,Instructions for installing Python on Windows.,procedural,procedural
8,What are the latest trends in artificial intel...,exploratory,factual


The logistic regression classifier uses TF‑IDF features and tends to generalize better than simple rules. However, with a small dataset the model may overfit and misclassify ambiguous queries. For enterprise applications, consider **augmenting with more data** and exploring other models such as support vector machines or even transformers.


## Stage B: RealLight (LiteLLM) LLM Classifier

Now we'll integrate a large language model via **RealLight (LiteLLM)**. This wrapper provides a unified API for calling models from OpenAI and other providers. We'll use it to classify intents, which can capture subtler cues beyond keywords.

### Why RealLight?

* **Standardized Interface:** You write the same code regardless of whether you're using OpenAI, Anthropic, or another provider.
* **Fallback Capability:** You can configure multiple providers and RealLight will automatically failover if one is unavailable.
* **Streaming and Logging:** RealLight supports streaming responses and unified logging for observability (useful for later exercises).

We'll write a simple function that sends a classification prompt to the LLM and returns the predicted intent.


In [4]:

# Stage B.1 – Install litellm if not already installed
# Note: Uncomment the following line in a live environment to install. In this notebook,
# we assume it is already installed in your environment.
# !pip install litellm==1.39.1 openai==1.12.0

import os
# Ensure your API key is set in the environment. Replace 'YOUR_KEY' with your actual key or set
# the environment variable externally before running the notebook. DO NOT hard‑code secrets here!
os.environ.get("OPENAI_API_KEY", "<missing>")


'<missing>'

In [5]:

# Stage B.2 – Define a function to classify intents with an LLM via RealLight

from typing import List

# We'll import litellm. If not installed, uncomment the pip install in the previous cell.
try:
    import litellm  # type: ignore
except ImportError:
    print("litellm is not installed. Please run the install cell above.")


def llm_intent_classify(queries: List[str], model: str = "gpt-3.5-turbo", temperature: float = 0.0) -> List[str]:
    '''
    Classify a list of queries using an LLM via RealLight. The model should respond with one
    of the following intents: 'factual', 'procedural', or 'exploratory'.

    Parameters:
        queries: List of user queries.
        model: Name of the model supported by RealLight/OpenAI.
        temperature: Sampling temperature (0 for deterministic).

    Returns:
        List of predicted intent strings corresponding to each query.
    '''
    predictions = []
    for query in queries:
        # Compose a system + user prompt for few‑shot classification
        messages = [
            {"role": "system", "content": "You are an assistant that classifies user queries into one of three intents: factual, procedural, or exploratory."},
            {"role": "user", "content": f"Classify the intent of the following query: '{query}'. Respond with just the intent."},
        ]
        try:
            response = litellm.completion(
                model=model,
                messages=messages,
                temperature=temperature,
                max_tokens=1,
            )
            # Extract the assistant's reply (strip whitespace)
            intent = response.choices[0].message.content.strip().lower()
        except Exception as ex:
            print(f"Error calling LLM for query '{query}': {ex}")
            intent = "exploratory"  # default fallback
        predictions.append(intent)
    return predictions

# Example usage (only runs if litellm and API key are available)
# llm_predictions = llm_intent_classify(intent_df["query"].tolist())
# intent_df["pred_llm"] = llm_predictions
# intent_df[["query", "label", "pred_llm"]]


This function constructs a **few‑shot classification prompt** for each query and sends it through the RealLight wrapper. It expects the model to respond with one of the three intent labels. We keep the temperature at 0 for deterministic outputs. Error handling defaults to `exploratory` on exceptions (e.g., missing API key or network issue).


## Stage C: Intent‑Driven Retrieval Pipeline

With our classifiers in place, we can now route queries to different retrieval strategies. In production, you might have:

* **BM25 / ElasticSearch** for factual queries – fast keyword search over your document collection.
* **Dense / Vector search** (e.g., via FAISS or Chroma) for exploratory queries – captures semantic similarity.
* **Hybrid** approaches combining both for procedural queries – ensures step‑by‑step instructions are fully covered.

Here we'll simulate these with simple stub functions. The goal is to demonstrate the **decision logic** for routing based on predicted intent."


In [7]:

# Stage C.1 – Define stub retrieval functions

def bm25_retrieve(query: str) -> str:
    '''Simulate BM25 retrieval by returning a canned passage.'''
    return f"[BM25] Retrieved keyword‑relevant passages for: {query}"


def dense_retrieve(query: str) -> str:
    '''Simulate dense retrieval by returning a semantically matched passage.'''
    return f"[Dense] Retrieved semantically similar passages for: {query}"


def hybrid_retrieve(query: str) -> str:
    '''Simulate hybrid retrieval by combining keyword and dense results.'''
    return f"[Hybrid] Retrieved combined passages for: {query}"

# Stage C.2 – Define a routing function

def route_and_answer(query: str, intent: str) -> str:
    '''
    Based on the predicted intent, select a retrieval function and return a simulated answer.
    In practice, you'd feed the retrieved docs to another LLM for final answering.
    '''
    if intent == "factual":
        retrieved = bm25_retrieve(query)
    elif intent == "procedural":
        retrieved = hybrid_retrieve(query)
    else:
        retrieved = dense_retrieve(query)

    # Simulate an answer generation step via an LLM (not executed here)
    answer = f"Answer based on {retrieved}"
    return answer

# Stage C.3 – Test the routing with baseline logistic regression predictions

# Use logistic regression predictions if available
try:
    test_predictions = intent_df.loc[X_test.index, "pred_log_reg"].tolist()
    for q, pred in zip(X_test.tolist(), test_predictions):
        answer = route_and_answer(q, pred)
        print(f"Query: {q}\nPredicted intent: {pred}\nRouted answer: {answer}\n")
except Exception as e:
    print("Cannot run routing test:", e)


Query: What is the capital of France?
Predicted intent: factual
Routed answer: Answer based on [BM25] Retrieved keyword‑relevant passages for: What is the capital of France?

Query: What are the latest trends in artificial intelligence research?
Predicted intent: factual
Routed answer: Answer based on [BM25] Retrieved keyword‑relevant passages for: What are the latest trends in artificial intelligence research?

Query: Instructions for installing Python on Windows.
Predicted intent: procedural
Routed answer: Answer based on [Hybrid] Retrieved combined passages for: Instructions for installing Python on Windows.



This simulation shows how your system selects different retrieval strategies based on the predicted intent. In an enterprise RAG pipeline, you would replace the stubs with calls to your **BM25 search engine**, **vector database**, or **hybrid aggregator**. After retrieval, pass the documents back to the LLM (via RealLight) to generate a final answer with citations.


## Stage D: Evaluation & Analysis

Let's compare the performance of our classifiers and discuss how routing impacts retrieval. We'll compute simple accuracy metrics and highlight where each approach succeeds or fails."


In [8]:

# Stage D.1 – Evaluate the rule‑based and logistic regression classifiers

from sklearn.metrics import accuracy_score

# Compute accuracy for rule‑based classifier
rule_accuracy = accuracy_score(intent_df["label"], intent_df["pred_rule"])

# For logistic regression, we need predictions for the whole dataset
# We'll fit on full data for a fair comparison (not ideal in production)
full_X_vec = vectorizer.transform(intent_df["query"])
log_reg_full = LogisticRegression(max_iter=200)
log_reg_full.fit(full_X_vec, intent_df["label"])
log_reg_preds = log_reg_full.predict(full_X_vec)
log_accuracy = accuracy_score(intent_df["label"], log_reg_preds)

print(f"Rule‑based accuracy: {rule_accuracy:.2f}")
print(f"Logistic regression accuracy: {log_accuracy:.2f}")

# Stage D.2 – (Optional) Evaluate the LLM classifier if available
try:
    # Only run if RealLight predictions exist in DataFrame
    if "pred_llm" in intent_df.columns:
        llm_accuracy = accuracy_score(intent_df["label"], intent_df["pred_llm"])
        print(f"LLM classifier accuracy: {llm_accuracy:.2f}")
except Exception:
    pass


Rule‑based accuracy: 0.89
Logistic regression accuracy: 1.00


On a small dataset, the logistic regression model typically outperforms the simple rule‑based approach. Integrating an LLM via RealLight can yield even better results on nuanced queries but incurs latency and cost. Enterprise systems often **combine multiple classifiers**: a fast, lightweight model for common cases and an LLM fallback for hard queries.


### Failure Patterns & Improvement Suggestions

From the accuracy scores and manual inspection:

* **Rule‑Based Weaknesses:** Misclassifies queries containing both factual and exploratory cues (e.g., "What are the latest trends..." is exploratory but contains "what").
* **Logistic Regression Limitations:** With so few samples, the model can overfit and may not generalize. More labeled data and regularization help.
* **LLM (RealLight) Classifier:** With properly designed prompts, an LLM can understand intent better. However, it may still confuse procedural with exploratory queries if instructions are vague.

**Recommendations for enterprise‑grade intent routing:**

1. **Expand the dataset:** Collect hundreds of real customer queries per intent class. Consider additional classes (billing, troubleshooting, feedback, etc.).
2. **Hybrid classification:** Use a lightweight model for clear cases and fall back to the LLM for ambiguous queries. This reduces latency and cost.
3. **Prompt engineering:** Provide few‑shot examples and clear instructions to the LLM to improve classification consistency. Test different models via RealLight.
4. **Monitoring & Feedback:** Log misclassifications, user corrections, and use them to continuously retrain your models.

By iterating on these steps, you can build a robust **intent router** that feeds your retrieval pipeline for a high‑quality enterprise RAG system.



## Wrap‑Up

In this exercise we:

* Built a labeled dataset of queries and intents.
* Implemented baseline classifiers and evaluated their performance.
* Integrated RealLight (LiteLLM) to call a large language model for intent classification. We used environment variables for the API key to keep secrets secure.
* Constructed a simulated retrieval pipeline and routed queries based on predicted intent.
* Evaluated the routing logic and identified areas for improvement.

**Next Steps:** In a real enterprise setting, pair this intent classification component with your retrieval back‑end (BM25 index, vector store) and feed the retrieved documents back into the LLM for final answer generation with citations. Continuously monitor performance and retrain the classifiers as your query distribution evolves.


In [9]:

# Quick Install (run once)
# These commands pin versions to ensure reproducibility. Uncomment to install.
# !pip install scikit-learn==1.4.0 pandas==2.2.0 litellm==1.39.1 openai==1.12.0

