# Agentic RAG System with Azure OpenAI & LangGraph

This Jupyter Notebook implements an Agentic Retrieval-Augmented Generation (RAG) system, incorporating Pinecone serverless setup, Azure OpenAI for embeddings and answer generation, Google Gemini for self-critique, and LangGraph for workflow orchestration. It includes MLflow for observability with prompt registration and evaluation metrics, addressing the `GraphRecursionError` by limiting refinements to one iteration.

**Components**:
- **Azure OpenAI**: `text-embedding-3-small` for embeddings, `gpt4o` for answer generation.
- **Pinecone**: Serverless vector database for storing and retrieving embeddings.
- **Google Gemini**: `gemini-2.5-flash` for self-critique.
- **LangGraph**: Orchestrates four nodes (Retriever, LLM Answer, Self-Critique, Refinement).
- **MLflow**: Logs queries, snippets, answers, critiques, refinement status, and evaluation metrics (latency, relevance, readability).

**Functionality**:
- Processes `self_critique_loop_dataset.json` (30 entries).
- Retrieves up to 5 knowledge base snippets, generates answers with `[KBxxx]` citations.
- Performs one self-critique and, if needed, one refinement with an additional snippet (top-6).
- Logs results and evaluates answers using MLflow.

**Sample Queries**:
- What are best practices for caching?
- How should I set up CI/CD pipelines?
- What are performance tuning tips?
- How do I version my APIs?
- What should I consider for error handling?

## 1. Install Dependencies

In [15]:
!python -m pip install langgraph langchain-openai pinecone mlflow pydantic google-generativeai openai dspy litellm textstat evaluate --quiet
!python -m pip install --upgrade mlflow --quiet

[0m

## 2. Import Libraries

In [16]:
import os
import json
import time
import pandas as pd
from dotenv import load_dotenv
from pinecone import Pinecone, ServerlessSpec
from langchain_openai import AzureOpenAIEmbeddings
from openai import AzureOpenAI
import google.generativeai as genai
from langgraph.graph import StateGraph, END
from typing import TypedDict, List, Any
import mlflow
from mlflow import register_prompt
print(mlflow.__version__)

3.4.0


## 3. Configure Credentials

In [17]:
# Load environment variables
load_dotenv()

# Set MLflow tracking URI
mlflow.set_tracking_uri("http://20.75.92.162:5000/")

embedding_model_name = "text-embedding-3-small"
embedding_deployment_name = "text-embedding-3-small"  # Replace with your Azure deployment name

# Verify environment variables
assert os.environ.get("AZURE_OPENAI_ENDPOINT"), "AZURE_OPENAI_ENDPOINT not set in .env"
assert os.environ.get("AZURE_OPENAI_API_KEY"), "AZURE_OPENAI_API_KEY not set in .env"
assert os.environ.get("AZURE_OPENAI_API_VERSION", "2024-06-01"), "AZURE_OPENAI_API_VERSION not set in .env"
assert os.environ.get("PINECONE_API_KEY"), "PINECONE_API_KEY not set in .env"
assert os.environ.get("GOOGLE_API_KEY"), "GOOGLE_API_KEY not set in .env"

# Verify Pinecone API key
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
if not PINECONE_API_KEY:
    raise RuntimeError("PINECONE_API_KEY not found. Please set it in your environment.")
else:
    print("PINECONE API KEY found")

# Initialize Pinecone client
pc = Pinecone(api_key=PINECONE_API_KEY)

python-dotenv could not parse statement starting at line 32


PINECONE API KEY found


## 4. Preprocessing & Indexing

### 4.1 Load the Knowledge Base JSON

In [18]:
# Load the dataset
with open('self_critique_loop_dataset.json', 'r') as f:
    kb_data = json.load(f)

print(f"Loaded {len(kb_data)} entries from the knowledge base.")

Loaded 30 entries from the knowledge base.


### 4.2 Set Up Azure OpenAI Embeddings

In [19]:
# Initialize embeddings
embeddings = AzureOpenAIEmbeddings(
    model=embedding_model_name,
    deployment=embedding_deployment_name,
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"]
)

# Detect embedding dimension
test_dim = len(embeddings.embed_query("dimension probe"))
print(f"Embedding dimension: {test_dim}")

Embedding dimension: 1536


### 4.3 Create and Populate Pinecone Index

In [20]:
# Define index parameters
INDEX_NAME = "agentic-rag-kb"
METRIC = "cosine"

# Create index if it doesn't exist
existing = [idx["name"] for idx in pc.list_indexes()]
if INDEX_NAME not in existing:
    print(f"Creating index '{INDEX_NAME}' ...")
    pc.create_index(
        name=INDEX_NAME,
        dimension=test_dim,
        metric=METRIC,
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )
    time.sleep(5)
else:
    print(f"Index '{INDEX_NAME}' already exists, reusing it.")

index = pc.Index(INDEX_NAME)
print(index.describe_index_stats())

# Prepare vectors: id, vector, metadata
vectors = []
for entry in kb_data:
    text_to_embed = f"{entry['question']} {entry['answer_snippet']}"
    vector = embeddings.embed_query(text_to_embed)
    vectors.append({
        "id": entry['doc_id'],
        "values": vector,
        "metadata": {
            "question": entry['question'],
            "text": entry['answer_snippet'],
            "source": entry['source'],
            "confidence_indicator": entry['confidence_indicator'],
            "last_updated": entry['last_updated']
        }
    })

# Upsert vectors to Pinecone
index.upsert(vectors=vectors)
print("Upsert complete.")

Index 'agentic-rag-kb' already exists, reusing it.
{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 30}},
 'total_vector_count': 30,
 'vector_type': 'dense'}
Upsert complete.


## 5. LangGraph Workflow

### 5.1 Set Up Azure OpenAI LLM (GPT-4o-mini)

In [21]:
# Initialize Azure OpenAI client
azure_client = AzureOpenAI(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
)

def generate_answer(query, snippets, temperature=0):
    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant. Generate a concise answer based on the provided snippets. Cite them as [KBxxx]. Ensure the answer directly addresses the query."
        },
        {
            "role": "user",
            "content": f"Query: {query}\nSnippets:\n" + "\n".join(
                [f"[{s['id']}] {s['metadata']['question']} {s['metadata']['text']}" for s in snippets]
            )
        }
    ]
    response = azure_client.chat.completions.create(
        model=os.getenv("AZURE_OPENAI_DEPLOYMENT"),
        messages=messages,
        temperature=temperature
    )
    return response.choices[0].message.content

### 5.2 Set Up Gemini for Self-Critique

In [22]:
# Configure Gemini API
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
critique_model = "gemini-2.5-flash"

def self_critique(query, answer):
    prompt = f"Query: {query}\nAnswer: {answer}\nIs this answer COMPLETE or does it need REFINE? Output only 'COMPLETE' or 'REFINE'."
    response = genai.GenerativeModel(critique_model).generate_content(prompt)
    return response.text.strip()

### 5.3 Retrieval Function

In [23]:
def retrieve_snippets(query, k=5):
    query_vector = embeddings.embed_query(query)
    results = index.query(vector=query_vector, top_k=k, include_metadata=True)
    return results['matches']

### 5.4 Define LangGraph Nodes and State

In [24]:
# Define the state
class State(TypedDict):
    query: str
    snippets: List[Any]
    answer: str
    critique: str
    refined: bool
    refinement_count: int  # Track number of refinements

# Node functions
def retriever_node(state: State) -> State:
    state['snippets'] = retrieve_snippets(state['query'], k=5)
    state['refinement_count'] = state.get('refinement_count', 0)  # Initialize counter
    return state

def llm_answer_node(state: State) -> State:
    snippets = state['snippets'][:5] if not state.get('refined', False) else state['snippets'][:6]
    state['answer'] = generate_answer(state['query'], snippets)
    return state

def self_critique_node(state: State) -> State:
    state['critique'] = self_critique(state['query'], state['answer'])
    return state

def refinement_node(state: State) -> State:
    if state['critique'] == "REFINE" and state.get('refinement_count', 0) < 1:  # Limit to 1 refinement
        state['snippets'] = retrieve_snippets(state['query'], k=6)
        state['refined'] = True
        state['refinement_count'] = state.get('refinement_count', 0) + 1
    return state

# Decision function for conditional edge
def decide_to_refine(state: State):
    if state['critique'] == "COMPLETE" or state.get('refinement_count', 0) >= 1:
        return "end"
    return "refinement"

### 5.5 Build the LangGraph Workflow

In [25]:
# Initialize the graph
workflow = StateGraph(State)

# Add nodes
workflow.add_node("retriever", retriever_node)
workflow.add_node("llm_answer", llm_answer_node)
workflow.add_node("self_critique", self_critique_node)
workflow.add_node("refinement", refinement_node)

# Define edges
workflow.set_entry_point("retriever")
workflow.add_edge("retriever", "llm_answer")
workflow.add_edge("llm_answer", "self_critique")
workflow.add_conditional_edges("self_critique", decide_to_refine, {"end": END, "refinement": "refinement"})
workflow.add_edge("refinement", "llm_answer")

# Compile the graph
app = workflow.compile()

## 6. Tracing & Observability with MLflow

In [26]:
# Register the prompt template
answer_template = """\
You are a helpful assistant. Generate a concise answer based on the provided snippets for the query: {{ query }}.
Snippets: {{ snippets }}
Cite snippets as [KBxxx].
"""
prompt = mlflow.register_prompt(
    name="raj-agentic-rag-answer-prompt",
    template=answer_template,
    commit_message="Initial commit for RAG answer prompt",
)
print(f"Created prompt '{prompt.name}' (version {prompt.version})")

# Enable autologging for Azure OpenAI
mlflow.openai.autolog()

def run_query_with_logging(query):
    # Prepare evaluation data
    eval_data = pd.DataFrame({
        "inputs": [query],
        "targets": [""]  # Placeholder; ground truth not available
    })

    with mlflow.start_run(run_name=f"Query: {query[:20]}..."):
        mlflow.log_param("query", query)
        mlflow.log_param("model", "gpt-4o-mini")
        mlflow.log_param("temperature", 0)

        # Run the workflow
        result = app.invoke(
            {"query": query, "refined": False, "refinement_count": 0},
            config={"recursion_limit": 50}
        )

        # Log standard metrics
        mlflow.log_metric("refined", 1 if result.get('refined', False) else 0)
        mlflow.log_text(result['answer'], "final_answer.txt")
        mlflow.log_text("\n".join([s['id'] for s in result['snippets']]), "retrieved_snippets.txt")
        mlflow.log_text(result['critique'], "critique.txt")
        mlflow.log_text(
            "\n".join([f"[{s['id']}] {s['metadata']['question']} {s['metadata']['text']}" for s in result['snippets']]),
            "snippet_details.txt"
        )

        # Evaluate the answer
        def predict(data: pd.DataFrame) -> list[str]:
            predictions = []
            prompt_obj = mlflow.genai.load_prompt(f"prompts:/raj-agentic-rag-answer-prompt/{prompt.version}")
            for _, row in data.iterrows():
                snippets_text = "\n".join(
                    [f"[{s['id']}] {s['metadata']['question']} {s['metadata']['text']}" for s in result['snippets']]
                )
                content = prompt_obj.format(query=row["inputs"], snippets=snippets_text)
                completion = azure_client.chat.completions.create(
                    model=os.getenv("AZURE_OPENAI_DEPLOYMENT"),
                    messages=[{"role": "user", "content": content}],
                    temperature=0,
                )
                predictions.append(completion.choices[0].message.content)
            return predictions

        # Run MLflow evaluation
        eval_results = mlflow.evaluate(
            model=predict,
            data=eval_data,
            targets="targets",
            extra_metrics=[
                mlflow.metrics.latency(),
                mlflow.metrics.ari_grade_level(),
                mlflow.metrics.flesch_kincaid_grade_level(),
            ],
        )

        # Log evaluation results
        for metric_name, metric_value in eval_results.metrics.items():
            mlflow.log_metric(metric_name, metric_value)

        return result

  prompt = mlflow.register_prompt(
2025/09/30 17:35:14 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for prompt version to finish creation. Prompt name: raj-agentic-rag-answer-prompt, version 7


Created prompt 'raj-agentic-rag-answer-prompt' (version 7)


## 7. Test with Sample Queries

In [27]:
# Set MLflow experiment
mlflow.set_experiment("raj-agentic-rag-evaluation")

sample_queries = [
    "What are best practices for caching?",
    "How should I set up CI/CD pipelines?",
    "What are performance tunning tips?",
    "How do i version my APIs?",
    "What should i consider for error handling?"
]

for query in sample_queries:
    result = run_query_with_logging(query)
    print(f"Query: {query}")
    print(f"Answer: {result['answer']}")
    print(f"Snippets: {[s['id'] for s in result['snippets']]}")
    print(f"Critique: {result['critique']}")
    print(f"Refined: {result.get('refined', False)}")
    print(f"Refinement Count: {result.get('refinement_count', 0)}")
    print("---")

2025/09/30 17:35:42 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.
2025/09/30 17:35:48 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


 View run Query: What are best practi... at: http://20.75.92.162:5000/#/experiments/218609737045096119/runs/1977b70ccc0540f5bf16d9a332778a63
離 View experiment at: http://20.75.92.162:5000/#/experiments/218609737045096119
Query: What are best practices for caching?
Answer: Best practices for caching include following well-defined patterns to ensure efficiency and effectiveness. This involves strategies such as setting appropriate expiration times, using cache keys wisely, and regularly monitoring cache performance to optimize resource usage and minimize latency [KB003][KB023][KB013].
Snippets: ['KB003', 'KB023', 'KB013', 'KB002', 'KB012', 'KB022']
Critique: REFINE
Refined: True
Refinement Count: 1
---


2025/09/30 17:36:44 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.
2025/09/30 17:36:52 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


 View run Query: How should I set up ... at: http://20.75.92.162:5000/#/experiments/218609737045096119/runs/731e01f0ba724361a4275186ccaefbc8
離 View experiment at: http://20.75.92.162:5000/#/experiments/218609737045096119
Query: How should I set up CI/CD pipelines?
Answer: To set up CI/CD pipelines effectively, follow these best practices: 

1. **Define Clear Stages**: Structure your pipeline into distinct stages such as build, test, and deploy.
2. **Automate Testing**: Integrate automated testing to ensure code quality at every stage.
3. **Use Version Control**: Maintain your code in a version control system to track changes and facilitate collaboration.
4. **Monitor and Optimize**: Continuously monitor the performance of your pipelines and optimize them for efficiency.
5. **Implement Rollback Mechanisms**: Ensure you have a strategy for rolling back deployments in case of failures.

These practices help in maintaining a robust and efficient CI/CD process [KB007][KB017][KB027].
Snippe

2025/09/30 17:37:51 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.
2025/09/30 17:37:56 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


 View run Query: What are performance... at: http://20.75.92.162:5000/#/experiments/218609737045096119/runs/1f837fac0512456e8e95271e78da8bbc
離 View experiment at: http://20.75.92.162:5000/#/experiments/218609737045096119
Query: What are performance tunning tips?
Answer: For effective performance tuning, consider the following best practices: 

1. **Identify Bottlenecks**: Use profiling tools to find slow parts of your application.
2. **Optimize Queries**: Ensure database queries are efficient and indexed properly.
3. **Cache Results**: Implement caching strategies to reduce load times for frequently accessed data.
4. **Monitor Resource Usage**: Keep an eye on CPU, memory, and disk I/O to identify resource constraints.
5. **Use Asynchronous Processing**: Offload long-running tasks to background processes to improve responsiveness.
6. **Review Code Efficiency**: Refactor code to eliminate unnecessary computations and improve algorithm efficiency.

Following these patterns can lead to si

2025/09/30 17:38:49 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.
2025/09/30 17:38:54 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


 View run Query: How do i version my ... at: http://20.75.92.162:5000/#/experiments/218609737045096119/runs/2046ec5706ec43b7bbd644292f19a6aa
離 View experiment at: http://20.75.92.162:5000/#/experiments/218609737045096119
Query: How do i version my APIs?
Answer: To version your APIs, it's important to follow well-defined patterns. Common practices include using version numbers in the URL (e.g., /v1/resource), in request headers, or as part of the query parameters. Ensure that each version is clearly documented and that you maintain backward compatibility where possible to avoid breaking changes for users of older versions [KB005][KB025][KB015].
Snippets: ['KB005', 'KB025', 'KB015', 'KB007', 'KB027', 'KB017']
Critique: REFINE
Refined: True
Refinement Count: 1
---


2025/09/30 17:39:46 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.
2025/09/30 17:39:52 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


 View run Query: What should i consid... at: http://20.75.92.162:5000/#/experiments/218609737045096119/runs/ed2a0397b4154526b54b42995c9891ce
離 View experiment at: http://20.75.92.162:5000/#/experiments/218609737045096119
Query: What should i consider for error handling?
Answer: When considering error handling, you should focus on following well-defined patterns, ensuring that errors are logged appropriately, providing meaningful error messages, and implementing a strategy for recovery or fallback mechanisms. Additionally, consider the context of the application and the user experience when designing your error handling approach [KB009][KB029][KB019].
Snippets: ['KB009', 'KB029', 'KB019', 'KB011', 'KB021', 'KB001']
Critique: REFINE
Refined: True
Refinement Count: 1
---
