# Aimpoint Digital AI Engineering Assignment
---

## Objective
Your assignment is to design, build, and explain a novel agentic workflow that utilizes a subset of the Wikipedia dataset. As part of this, you will need to define a distinctive GenAI use case that your system is intended to solve. The aim is to showcase not just your technical implementation skills, but also your ability to apply agentic system design innovatively and practically. You will implement your workflow in the Databricks Free Edition, starting from the provided notebook `01_agentic_wikipedia_aimpoint_interview.ipynb`.

To get you started, we pre-installed LangChain and LangGraph which are open source GenAI orchestration frameworks that work well in a Databricks workspace. In addition, we have provided you with a basic setup to access the data source using a LangChain dataloader (https://python.langchain.com/docs/integrations/document_loaders/wikipedia/).

You may use coding assistants for this assignment, but you must provide your own custom prompts and demonstrate your own critical thinking. Large language models must not be used to generate responses for the open-response questions in Part B of this notebook.

Note: This assignment uses serverless clusters. At the time of creating this notebook, all components run successfully. However, you may need to address package dependency issues in the future to ensure your GenAI solution continues to function properly. 

## Deliverables

1. Reference Architecture
    - This should highlight your approach to addressing your use case or problem in either a pdf or image format; include technical agentic workflow details here.

2. Databricks Notebook(s)
    - Includes primary notebook `01_agentic_wikipedia_aimpoint_interview`.ipynb and any supplemental notebooks required to run the agent
    - In the `01_agentic_wikipedia_aimpoint_interview`.ipynb notebook complete the **GenAI Application Development** and **Reflection** sections. The GenAI Application Development section is where you add your own custom logic to create and run your agentic workflow. The Reflection section is writing a markdown response to answer the two questions.
    - To reduce your development time, we created the logic for you to have a FAISS vector store and made the LLM accessible as well.
    - Before finalizing, make sure your code runs correctly by using "Run All" to validate functionality. Then go to "File" → "Export" → "HTML" to download as HTML file. Next, open this HTML file. Finally save as a PDF see instructions below. __Note: In your submissions this must be a PDF file format__

    > **Save HTML as PDF**
    > - Windows: (ctrl + P) → Save as PDF → Save
    > - MacOS: (⌘ + P) → Save as PDF → Save


## Data Source

The Wikipedia Loader ingests documents from the Wikipedia API and converts them into LangChain document objects. The page content includes the first sections of the Wikipedia articles and the metadata is described in detail below.

__Recommendation__: If you are using the LangChain document loader we recommend filtering down to 10k or fewer documents. The `query_terms` argument below can be upated to update the search term used to search wikipedia. Make sure you update this based on the use case you defined.

In the metadata of the LangChain document object; we have the following information:

| Column  | Definition                                                                 |
|---------|-----------------------------------------------------------------------------|
| title   | The Wikipedia page title (e.g., "Quantum Computing").                       |
| summary | A short extract or condensed description from the page content.             |
| source  | The URL link to the original Wikipedia article.                             |

In [0]:
%pip install -U -qqqq backoff databricks-langchain langgraph==0.5.3 uv databricks-agents mlflow-skinny[databricks] chromadb sentence-transformers langchain-huggingface langchain-chroma wikipedia faiss-cpu
dbutils.library.restartPython()

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
 #######################################################################################################
 ###### Python Package Imports for this notebook                                                  ######
 #######################################################################################################

from langchain.document_loaders import WikipediaLoader
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS
# from langchain.embeddings import DatabricksEmbeddings

from databricks_langchain import (
    ChatDatabricks,
    DatabricksEmbeddings,
    UCFunctionToolkit,
    VectorSearchRetrieverTool,
)

 
 #######################################################################################################
 ###### Config (Define LLMs, Embeddings, Vector Store, Data Loader specs)                         ######
 #######################################################################################################

# DataLoader Config
query_terms = ["sport", "football", "soccer", "basketball","baseball", "track","swimming", "gymnastics"] #TODO: update to match your use case requirements
max_docs = 10 #TODO: recommend starting with a smaller number for testing purposes

# Retriever Config
k = 8 # number of documents to return
EMBEDDING_MODEL = "databricks-bge-large-en" # Embedding model endpoint name


# LLM Config
LLM_ENDPOINT_NAME = "databricks-meta-llama-3-1-8b-instruct" # Model Serving endpoint name; other option see "Serving" under AI/ML tab (e.g. databricks-gpt-oss-20b)


example_question = "describe soccer?"


In [0]:
 #######################################################################################################
 ###### Wikipedia Data Loader                                                                     ######
 #######################################################################################################

docs = WikipediaLoader(query=query_terms, load_max_docs=max_docs).load() # Load in documents from Wikipedia takes about 10 minutes for 1K articles

#######################################################################################################
###### FAISS Retriever: Using DBX embedding model                                                ###### #######################################################################################################

# Define the embeddings and the FAISS vector store
embeddings = DatabricksEmbeddings(endpoint=EMBEDDING_MODEL) # Use to generate embeddings
vector_store = FAISS.from_documents(docs, embeddings)
 
# Example of how to invoke the vector store
results = vector_store.similarity_search(
    "What is the most popular sport in the US?",
    k=k
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

#######################################################################################################
###### LLM: Using DBX Foundation Model                                                           ###### #######################################################################################################

llm = ChatDatabricks(endpoint=LLM_ENDPOINT_NAME)

response = llm.invoke("What is the most popular sport in the US?")

print("\n",response.content)

* Simone Arianne Biles Owens (née Biles; born March 14, 1997) is an American artistic gymnast. Her 11 Olympic medals and 30 World Championship medals make her the most decorated gymnast in history. She is widely regarded as one of the greatest gymnasts of all time, and one of the greatest female athletes in history. With 11 Olympic medals, she is tied with Věra Čáslavská as the second-most decorated female Olympic gymnast behind Larisa Latynina, and has the most Olympic medals earned by a U.S. gymnast.
At the Olympic Games, Biles is a two-time gold medalist in the individual all-around (2016, 2024). She is also a two-time champion on vault (2016, 2024), the 2016 champion and 2024 silver medalist on floor exercise, and a two-time bronze medalist on balance beam (2016, 2020). Biles led the gold medal-winning United States teams in 2016, dubbed the "Final Five," and in 2024, dubbed the "Golden Girls". At the 2020 Summer Olympics, where she was favored to win at least four of the six avail

### a) GenAI Application Development

__REQUIRED__: This section is where input your custom logic to create and run your agentic workflow. Feel free to add as many codes cells that are needed for this assignment

In [0]:
#######################################################################################################
###### FULL SELF-REFLECTIVE AGENT WORKFLOW                                                       ######
#######################################################################################################

from langgraph.graph import StateGraph, END
from typing import TypedDict, List, Dict, Any
import json
import mlflow
import time
import re


#######################################################################################################
# 1️⃣ Agent State
#######################################################################################################

class AgentState(TypedDict):
    question: str
    docs: List[Dict[str, Any]]
    claims: List[Dict[str, Any]]
    answer: str
    retry_count: int
    confidence: Dict[str, Any]
    answer_mode: str
    reflection_count: int
    critique: Dict[str, Any]


#######################################################################################################
# 2️⃣ Retrieval Node
#######################################################################################################

def retrieve_node(state: AgentState):
    retrieval_k = k if state["retry_count"] == 0 else k * 2

    results = vector_store.similarity_search(state["question"], k=retrieval_k)

    docs = []
    for i, r in enumerate(results):
        docs.append({
            "doc_index": i,
            "content": r.page_content,
            "metadata": r.metadata
        })

    return {"docs": docs}


#######################################################################################################
# 3️⃣ Query Rewrite Node
#######################################################################################################

def rewrite_query_node(state: AgentState):

    prompt = f"""
Rewrite the question to improve semantic retrieval.

Original:
{state["question"]}

Return only the improved question.
"""

    improved_query = llm.invoke(prompt).content.strip()

    print("\n🔁 Rewritten Query:", improved_query)

    return {
        "question": improved_query,
        "retry_count": state["retry_count"] + 1
    }
    
def extract_json(response: str) -> str:
    """Strips markdown fences and preamble text before JSON parsing."""
    response = response.strip()
    match = re.search(r'```(?:json)?\s*(.*?)\s*```', response, re.DOTALL)
    if match:
        return match.group(1).strip()
    return response

#######################################################################################################
# 4️⃣ Claim Extraction Node
#######################################################################################################

def claims_node(state: AgentState):

    prompt = f"""
Extract 3-6 factual claims answering the question using ONLY the documents.

Question:
{state["question"]}

Documents:
{json.dumps(state["docs"], ensure_ascii=False)}

Return JSON:
{{
  "claims": [
    {{"claim": "...", "doc_index": 0}}
  ]
}}
"""

    response = extract_json(llm.invoke(prompt).content)
    extract_json(response)


    try:
        parsed = json.loads(response)
        claims = parsed.get("claims", [])

        cleaned = []
        for c in claims:
            if "claim" in c and "doc_index" in c:
                cleaned.append({
                    "claim": str(c["claim"]),
                    "doc_index": int(c["doc_index"])
                })

        return {"claims": cleaned}

    except:
        return {"claims": []}


#######################################################################################################
# 5️⃣ Evidence Validation (Hallucination Filter)
#######################################################################################################

def validate_claims_node(state: AgentState):

    validated_claims = []

    for claim in state["claims"]:
        idx = claim["doc_index"]
        if idx >= len(state["docs"]):
            continue

        doc_text = state["docs"][idx]["content"]

        prompt = f"""
Verify if the claim is directly supported by the document.

Claim:
{claim["claim"]}

Document:
{doc_text}

Return JSON:
{{ "supported": true/false }}
"""

        response = extract_json(llm.invoke(prompt).content)
        extract_json(response) 

        try:
            parsed = json.loads(response)
            if parsed.get("supported") == True:
                validated_claims.append(claim)
        except:
            pass

    print(f"\n🔎 Validated {len(validated_claims)} / {len(state['claims'])} claims")

    return {"claims": validated_claims}


#######################################################################################################
# 6️⃣ Retry Decision
#######################################################################################################

def retry_decision(state: AgentState):
    if len(state["claims"]) < 2 and state["retry_count"] == 0:
        return "rewrite"
    return "continue"


#######################################################################################################
# 7️⃣ Answer Node (With Style Control)
#######################################################################################################

def answer_node(state: AgentState):

    if not state["claims"]:
        return {"answer": "I don't have enough evidence in the retrieved documents to answer confidently."}

    mode_instruction = {
        "concise": "Write a short answer in 3-5 sentences.",
        "detailed": "Write a detailed explanation.",
        "bullet": "Write as bullet points.",
        "executive_summary": "Write a high-level executive summary."
    }.get(state["answer_mode"], "Write a concise answer.")

    prompt = f"""
You are a factual assistant.

Question:
{state["question"]}

Claims:
{json.dumps(state["claims"], ensure_ascii=False)}

Instructions:
{mode_instruction}

Rules:
- Only use provided claims.
- Cite document index like (Doc 0).
- Do not hallucinate.
"""

    response = extract_json(llm.invoke(prompt).content)

    return {"answer": response}


#######################################################################################################
# 8️⃣ Self-Reflection Critique Node
#######################################################################################################

def critique_node(state: AgentState):

    prompt = f"""
Evaluate the answer quality.

Question:
{state["question"]}

Answer:
{state["answer"]}

Claims:
{json.dumps(state["claims"], ensure_ascii=False)}

Return JSON:
{{
  "needs_improvement": true/false,
  "issues": ["..."],
  "suggestions": "..."
}}
"""

    response = extract_json(llm.invoke(prompt).content)
    extract_json(response)

    try:
        parsed = json.loads(response)
        return {"critique": parsed}
    except:
        return {"critique": {"needs_improvement": False}}


#######################################################################################################
# 9️⃣ Regeneration Node
#######################################################################################################

def improve_answer_node(state: AgentState):

    prompt = f"""
Improve the answer based on critique.

Original Answer:
{state["answer"]}

Claims:
{json.dumps(state["claims"], ensure_ascii=False)}

Critique:
{json.dumps(state["critique"], ensure_ascii=False)}

Rules:
- Only use claims.
- Improve clarity and completeness.
- Do not hallucinate.
"""

    improved = llm.invoke(prompt).content.strip()

    return {
        "answer": improved,
        "reflection_count": state["reflection_count"] + 1
    }


#######################################################################################################
# 🔟 Reflection Decision
#######################################################################################################

def reflection_decision(state: AgentState):
    if (
        state["critique"].get("needs_improvement", False)
        and state["reflection_count"] < 1
    ):
        return "improve"
    return "continue"


#######################################################################################################
# 1️⃣1️⃣ Confidence Node
#######################################################################################################

def confidence_node(state: AgentState):

    prompt = f"""
Rate confidence from 1-5.

Answer:
{state["answer"]}

Claims:
{json.dumps(state["claims"], ensure_ascii=False)}

Return JSON:
{{ "score": 1-5, "reasoning": "..." }}
"""

    response = extract_json(llm.invoke(prompt).content)
    extract_json(response)

    try:
        parsed = json.loads(response)
        return {"confidence": parsed}
    except:
        return {"confidence": {"score": 2}}


#######################################################################################################
# 1️⃣2️⃣ Evaluation Metrics
#######################################################################################################

def evaluation_node(state: AgentState):

    unique_docs = len(set(c["doc_index"] for c in state["claims"])) if state["claims"] else 0

    metrics = {
        "num_claims": len(state["claims"]),
        "unique_docs_used": unique_docs,
        "retry_count": state["retry_count"],
        "reflection_count": state["reflection_count"]
    }

    print("\n📊 Metrics:", metrics)
    mlflow.log_metrics(metrics)

    return {}


#######################################################################################################
# 1️⃣3️⃣ Build Graph
#######################################################################################################

graph = StateGraph(AgentState)

graph.add_node("retrieve", retrieve_node)
graph.add_node("claims", claims_node)
graph.add_node("rewrite", rewrite_query_node)
graph.add_node("validate", validate_claims_node)
graph.add_node("answer", answer_node)
graph.add_node("critique", critique_node)
graph.add_node("improve", improve_answer_node)
graph.add_node("confidence", confidence_node)
graph.add_node("evaluate", evaluation_node)

graph.set_entry_point("retrieve")

graph.add_edge("retrieve", "claims")

graph.add_conditional_edges(
    "claims",
    retry_decision,
    {"rewrite": "rewrite", "continue": "validate"}
)

graph.add_edge("rewrite", "retrieve")
graph.add_edge("validate", "answer")
graph.add_edge("answer", "critique")

graph.add_conditional_edges(
    "critique",
    reflection_decision,
    {"improve": "improve", "continue": "confidence"}
)

graph.add_edge("improve", "critique")
graph.add_edge("confidence", "evaluate")
graph.add_edge("evaluate", END)

agent = graph.compile()


#######################################################################################################
# 1️⃣4️⃣ Run Agent
#######################################################################################################

with mlflow.start_run():

    start = time.time()

    result = agent.invoke({
        "question": example_question,
        "docs": [],
        "claims": [],
        "answer": "",
        "retry_count": 0,
        "confidence": {},
        "answer_mode": "detailed",
        "reflection_count": 0,
        "critique": {}
    })

    latency = time.time() - start
    mlflow.log_metric("latency_seconds", latency)

    print("\n==============================")
    print("QUESTION:", result["question"])
    print("==============================\n")

    print("ANSWER:\n", result["answer"])
    print("\nCLAIMS:\n", json.dumps(result["claims"], indent=2))
    print("\nCONFIDENCE:\n", result["confidence"])


🔁 Rewritten Query: What are the fundamental characteristics and rules of soccer?

🔎 Validated 0 / 0 claims

📊 Metrics: {'num_claims': 0, 'unique_docs_used': 0, 'retry_count': 1, 'reflection_count': 0}

QUESTION: What are the fundamental characteristics and rules of soccer?

ANSWER:
 I don't have enough evidence in the retrieved documents to answer confidently.

CLAIMS:
 []

CONFIDENCE:
 {'score': 3, 'reasoning': 'Confidence is being rated as 3 due to the absence of concrete evidence in the retrieved documents, which indicates a moderate level of uncertainty. A score of 3 suggests that the available information is limited and further investigation or clarification may be necessary to answer more confidently.'}


### b) Reflection

#### 1. If I had more time, what improvements would I make and why?

If given additional time, I would focus on strengthening retrieval quality, reliability guarantees, and evaluation rigor.

**a) Add Cross-Encoder Re-ranking**
Currently, retrieval uses bi-encoder embeddings (BGE + FAISS). While efficient, bi-encoders optimize for semantic similarity but may miss fine-grained relevance.  
I would add a cross-encoder re-ranking stage after initial retrieval to score the top-k documents more precisely.

**Why:**  
Cross-encoders significantly improve ranking accuracy and reduce irrelevant context.  
**Value:** Better grounding → fewer hallucinations → stronger answers.

---

**b) Implement Multi-Hop Retrieval Planning**
The current workflow performs single-hop retrieval. For complex questions (e.g., involving comparisons or causal reasoning), multi-hop reasoning is required.

I would:
- Add a query decomposition node
- Generate sub-questions
- Retrieve evidence per sub-question
- Merge claims across hops

**Why:**  
Many real-world questions require combining information across multiple documents.  
**Value:** Improves reasoning depth and correctness for complex queries.

---

**c) Add Semantic Similarity Threshold Gating**
Instead of always accepting retrieved documents, I would introduce similarity score thresholds.

If top similarity score is below a threshold:
- Return "insufficient evidence"
- Or escalate to fallback mechanism

**Why:**  
Low similarity often correlates with hallucination risk.  
**Value:** Improves system reliability and prevents overconfident answers.

---

**d) Add Offline Evaluation Benchmarking**
I would create a labeled evaluation dataset and compute:

- Faithfulness score
- Answer relevancy
- Citation precision
- Retrieval recall

**Why:**  
Prototype systems often lack rigorous measurement.  
**Value:** Enables systematic performance tracking and regression testing.

---

**e) Add Caching and Cost Optimization**
- Cache embeddings
- Cache LLM responses
- Batch retrieval
- Use smaller models for validation steps

**Why:**  
Prototype systems do not optimize cost or latency.  
**Value:** Reduces inference cost and improves scalability.

---

#### 2. What steps are required to move this from prototype to production?

Moving from prototype to production requires changes across architecture, infrastructure, evaluation, and governance.

---

### A. Infrastructure Hardening

1. Deploy model endpoints behind autoscaling infrastructure.
2. Use managed vector databases (e.g., Databricks Vector Search or Pinecone).
3. Add retry mechanisms and rate limiting.
4. Implement structured logging and tracing (e.g., OpenTelemetry).

**Goal:** Ensure reliability, scalability, and observability.

---

### B. Monitoring & Observability

1. Track:
   - Latency
   - Token usage
   - Retrieval score distributions
   - Claim validation rate
   - Reflection frequency
2. Add alerting on:
   - High hallucination rate
   - Low confidence scores
   - Retrieval failures

**Goal:** Detect degradation early.

---

### C. Security & Governance

1. Add input sanitization and prompt injection protection.
2. Enforce strict output schemas.
3. Log prompts and outputs for auditability.
4. Apply access control and authentication.
5. Implement PII detection if needed.

**Goal:** Enterprise-grade safety and compliance.

---

### D. Robust Evaluation Pipeline

1. Create a labeled benchmark dataset.
2. Automate nightly evaluation runs.
3. Track:
   - Faithfulness
   - Citation correctness
   - Confidence calibration
4. Perform A/B testing for new model versions.

**Goal:** Prevent silent performance regressions.

---

### E. Performance Optimization

1. Reduce multi-pass LLM calls.
2. Use smaller models for critique/validation.
3. Add asynchronous orchestration.
4. Introduce token budgeting and truncation strategies.

**Goal:** Lower latency and cost.

---

### F. Productization Layer

1. Wrap the agent in an API (FastAPI or Databricks Serving).
2. Add streaming support.
3. Implement UI confidence indicators.
4. Provide structured JSON output mode.

**Goal:** Make system usable in real-world applications.

---

### Final Reflection

This prototype demonstrates a self-reflective, grounded RAG architecture with structured reasoning and guardrails. However, productionizing such a system requires:

- Rigorous evaluation
- Infrastructure scalability
- Monitoring
- Security controls
- Cost optimization
- Continuous improvement pipelines

The architectural foundation is strong, but enterprise deployment requires systematic engineering beyond model prompting.




# Advanced Features Implemented

## 1️⃣ Query Rewriting

If initial retrieval is weak, the system rewrites the user query to improve semantic search.

**Why:** Improves recall and robustness against vague queries.  
**Value:** Higher retrieval quality without blindly increasing context size.

---

## 2️⃣ Conditional Retrieval Expansion

If evidence is insufficient, the system increases retrieval size (`k`) dynamically.

**Why:** Small retrieval windows may miss relevant documents.  
**Value:** Cost-efficient scaling instead of always retrieving large contexts.

---

## 3️⃣ Structured Claim Extraction

The system extracts 3–6 factual claims tied to specific document indices before generating the final answer.

**Why:** Direct LLM answering risks hallucination.  
**Value:** Forces intermediate reasoning and grounding.

---

## 4️⃣ Hallucination Detection (Evidence Validation Layer)

Each claim is validated against its source document before being used.

**Why:** LLMs may fabricate claims.  
**Value:** Filters unsupported claims and increases factual reliability.

---

## 5️⃣ Controlled Answer Modes

Supports configurable output styles:
- concise
- detailed
- bullet
- executive_summary

**Why:** Different consumers require different formats.  
**Value:** Improves product flexibility and usability.

---

## 6️⃣ Self-Reflection Loop (Self-Critique + Regeneration)

After generating an answer:
- The system critiques its own response
- Identifies weaknesses
- Regenerates improved output (limited to 1 iteration)

**Why:** Initial answers may be shallow or incomplete.  
**Value:** Improves clarity, citation usage, and completeness.

This mimics advanced reasoning systems such as Self-Refine and Reflexion architectures.

---

## 7️⃣ Confidence Scoring

The system assigns a confidence score (1–5) based on:
- Claim strength
- Evidence coverage
- Answer completeness

**Why:** Production systems require reliability signals.  
**Value:** Enables threshold gating and monitoring.

---

## 8️⃣ Evaluation & Observability

Metrics logged via MLflow:

- Number of extracted claims
- Unique documents used
- Retry count
- Reflection count
- Latency

**Why:** AI systems require instrumentation and monitoring.  
**Value:** Enables performance tracking and iterative improvement.

---

# Design Principles

- Strict grounding
- Iterative reasoning
- Conditional logic (not linear prompting)
- Modular nodes
- Observable system behavior
- Controlled generation
- Guardrail-first design

