# Notebook 00: Data Ingestion for RAG Systems

## üéØ Your Mission

You're preparing data for RAG (Retrieval-Augmented Generation) systems. Your job: ingest IT tickets into vector databases so they can be searched semantically.

**Why this matters:** Before you can search documents semantically, you need to index them into a vector database. This notebook handles the ingestion process for both single-field and multi-field RAG approaches.

---

## ‚ö° Quick Overview

**What this notebook does:**
- ‚úÖ Loads IT ticket data from CSV files
- ‚úÖ Creates vector stores in LlamaStack
- ‚úÖ Prepares documents for indexing (single-field or multi-field)
- ‚úÖ Indexes documents into vector stores in batches
- ‚úÖ Verifies ingestion success

**This notebook will:**
- Create **both** vector stores (single-field and multi-field) in one run
- Index documents into both stores
- Prepare everything needed for notebooks 01 and 02

**Time:** ~10-15 minutes (depending on data size)

---


## üéØ What You'll Learn

By the end of this notebook, you will:
- ‚úÖ Understand how to prepare data for RAG ingestion
- ‚úÖ Create both vector stores in LlamaStack (single-field and multi-field)
- ‚úÖ Index documents using both approaches
- ‚úÖ Verify that both ingestions were successful

---

## üìã The Journey

1. **Load Data** - Load IT tickets from CSV
2. **Set Up LlamaStack** - Connect to LlamaStack
3. **Choose Ingestion Mode** - Single-field or multi-field
4. **Create Vector Store** - Set up the vector database
5. **Prepare Documents** - Format data for ingestion
6. **Index Documents** - Ingest into vector store
7. **Verify Ingestion** - Confirm documents are searchable

---


### Step 1: Load and Explore the Dataset

**What we're doing:** Loading IT call center tickets from CSV files.

**Why:** We need to understand the data structure before we can ingest it into the vector database.


In [None]:
# Import required libraries
import pandas as pd
from pathlib import Path
from llama_stack_client import RAGDocument

# Load the CSV file from the data directory
data_dir = Path("../data")
file_path = data_dir / "synthetic-it-call-center-tickets-sample.csv"

print("üîÑ Loading IT call center tickets dataset...")
df = pd.read_csv(file_path)

print(f"‚úÖ Loaded {len(df)} tickets")
print(f"üìã Dataset shape: {df.shape[0]} rows √ó {df.shape[1]} columns")
print(f"\nüîç Let's examine the dataset structure:")
print("=" * 60)
df.head()


**What we see:** Each ticket has multiple fields:
- **`short_description`** - Brief problem summary
- **`content`** - Detailed problem description  
- **`close_notes`** - Diagnostic findings and resolution steps
- **Other fields** - Metadata like ticket number, priority, etc.

**üí° Key insight:** We can create different types of RAG documents:
- **Single-field**: Use only `short_description` (simpler, faster)
- **Multi-field**: Combine `short_description + content + close_notes` (richer context)

---

### Step 2: Set Up LlamaStack Client

**What we're doing:** Connecting to LlamaStack and configuring our environment.

**Why:** LlamaStack provides the vector database and ingestion APIs we need.


In [None]:
# Import required libraries for LlamaStack
import os
import sys
from pathlib import Path
from llama_stack_client import LlamaStackClient
from termcolor import cprint

# Add root src directory to path to import shared config
root_dir = Path("../..").resolve()
sys.path.insert(0, str(root_dir / "src"))

# Import centralized configuration
from config import LLAMA_STACK_URL, MODEL, NAMESPACE

# Initialize LlamaStack client
print("üîÑ Step 2: Connecting to LlamaStack...")
print("=" * 60)
print(f"üì° LlamaStack URL: {LLAMA_STACK_URL}")
print(f"ü§ñ Model: {MODEL}")

client = LlamaStackClient(base_url=LLAMA_STACK_URL)

# Verify connection
try:
    models = client.models.list()
    print(f"\n‚úÖ Connected to LlamaStack")
    print(f"   Available models: {len(models.data) if hasattr(models, 'data') else 'N/A'}")
except Exception as e:
    print(f"\n‚ùå Failed to connect to LlamaStack: {e}")
    print(f"\nüí° Make sure:")
    print(f"   1. LlamaStack is deployed and running")
    print(f"   2. LLAMA_STACK_URL is set correctly")
    print(f"   3. You have network access to LlamaStack")
    raise

# Configure inference parameters
model = MODEL
stream = True
max_tokens = 4096
temperature = 0.0

print(f"\n‚öôÔ∏è  Inference Parameters:")
print(f"   Model: {model}")
print(f"   Temperature: {temperature}")
print(f"   Max Tokens: {max_tokens}")
print(f"   Stream: {stream}")


**What happened:** We connected to LlamaStack successfully! ‚úÖ

**What's next:** Now we'll create both vector stores (single-field and multi-field) and index documents into both.

---

### Step 3: Create Both Vector Stores

**What we're doing:** Creating two vector stores - one for single-field RAG and one for multi-field RAG.

**Why:** We'll create both so you can use:
- **Single-field vector store** for notebook 01 (introduction to RAG)
- **Multi-field vector store** for notebook 02 (advanced RAG)

**üí° Benefits of creating both:**
- **Single-field RAG**: Faster, simpler, good for basic search (uses only `short_description`)
- **Multi-field RAG**: Richer context, better for complex queries (uses `short_description + content + close_notes`)
- **Comparison**: You can compare both approaches side-by-side


In [None]:
# We'll create both vector stores for this workshop
print("=" * 60)
print("üìã Vector Store Configuration")
print("=" * 60)
print("\nüîß Creating Both Vector Stores:")

print("\nüìù Single-Field RAG Vector Store:")
print("   - Content field: short_description only")
print("   - Use case: Basic semantic search")
print("   - Best for: Notebook 01 - Introduction to RAG")
print("   - Vector store name: 'single-field-rag-tickets'")

print("\nüìù Multi-Field RAG Vector Store:")
print("   - Content fields: short_description + content + close_notes")
print("   - Use case: Advanced semantic search with full context")
print("   - Best for: Notebook 02 - Advanced RAG")
print("   - Vector store name: 'multi-field-rag-tickets'")

print("\nüí° Both vector stores will be created and populated in this notebook!")


---

### Step 4: Create Both Vector Stores

**What we're doing:** Creating two ChromaDB vector stores in LlamaStack - one for single-field and one for multi-field RAG.

**Why:** We need separate vector stores because they'll contain different document structures (single-field vs multi-field), enabling comparison and use in different notebooks.


In [None]:
# Create both ChromaDB vector stores
print("\nüîÑ Step 4: Creating ChromaDB vector stores...")
print("=" * 60)
print("   - Provider: ChromaDB (embedded in LlamaStack)")
print("   - Embedding model: sentence-transformers/nomic-ai/nomic-embed-text-v1.5")
print("   - Embedding dimension: 768")

# Create single-field vector store
print("\nüì¶ Creating single-field vector store...")
vs_single_field = client.vector_stores.create(
    name="single-field-rag-tickets",
    extra_body={
        "provider_id": "chromadb",
        "embedding_model": "sentence-transformers/nomic-ai/nomic-embed-text-v1.5",
        "embedding_dimension": 768
    }
)

print(f"‚úÖ Single-field vector store created!")
print(f"   ID: {vs_single_field.id}")
print(f"   Name: {vs_single_field.name if hasattr(vs_single_field, 'name') else 'N/A'}")

# Create multi-field vector store
print("\nüì¶ Creating multi-field vector store...")
vs_multi_field = client.vector_stores.create(
    name="multi-field-rag-tickets",
    extra_body={
        "provider_id": "chromadb",
        "embedding_model": "sentence-transformers/nomic-ai/nomic-embed-text-v1.5",
        "embedding_dimension": 768
    }
)

print(f"‚úÖ Multi-field vector store created!")
print(f"   ID: {vs_multi_field.id}")
print(f"   Name: {vs_multi_field.name if hasattr(vs_multi_field, 'name') else 'N/A'}")

print("=" * 60)

# Store vector store IDs for later use
VECTOR_STORE_ID_SINGLE_FIELD = vs_single_field.id
VECTOR_STORE_ID_MULTI_FIELD = vs_multi_field.id

print(f"\nüí° Vector Store IDs saved:")
print(f"   Single-field: {VECTOR_STORE_ID_SINGLE_FIELD} (for notebook 01)")
print(f"   Multi-field: {VECTOR_STORE_ID_MULTI_FIELD} (for notebook 02)")
print(f"\n   You'll need these IDs in the query notebooks!")


**What happened:** We created both ChromaDB vector stores! ‚úÖ

**üí° What is ChromaDB?** It's a vector database that stores embeddings. Think of it as a specialized database optimized for finding similar vectors (similar meanings).

**Key point:** ChromaDB is embedded in LlamaStack - no separate deployment needed! This makes setup simple.

**What's next:** Now we'll prepare our ticket data and convert it into both document formats (single-field and multi-field) for indexing.

---

### Step 5: Prepare Documents for Both Ingestion Modes

**What we're doing:** Converting ticket data into RAG documents for both single-field and multi-field approaches.

**Why:** We need to create two different document structures:
- **Single-field documents**: Use only `short_description` (for notebook 01)
- **Multi-field documents**: Combine `short_description + content + close_notes` (for notebook 02)


In [None]:
# Prepare the data
print("\nüîÑ Step 5: Preparing data for indexing...")
print("=" * 60)

# Fill missing values with empty strings
df = df.fillna("")

# Use all tickets (sample file already has 1000 rows)
df_1000 = df  # Sample file already has 1000 rows
print(f"   Processing {len(df_1000)} tickets (out of {len(df)} total)")

# Create single-field RAG documents
print(f"\nüîÑ Creating single-field RAG documents...")
print("   Using field: short_description (problem summary)")
print("   Storing other fields as metadata")

documents_single_field = [
    RAGDocument(
        document_id=f"ticket-{i}",
        content=df_1000.iloc[i]["short_description"],
        mime_type="text/plain",
        metadata=df_1000.iloc[i].drop("short_description").to_dict(),
    )
    for i in range(len(df_1000))
]

print(f"‚úÖ Created {len(documents_single_field)} single-field RAG documents")
print(f"   - Content: short_description (what we'll search)")
print(f"   - Metadata: All other fields (for filtering)")

# Create multi-field RAG documents
print(f"\nüîÑ Creating multi-field RAG documents...")
print("   Combining fields: short_description + content + close_notes")
print("   Storing other fields as metadata")

documents_multi_field = [
    RAGDocument(
        document_id=f"ticket-{i}",
        content=f"{df_1000.iloc[i]['short_description']}\n\n{df_1000.iloc[i]['content']}\n\n{df_1000.iloc[i]['close_notes']}",
        mime_type="text/plain",
        metadata=df_1000.iloc[i].drop(["short_description", "content", "close_notes"]).to_dict(),
    )
    for i in range(len(df_1000))
]

print(f"‚úÖ Created {len(documents_multi_field)} multi-field RAG documents")
print(f"   - Content: short_description + content + close_notes (full ticket story)")
print(f"   - Metadata: All other fields (for filtering)")

print(f"\n‚úÖ Total: {len(documents_single_field)} single-field + {len(documents_multi_field)} multi-field documents ready for indexing")


**What happened:** We created RAG documents for both approaches! ‚úÖ

**üí° What is a RAG Document?**
- **Content:** The field(s) that will be searched semantically
- **Metadata:** Additional fields stored for filtering and context

**What's next:** Now we'll index both document sets into their respective vector stores. LlamaStack will automatically chunk them, generate embeddings, and store them for semantic search.

---

### Step 6: Index Documents into Both Vector Stores

**What we're doing:** Ingesting documents into both vector stores in batches.

**Why:** Batch processing prevents timeouts and allows progress tracking. LlamaStack will automatically:
- Chunk long documents
- Generate embeddings for each chunk
- Store them in ChromaDB for semantic search

**We'll index:**
1. Single-field documents ‚Üí single-field vector store
2. Multi-field documents ‚Üí multi-field vector store


In [None]:
# Index documents into both vector stores (in batches to avoid timeout)
print("\nüîÑ Step 6: Indexing documents into both vector stores...")
print("=" * 60)
print(f"   Chunk size: 1024 tokens")
print(f"   Single-field batch size: 100 documents")
print(f"   Multi-field batch size: 10 documents (smaller due to larger document size)")
print(f"   Processing in batches to avoid timeout...")

BATCH_SIZE_SINGLE_FIELD = 100  # Single-field documents are smaller, can use larger batches
BATCH_SIZE_MULTI_FIELD = 10    # Multi-field documents are larger, use smaller batches to avoid timeout

# ============================================================
# Index single-field documents into single-field vector store
# ============================================================
print(f"\n{'='*60}")
print("üì¶ Indexing Single-Field Documents")
print(f"{'='*60}")
print(f"   Vector store: single-field-rag-tickets")
print(f"   Total documents: {len(documents_single_field)}")
print(f"   Batch size: {BATCH_SIZE_SINGLE_FIELD} documents")

total_batches_single = (len(documents_single_field) + BATCH_SIZE_SINGLE_FIELD - 1) // BATCH_SIZE_SINGLE_FIELD
inserted_count_single = 0

for batch_num in range(total_batches_single):
    start_idx = batch_num * BATCH_SIZE_SINGLE_FIELD
    end_idx = min(start_idx + BATCH_SIZE_SINGLE_FIELD, len(documents_single_field))
    batch = documents_single_field[start_idx:end_idx]
    
    print(f"\n   Batch {batch_num + 1}/{total_batches_single}: Processing documents {start_idx} to {end_idx-1}...")
    
    try:
        insert_result = client.tool_runtime.rag_tool.insert( 
            chunk_size_in_tokens=1024,
            documents=batch,
            vector_db_id=str(VECTOR_STORE_ID_SINGLE_FIELD),
            extra_body={"vector_store_id": str(VECTOR_STORE_ID_SINGLE_FIELD)},
            extra_headers=None,
            extra_query=None,
            timeout=300  # 5 minute timeout per batch
        )
        inserted_count_single += len(batch)
        print(f"   ‚úÖ Batch {batch_num + 1} indexed successfully ({inserted_count_single}/{len(documents_single_field)} documents)")
    except Exception as e:
        print(f"   ‚ö†Ô∏è  Error indexing batch {batch_num + 1}: {e}")
        print(f"   üí° Tip: You can continue with the documents already indexed, or reduce BATCH_SIZE_SINGLE_FIELD")
        continue

print(f"\n‚úÖ Single-field indexing complete!")
print(f"   Successfully indexed: {inserted_count_single}/{len(documents_single_field)} documents")

# ============================================================
# Index multi-field documents into multi-field vector store
# ============================================================
print(f"\n{'='*60}")
print("üì¶ Indexing Multi-Field Documents")
print(f"{'='*60}")
print(f"   Vector store: multi-field-rag-tickets")
print(f"   Total documents: {len(documents_multi_field)}")
print(f"   Batch size: {BATCH_SIZE_MULTI_FIELD} documents (smaller batches due to larger document size)")

total_batches_multi = (len(documents_multi_field) + BATCH_SIZE_MULTI_FIELD - 1) // BATCH_SIZE_MULTI_FIELD
inserted_count_multi = 0

for batch_num in range(total_batches_multi):
    start_idx = batch_num * BATCH_SIZE_MULTI_FIELD
    end_idx = min(start_idx + BATCH_SIZE_MULTI_FIELD, len(documents_multi_field))
    batch = documents_multi_field[start_idx:end_idx]
    
    print(f"\n   Batch {batch_num + 1}/{total_batches_multi}: Processing documents {start_idx} to {end_idx-1}...")
    
    try:
        insert_result = client.tool_runtime.rag_tool.insert( 
            chunk_size_in_tokens=1024,
            documents=batch,
            vector_db_id=str(VECTOR_STORE_ID_MULTI_FIELD),
            extra_body={"vector_store_id": str(VECTOR_STORE_ID_MULTI_FIELD)},
            extra_headers=None,
            extra_query=None,
            timeout=300  # 5 minute timeout per batch
        )
        inserted_count_multi += len(batch)
        print(f"   ‚úÖ Batch {batch_num + 1} indexed successfully ({inserted_count_multi}/{len(documents_multi_field)} documents)")
    except Exception as e:
        print(f"   ‚ö†Ô∏è  Error indexing batch {batch_num + 1}: {e}")
        print(f"   üí° Tip: You can continue with the documents already indexed, or reduce BATCH_SIZE_MULTI_FIELD")
        continue

print(f"\n‚úÖ Multi-field indexing complete!")
print(f"   Successfully indexed: {inserted_count_multi}/{len(documents_multi_field)} documents")

# Summary
print(f"\n{'='*60}")
print("üìä Indexing Summary")
print(f"{'='*60}")
print(f"‚úÖ Single-field vector store: {inserted_count_single}/{len(documents_single_field)} documents")
print(f"   Vector store ID: {VECTOR_STORE_ID_SINGLE_FIELD}")
print(f"\n‚úÖ Multi-field vector store: {inserted_count_multi}/{len(documents_multi_field)} documents")
print(f"   Vector store ID: {VECTOR_STORE_ID_MULTI_FIELD}")

print(f"\nüí° LlamaStack automatically:")
print(f"   - Chunked the documents")
print(f"   - Generated embeddings for each chunk")
print(f"   - Stored them in ChromaDB for semantic search")


**What happened:** We indexed all documents into both ChromaDB vector stores! ‚úÖ

**üéâ Success!** The tickets are now searchable using semantic similarity in both vector stores.

**üí° What happened behind the scenes:**
- LlamaStack automatically chunked the documents
- Generated embeddings using the embedding model
- Stored them in the vector databases

---

### Step 7: Verify Both Vector Stores

**What we're doing:** Checking that both vector stores contain our documents and are ready for queries.

**Why:** Verification ensures the ingestion was successful before we start querying in notebooks 01 and 02.


In [None]:
# Display both vector stores with documents after indexing
print("\n" + "=" * 60)
print("üìä Vector Store Status After Indexing")
print("=" * 60)

# ============================================================
# Verify Single-Field Vector Store
# ============================================================
print(f"\n{'='*60}")
print("üì¶ Single-Field Vector Store")
print(f"{'='*60}")

vs_single_updated = client.vector_stores.retrieve(VECTOR_STORE_ID_SINGLE_FIELD)

print(f"\nüì¶ Vector Store Details:")
print(f"   ID: {vs_single_updated.id}")
print(f"   Status: {vs_single_updated.status}")
if vs_single_updated.name:
    print(f"   Name: {vs_single_updated.name}")
if vs_single_updated.metadata:
    provider = vs_single_updated.metadata.get('provider_id', 'N/A')
    print(f"   Provider: {provider}")

if hasattr(vs_single_updated, 'file_counts') and vs_single_updated.file_counts:
    print(f"\nüìä Document Statistics:")
    print(f"   Total files: {vs_single_updated.file_counts.total}")
    print(f"   Completed: {vs_single_updated.file_counts.completed}")
    print(f"   In progress: {vs_single_updated.file_counts.in_progress}")
    print(f"   Failed: {vs_single_updated.file_counts.failed}")

# ============================================================
# Verify Multi-Field Vector Store
# ============================================================
print(f"\n{'='*60}")
print("üì¶ Multi-Field Vector Store")
print(f"{'='*60}")

vs_multi_updated = client.vector_stores.retrieve(VECTOR_STORE_ID_MULTI_FIELD)

print(f"\nüì¶ Vector Store Details:")
print(f"   ID: {vs_multi_updated.id}")
print(f"   Status: {vs_multi_updated.status}")
if vs_multi_updated.name:
    print(f"   Name: {vs_multi_updated.name}")
if vs_multi_updated.metadata:
    provider = vs_multi_updated.metadata.get('provider_id', 'N/A')
    print(f"   Provider: {provider}")

if hasattr(vs_multi_updated, 'file_counts') and vs_multi_updated.file_counts:
    print(f"\nüìä Document Statistics:")
    print(f"   Total files: {vs_multi_updated.file_counts.total}")
    print(f"   Completed: {vs_multi_updated.file_counts.completed}")
    print(f"   In progress: {vs_multi_updated.file_counts.in_progress}")
    print(f"   Failed: {vs_multi_updated.file_counts.failed}")

# ============================================================
# Test queries on both vector stores
# ============================================================
print(f"\n{'='*60}")
print("üîç Testing Both Vector Stores")
print(f"{'='*60}")

sample_query = "IT support ticket"

# Test single-field vector store
print(f"\nüìù Testing single-field vector store...")
try:
    query_result_single = client.tool_runtime.rag_tool.query(
        content=sample_query,
        vector_db_ids=[str(VECTOR_STORE_ID_SINGLE_FIELD)],
        extra_body={"vector_store_ids": [str(VECTOR_STORE_ID_SINGLE_FIELD)]},
    )
    print(f"   ‚úÖ Single-field vector store is queryable!")
except Exception as e:
    print(f"   ‚ö†Ô∏è  Could not query single-field vector store: {e}")

# Test multi-field vector store
print(f"\nüìù Testing multi-field vector store...")
try:
    query_result_multi = client.tool_runtime.rag_tool.query(
        content=sample_query,
        vector_db_ids=[str(VECTOR_STORE_ID_MULTI_FIELD)],
        extra_body={"vector_store_ids": [str(VECTOR_STORE_ID_MULTI_FIELD)]},
    )
    print(f"   ‚úÖ Multi-field vector store is queryable!")
except Exception as e:
    print(f"   ‚ö†Ô∏è  Could not query multi-field vector store: {e}")

print(f"\n{'='*60}")
print("‚úÖ Both vector stores are ready for semantic search!")
print(f"{'='*60}")


**What happened:** We verified both vector stores are working! ‚úÖ

**üéâ Success!** Your data is now ingested into both vector stores and ready for semantic search.

---

## üéì Key Takeaways

**What you accomplished:**
- ‚úÖ Created two vector stores in LlamaStack (single-field and multi-field)
- ‚úÖ Prepared documents for both ingestion modes
- ‚úÖ Indexed documents into both vector databases
- ‚úÖ Verified both ingestions were successful

**üí° Important Information to Save:**

```python
# Vector Store IDs (save these for query notebooks!)
VECTOR_STORE_ID_SINGLE_FIELD = "{VECTOR_STORE_ID_SINGLE_FIELD}"
VECTOR_STORE_ID_MULTI_FIELD = "{VECTOR_STORE_ID_MULTI_FIELD}"
```

**Next Steps:**
- **For single-field RAG**: Use `01_introduction_to_rag.ipynb` to query the single-field vector store
- **For multi-field RAG**: Use `02_advanced_rag_with_multiple_fields.ipynb` to query the multi-field vector store

**üí° Tip:** Both vector stores are now ready! You can use notebook 01 and notebook 02 without re-running ingestion.

---

## üéâ You Did It!

You've successfully ingested IT tickets into both vector databases! The documents are now searchable using semantic similarity in both formats.

**What's next:** 
- **Single-field RAG**: `01_introduction_to_rag.ipynb` - Learn basic semantic search
- **Multi-field RAG**: `02_advanced_rag_with_multiple_fields.ipynb` - Learn advanced semantic search with full context
