#High Fidelity Data Ingestion

Copyright 2025-2026, Denis Rothman

**Goal:** This notebook transforms our basic data pipeline into a high-fidelity ingestion system, a crucial prerequisite for the verifiable, citation-capable AI we are building in Chapter 9. We will simulate the work of a secure "Data Management Department" by taking raw source documents and processing them into a structured, metadata-rich knowledge base.

This process involves three key steps:

* **Prepare a Curated Dataset:** We will create and load several sample Marketing documents, simulating a secure, pre-vetted data source ready for our engine.

* **Enrich Data with Source Metadata:** This is the core upgrade. We will modify the ingestion process to tag every single data chunk with its original document source, a critical step that enables verifiability and citations.

* **Verify the Ingestion:** We will conclude by running a test query to inspect the vector database and confirm that our high-fidelity metadata has been successfully stored.

**January 2026 Upgrade:**
In section *2.Initialize Clients*, we can clear the index of its content or append it:
```python
clear_index = True # If True, empties the index namespaces. If False, appends to existing index.
```

# 1.Installation and Setup

In [None]:
# 1.Installation and Setup
# -------------------------------------------------------------------------
# We install specific versions for stability and reproducibility.
# We include tiktoken for token-based chunking and tenacity for robust API calls.

In [None]:
!pip install tqdm==4.67.1 --upgrade
!pip install openai==1.104.2
!pip install pinecone==7.0.0 tqdm==4.67.1 tenacity==8.3.0

Collecting openai==1.104.2
  Downloading openai-1.104.2-py3-none-any.whl.metadata (29 kB)
Downloading openai-1.104.2-py3-none-any.whl (928 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m928.2/928.2 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 2.12.0
    Uninstalling openai-2.12.0:
      Successfully uninstalled openai-2.12.0
Successfully installed openai-1.104.2
Collecting pinecone==7.0.0
  Downloading pinecone-7.0.0-py3-none-any.whl.metadata (9.5 kB)
Collecting tenacity==8.3.0
  Downloading tenacity-8.3.0-py3-none-any.whl.metadata (1.2 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone==7.0.0)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Downloading pinecone-7.0.0-py3-none-any.whl (516 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚î

In [None]:
# Imports for this notebook
import json
import time
from tqdm.auto import tqdm
import tiktoken
from pinecone import Pinecone, ServerlessSpec
from tenacity import retry, stop_after_attempt, wait_random_exponential
# general imports required in the notebooks of this book
import re
import textwrap
from IPython.display import display, Markdown
import copy

In [None]:
# Imports and API Key Setup
# We will use the OpenAI library to interact with the LLM and Google Colab's
# secret manager to securely access your API key.

import os
from openai import OpenAI
from google.colab import userdata

# Load the API key from Colab secrets, set the env var, then init the client
try:
    api_key = userdata.get("API_KEY")
    if not api_key:
        raise userdata.SecretNotFoundError("API_KEY not found.")

    # Set environment variable for downstream tools/libraries
    os.environ["OPENAI_API_KEY"] = api_key

    # Create client (will read from OPENAI_API_KEY)
    client = OpenAI()
    print("OpenAI API key loaded and environment variable set successfully.")

except userdata.SecretNotFoundError:
    print('Secret "API_KEY" not found.')
    print('Please add your OpenAI API key to the Colab Secrets Manager.')
except Exception as e:
    print(f"An error occurred while loading the API key: {e}")

# Configuration
EMBEDDING_MODEL = "text-embedding-3-small"
EMBEDDING_DIM = 1536 # Dimension for text-embedding-3-small
GENERATION_MODEL = "gpt-5"

OpenAI API key loaded and environment variable set successfully.


In [None]:
try:
    # Standard way to access secrets securely in Google Colab
    from google.colab import userdata
    PINECONE_API_KEY = userdata.get('PINECONE_API_KEY')
    if not PINECONE_API_KEY:
        raise ValueError("API Keys not found in Colab secrets.")
    print("API Keys loaded successfully.")
except ImportError:
    # Fallback for non-Colab environments (e.g., local Jupyter)
    PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY')
    if not PINECONE_API_KEY:
        print("Warning: API Keys not found. Ensure environment variables are set.")

API Keys loaded successfully.


## 2.Initialize Clients


We can clear the index of its content or append it:

```python
clear_index=True
```

**The new data will be APPENDED.** It will **not** overwrite the old data.

### The Reason (The ID Logic)

The notebook generates IDs for the vectors using this specific line of code:

```python
chunk_id = f"{doc_name}_chunk_{total_vectors_uploaded + j}"

```

This ID is made of three parts:

1. **`doc_name`**: The filename (e.g., `brand_style_guide.txt`).
2. **`chunk`**: A static text string.
3. **`total_vectors_uploaded + j`**: A numeric counter that starts at 0 every time you run the cell.

### Why it Appends (Safe)

If the **"new data"** consists of files with **different filenames** than the previous run:

* **Run 1 (Old Data):** Processed `OldFile.txt`. ID generated: `OldFile.txt_chunk_0`.
* **Run 2 (New Data):** You clear the folder and add `NewFile.txt`. The counter resets to 0. ID generated: `NewFile.txt_chunk_0`.

Because the filename (`doc_name`) is part of the ID string, `OldFile.txt_chunk_0` is different from `NewFile.txt_chunk_0`. Pinecone sees them as completely different vectors and keeps both.

### The Caution (Potential Duplicates)

If there is a re-upload of the **exact same file** (same filename) but in a different batch or order:

* The counter (`total_vectors_uploaded`) might be different than the first time.
* This would generate *new* IDs for the *same* content, resulting in duplicate data in your database rather than an overwrite.

**Summary:** As long as your new data has unique filenames compared to the old data, it will cleanly append to the index.



In [None]:
# 2.Initialize Clients
clear_index = True # If True, empties the index namespaces. If False, appends to existing index.

# --- Initialize Clients (assuming this is already done) ---

# --- Initialize Pinecone Client ---
pc = Pinecone(api_key=PINECONE_API_KEY)

# --- Define Index and Namespaces (assuming this is already done) ---
INDEX_NAME = 'genai-mas-mcp-ch3'
NAMESPACE_KNOWLEDGE = "KnowledgeStore"
NAMESPACE_CONTEXT = "ContextLibrary"
spec = ServerlessSpec(cloud='aws', region='us-east-1')

# Check if index exists
if INDEX_NAME not in pc.list_indexes().names():
    print(f"Index '{INDEX_NAME}' not found. Creating new serverless index...")
    pc.create_index(
        name=INDEX_NAME,
        dimension=EMBEDDING_DIM,
        metric='cosine',
        spec=spec
    )
    # Wait for index to be ready
    while not pc.describe_index(INDEX_NAME).status['ready']:
        print("Waiting for index to be ready...")
        time.sleep(1)
    print("Index created successfully. It is new and empty.")
else:
    print(f"Index '{INDEX_NAME}' already exists.")

    if clear_index:
        print("clear_index=True. Clearing namespaces for a fresh start...")
        index = pc.Index(INDEX_NAME)
        namespaces_to_clear = [NAMESPACE_KNOWLEDGE, NAMESPACE_CONTEXT]

        for namespace in namespaces_to_clear:
            # Check if namespace exists and has vectors before deleting
            stats = index.describe_index_stats()
            if namespace in stats.namespaces and stats.namespaces[namespace].vector_count > 0:
                print(f"Clearing namespace '{namespace}'...")
                index.delete(delete_all=True, namespace=namespace)

                # **CRITICAL FUNCTTION: Wait for deletion to complete**
                while True:
                    stats = index.describe_index_stats()
                    if namespace not in stats.namespaces or stats.namespaces[namespace].vector_count == 0:
                        print(f"Namespace '{namespace}' cleared successfully.")
                        break
                    print(f"Waiting for namespace '{namespace}' to clear...")
                    time.sleep(5) # Poll every 5 seconds
            else:
                print(f"Namespace '{namespace}' is already empty or does not exist. Skipping.")
    else:
        print("clear_index=False. Index will not be emptied (Append mode).")

# Connect to the index for subsequent operations
index = pc.Index(INDEX_NAME)

Index 'genai-mas-mcp-ch3' already exists.
clear_index=False. Index will not be emptied (Append mode).


# 3.Data Preparation: The Context Library (Procedural RAG)

In [None]:
# Create a directory to store our source documents
if not os.path.exists("marketing_documents"):
    os.makedirs("marketing_documents")

In [None]:
#@title Document 1: Brand Style Guide
brand_style_guide_text = """
Brand Voice and Tone Guide: "Innovate Forward"

Our brand voice is guided by three core principles: Clarity, Confidence, and Aspiration. We are expert guides, not academic lecturers. Our tone should always be empowering, forward-looking, and accessible.

1. Clarity:
- Use simple, direct language. Avoid jargon and overly technical terms.
- Prefer short, declarative sentences.
- Structure content with clear headings and bullet points for scannability.
- Goal: Make complex topics feel simple and understandable.

2. Confidence:
- Use an active voice. (e.g., "Our system delivers results," not "Results are delivered by our system.")
- Be authoritative but not arrogant. State facts and benefits directly.
- Avoid hedging language like "might," "could," or "perhaps."
- Goal: Instill trust and convey expertise.

3. Aspiration:
- Focus on the benefits, not just the features. Frame our product as a tool for achieving a better future.
- Use forward-looking and positive language (e.g., "imagine," "transform," "unlock").
- Speak to the user's goals and ambitions.
- Goal: Inspire the user and connect our brand to their success.

Forbidden Language:
- Never use overly casual slang or unprofessional language.
- Do not make specific, quantitative promises that cannot be universally guaranteed (e.g., "You will increase profits by 300%").
- Avoid negative comparisons to competitors. Focus on our strengths.
"""
with open("marketing_documents/brand_style_guide.txt", "w") as f:
    f.write(brand_style_guide_text)

print("‚úÖ Created marketing_documents/brand_style_guide.txt")

‚úÖ Created marketing_documents/brand_style_guide.txt


In [None]:
#@title Document 2: Product Spec Sheet
product_spec_sheet_text = """
Product Specification Sheet: Project QuantumDrive

Product Name: QuantumDrive Q-1
Product Type: Solid-State Drive (SSD)
Target Market: Creative Professionals (Video Editors, 3D Artists, Photographers)

Core Features:
- Storage Capacity: Available in 2TB, 4TB, and 8TB models.
- Read Speed: Sequential read speeds up to 7,500 MB/s.
- Write Speed: Sequential write speeds up to 7,000 MB/s.
- Interface: NVMe 2.0, PCIe Gen 5.
- Endurance Rating: 3,000 Terabytes Written (TBW) for 4TB model.
- Cooling System: Integrated graphene heat spreader. Prevents thermal throttling under sustained load.
- Software: Includes "DataWeaver" backup and encryption suite. AES-256 bit hardware encryption.
- Warranty: 5-year limited warranty.
"""
with open("marketing_documents/product_spec_sheet.txt", "w") as f:
    f.write(product_spec_sheet_text)

print("‚úÖ Created marketing_documents/product_spec_sheet.txt")

‚úÖ Created marketing_documents/product_spec_sheet.txt


In [None]:
#@title Document 3: Competitor Press Release
competitor_press_release_text = """
FOR IMMEDIATE RELEASE

ChronoTech Unveils the Chrono SSD Pro: Speed for the Modern Creator

CUPERTINO, CA ‚Äì ChronoTech today announced the launch of the Chrono SSD Pro, its new flagship solid-state drive. Aimed at digital artists and content creators, the Chrono SSD Pro prioritizes raw performance to reduce workflow bottlenecks.

"Creators are tired of waiting. The Chrono SSD Pro is our answer," said Jane Doe, CEO of ChronoTech. "We've focused on delivering the fastest possible read and write speeds to ensure that technology never gets in the way of creativity."

The new drive boasts sequential read speeds of 7,300 MB/s and is built on the proven PCIe Gen 4 platform. ChronoTech is emphasizing its value proposition, offering the 4TB model at a highly competitive price point. The Chrono SSD Pro is available for purchase today.
"""
with open("marketing_documents/competitor_press_release.txt", "w") as f:
    f.write(competitor_press_release_text)

print("‚úÖ Created marketing_documents/competitor_press_release.txt")

‚úÖ Created marketing_documents/competitor_press_release.txt


In [None]:
#@title Document 4: Social Media Brief
social_media_brief_text = """
Social Media Campaign Brief: QuantumDrive Q1 Launch

Campaign Goal: Generate excitement and drive pre-orders for the new QuantumDrive Q-1.

Target Audience:
- Primary: Professional video editors and 3D artists on LinkedIn and Twitter.
- Secondary: Tech enthusiasts and PC builders on Instagram and Reddit.

Key Messages:
1. End the Wait: Focus on the theme of speed. Emphasize how the QuantumDrive eliminates rendering and loading times.
2. Built for Pros: Highlight the professional-grade features like the graphene heat spreader and hardware encryption.
3. The Ultimate Upgrade: Position the QuantumDrive as the single most impactful upgrade a creative professional can make to their workstation.

Call to Action (CTA): Drive users to the pre-order page on our website. Use a trackable link.

Hashtags: #QuantumDrive #EndTheWait #BuiltForPros #SSD
"""
with open("marketing_documents/social_media_brief.txt", "w") as f:
    f.write(social_media_brief_text)

print("‚úÖ Created marketing_documents/social_media_brief.txt")

‚úÖ Created marketing_documents/social_media_brief.txt


In [None]:
#@title Document 5: SEO Keywords
seo_keywords_text = """
SEO Target Keywords & Topics - 2025

Primary Keyword: "best ssd for video editing"

Secondary Keywords:
- "fastest ssd for 4k video"
- "nvme gen 5 ssd"
- "high endurance ssd for professionals"
- "video editing storage solutions"

Content Goals:
- Create a pillar page for "The Ultimate Guide to Video Editing Storage."
- Write supporting blog posts for each of the secondary keywords.
- Ensure all content is authoritative, helpful, and links back to the QuantumDrive product page where appropriate.
- Target a technical but accessible tone.
"""
with open("marketing_documents/seo_keywords.txt", "w") as f:
    f.write(seo_keywords_text)

print("‚úÖ Created marketing_documents/seo_keywords.txt")

‚úÖ Created marketing_documents/seo_keywords.txt


In [None]:
#@title Document 6: Customer Interview Notes
customer_interview_notes_text = """
Customer Interview Notes: Maria R., Freelance Video Editor

Background:
- Works with 4K and 6K video files from multiple clients.
- Current workstation is 2 years old.
- Struggles with project deadlines.

Pain Points:
- "My current drive is the bottleneck. I spend hours just waiting for files to transfer or for a timeline to render. It's dead time."
- "Had a drive fail on me last year. Lost a whole project. Now I'm paranoid about backups, which takes even more time."
- "When a drive overheats, the speed drops, and my whole system grinds to a halt right in the middle of a critical render. It's incredibly frustrating."

Goals:
- Wants to reduce wasted time and take on more client work.
- Needs a storage solution that is not just fast, but reliable and secure.
- "I just want my tools to disappear. I want to focus on the creative work, not the hardware."
"""
with open("marketing_documents/customer_interview_notes.txt", "w") as f:
    f.write(customer_interview_notes_text)

print("‚úÖ Created marketing_documents/customer_interview_notes.txt")

‚úÖ Created marketing_documents/customer_interview_notes.txt


In [None]:
#@title Document 7: Email Nurture Outline
email_nurture_outline_text = """
Email Nurture Sequence Outline: New Lead Follow-Up

Audience: Users who downloaded our "Video Editing Storage Guide."
Goal: Nurture the lead and guide them toward a purchase of the QuantumDrive.

Email 1: The Problem (Send 1 day after download)
- Objective: Acknowledge their pain point (slow storage).
- Content: Briefly introduce the concept of workflow bottlenecks and how they kill creativity.
- CTA: "Is slow storage holding you back?" (No product mention yet).

Email 2: The Solution (Send 3 days after download)
- Objective: Introduce the QuantumDrive as the solution.
- Content: Highlight the key benefits from the spec sheet (speed, reliability). Focus on the "End the Wait" message.
- CTA: Link to the QuantumDrive product page.

Email 3: The Proof (Send 5 days after download)
- Objective: Build trust with social proof.
- Content: (Fictional) Include a short testimonial from a professional editor. Reiterate the 5-year warranty.
- CTA: "Ready to upgrade? Pre-order your QuantumDrive today."
"""
with open("marketing_documents/email_nurture_outline.txt", "w") as f:
    f.write(email_nurture_outline_text)

print("‚úÖ Created marketing_documents/email_nurture_outline.txt")

‚úÖ Created marketing_documents/email_nurture_outline.txt


In [None]:
# 3.Data Preparation: The Context Library (Procedural RAG)
# -------------------------------------------------------------------------
# We define the Semantic Blueprints derived from Chapter 1.
# CRITICAL: We embed the 'description' (the intent), so the Librarian agent
# can find the right blueprint based on the desired style. The 'blueprint'
# itself is stored as metadata.

context_blueprints = [
    {
        "id": "blueprint_suspense_narrative",
        "description": "A precise Semantic Blueprint designed to generate suspenseful and tense narratives, suitable for children's stories. Focuses on atmosphere, perceived threats, and emotional impact. Ideal for creative writing.",
        "blueprint": json.dumps({
              "scene_goal": "Increase tension and create suspense.",
              "style_guide": "Use short, sharp sentences. Focus on sensory details (sounds, shadows). Maintain a slightly eerie but age-appropriate tone.",
              "participants": [
                { "role": "Agent", "description": "The protagonist experiencing the events." },
                { "role": "Source_of_Threat", "description": "The underlying danger or mystery." }
              ],
            "instruction": "Rewrite the provided facts into a narrative adhering strictly to the scene_goal and style_guide."
            })
    },
    {
        "id": "blueprint_technical_explanation",
        "description": "A Semantic Blueprint designed for technical explanation or analysis. This blueprint focuses on clarity, objectivity, and structure. Ideal for breaking down complex processes, explaining mechanisms, or summarizing scientific findings.",
        "blueprint": json.dumps({
              "scene_goal": "Explain the mechanism or findings clearly and concisely.",
              "style_guide": "Maintain an objective and formal tone. Use precise terminology. Prioritize factual accuracy and clarity over narrative flair.",
              "structure": ["Definition", "Function/Operation", "Key Findings/Impact"],
              "instruction": "Organize the provided facts into the defined structure, adhering to the style_guide."
            })
    },
    {
        "id": "blueprint_casual_summary",
        "description": "A goal-oriented context for creating a casual, easy-to-read summary. Focuses on brevity and accessibility, explaining concepts simply.",
        "blueprint": json.dumps({
              "scene_goal": "Summarize information quickly and casually.",
              "style_guide": "Use informal language. Keep it brief and engaging. Imagine explaining it to a friend.",
              "instruction": "Summarize the provided facts using the casual style guide."
            })
    }
]

print(f"\nPrepared {len(context_blueprints)} context blueprints.")


Prepared 3 context blueprints.


In [None]:
#@title Updating the Data Loading and Processing Logic
# -------------------------------------------------------------------------
# Load all documents from our new directory
knowledge_base = {}
doc_dir = "marketing_documents"
for filename in os.listdir(doc_dir):
    if filename.endswith(".txt"):
        with open(os.path.join(doc_dir, filename), 'r') as f:
            knowledge_base[filename] = f.read()

print(f"üìö Loaded {len(knowledge_base)} documents into the knowledge base.")

üìö Loaded 7 documents into the knowledge base.


In [None]:
#@title 4.Helper Functions for Chunking and Embedding
# -------------------------------------------------------------------------

# Initialize tokenizer for robust, token-aware chunking
tokenizer = tiktoken.get_encoding("cl100k_base")

def chunk_text(text, chunk_size=400, overlap=50):
    """Chunks text based on token count with overlap (Best practice for RAG)."""
    tokens = tokenizer.encode(text)
    chunks = []
    for i in range(0, len(tokens), chunk_size - overlap):
        chunk_tokens = tokens[i:i + chunk_size]
        chunk_text = tokenizer.decode(chunk_tokens)
        # Basic cleanup
        chunk_text = chunk_text.replace("\n", " ").strip()
        if chunk_text:
            chunks.append(chunk_text)
    return chunks

@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def get_embeddings_batch(texts, model=EMBEDDING_MODEL):
    """Generates embeddings for a batch of texts using OpenAI, with retries."""
    # OpenAI expects the input texts to have newlines replaced by spaces
    texts = [t.replace("\n", " ") for t in texts]
    response = client.embeddings.create(input=texts, model=model)
    return [item.embedding for item in response.data]


In [None]:
#@title Process and Upload Data (High-Fidelity Version)

# --- 6.1. Context Library (No Changes) ---
print(f"\nProcessing and uploading Context Library to namespace: {NAMESPACE_CONTEXT}")
# ... (The existing code for context_blueprints remains the same) ...
vectors_context = []
for item in tqdm(context_blueprints):
    embedding = get_embeddings_batch([item['description']])[0]
    vectors_context.append({
        "id": item['id'],
        "values": embedding,
        "metadata": { "description": item['description'], "blueprint_json": item['blueprint'] }
    })
if vectors_context:
    index.upsert(vectors=vectors_context, namespace=NAMESPACE_CONTEXT)
    print(f"Successfully uploaded {len(vectors_context)} context vectors.")

# --- 6.2. Knowledge Base (UPGRADED FOR HIGH-FIDELITY RAG) ---
print(f"\nProcessing and uploading Knowledge Base to namespace: {NAMESPACE_KNOWLEDGE}")
batch_size = 100
total_vectors_uploaded = 0

for doc_name, doc_content in knowledge_base.items():
    print(f"  - Processing document: {doc_name}")
    # Chunk the document content
    knowledge_chunks = chunk_text(doc_content)

    # Process in batches
    for i in tqdm(range(0, len(knowledge_chunks), batch_size), desc=f"  Uploading {doc_name}"):
        batch_texts = knowledge_chunks[i:i+batch_size]
        batch_embeddings = get_embeddings_batch(batch_texts)

        batch_vectors = []
        for j, embedding in enumerate(batch_embeddings):
            chunk_id = f"{doc_name}_chunk_{total_vectors_uploaded + j}"

            # CRITICAL UPGRADE: Add the 'source' document name to the metadata
            batch_vectors.append({
                "id": chunk_id,
                "values": embedding,
                "metadata": {
                    "text": batch_texts[j],
                    "source": doc_name  # This is the key to verifiability
                }
            })

        # Upsert the batch
        index.upsert(vectors=batch_vectors, namespace=NAMESPACE_KNOWLEDGE)

    total_vectors_uploaded += len(knowledge_chunks)

print(f"\n‚úÖ Successfully uploaded {total_vectors_uploaded} knowledge vectors from {len(knowledge_base)} documents.")

In [None]:
#@title 5.Final Verification
# -------------------------------------------------------------------------
print("\nIngestion complete. Final Pinecone Index Stats (may take a moment to update):")
time.sleep(15) # Give Pinecone a moment to update stats
print(index.describe_index_stats())


Ingestion complete. Final Pinecone Index Stats (may take a moment to update):
{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'ContextLibrary': {'vector_count': 3},
                'KnowledgeStore': {'vector_count': 10}},
 'total_vector_count': 13,
 'vector_type': 'dense'}


In [None]:
#@title Verify Metadata Ingestion
# This step confirms our 'source' metadata was successfully added.
import pprint
print("Querying a sample vector to verify metadata...")

# Get embedding for a sample query
query_embedding = get_embeddings_batch(["Sum up the lead follow up"])[0]

# Query Pinecone
results = index.query(
    vector=query_embedding,
    top_k=1,
    namespace=NAMESPACE_KNOWLEDGE,
    include_metadata=True
)

# Print the metadata of the top result
if results['matches']:
    top_match_metadata = results['matches'][0]['metadata']
    print("\n‚úÖ Verification successful! Metadata of top match:")
    pprint.pprint(top_match_metadata)
else:
    print("‚ùå Verification failed. No results found.")

Querying a sample vector to verify metadata...

‚úÖ Verification successful! Metadata of top match:
{'source': 'email_nurture_outline.txt',
 'text': 'Email Nurture Sequence Outline: New Lead Follow-Up  Audience: Users '
         'who downloaded our "Video Editing Storage Guide." Goal: Nurture the '
         'lead and guide them toward a purchase of the QuantumDrive.  Email 1: '
         'The Problem (Send 1 day after download) - Objective: Acknowledge '
         'their pain point (slow storage). - Content: Briefly introduce the '
         'concept of workflow bottlenecks and how they kill creativity. - CTA: '
         '"Is slow storage holding you back?" (No product mention yet).  Email '
         '2: The Solution (Send 3 days after download) - Objective: Introduce '
         'the QuantumDrive as the solution. - Content: Highlight the key '
         'benefits from the spec sheet (speed, reliability). Focus on the "End '
         'the Wait" message. - CTA: Link to the QuantumDrive produ