<a href="https://colab.research.google.com/github/saskinosie/CalibrateAI-5-12-25/blob/main/_calibrateai__hack_day_micro_conference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic Search with Weaviate's Query Agent, LlamaIndex and Comet

##Pre-requisites
For this workshop you will need:

* A (free) Weaviate Cloud (WCD) account
* A cluster set up in WCD
* The REST endpoint for your cluster
* Your cluster Admin API key
* An OpenAI API key

In this workshop we will create a Retrieval Augmented Generation system leveraging Weaviate's Query Agent, LlamaIndex and Comet.


We’re utilizing LlamaIndex to transform full-text PDF research articles into manageable, structured text chunks. These chunks are enhanced with metadata and section detection logic, then uploaded into a Weaviate vector database to support semantic search over a collection of space medicine literature allowing us to query our data using natural languge using Weaviate's query agent. In the subsequent woerkshop, we will use Comet's end-to-end model evaluation platform to benchmark our RAG system.

### Link to slide deck
https://docs.google.com/presentation/d/1UbDpA0dhuHiiSu5vsiv_PPA9HzgzZKcRn7t4zuoFzzA/edit?usp=sharing

### Link to repo and instructions for copllecting REST endpoint and API keys
https://github.com/saskinosie/CalibrateAI-5-12-25

### Link to G-Drive folder with research articles https://drive.google.com/drive/folders/18iu8lGJ0SEZcISkUqc20pecGrb61Mo7s?usp=drive_link

In [None]:
!pip install llama-index pymupdf weaviate-client weaviate-agents


Collecting llama-index
  Downloading llama_index-0.12.35-py3-none-any.whl.metadata (12 kB)
Collecting pymupdf
  Downloading pymupdf-1.25.5-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Collecting weaviate-client
  Downloading weaviate_client-4.14.1-py3-none-any.whl.metadata (3.7 kB)
Collecting weaviate-agents
  Downloading weaviate_agents-0.6.0-py3-none-any.whl.metadata (1.1 kB)
Collecting llama-index-agent-openai<0.5,>=0.4.0 (from llama-index)
  Downloading llama_index_agent_openai-0.4.7-py3-none-any.whl.metadata (438 bytes)
Collecting llama-index-cli<0.5,>=0.4.1 (from llama-index)
  Downloading llama_index_cli-0.4.1-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-core<0.13,>=0.12.35 (from llama-index)
  Downloading llama_index_core-0.12.35-py3-none-any.whl.metadata (2.4 kB)
Collecting llama-index-embeddings-openai<0.4,>=0.3.0 (from llama-index)
  Downloading llama_index_embeddings_openai-0.3.1-py3-none-any.whl.metadata (684 bytes)
Collecting lla

In [None]:
import fitz  # PyMuPDF
import json
import re
import requests
from llama_index.core import Document
from llama_index.core.node_parser import HierarchicalNodeParser
import weaviate

In [None]:
from google.colab import userdata

WEAVIATE_URL = userdata.get("WEAVIATE_URL")
WEAVIATE_API_KEY = userdata.get("WEAVIATE_API_KEY")
OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")

print("Weaviate URL:", WEAVIATE_URL)
print("Weaviate API Key:", WEAVIATE_API_KEY)
print("OpenAI API Key:", OPENAI_API_KEY)

Weaviate URL: https://rwxzavyuspepzg2fkhjag.c0.us-west3.gcp.weaviate.cloud
Weaviate API Key: IwgDlvGkKqzCylsFoLh6wAuPgYqY9bvyp7yR
OpenAI API Key: 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


In [None]:
client = weaviate.connect_to_weaviate_cloud(
    WEAVIATE_URL,
    auth_credentials=weaviate.AuthApiKey(WEAVIATE_API_KEY),
    headers={
        "X-OpenAI-Api-Key": OPENAI_API_KEY
    }
)

In [None]:
assert client.is_ready(), "Weaviate client is not ready. Check credentials and endpoint."


In [None]:
client.is_ready()

True

In [None]:
from weaviate.classes.config import Configure
client.collections.create(
    name = "SpaceMedResearch",
    vectorizer_config= [
            Configure.NamedVectors.text2vec_weaviate(
                name="main_vector",
                model="Snowflake/snowflake-arctic-embed-l-v2.0",
                source_properties=["title", "content"],
            )
        ],
    )


UnexpectedStatusCodeError: Collection may not have been created properly.! Unexpected status code: 422, with response body: {'error': [{'message': 'rpc error: code = Internal desc = class name SpaceMedResearch already exists'}]}.

In [None]:
gdrive_links = [
    "https://drive.google.com/file/d/1bNX5nZTif8roMK1bFaJmHF6wxapi5YDg/view?usp=sharing",
    "https://drive.google.com/file/d/1FZkvMOyTP_-kSIyx9VewaV9tP_XwpZXK/view?usp=drive_link",
    "https://drive.google.com/file/d/1jfcCLHmAazvs7DnAhd3jS0LOb7qOctMc/view?usp=drive_link",
    "https://drive.google.com/file/d/1K8D6VOe2aAX6tIfJWzF2-9zWbqH0C_wp/view?usp=drive_link",
    "https://drive.google.com/file/d/12ee59tcUcxotC1NFfz0EaLNWAAqakKDk/view?usp=drive_link",
    "https://drive.google.com/file/d/115LBMKIobYRdqKWqL2uR1VWoPS5zqoZq/view?usp=drive_link",
    "https://drive.google.com/file/d/1CcjaMYUIQNJ2S4nFGHIh0hpkeG158-Ag/view?usp=drive_link",
    "https://drive.google.com/file/d/1eR6rTQcYw_q4Lob2JFB2_cnND9U2VivA/view?usp=drive_link",
    "https://drive.google.com/file/d/1Gw9UQGNIcDTLpCaamYm4WoeYqUVlKsAG/view?usp=drive_link",
    "https://drive.google.com/file/d/1E961JtImN2eis_IxK5EZS3JaqhXkyhK8/view?usp=drive_link",
    "https://drive.google.com/file/d/1G5xQ10Ijjhrnm_Uq_hl2OgXNaKERfB3G/view?usp=drive_link",
    "https://drive.google.com/file/d/1u-nmLQIvBdcRomCo__yvoCCVNHV3AiKg/view?usp=drive_link",
    "https://drive.google.com/file/d/1cfF_cRkvfaTw5BTpiMxalw0Bc-IdBBnO/view?usp=drive_link",
    "https://drive.google.com/file/d/1WSMqabWY4pElGrQjVVkkaJ8NNEuP6T10/view?usp=drive_link",
    "https://drive.google.com/file/d/11bCbFObW-51XE0lMS3sz5-VYvWsxctRm/view?usp=drive_link",
    "https://drive.google.com/file/d/13AFfg8doORRytR3IXfTOmL1vpp6hsCuG/view?usp=drive_link",
    "https://drive.google.com/file/d/1k8QYuAsyzkMTJA-KzPrOX2BToKmzSr3l/view?usp=drive_link"

]

In [None]:
# PDF wrangling function and chunk setup

def download_google_drive_pdf(share_url, output_folder="downloads"):
    os.makedirs(output_folder, exist_ok=True)
    file_id_match = re.search(r"/d/([^/]+)", share_url)
    if not file_id_match:
        raise ValueError(f"Invalid Google Drive URL: {share_url}")
    file_id = file_id_match.group(1)
    download_url = f"https://drive.google.com/uc?export=download&id={file_id}"
    response = requests.get(download_url)
    pdf_path = os.path.join(output_folder, f"{file_id}.pdf")
    with open(pdf_path, "wb") as f:
        f.write(response.content)
    return pdf_path

def extract_text(filepath):
    doc = fitz.open(filepath)
    return "\n".join(page.get_text("text") for page in doc)

def extract_title(text):
    candidate_block = text[:1000]
    lines = [line.strip() for line in candidate_block.split("\n") if line.strip()]
    for i, line in enumerate(lines):
        if line.lower() != line and len(line.split()) > 5 and not line.endswith(":") and i < 5:
            return line
    return "Unknown Title"

def slugify(text):
    text = text.lower()
    text = re.sub(r"[^\w\s-]", "", text)
    text = re.sub(r"\s+", "-", text)
    return text.strip("-")

def detect_section(text_chunk, chunk_index=0):
    lowered = text_chunk.lower()
    if "introduction" in lowered[:150] or chunk_index == 0:
        return "Introduction"
    elif "methods" in lowered[:150] or "materials and methods" in lowered[:150]:
        return "Methods"
    elif "results" in lowered[:150]:
        return "Results"
    elif "discussion" in lowered[:150]:
        return "Discussion"
    elif "conclusion" in lowered[:150]:
        return "Conclusion"
    else:
        return "Unknown"

def chunk_for_weaviate(text, title=None):
    from llama_index.core.node_parser import HierarchicalNodeParser
    from llama_index.core.text_splitter import SentenceSplitter

    if not title:
        title = extract_title(text)
    slug = slugify(title)
    document = Document(text=text, metadata={"title": title, "slug": slug})

    # Create hierarchical parser directly without parameters
    # The latest version may not accept parameters in from_defaults()
    parser = HierarchicalNodeParser.from_defaults()

    # Configure it after creation if needed
    # This approach is more compatible with different versions

    nodes = parser.get_nodes_from_documents([document])

    # Further chunk the nodes if they're too large using SentenceSplitter
    text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)
    smaller_nodes = []
    for node in nodes:
        split_texts = text_splitter.split_text(node.text)
        for i, split_text in enumerate(split_texts):
          smaller_nodes.append({
              "text": split_text,
              "metadata": {
                  "title": title,
                  "slug": slug,
                  "section": detect_section(split_text, chunk_index=i)
                  }
              })


    return smaller_nodes



In [None]:
# Chunking Function

def process_gdrive_links(gdrive_links, output_path="demo_chunks.json"):
    all_chunks = []
    for link in gdrive_links:
        print(f"Processing: {link}")
        try:
            pdf_path = download_google_drive_pdf(link)
            text = extract_text(pdf_path)
            chunks = chunk_for_weaviate(text)
            all_chunks.extend(chunks)
        except Exception as e:
            print(f"❌ Failed to process {link}: {e}")

    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(all_chunks, f, ensure_ascii=False, indent=2)
    print(f"\n✅ Saved {len(all_chunks)} chunks to {output_path}")

In [None]:
# Chunk PDFs
import os
process_gdrive_links(gdrive_links, output_path="demo_chunks.json")

Processing: https://drive.google.com/file/d/1bNX5nZTif8roMK1bFaJmHF6wxapi5YDg/view?usp=sharing
Processing: https://drive.google.com/file/d/1FZkvMOyTP_-kSIyx9VewaV9tP_XwpZXK/view?usp=drive_link
Processing: https://drive.google.com/file/d/1jfcCLHmAazvs7DnAhd3jS0LOb7qOctMc/view?usp=drive_link
Processing: https://drive.google.com/file/d/1K8D6VOe2aAX6tIfJWzF2-9zWbqH0C_wp/view?usp=drive_link
Processing: https://drive.google.com/file/d/12ee59tcUcxotC1NFfz0EaLNWAAqakKDk/view?usp=drive_link
Processing: https://drive.google.com/file/d/115LBMKIobYRdqKWqL2uR1VWoPS5zqoZq/view?usp=drive_link
Processing: https://drive.google.com/file/d/1CcjaMYUIQNJ2S4nFGHIh0hpkeG158-Ag/view?usp=drive_link
Processing: https://drive.google.com/file/d/1eR6rTQcYw_q4Lob2JFB2_cnND9U2VivA/view?usp=drive_link
Processing: https://drive.google.com/file/d/1Gw9UQGNIcDTLpCaamYm4WoeYqUVlKsAG/view?usp=drive_link
Processing: https://drive.google.com/file/d/1E961JtImN2eis_IxK5EZS3JaqhXkyhK8/view?usp=drive_link
Processing: https://dri

In [None]:
# Function to batch upload to Weaviate (including UUIDs)
def bulk_upload_space_chunks_to_weaviate(json_file_path, collection_name="SpaceMedResearch"):
    import weaviate
    from google.colab import userdata
    from weaviate.util import generate_uuid5

    # Client initialization
    WEAVIATE_URL = userdata.get("WEAVIATE_URL")
    WEAVIATE_API_KEY = userdata.get("WEAVIATE_API_KEY")
    OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")

    client = weaviate.connect_to_weaviate_cloud(
        WEAVIATE_URL,
        auth_credentials=weaviate.AuthApiKey(WEAVIATE_API_KEY),
        headers={
            "X-OpenAI-Api-Key": OPENAI_API_KEY
        }
    )

    docs_collection = client.collections.get(collection_name)

    with open(json_file_path, "r", encoding="utf-8") as f:
        chunks = json.load(f)

    successful_uploads = 0

    with docs_collection.batch.fixed_size(batch_size=100, concurrent_requests=2) as batch:
        for i, chunk in enumerate(chunks):
            text = chunk.get("text", "")
            metadata = chunk.get("metadata", {})

            # Create a unique ID by combining title with chunk index and first 20 chars of text
            unique_id = f"{metadata.get('title', 'unknown')}-chunk-{i}-{text[:20]}"
            uid = generate_uuid5(unique_id)

            batch.add_object(
                properties={
                    "content": text,
                    "title": metadata.get("title", "unknown"),
                    "slug": metadata.get("slug", "unknown"),
                    "section": metadata.get("section", "unknown")
                },
                uuid=uid
            )
            successful_uploads += 1

            # Progress indicator
            if i % 500 == 0 and i > 0:
                print(f"Progress: {i}/{len(chunks)} chunks processed")

            if batch.number_errors > 10:
                print("❌ Too many errors during batch import — stopping early.")
                break

    # Verify the actual count in the collection
    collection_count = docs_collection.aggregate.over_all().total_count
    print(f"✅ Uploaded {successful_uploads} chunks to Weaviate from {json_file_path}")
    print(f"✅ Collection now contains {collection_count} objects")

In [None]:
# Batch upload
bulk_upload_space_chunks_to_weaviate("/content/demo_chunks.json")

Progress: 500/4008 chunks processed
Progress: 1000/4008 chunks processed
Progress: 1500/4008 chunks processed
Progress: 2000/4008 chunks processed
Progress: 2500/4008 chunks processed
Progress: 3000/4008 chunks processed
Progress: 3500/4008 chunks processed
Progress: 4000/4008 chunks processed
✅ Uploaded 4008 chunks to Weaviate from /content/demo_chunks.json
✅ Collection now contains 4011 objects


In [None]:
from weaviate.classes.init import Auth
# Try importing from weaviate-agents
from weaviate_agents.query import QueryAgent

# Instantiate agent object, and specify the collections to query
qa = QueryAgent(
    client=client, collections=["SpaceMedResearch"]
)

In [None]:
# Perform a query
response = qa.run(
    "What are the greatest health concerns facing astronauts during their time in space and upon their return to earth?"
)
# Print the response
response.display()





In [None]:
# Perform a query
response = qa.run(
    "What are the health concerns for individuals in general aviation?"
)
# Print the response
response.display()





In [None]:
# Perform a query
response = qa.run(
    "What stress is common in pilots?"
)
# Print the response
response.display()



