In [1]:
## SOPbot setup and initial testing ##

In [2]:
## install dependencies ##
%pip install -r ../requirements.txt

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.1.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
### load and check connections ###

from dotenv import load_dotenv
import os

load_dotenv('../.env')

## Verify credentials ##
print("✓ Azure OpenAI Endpoint:", os.getenv("AZURE_OPENAI_ENDPOINT"))
print("✓ Azure OpenAI Embedding Endpoint:", os.getenv("AZURE_OPENAI_EMBEDDING_ENDPOINT"))
print("✓ Azure Search Endpoint:", os.getenv("AZURE_SEARCH_ENDPOINT"))
print("\nAll credentials loaded!")

✓ Azure OpenAI Endpoint: https://sampd-mbgsuu3f-eastus2.cognitiveservices.azure.com/
✓ Azure OpenAI Embedding Endpoint: https://crebot.cognitiveservices.azure.com/
✓ Azure Search Endpoint: https://crebot-search.search.windows.net

All credentials loaded!


In [4]:
## process pdfs ##
import sys
sys.path.append('..')

from src.pdf_processor import SOPProcessor

processor = SOPProcessor()
chunks = processor.process_directory("../data")

print(f"\n✓ Processed {len(chunks)} chunks from SOPs")
print("\nSample chunk:")
print("Metadata:", chunks[0]['metadata'])
print("\nContent preview:", chunks[0]['content'][:300], "...")

Processing: 1.01 CRE SOP FDA Form 1572 - Copy.pdf
  -> Created 1 chunks
Processing: 1.01 CRE SOP FDA Form 1572.pdf
  -> Created 1 chunks
Processing: 1.02 CRE SOP GCP Training - Copy.pdf
  -> Created 1 chunks
Processing: 1.02 CRE SOP GCP Training.pdf
  -> Created 1 chunks
Processing: 1.03 CRE SOP Internal Audits - Copy.pdf
  -> Created 1 chunks
Processing: 1.03 CRE SOP Internal Audits.pdf
  -> Created 1 chunks
Processing: 1.04 CRE SOP MSO Billing - Copy.pdf
  -> Created 1 chunks
Processing: 1.04 CRE SOP MSO Billing.pdf
  -> Created 1 chunks
Processing: 1.05 CRE SOP New Employee - Copy.pdf
  -> Created 1 chunks
Processing: 1.05 CRE SOP New Employee.pdf
  -> Created 1 chunks
Processing: 1.06 CRE SOP Protocol Training - Copy.pdf
  -> Created 1 chunks
Processing: 1.06 CRE SOP Protocol Training.pdf
  -> Created 1 chunks
Processing: 1.07 CRE SOP Roles  Responsibilities v1.4 - Copy.pdf
  -> Created 29 chunks
Processing: 1.07 CRE SOP Roles  Responsibilities v1.4.pdf
  -> Created 29 chunks
Proce

In [5]:
## generate embeddings ##
from src.embeddings import EmbeddingGenerator

generator = EmbeddingGenerator()

### Extract content for embedding ##
contents = [chunk['content'] for chunk in chunks]

### Generate embeddings ##
print("Generating embeddings (this may take a minute)...")
embeddings = generator.generate_batch_embeddings(contents)

print(f"\n✓ Generated {len(embeddings)} embeddings")
print(f"✓ Embedding dimension: {len(embeddings[0])}")

Generating embeddings (this may take a minute)...
Processed 100/365 embeddings
Processed 200/365 embeddings
Processed 300/365 embeddings
Processed 365/365 embeddings

✓ Generated 365 embeddings
✓ Embedding dimension: 1536


In [6]:
### create search index ##
from src.search_index import AzureSearchIndexManager

manager = AzureSearchIndexManager()

### Delete existing index if any ##
print("Checking for existing index...")
manager.delete_index()

### Create new index ##
print("\nCreating new index...")
manager.create_index(embedding_dimensions=1536)

Checking for existing index...
Index 'sop-index' deleted

Creating new index...
Index 'sop-index' created successfully


<azure.search.documents.indexes.models._index.SearchIndex at 0x23cd5183bd0>

In [7]:
## upload docs ##
import uuid

documents = []
for chunk, embedding in zip(chunks, embeddings):
    doc = {
        "id": str(uuid.uuid4()),
        "content": chunk['content'],
        "sop_number": chunk['metadata']['sop_number'],
        "title": chunk['metadata']['title'],
        "section_type": chunk['metadata']['section_type'],
        "version": chunk['metadata']['version'],
        "filename": chunk['metadata']['filename'],
        "effective_date": chunk['metadata']['effective_date'],
        "content_vector": embedding
    }
    documents.append(doc)

# Upload AZ search ##
print(f"Uploading {len(documents)} documents to Azure Search...")
manager.upload_documents(documents)

print(f"\n✓ Successfully uploaded {len(documents)} documents!")

Uploading 365 documents to Azure Search...
Uploaded 365 documents

✓ Successfully uploaded 365 documents!


In [8]:
### test RAG ##
from src.rag_pipeline import SOPRAGPipeline

pipeline = SOPRAGPipeline()

test_queries = [
    "What is the parking procedure for patients at CH20?",
    "How should informed consent be obtained remotely?",
    "What should I do during a sponsor audit?"
]

for query in test_queries:
    print(f"\n{'='*60}")
    print(f"Question: {query}")
    print('='*60)

    result = pipeline.query(query)

    print(f"\nAnswer:\n{result['answer']}")
    print(f"\nSources:\n" + "\n".join(f"- {s}" for s in result['sources']))


Question: What is the parking procedure for patients at CH20?
Retrieving documents for: What is the parking procedure for patients at CH20?
Generating answer...

Answer:
According to SOP 1.13, the parking procedure for patients at CH20 is as follows:

- When scheduling an appointment, provide patients with:
  - The address, directions, a map to the parking lot, and clinic contact information.
  - Inform them that the parking lot entrance is on 20th Street.
- Instruct patients on how to get from the parking lot to the building entrance and tell them to call upon arrival.
- Inform patients:
  - They cannot follow anyone into the building and must remain outside until let in by appropriate personnel.
  - If the front intercom is used, provide instructions:
    1. Click the Contacts Button.
    2. Choose the clinic/department for their appointment.
    3. Choose the same clinic/department again to CALL.
    4. Someone will answer, verify the appointment, and unlock the door automatically.

In [9]:
### test interaction ##
query = input("Enter your question: ")
result = pipeline.query(query)

print(f"\n{'='*60}")
print(f"Answer:\n{result['answer']}")
print(f"\n{'='*60}")
print(f"Sources:\n" + "\n".join(f"- {s}" for s in result['sources']))

Retrieving documents for: What all needs to be done for a monitor visit?

Generating answer...

Answer:
Based on SOP 3.04 – Monitor Visits (On Site or Remote), the following steps need to be completed for a monitor visit:

**I. Scheduling and Preparation**
- Provide new monitors with the Monitor Informational Packet before any visit (remote or onsite).
- Schedule monitoring visits in advance:
  - Minimum of four (4) weeks for new monitors.
  - Minimum of three (3) weeks for established monitors.
- Use the Fillable Monitor Request Form to schedule the visit.
- Regulatory Coordinator sends a calendar invitation to all regulatory personnel and the Study Coordinator.
- If EMR access is required, Ronald Prevatt initiates the access request process.
- Study Coordinator is cc’d on all related correspondence.
- Study Coordinator secures and reserves a designated workspace for the monitor and notifies the regulatory team of the location.
- Monitors must submit the IMV (Interim Monitoring Visit)