This notebook follows the following post [here](https://dgallitelli95.medium.com/building-robust-ai-systems-with-dspy-and-amazon-bedrock-d0376f158d88)

Documentation in DSPy
- faiss rm: https://github.com/stanfordnlp/dspy/blob/main/dspy/retrieve/faiss_rm.py
- built on sentence transformer: https://github.com/stanfordnlp/dspy/blob/main/dsp/modules/sentence_vectorizer.py

In [None]:
from dspy.retrieve.qdrant_rm import QdrantRM
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct, VectorParams
import numpy as np

## Sample dataset

In [None]:
classes = ["pulmonary edema", "consolidation", "pleural effusion", "pneumothorax", "cardiomegaly"]

In [None]:
reports = [
    """
    RADIOLOGY REPORT

    Exam
    PA and lateral chest radiograph (2 views) (2 images) Date: XXXX, XXXX at XXXX hours Indication: Chest pain. Comparison: Chest radiograph from XXXX, XXXX. Findings: The cardiac silhouette is borderline enlarged. Otherwise, there is no focal opacity. Mediastinal contours are within normal limits. There is no large pleural effusion. No pneumothorax. Transcribed by - PSCB Transcription Date - XXXX

    IMPRESSION
    Borderline enlargement of the cardiac silhouette without acute pulmonary disease. DICTATED BY : Dr. XXXX XXXX XXXX XXXX XXXX ELECTRONICALLY SIGNED XXXX. XXXX XXXX XXXX XXXX XXXX TRANSCRIBED XXXX 11 XXXX XXXX  RADRES XXXX

    SIGNATURE
    XXXX
    """,
    """
    RADIOLOGY REPORT

    Exam
    PA and lateral chest radiograph (2 views) (2 images) Date: XXXX, XXXX at XXXX hours Indication: Shortness of breath. Comparison: Chest radiograph from XXXX, XXXX. Findings: There is evidence of bilateral pulmonary edema. The cardiac silhouette is normal. No pleural effusion or pneumothorax. Transcribed by - PSCB Transcription Date - XXXX

    IMPRESSION
    Bilateral pulmonary edema. No evidence of pleural effusion or pneumothorax. DICTATED BY : Dr. XXXX XXXX XXXX XXXX XXXX ELECTRONICALLY SIGNED XXXX. XXXX XXXX XXXX XXXX XXXX TRANSCRIBED XXXX 11 XXXX XXXX  RADRES XXXX

    SIGNATURE
    XXXX
    """,
    """
    RADIOLOGY REPORT

    Exam
    PA and lateral chest radiograph (2 views) (2 images) Date: XXXX, XXXX at XXXX hours Indication: Cough and fever. Comparison: Chest radiograph from XXXX, XXXX. Findings: There is consolidation in the right lower lobe. The cardiac silhouette is normal. No pleural effusion or pneumothorax. Transcribed by - PSCB Transcription Date - XXXX

    IMPRESSION
    Right lower lobe consolidation. No pleural effusion or pneumothorax. DICTATED BY : Dr. XXXX XXXX XXXX XXXX XXXX ELECTRONICALLY SIGNED XXXX. XXXX XXXX XXXX XXXX XXXX TRANSCRIBED XXXX 11 XXXX XXXX  RADRES XXXX

    SIGNATURE
    XXXX
    """,
    """
    RADIOLOGY REPORT

    Exam
    PA and lateral chest radiograph (2 views) (2 images) Date: XXXX, XXXX at XXXX hours Indication: Chest pain. Comparison: Chest radiograph from XXXX, XXXX. Findings: There is a small left pleural effusion. The cardiac silhouette is normal. No pneumothorax. Transcribed by - PSCB Transcription Date - XXXX

    IMPRESSION
    Small left pleural effusion. No pneumothorax. DICTATED BY : Dr. XXXX XXXX XXXX XXXX XXXX ELECTRONICALLY SIGNED XXXX. XXXX XXXX XXXX XXXX XXXX TRANSCRIBED XXXX 11 XXXX XXXX  RADRES XXXX

    SIGNATURE
    XXXX
    """,
    """
    RADIOLOGY REPORT

    Exam
    PA and lateral chest radiograph (2 views) (2 images) Date: XXXX, XXXX at XXXX hours Indication: Trauma. Comparison: Chest radiograph from XXXX, XXXX. Findings: There is a right-sided pneumothorax. The cardiac silhouette is normal. No pleural effusion. Transcribed by - PSCB Transcription Date - XXXX

    IMPRESSION
    Right-sided pneumothorax. No pleural effusion. DICTATED BY : Dr. XXXX XXXX XXXX XXXX XXXX ELECTRONICALLY SIGNED XXXX. XXXX XXXX XXXX XXXX XXXX TRANSCRIBED XXXX 11 XXXX XXXX  RADRES XXXX

    SIGNATURE
    XXXX
    """,
    """
    RADIOLOGY REPORT

    Exam
    PA and lateral chest radiograph (2 views) (2 images) Date: XXXX, XXXX at XXXX hours Indication: Shortness of breath and leg swelling. Comparison: Chest radiograph from XXXX, XXXX. Findings: There is moderate pulmonary edema and bilateral pleural effusion. The cardiac silhouette is enlarged. No pneumothorax. Transcribed by - PSCB Transcription Date - XXXX

    IMPRESSION
    Moderate pulmonary edema and bilateral pleural effusion. Cardiomegaly. No pneumothorax. DICTATED BY : Dr. XXXX XXXX XXXX XXXX XXXX ELECTRONICALLY SIGNED XXXX. XXXX XXXX XXXX XXXX XXXX TRANSCRIBED XXXX 11 XXXX XXXX  RADRES XXXX

    SIGNATURE
    XXXX
    """,
    """
    RADIOLOGY REPORT

    Exam
    PA and lateral chest radiograph (2 views) (2 images) Date: XXXX, XXXX at XXXX hours Indication: Fever and cough. Comparison: Chest radiograph from XXXX, XXXX. Findings: There is a consolidation in the left upper lobe. The cardiac silhouette is normal. No pleural effusion or pneumothorax. Transcribed by - PSCB Transcription Date - XXXX

    IMPRESSION
    Left upper lobe consolidation. No pleural effusion or pneumothorax. DICTATED BY : Dr. XXXX XXXX XXXX XXXX XXXX ELECTRONICALLY SIGNED XXXX. XXXX XXXX XXXX XXXX XXXX TRANSCRIBED XXXX 11 XXXX XXXX  RADRES XXXX

    SIGNATURE
    XXXX
    """,
    """
    RADIOLOGY REPORT

    Exam
    PA and lateral chest radiograph (2 views) (2 images) Date: XXXX, XXXX at XXXX hours Indication: Routine check-up. Comparison: Chest radiograph from XXXX, XXXX. Findings: The cardiac silhouette is normal. No focal opacity. Mediastinal contours are within normal limits. There is no pleural effusion or pneumothorax. Transcribed by - PSCB Transcription Date - XXXX

    IMPRESSION
    Normal chest radiograph. No abnormalities detected. DICTATED BY : Dr. XXXX XXXX XXXX XXXX XXXX ELECTRONICALLY SIGNED XXXX. XXXX XXXX XXXX XXXX XXXX TRANSCRIBED XXXX 11 XXXX XXXX  RADRES XXXX

    SIGNATURE
    XXXX
    """,
    """
    RADIOLOGY REPORT

    Exam
    PA and lateral chest radiograph (2 views) (2 images) Date: XXXX, XXXX at XXXX hours Indication: Dyspnea. Comparison: Chest radiograph from XXXX, XXXX. Findings: There is mild cardiomegaly. Bilateral pleural effusions are present. No evidence of pneumothorax. Transcribed by - PSCB Transcription Date - XXXX

    IMPRESSION
    Mild cardiomegaly with bilateral pleural effusions. No pneumothorax. DICTATED BY : Dr. XXXX XXXX XXXX XXXX XXXX ELECTRONICALLY SIGNED XXXX. XXXX XXXX XXXX XXXX XXXX TRANSCRIBED XXXX 11 XXXX XXXX  RADRES XXXX

    SIGNATURE
    XXXX
    """,
    """
    RADIOLOGY REPORT

    Exam
    PA and lateral chest radiograph (2 views) (2 images) Date: XXXX, XXXX at XXXX hours Indication: Trauma. Comparison: Chest radiograph from XXXX, XXXX. Findings: There is a left-sided pneumothorax. The cardiac silhouette is normal. No pleural effusion. Transcribed by - PSCB Transcription Date - XXXX

    IMPRESSION
    Left-sided pneumothorax. No pleural effusion. DICTATED BY : Dr. XXXX XXXX XXXX XXXX XXXX ELECTRONICALLY SIGNED XXXX. XXXX XXXX XXXX XXXX XXXX TRANSCRIBED XXXX 11 XXXX XXXX  RADRES XXXX

    SIGNATURE
    XXXX
    """
]

# Ground Truth Labels for each report
ground_truth = [
    ["cardiomegaly"],
    ["pulmonary edema"],
    ["consolidation"],
    ["pleural_effusion"],
    ["pneumothorax"],
    ["pulmonary edema", "pleural effusion", "cardiomegaly"],
    ["consolidation"],
    [],
    ["cardiomegaly", "pleural effusion"],
    ["pneumothorax"]
]


## Retrieval model with custom vectorizer

In [None]:
# from dsp.modules import sentence_vectorizer

# vectorizer = sentence_vectorizer.SentenceTransformersVectorizer()
# vectorizer = sentence_vectorizer.SentenceTransformersVectorizer("all-MiniLM-L12-v2")
# vectorizer = sentence_vectorizer.SentenceTransformersVectorizer("all-mpnet-base-v2")

# Test different values of `model_name_or_path` from sentence_transformers, default = "all-MiniLM-L6-v2"

In [None]:
# def build_retriever_client(labels, collection_name, k, vectorizer = None):
#     client = QdrantClient(":memory:")
#     ids = list(range(0, len(labels)))

#     # If you want to change the model: (reference: https://github.com/qdrant/fastembed?tab=readme-ov-file#usage-with-qdrant)
#     # client.set_model("sentence-transformers/all-MiniLM-L6-v2")
#     # List of supported models: https://qdrant.github.io/fastembed/examples/Supported_Models

#     if vectorizer is not None:
#         client.add(
#             collection_name=collection_name,
#             documents=labels,
#             ids=ids
#         )
#     else:
#         # Embed the documents using your custom vectorizer
#         embedded_docs = [vectorizer(label) for label in labels]
        
#         # Get the vector size from the first embedded document
#         vector_size = len(embedded_docs[0])
        
#         # Create the collection
#         client.create_collection(
#             collection_name=collection_name,
#             vectors_config=VectorParams(size=vector_size, distance="Cosine")
#         )
        
#         # Create PointStruct objects
#         points = [
#             PointStruct(
#                 id=idx,
#                 vector=embedded_doc.tolist(),
#                 payload={"text": label}
#             )
#             for idx, (label, embedded_doc) in enumerate(zip(labels, embedded_docs))
#         ]
        
#         # Add the embedded documents to Qdrant
#         client.upsert(
#             collection_name=collection_name,
#             points=points
#         )

#     qdrant_retriever_model = QdrantRM(collection_name, client, k=k)

#     return qdrant_retriever_model

In [None]:
def build_retriever_client(labels, collection_name, k, vectorizer = None):
    client = QdrantClient(":memory:")
    ids = list(range(0, len(labels)))

    # If you want to change the model: (reference: https://github.com/qdrant/fastembed?tab=readme-ov-file#usage-with-qdrant)
    # client.set_model("sentence-transformers/all-MiniLM-L6-v2")
    # List of supported models: https://qdrant.github.io/fastembed/examples/Supported_Models

    if vectorizer is not None:
        client.set_model(vectorizer)
        
    client.add(
        collection_name=collection_name,
        documents=labels,
        ids=ids
    )

    qdrant_retriever_model = QdrantRM(collection_name, client, k=k)

    return qdrant_retriever_model

In [None]:
# qdrant_retriever_model = build_retriever_client(labels=classes, collection_name="rad", k=3)
qdrant_retriever_model = build_retriever_client(labels=classes, collection_name="rad", k=3, vectorizer="sentence-transformers/all-MiniLM-L6-v2")
# qdrant_retriever_model = build_retriever_client(labels=classes, collection_name="rad", k=3, vectorizer="nomic-ai/nomic-embed-text-v1.5-Q")
# qdrant_retriever_model = build_retriever_client(labels=classes, collection_name="rad", k=3, vectorizer="BAAI/bge-large-en-v1.5")
# qdrant_retriever_model = build_retriever_client(labels=classes, collection_name="rad", k=3, vectorizer="intfloat/multilingual-e5-large")


In [None]:
client = QdrantClient(":memory:")
docs = classes
ids = list(range(0, len(docs)))

# Embed the documents using your custom vectorizer
embedded_docs = [vectorizer(doc) for doc in docs]

# Get the vector size from the first embedded document
vector_size = len(embedded_docs[0])

# Create the collection
client.create_collection(
    collection_name="rad",
    vectors_config=VectorParams(size=vector_size, distance="Cosine")
)

# Create PointStruct objects
points = [
    PointStruct(
        id=idx,
        vector=embedded_doc.tolist(),
        payload={"text": doc}
    )
    for idx, (doc, embedded_doc) in enumerate(zip(docs, embedded_docs))
]

# Add the embedded documents to Qdrant
client.upsert(
    collection_name="rad",
    points=points
)

In [None]:
qdrant_retriever_model = QdrantRM("rad", client, k=3)

## Retrieval model

In [None]:
from dspy.retrieve.qdrant_rm import QdrantRM
from qdrant_client import QdrantClient

In [None]:
client = QdrantClient(":memory:")

In [None]:
docs = classes
ids = list(range(0,len(docs)))

In [None]:
client.add(
    collection_name="rad",
    documents=docs,
    ids=ids
    )

In [None]:
qdrant_retriever_model = QdrantRM("rad", client, k=3)

In [None]:
reports[0]

In [None]:
n = 8
print(reports[n])
print(ground_truth[n])
print(qdrant_retriever_model.forward(reports[n], k=3))

## Retrieval metrics

### TODO
- Add code to loop through the examples and save off the ground truth label, position, model name. This will allow us to see how well ensembling will work, and if there are certain examples that work well with some models but not others.

In [None]:
positions = []
top_k = 0
for report, labels in zip(reports, ground_truth):
    results = qdrant_retriever_model.forward(report, k=5)
    results_list = [elt['long_text'] for elt in results]

    for label in labels:
        if label in results_list:
            position = results_list.index(label) + 1
            top_k += 1
        else:
            position = len(results_list)
        positions.append(position)

In [None]:
import statistics
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Calculate summary statistics
mean_value = statistics.mean(positions)
median_value = statistics.median(positions)
mode_value = statistics.mode(positions)
percentile_95 = np.percentile(positions, 95)

print(f"Mean: {mean_value}")
print(f"Median: {median_value}")
print(f"Mode: {mode_value}")
print(f"95th Percentile: {percentile_95}")

# Plot histogram
plt.hist(positions, bins=5, edgecolor='black')
plt.title('Histogram of Retrieval Positions')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

In [None]:
from dspy.retrieve.faiss_rm import FaissRM
# from dspy.dsp.modules.sentence_vectorizer import SentenceTransformersVectorizer

document_chunks = reports

In [None]:
rm = FaissRM(
    document_chunks=document_chunks
    # vectorizer=SentenceTransformersVectorizer
)
print(rm(["Provide your question here"]))