# Miimansa Clinical Text Retriever   


In this notebook, we demonstrate the functionality of the MiimansaClinicalTextRetriever. This tool is designed to enhance the retrieval of clinical texts by leveraging a pre-existing question bank. The process involves checking if a similar question is present in the question bank, and if found, returning the corresponding context. Otherwise, it retrieves the top_k most relevant contexts for the given query using a traditional Retrieval-Augmented Generation (RAG) approach.

In [1]:
import warnings

warnings.filterwarnings("ignore")

### DEFINE VARIABLES

In [13]:
DB_PATH = "question_bank.csv"
METADATA_PATH = "./output/metadata.pkl"
VECTOR_DB_PATH = "./output"
DIRECT_HIT_MODEL = "mixedbread-ai/mxbai-embed-large-v1"

## Dataset Creation : Question-Bank 
The dataset should contain columns:
- context: This includes the passages or contexts from which questions are generated.
- generated_question: This contains a set of questions generated from each context, possibly using language models (LLMs).

### Example Dataset

In [6]:
import pandas as pd

pd.set_option("display.max_colwidth", None)

In [7]:
data = {
    "generated_question": [
        "Who is the clinical study sponsor?",
        "What phase is being described in the context?",
        "What is being evaluated in Phase 1?",
        "How many subjects are anticipated to be enrolled and dosed in this study?  ",
    ],
    "context": [
        """
Clinical Study Sponsor: 
Kite Pharma, Inc.
2400 Broadway
Santa Monica, CA 90404
United States of America
Key Sponsor Contacts: 
Clinical Development
Kite Pharma, Inc.
2 Roundwood Avenue
Stockley Park
Uxbridge, Middlesex
Phone:
Email:
Clinical Development
Kite Pharma Inc.
2400 Broadway
Santa Monica, CA 90404
Phone:
Email:
Clinical Operations
Kite Pharma, Inc.
2400 Broadway
Santa Monica, CA 90404
Phone:
Email:""",
        """Study Objectives 
Phase 1 Study
The primary objective of Phase 1 is to evaluate the safety of axicabtagene ciloleucel regimens.""",
        """Study Objectives 
Phase 1 Study
The primary objective of Phase 1 is to evaluate the safety of axicabtagene ciloleucel regimens.""",
        """
3.3. 
Number of Subjects
Participants in this trial will be referred to as “subjects”. It is anticipated that approximately 268 to 286 subjects will be enrolled and dosed in this study as defined below:
Phase 1 study: approximately 6 to 24 subjects
Phase 2 pivotal study: approximately 92 subjects enrolled into 2 cohorts
Cohort 1: approximately 72 subjects
Cohort 2: approximately 20 subjects
Phase 2 safety management study: approximately 170 subjects enrolled and dosed within 4 cohorts
Cohort 3: approximately 40 subjects
Cohort 4: approximately 40 subjects
Cohort 5: approximately 50 subjects
Cohort 6: approximately 40 subjects
It should be noted that Kite Pharma may choose to close enrollment at any time. Please refer to the statistical considerations section of the protocol for sample size estimations.
""",
    ],
    # 'ID' : [1,1,2],
    # 'QID': [0,1,2]
}

# Load dataset
dataset = pd.DataFrame(data)
dataset

Unnamed: 0,generated_question,context
0,Who is the clinical study sponsor?,"\nClinical Study Sponsor: \nKite Pharma, Inc.\n2400 Broadway\nSanta Monica, CA 90404\nUnited States of America\nKey Sponsor Contacts: \nClinical Development\nKite Pharma, Inc.\n2 Roundwood Avenue\nStockley Park\nUxbridge, Middlesex\nPhone:\nEmail:\nClinical Development\nKite Pharma Inc.\n2400 Broadway\nSanta Monica, CA 90404\nPhone:\nEmail:\nClinical Operations\nKite Pharma, Inc.\n2400 Broadway\nSanta Monica, CA 90404\nPhone:\nEmail:"
1,What phase is being described in the context?,Study Objectives \nPhase 1 Study\nThe primary objective of Phase 1 is to evaluate the safety of axicabtagene ciloleucel regimens.
2,What is being evaluated in Phase 1?,Study Objectives \nPhase 1 Study\nThe primary objective of Phase 1 is to evaluate the safety of axicabtagene ciloleucel regimens.
3,How many subjects are anticipated to be enrolled and dosed in this study?,\n3.3. \nNumber of Subjects\nParticipants in this trial will be referred to as “subjects”. It is anticipated that approximately 268 to 286 subjects will be enrolled and dosed in this study as defined below:\nPhase 1 study: approximately 6 to 24 subjects\nPhase 2 pivotal study: approximately 92 subjects enrolled into 2 cohorts\nCohort 1: approximately 72 subjects\nCohort 2: approximately 20 subjects\nPhase 2 safety management study: approximately 170 subjects enrolled and dosed within 4 cohorts\nCohort 3: approximately 40 subjects\nCohort 4: approximately 40 subjects\nCohort 5: approximately 50 subjects\nCohort 6: approximately 40 subjects\nIt should be noted that Kite Pharma may choose to close enrollment at any time. Please refer to the statistical considerations section of the protocol for sample size estimations.\n


## Preprocess Dataset

In [8]:
# Run to create unique question ID
dataset["QID"] = pd.factorize(dataset["generated_question"])[0].astype(str)
# Run to create unique context ID
dataset["ID"] = pd.factorize(dataset["context"])[0].astype(str)
dataset["ID"] = dataset["ID"].astype(str)
unique_context = dataset["context"].unique()
unique_context_id = dataset["ID"].unique()
# unique_context_id

In [9]:
dataset

Unnamed: 0,generated_question,context,QID,ID
0,Who is the clinical study sponsor?,"\nClinical Study Sponsor: \nKite Pharma, Inc.\n2400 Broadway\nSanta Monica, CA 90404\nUnited States of America\nKey Sponsor Contacts: \nClinical Development\nKite Pharma, Inc.\n2 Roundwood Avenue\nStockley Park\nUxbridge, Middlesex\nPhone:\nEmail:\nClinical Development\nKite Pharma Inc.\n2400 Broadway\nSanta Monica, CA 90404\nPhone:\nEmail:\nClinical Operations\nKite Pharma, Inc.\n2400 Broadway\nSanta Monica, CA 90404\nPhone:\nEmail:",0,0
1,What phase is being described in the context?,Study Objectives \nPhase 1 Study\nThe primary objective of Phase 1 is to evaluate the safety of axicabtagene ciloleucel regimens.,1,1
2,What is being evaluated in Phase 1?,Study Objectives \nPhase 1 Study\nThe primary objective of Phase 1 is to evaluate the safety of axicabtagene ciloleucel regimens.,2,1
3,How many subjects are anticipated to be enrolled and dosed in this study?,\n3.3. \nNumber of Subjects\nParticipants in this trial will be referred to as “subjects”. It is anticipated that approximately 268 to 286 subjects will be enrolled and dosed in this study as defined below:\nPhase 1 study: approximately 6 to 24 subjects\nPhase 2 pivotal study: approximately 92 subjects enrolled into 2 cohorts\nCohort 1: approximately 72 subjects\nCohort 2: approximately 20 subjects\nPhase 2 safety management study: approximately 170 subjects enrolled and dosed within 4 cohorts\nCohort 3: approximately 40 subjects\nCohort 4: approximately 40 subjects\nCohort 5: approximately 50 subjects\nCohort 6: approximately 40 subjects\nIt should be noted that Kite Pharma may choose to close enrollment at any time. Please refer to the statistical considerations section of the protocol for sample size estimations.\n,3,2


In [None]:
dataset.to_csv(DB_PATH, index=False)

## Create COLBERT vector-database of contexts

In [19]:
import numpy as np
from ragatouille import RAGPretrainedModel

In [None]:
%%time
# creates a folder at 'VECTOR_DB_PATH/colbert/indexes/Colbert-Experimental'
RAG = RAGPretrainedModel.from_pretrained(
    "colbert-ir/colbertv2.0", index_root=VECTOR_DB_PATH
)

RAG.index(
    collection=unique_context,
    document_ids=unique_context_id,
    index_name="Colbert-Experimental",
    overwrite_index=True,
    max_document_length=256,
    split_documents=True,
)

## Create Metadata

In [10]:
from langchain_community.utilities.miimansa import MiimansaUtility

In [13]:
MiimansaUtility.prepare_metadata(DB_PATH, DIRECT_HIT_MODEL, METADATA_PATH)

Creating mappings...
Computing embeddings...
Metadata saved at ./output/metadata.pkl


## Retrieve relevant documents

In [11]:
import joblib
from langchain_community.retrievers.miimansa import MiimansaClinicalTextRetriever
from sentence_transformers import SentenceTransformer

In [14]:
metadata = joblib.load(METADATA_PATH)
direct_hit_model = SentenceTransformer(DIRECT_HIT_MODEL)

In [22]:
RAG = MiimansaClinicalTextRetriever.from_index(
    "output/colbert/indexes/Colbert-Experimental"
)
retriever = RAG.as_langchain_retriever(
    metadata=metadata,
    direct_hit_model=direct_hit_model,
    direct_hit_threshold=0.91,
    log_direct_hit=True,
    log_dir="./logs",
    k=5,
)

### Direct Hit example

In [23]:
retriever.get_relevant_documents("Who is the clinical study sponsor ?")

[Document(page_content='\nClinical Study Sponsor: \nKite Pharma, Inc.\n2400 Broadway\nSanta Monica, CA 90404\nUnited States of America\nKey Sponsor Contacts: \nClinical Development\nKite Pharma, Inc.\n2 Roundwood Avenue\nStockley Park\nUxbridge, Middlesex\nPhone:\nEmail:\nClinical Development\nKite Pharma Inc.\n2400 Broadway\nSanta Monica, CA 90404\nPhone:\nEmail:\nClinical Operations\nKite Pharma, Inc.\n2400 Broadway\nSanta Monica, CA 90404\nPhone:\nEmail:')]

### Non-direct Hit example

In [25]:
retriever.get_relevant_documents("Who is the sponsor?")



[Document(page_content='\nClinical Study Sponsor: \nKite Pharma, Inc.\n2400 Broadway\nSanta Monica, CA 90404\nUnited States of America\nKey Sponsor Contacts: \nClinical Development\nKite Pharma, Inc.\n2 Roundwood Avenue\nStockley Park\nUxbridge, Middlesex\nPhone:\nEmail:\nClinical Development\nKite Pharma Inc.\n2400 Broadway\nSanta Monica, CA 90404\nPhone:\nEmail:\nClinical Operations\nKite Pharma, Inc.\n2400 Broadway\nSanta Monica, CA 90404\nPhone:\nEmail:'),
 Document(page_content='Study Objectives \nPhase 1 Study\nThe primary objective of Phase 1 is to evaluate the safety of axicabtagene ciloleucel regimens.'),
 Document(page_content='\n3.3. \nNumber of Subjects\nParticipants in this trial will be referred to as “subjects”. It is anticipated that approximately 268 to 286 subjects will be enrolled and dosed in this study as defined below:\nPhase 1 study: approximately 6 to 24 subjects\nPhase 2 pivotal study: approximately 92 subjects enrolled into 2 cohorts\nCohort 1: approximately 7