# Vector Database
#### 1.25.24
#### Sheldon Sutton in connection with MUSC
#### Surgical Innovation Center

In [5]:
#!pip install openai
#!pip install tiktoken
#!pip install psycopg2 -- looping error
#!pip install pgvector



In [40]:
import re

text = """CLINICAL PRACTICE GUIDELINE\n2023 ACC/AHA/ACCP/HRS Guideline\nfor the Diagnosis and Management of\nAtrial Fibrillation\nA Report of the American College of Cardiology/American Heart Association\nJoint Committee"""

# Define the pattern to extract the title
pattern = r'\d{4}\s(.*?)(?:\n[A-Z]|$)'

# Use re.findall to extract the title
matches = re.findall(pattern, text, re.DOTALL)

# Join the matches and replace '\n' with a space
title = ' '.join(matches).replace('\n', ' ')

# Print the extracted title
print(title)

ACC/AHA/ACCP/HRS Guideline for the Diagnosis and Management of


In [9]:
#Note:this code section only works with ACC in folders by year
#pip install pymupdf


Collecting pymupdf
  Downloading PyMuPDF-1.23.19-cp39-none-macosx_10_9_x86_64.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 2.7 MB/s eta 0:00:01
[?25hCollecting PyMuPDFb==1.23.9
  Downloading PyMuPDFb-1.23.9-py3-none-macosx_10_9_x86_64.whl (30.1 MB)
[K     |████████████████████████████████| 30.1 MB 2.3 MB/s eta 0:00:01
[?25hInstalling collected packages: PyMuPDFb, pymupdf
Successfully installed PyMuPDFb-1.23.9 pymupdf-1.23.19
Note: you may need to restart the kernel to use updated packages.


# Create all Dfs at once from multiple folders
assumes each folder contains files (i.e. not subfolders)


### Imports/Settings

In [2]:
import os
import pandas as pd
import fitz  # PyMuPDF
from tqdm import tqdm #for progress bars when reading files
import warnings

# Filter out FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)
#warnings.resetwarnings() #return to default

pd.set_option('display.max_colwidth', 500)
#pd.reset_option('display.max_colwidth') #return to default

### Functions

In [2]:
# Function to extract text from a PDF file
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page_num in range(doc.page_count):
        page = doc[page_num]
        text += page.get_text()
    doc.close()
    return text

In [3]:
# Function to create a DataFrame for a given folder
def create_df(folder_path, journal_name):
    df = pd.DataFrame(columns=['Journal', 'Title', 'Text'])

    for pdf_file in tqdm(os.listdir(folder_path), desc=f'Processing {journal_name} PDFs', unit='file'):
        if pdf_file.endswith(".pdf"):
            pdf_path = os.path.join(folder_path, pdf_file)
            title = pdf_file.replace('.pdf', '')
            text = extract_text_from_pdf(pdf_path)
            df = df.append({'Journal': journal_name, 'Title': title, 'Text': text}, ignore_index=True)

    return df

### Code using functions

In [4]:
# Base folder path
base_folder = "/Users/sheldonsutton/Desktop/MUSC_VectorDB"

# Access folders
acc_folder = os.path.join(base_folder, 'ACC')
aha_folder = os.path.join(base_folder, 'American Heart Association')
sca_folder = os.path.join(base_folder, 'Society of Cardiovascular Anesthesiologists')
sts_folder = os.path.join(base_folder, 'Society of Thoracic Surgeons')

# Create DataFrames
df_acc = create_df(acc_folder, 'ACC')
df_aha = create_df(aha_folder, 'American Heart Association')
df_sca = create_df(sca_folder, 'Society of Cardiovascular Anesthesiologists')
df_sts = create_df(sts_folder, 'Society of Thoracic Surgeons')


Processing ACC PDFs: 100%|██████████| 20/20 [00:13<00:00,  1.46file/s]
Processing American Heart Association PDFs: 100%|██████████| 27/27 [00:05<00:00,  4.66file/s]
Processing Society of Cardiovascular Anesthesiologists PDFs: 100%|██████████| 6/6 [00:00<00:00,  7.61file/s]
Processing Society of Thoracic Surgeons PDFs: 100%|██████████| 12/12 [00:01<00:00,  8.50file/s]


In [5]:
df_acc.head(2)

Unnamed: 0,Journal,Title,Text
0,ACC,shen-et-al-2017-2017-acc-aha-hrs-guideline-for-the-evaluation-and-management-of-patients-with-syncope,CLINICAL PRACTICE GUIDELINE\n2017 ACC/AHA/HRS Guideline for the\nEvaluation and Management of\nPatients With Syncope\nA Report of the American College of Cardiology/American Heart Association\nTask Force on Clinical Practice Guidelines and the Heart Rhythm Society\nDeveloped in Collaboration With the American College of Emergency Physicians and\nSociety for Academic Emergency Medicine\nEndorsed by the Pediatric and Congenital Electrophysiology Society\nWriting\nCommittee\nMembers*\nWin-Kuang...
1,ACC,gulati-et-al-2021-2021-aha-acc-ase-chest-saem-scct-scmr-guideline-for-the-evaluation-and-diagnosis-of-chest-pain,"CLINICAL PRACTICE GUIDELINE: FULL TEXT\n2021 AHA/ACC/ASE/CHEST/SAEM/\nSCCT/SCMR Guideline for the\nEvaluation and Diagnosis of Chest Pain\nA Report of the American College of Cardiology/American Heart Association\nJoint Committee on Clinical Practice Guidelines\nWriting\nCommittee\nMembers*\nMartha Gulati, MD, MS, FACC, FAHA, Chairy\nPhillip D. Levy, MD, MPH, FACC, FAHA, Vice Chairy\nDebabrata Mukherjee, MD, MS, FACC, FAHA, Vice Chairy\nEzra Amsterdam, MD, FACCy\nDeepak L. Bhatt, MD, MPH, FA..."


### Combine DFs

In [6]:
journals_df = pd.concat([df_acc, df_aha, df_sca, df_sts], ignore_index=True)
journals_df['Text'] = journals_df['Text'].str.replace('\n', ' ') #replace \n with ' ' in the 'Text' column
journals_df

Unnamed: 0,Journal,Title,Text
0,ACC,shen-et-al-2017-2017-acc-aha-hrs-guideline-for-the-evaluation-and-management-of-patients-with-syncope,"CLINICAL PRACTICE GUIDELINE 2017 ACC/AHA/HRS Guideline for the Evaluation and Management of Patients With Syncope A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines and the Heart Rhythm Society Developed in Collaboration With the American College of Emergency Physicians and Society for Academic Emergency Medicine Endorsed by the Pediatric and Congenital Electrophysiology Society Writing Committee Members* Win-Kuang Shen, MD, F..."
1,ACC,gulati-et-al-2021-2021-aha-acc-ase-chest-saem-scct-scmr-guideline-for-the-evaluation-and-diagnosis-of-chest-pain,"CLINICAL PRACTICE GUIDELINE: FULL TEXT 2021 AHA/ACC/ASE/CHEST/SAEM/ SCCT/SCMR Guideline for the Evaluation and Diagnosis of Chest Pain A Report of the American College of Cardiology/American Heart Association Joint Committee on Clinical Practice Guidelines Writing Committee Members* Martha Gulati, MD, MS, FACC, FAHA, Chairy Phillip D. Levy, MD, MPH, FACC, FAHA, Vice Chairy Debabrata Mukherjee, MD, MS, FACC, FAHA, Vice Chairy Ezra Amsterdam, MD, FACCy Deepak L. Bhatt, MD, MPH, FACC, FAHAy Kim..."
2,ACC,grundy-et-al-2018-2018-aha-acc-aacvpr-aapa-abc-acpm-ada-ags-apha-aspc-nla-pcna-guideline-on-the-management-of-blood,"CLINICAL PRACTICE GUIDELINE 2018 AHA/ACC/AACVPR/AAPA/ ABC/ACPM/ADA/AGS/APhA/ASPC/ NLA/PCNA Guideline on the Management of Blood Cholesterol A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines Writing Committee Members Scott M. Grundy, MD, PHD, FAHA, Chair* Neil J. Stone, MD, FACC, FAHA, Vice Chair* Alison L. Bailey, MD, FACC, FAACVPRy Craig Beam, CRE* Kim K. Birtcher, MS, PHARMD, AACC, FNLAz Roger S. Blumenthal, MD, FACC, FAHA,..."
3,ACC,stout-et-al-2018-2018-aha-acc-guideline-for-the-management-of-adults-with-congenital-heart-disease,"CLINICAL PRACTICE GUIDELINE 2018 AHA/ACC Guideline for the Management of Adults With Congenital Heart Disease A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines Developed in Collaboration With the American Association for Thoracic Surgery, American Society of Echocardiography, Heart Rhythm Society, International Society for Adult Congenital Heart Disease, Society for Cardiovascular Angiography and Interventions, and Society of..."
4,ACC,whelton-et-al-2017-2017-acc-aha-aapa-abc-acpm-ags-apha-ash-aspc-nma-pcna-guideline-for-the-prevention-detection,"CLINICAL PRACTICE GUIDELINE 2017 ACC/AHA/AAPA/ABC/ACPM/ AGS/APhA/ASH/ASPC/NMA/PCNA Guideline for the Prevention, Detection, Evaluation, and Management of High Blood Pressure in Adults A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines Writing Committee Members Paul K. Whelton, MB, MD, MSC, FAHA, Chair Robert M. Carey, MD, FAHA, Vice Chair Wilbert S. Aronow, MD, FACC, FAHA* Donald E. Casey, JR, MD, MPH, MBA, FAHAy Karen J. Coll..."
...,...,...,...
56,Society of Thoracic Surgeons,PBM_Guideline_2021,"PATIENT BLOOD MANAGEMENT GUIDELINES STS/SCA/AmSECT/SABM Update to the Clinical Practice Guidelines on Patient Blood Management Pierre Tibi, MD, R. Scott McClure, MD, FRCSC, Jiapeng Huang, MD, Robert A. Baker, PhD, CCP, David Fitzgerald, DHA, CCP, C. David Mazer, MD, Marc Stone, MD, Danny Chu, MD, Alfred H. Stammers, MSA, CCP Emeritus, Tim Dickinson, CCP, Linda Shore-Lesserson, MD, Victor Ferraris, MD, Scott Firestone, MS, Kalie Kissoon, and Susan Moffatt-Bruce, MD, FRCSC Department of Cardio..."
57,Society of Thoracic Surgeons,Guidelines_ArterialConduitsforCABG,"The Society of Thoracic Surgeons Clinical Practice Guidelines on Arterial Conduits for Coronary Artery Bypass Grafting Gabriel S. Aldea, MD, Faisal G. Bakaeen, MD, Jay Pal, MD, PhD, Stephen Fremes, MD, Stuart J. Head, MD, PhD, Joseph Sabik, MD, Todd Rosengart, MD, A. Pieter Kappetein, MD, PhD, Vinod H. Thourani, MD, Scott Firestone, MS, and John D. Mitchell, MD Division of Cardiothoracic Surgery, University of Washington School of Medicine, Seattle, Washington; Department of Cardiovascular S..."
58,Society of Thoracic Surgeons,CredentialingofPractitionerstoPerformEndovascularStentGrafti,"STS/AATS POSITION STATEMENT Guidelines for Credentialing of Practitioners to Perform Endovascular Stent-Grafting of the Thoracic Aorta Nicholas T. Kouchoukos, MD (Chair), Joseph E. Bavaria, MD, Joseph S. Coselli, MD, Ralph De La Torre, MD, John S. Ikonomidis, MD, Riyad C. Karmy-Jones, MD, Robert Scott Mitchell, MD, Richard J. Shemin, MD, David Spielvogel, MD, Lars G. Svensson, MD, and Grayson H. Wheatley, MD Task Force on Endovascular Surgery, Workforce on Adult Cardiac Surgery, Council on E..."
59,Society of Thoracic Surgeons,GuidelinesforReportingDataandOutcomesfortheSurgicalTreatment,"REPORT FROM THE WORKFORCE ON EVIDENCE-BASED SURGERY Guidelines for Reporting Data and Outcomes for the Surgical Treatment of Atrial Fibrillation Richard J. Shemin, MD, James L. Cox, MD, A. Marc Gillinov, MD, Eugene H. Blackstone, MD, and Charles R. Bridges, MD Division of Cardiothoracic Surgery, David Geffen School of Medicine at UCLA, Los Angeles, California; Division of Cardiothoracic Surgery, Washington University School of Medicine, St. Louis, Missouri; Department of Thoracic and Cardiov..."


In [21]:
journals_df[['Journal']].value_counts()

Journal                                    
American Heart Association                     26
ACC                                            19
Society of Thoracic Surgeons                   11
Society of Cardiovascular Anesthesiologists     5
dtype: int64

# Tokenizing/Cost Estimate

### Imports

In [7]:
#import openai 
import os
import pandas as pd 
import numpy as np 
import json
import tiktoken
#import psycopg2
import ast
import pgvector
import math
#from psycopg2.extras import execute_values 
#from pgvector.psycopg2 import register_vector

### Functions

In [8]:
# Helper functions to help us create the embeddings
# Helper func: calculate number of tokens
def num_tokens_from_string(string: str, encoding_name="cl100k_base") -> int:
    if not string:
        return 0

    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))

    return num_tokens

In [9]:
# Helper function: calculate length of essay 
def get_essay_length(essay):
    word_list = essay.split() 
    num_words = len(word_list) 
    return num_words

In [10]:
# Helper function: calculate cost of embedding num_tokens 
# Assumes we're using the text-embedding-ada-002 model 
# See https://openai.com/pricing
def get_embedding_cost(num_tokens):
    return num_tokens/1000*0.0001

In [11]:
# Helper function: calculate total cost of embedding all content in the dataframe 
def get_total_embeddings_cost(df):
    total_tokens = 0
    for text in journals_df['Text']:
        token_len = num_tokens_from_string(text)
        total_tokens += token_len
    total_cost = get_embedding_cost(total_tokens)
    return total_cost

### Apply Functions

In [12]:
import tiktoken
# Apply the num_tokens_from_string function to the 'Text' column
journals_df['NumTokens'] = journals_df['Text'].apply(num_tokens_from_string)
# Apply the get_essay_length function to the 'EssayLength' column
journals_df['EssayLength'] = journals_df['Text'].apply(get_essay_length)
# Apply the get_embedding_cost function to the 'EmbeddingCost' column
journals_df['EmbeddingCost'] = journals_df['NumTokens'].apply(get_embedding_cost)


# Display the DataFrame with the new 'NumTokens', 'EssayLength', 'EmbeddingCost' columns
journals_df.head(3)


Unnamed: 0,Journal,Title,Text,NumTokens,EssayLength,EmbeddingCost
0,ACC,shen-et-al-2017-2017-acc-aha-hrs-guideline-for-the-evaluation-and-management-of-patients-with-syncope,"CLINICAL PRACTICE GUIDELINE 2017 ACC/AHA/HRS Guideline for the Evaluation and Management of Patients With Syncope A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines and the Heart Rhythm Society Developed in Collaboration With the American College of Emergency Physicians and Society for Academic Emergency Medicine Endorsed by the Pediatric and Congenital Electrophysiology Society Writing Committee Members* Win-Kuang Shen, MD, F...",96818,54853,0.009682
1,ACC,gulati-et-al-2021-2021-aha-acc-ase-chest-saem-scct-scmr-guideline-for-the-evaluation-and-diagnosis-of-chest-pain,"CLINICAL PRACTICE GUIDELINE: FULL TEXT 2021 AHA/ACC/ASE/CHEST/SAEM/ SCCT/SCMR Guideline for the Evaluation and Diagnosis of Chest Pain A Report of the American College of Cardiology/American Heart Association Joint Committee on Clinical Practice Guidelines Writing Committee Members* Martha Gulati, MD, MS, FACC, FAHA, Chairy Phillip D. Levy, MD, MPH, FACC, FAHA, Vice Chairy Debabrata Mukherjee, MD, MS, FACC, FAHA, Vice Chairy Ezra Amsterdam, MD, FACCy Deepak L. Bhatt, MD, MPH, FACC, FAHAy Kim...",127678,73972,0.012768
2,ACC,grundy-et-al-2018-2018-aha-acc-aacvpr-aapa-abc-acpm-ada-ags-apha-aspc-nla-pcna-guideline-on-the-management-of-blood,"CLINICAL PRACTICE GUIDELINE 2018 AHA/ACC/AACVPR/AAPA/ ABC/ACPM/ADA/AGS/APhA/ASPC/ NLA/PCNA Guideline on the Management of Blood Cholesterol A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines Writing Committee Members Scott M. Grundy, MD, PHD, FAHA, Chair* Neil J. Stone, MD, FACC, FAHA, Vice Chair* Alison L. Bailey, MD, FACC, FAACVPRy Craig Beam, CRE* Kim K. Birtcher, MS, PHARMD, AACC, FNLAz Roger S. Blumenthal, MD, FACC, FAHA,...",101129,54412,0.010113


In [13]:
# quick check on total token amount for price estimation 
total_cost = get_total_embeddings_cost(journals_df)
print("estimated price to embed this content = $" + str(round(total_cost,3)))

estimated price to embed this content = $0.413


### Tokenization

In [14]:
total_cost = 0.0
new_list = []
ideal_token_size = 1024

# Iterate through rows in the 'journals_df' DataFrame
for i in range(len(journals_df.index)):
    text = journals_df['Text'][i]
    token_len = num_tokens_from_string(text)

    if token_len <= ideal_token_size:
        # If the text is less than or equal to 1024 tokens, add it directly to the new list
        new_list.append([journals_df['Title'][i], text, journals_df['Journal'][i], token_len])
        total_cost += get_embedding_cost(token_len)
    else:
        # If the text is longer, split it into chunks of approximately 1024 tokens
        words = text.split()

        # Remove empty spaces
        words = [x for x in words if x != ' ']

        # Calculate the number of chunks
        num_chunks = len(words) // int(ideal_token_size * 0.75)  # Convert to integer

        for j in range(num_chunks):
            start = int(j * ideal_token_size * 0.75)
            end = int((j + 1) * ideal_token_size * 0.75)

            new_content = words[start:end]
            new_content_string = ' '.join(new_content)
            new_content_token_len = num_tokens_from_string(new_content_string)

            if new_content_token_len > 0:
                new_list.append([journals_df['Title'][i], new_content_string, journals_df['Journal'][i], new_content_token_len])
                total_cost += get_embedding_cost(new_content_token_len)

# Display the total estimated price
print("Total Estimated Price: $", round(total_cost,3))


Total Estimated Price: $ 0.404


In [15]:
journals_df.sample(5)

Unnamed: 0,Journal,Title,Text,NumTokens,EssayLength,EmbeddingCost
58,Society of Thoracic Surgeons,CredentialingofPractitionerstoPerformEndovascularStentGrafti,"STS/AATS POSITION STATEMENT Guidelines for Credentialing of Practitioners to Perform Endovascular Stent-Grafting of the Thoracic Aorta Nicholas T. Kouchoukos, MD (Chair), Joseph E. Bavaria, MD, Joseph S. Coselli, MD, Ralph De La Torre, MD, John S. Ikonomidis, MD, Riyad C. Karmy-Jones, MD, Robert Scott Mitchell, MD, Richard J. Shemin, MD, David Spielvogel, MD, Lars G. Svensson, MD, and Grayson H. Wheatley, MD Task Force on Endovascular Surgery, Workforce on Adult Cardiac Surgery, Council on E...",2410,1579,0.000241
38,American Heart Association,meschia-et-al-2023-management-of-inherited-cns-small-vessel-diseases-the-cadasil-example-a-scientific-statement-from,"Stroke is available at www.ahajournals.org/journal/str Stroke e452 October 2023 Stroke. 2023;54:e452–e464. DOI: 10.1161/STR.0000000000000444 © 2023 American Heart Association, Inc. AHA SCIENTIFIC STATEMENT Management of Inherited CNS Small Vessel Diseases: The CADASIL Example: A Scientific Statement From the American Heart Association James F. Meschia, MD, FAHA, Chair; Bradford B. Worrall, MD, MSc, FAHA, Vice Chair; Fanny M. Elahi, MD, PhD; Owen A. Ross, PhD; Michael M. Wang, MD, PhD;...",21368,10345,0.002137
44,American Heart Association,perman-et-al-2023-temperature-management-for-comatose-adult-survivors-of-cardiac-arrest-a-science-advisory-from-the,"Circulation September 19, 2023 Circulation. 2023;148:982–988. DOI: 10.1161/CIR.0000000000001164 Circulation is available at www.ahajournals.org/journal/circ 982 © 2023 American Heart Association, Inc. AHA SCIENCE ADVISORY Temperature Management for Comatose Adult Survivors of Cardiac Arrest: A Science Advisory From the American Heart Association Sarah M. Perman, MD, MSCE, FAHA, Vice Chair; Jason A. Bartos, MD, PhD, FAHA; Marina Del Rios, MD, MSc; Michael W. Donnino, MD; Karen G. Hirsc...",9160,5155,0.000916
50,Society of Thoracic Surgeons,AorticValveandAscendingAortaGuidelinesforManagementandQualityMeasures,"DOI: 10.1016/j.athoracsur.2013.01.083 Ann Thorac Surg 2013;95:1-66 Vinod H. Thourani, E. Murat Tuzcu, John Webb and Mathew R. Williams Michael Reardon, T. Brett Reece, G. Russell Reiss, Eric E. Roselli, Craig R. Smith, Kodali, Samir Kapadia, Martin B. Leon, Brian Lima, Bruce W. Lytle, Michael J. Mack, Dewey, Richard S. D'Agostino, Thomas G. Gleason, Katherine B. Harrington, Susheel Joseph E. Bavaria, Eugene H. Blackstone, Tirone E. David, Nimesh D. Desai, Todd M. Craig Miller, Patrick T...",92825,56225,0.009283
14,ACC,ommen-et-al-2020-2020-aha-acc-guideline-for-the-diagnosis-and-treatment-of-patients-with-hypertrophic-cardiomyopathy,"CLINICAL PRACTICE GUIDELINE 2020 AHA/ACC Guideline for the Diagnosis and Treatment of Patients With Hypertrophic Cardiomyopathy A Report of the American College of Cardiology/American Heart Association Joint Committee on Clinical Practice Guidelines Developed in collaboration with and endorsed by the American Association for Thoracic Surgery, American Society of Echocardiography, Heart Failure Society of America, Heart Rhythm Society, Society for Cardiovascular Angiography and Interventions,...",110626,64372,0.011063


In [145]:
len(new_list) 

2941

In [16]:
token_df = pd.DataFrame(new_list, columns = ['title', 'text','journal','tokenLength'])
token_df

Unnamed: 0,title,text,journal,tokenLength
0,shen-et-al-2017-2017-acc-aha-hrs-guideline-for-the-evaluation-and-management-of-patients-with-syncope,"CLINICAL PRACTICE GUIDELINE 2017 ACC/AHA/HRS Guideline for the Evaluation and Management of Patients With Syncope A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines and the Heart Rhythm Society Developed in Collaboration With the American College of Emergency Physicians and Society for Academic Emergency Medicine Endorsed by the Pediatric and Congenital Electrophysiology Society Writing Committee Members* Win-Kuang Shen, MD, F...",ACC,1253
1,shen-et-al-2017-2017-acc-aha-hrs-guideline-for-the-evaluation-and-management-of-patients-with-syncope,"PHD, RN, FAHA Duminda N. Wijeysundera, MD, PHD #Former Task Force member; current member during the writing effort. TABLE OF CONTENTS PREAMBLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . e41 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . e43 1.1. Methodology and Evidence Review . . . . . . . . . . e43 1.2. Organization of the Writing Committee . . . . . . . e44 1.3. Document Review and Approval . . . . . . . . . . . . . e44 1.4. Scope ...",ACC,1183
2,shen-et-al-2017-2017-acc-aha-hrs-guideline-for-the-evaluation-and-management-of-patients-with-syncope,". . . e69 5. REFLEX CONDITIONS: RECOMMENDATIONS . . . . e70 5.1. Vasovagal Syncope: Recommendations . . . . . . . e70 Shen et al. J A C C V O L . 7 0 , N O . 5 , 2 0 1 7 2017 ACC/AHA/HRS Syncope Guideline A U G U S T 1 , 2 0 1 7 : e 3 9 – 1 1 0 e40 5.2. Pacemakers in Vasovagal Syncope: Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . e72 5.3. Carotid Sinus Syndrome: Recommendations . . . e72 5.4. Other Reﬂex Conditions . . . . . . . . . . . . . . . . . . . . e73 6. ORTHOSTATI...",ACC,1074
3,shen-et-al-2017-2017-acc-aha-hrs-guideline-for-the-evaluation-and-management-of-patients-with-syncope,"development and publication of guidelines without commercial support, and members of each organization volunteer their time to the writing and review efforts. Guidelines are ofﬁcial policy of the ACC and AHA. Intended Use Practice guidelines provide recommendations applicable to patients with or at risk of developing cardiovascular disease. The focus is on medical practice in the United States, but guidelines developed in collaboration with other organizations may have a global impact. Altho...",ACC,1010
4,shen-et-al-2017-2017-acc-aha-hrs-guideline-for-the-evaluation-and-management-of-patients-with-syncope,"into actionable recommendations. ERC mem- bers may include methodologists, epidemiologists, healthcare providers, and biostatisticians. The recom- mendations developed by the writing committee on the basis of the systematic review are marked with “SR”. Guideline-Directed Management and Therapy The term guideline-directed management and therapy (GDMT) encompasses clinical evaluation, diagnostic testing, and pharmacological and procedural treatments. For these and all recommended drug treatmen...",ACC,1173
...,...,...,...,...
2936,AntibioticProphylaxisinCardiacSurgeryPartIDuration,"No Sisto and colleagues [39] Finland 1994 551 Yes Ceftriaxone 2.9 Cefuroxime 2.9 48 Hours No Hall and colleagues [40] Australia 1993 1031 No Ceftriaxone 2.7 Flucloxacillin gentamycin 1.6 48 Hours No Beam and colleagues [41] United States 1984 94 Yes Ceftriaxone 4.1 Cefazolin 2.2 48 Hours No a Surgical site infections refers to the incidence of sternal infections, including mediastinitis and superﬁcial sternal wound infections. 399 Ann Thorac Surg WORKFORCE REPORT EDWARDS ET AL 2006;81:397–40...",Society of Thoracic Surgeons,1221
2937,AntibioticProphylaxisinCardiacSurgeryPartIDuration,"There was no differ- ence in SSI between the groups. Geroulanos and col- leagues [45] compared a 2-day course of cefuroxime against a 4-day course of cefazolin in 569 randomized patients undergoing cardiac surgery in Switzerland. There was no statistically signiﬁcant difference between infectious complications in the two groups. In 1988, Jewell and colleagues [46] randomized 200 CABG pa- tients into a group receiving 48 hours of intravenous cephalothin or a group receiving 3 days of oral cep...",Society of Thoracic Surgeons,1153
2938,AntibioticProphylaxisinCardiacSurgeryPartIDuration,Surgery [56] Cephalosporin class of antimicrobials Data suggest that a 1-day course of intravenous antimicrobials is as efﬁcacious as the traditional 48-hour (or longer) regimen. American Society of Health-System Pharmacists Commission on Therapeutics [4] Cefazolin for up to 72 hours The duration is based on consensus of the expert panel because the data do not delineate the optimal duration of prophylaxis. Prophylaxis for 24 hours or less may be appropriate for cardiothoracic procedures. Ce...,Society of Thoracic Surgeons,1126
2939,AntibioticProphylaxisinCardiacSurgeryPartIDuration,"tubes are removed [16]. The writing committee found no scientiﬁc evidence that this practice provides enhanced protection against infectious complications. To the con- trary, there is uniform agreement that this policy should not be followed [4, 24, 55]. CONCLUSION. The duration of antibiotic prophylaxis should not be dependent on indwelling catheters of any type. OPTIMAL PRACTICE. Decisions regarding the continuation of antibiotic prophylaxis are not guided by the presence of indwelling cat...",Society of Thoracic Surgeons,1254


In [24]:
# Save the DataFrame to a CSV file
#token_df.to_csv('token_data.csv', index=False)

In [4]:
# Read the data from the CSV file into a DataFrame
#token_df = pd.read_csv('token_data.csv')

# Embeddings Transformer

### Simple Example


In [27]:
import warnings
warnings.filterwarnings("ignore")

In [23]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer("neuml/pubmedbert-base-embeddings")
embeddings = model.encode(sentences)
print(embeddings)

[[-0.54902166 -0.00991799 -0.2637591  ... -0.15789218 -1.2998055
   0.80934805]
 [-1.0420783   0.789706    0.5180276  ... -0.5906364  -1.0819337
   0.50429845]]


In [25]:
embeddings.shape

(2, 768)

### Helper Function

In [62]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("neuml/pubmedbert-base-embeddings")

# Function to get embeddings for a given text
def get_embeddings(text):
    return model.encode([text])

  return self.fget.__get__(instance, owner)()


### DF example

In [14]:
#Example DataFrame structure
token_test = pd.DataFrame({
    'Title': ['Title1', 'Title2', 'Title1', 'Title3'],
    'Text': ['Text1', 'Text2', 'Text3', 'Text4'],
    'Journal': ['Journal1', 'Journal2', 'Journal3', 'Journal4'],
    'TokenLength': [100, 150, 80, 120]
})
token_test

Unnamed: 0,Title,Text,Journal,TokenLength
0,Title1,Text1,Journal1,100
1,Title2,Text2,Journal2,150
2,Title1,Text3,Journal3,80
3,Title3,Text4,Journal4,120


In [16]:
token_test['Text'].apply(get_embeddings)

0    [[-0.34269476, 0.22396612, -0.39297545, -0.5840893, -0.29564363, 0.39713198, 0.058310296, -0.37871805, 0.09129639, -0.33860853, 0.12316477, 0.21731886, -0.080019265, -0.5096862, -1.383059, -0.50511414, -0.7472632, -0.41175932, -0.8023629, 0.863835, 0.17570278, 0.4570406, -1.0481806, -0.33272225, 1.482028, -0.46649534, -0.31816965, 0.40699726, -0.39038014, 0.9384389, -0.73542595, 0.22658761, -0.7853, -0.026946098, -0.7854873, -0.09198919, 0.66320056, 0.5490468, 1.5014609, 0.30245855, 0.064644...
1    [[-0.3205153, 0.2158633, -0.309447, -0.45649132, -0.2678013, 0.36074597, 0.2471737, -0.33655283, 0.031112397, -0.36891162, 0.16731533, 0.25502813, -0.045056507, -0.38582173, -1.4026335, -0.66704446, -0.7367033, -0.42078236, -0.7827431, 0.5990229, 0.37577748, 0.32353404, -1.1156484, -0.24333572, 1.3602757, -0.5134183, -0.053586997, 0.43070912, -0.43484792, 0.7828852, -0.89400333, 0.069375366, -0.9891583, 0.10668503, -0.6152518, -0.1414635, 0.4736438, 0.4884441, 1.2887976, 0.38410527, 0.

### Apply to token_df RUNS FOR A WHILE

In [63]:
# Apply the function to the 'Text' column of token_df
token_df['Embeddings'] = token_df['text'].apply(get_embeddings)


In [64]:
# Save the DataFrame to a CSV file
token_df.to_csv('xxtoken_data_embed.csv', index=False)

In [65]:
xxx = pd.read_csv('xxtoken_data_embed.csv')

In [66]:
pd.DataFrame(xxx.iloc[1923]).T

Unnamed: 0,title,text,journal,tokenLength,Embeddings
1923,abdalla-et-al-2023-implementation-strategies-to-improve-blood-pressure-control-in-the-united-states-a-scientific,mitigation efforts can be paired with multilevel strategies that support wider adoption of evidence-based inter- ventions for BP control to increase access to high- quality health care services within communities. • Accurate BP measurement and increased use of SMBP monitoring – Increase and synergize efforts to educate and train clinicians and patients in how to select validated BP measurement devices and to measure BP accu- rately inside and outside of the office setting – Increase the use ...,American Heart Association,1051,[[-3.06179702e-01 3.85651082e-01 -2.64884740e-01 -3.00773293e-01\n -1.97745889e-01 2.44078264e-01 -7.20645785e-01 1.26247443e-02\n -7.91457817e-02 -1.09105849e+00 5.29329598e-01 -1.32602900e-01\n 7.84991160e-02 -8.33393097e-01 -4.56297040e-01 1.02011852e-01\n 2.89022148e-01 3.73335391e-01 2.93962121e-01 8.51432383e-01\n -5.04455745e-01 -2.73868740e-01 -1.48879409e-01 -6.35170341e-01\n 5.93232214e-01 -8.74805972e-02 3.13392848e-01 -9.63422000e-01\n -1.10025913e-01 5.186662...


In [None]:
# Saving as Parquet requires squeeze
# token_df['Column1'] = token_df['Column1'].apply(lambda x: np.squeeze(x))

In [205]:
token_df.to_parquet('token_data_embed.parquet')
# df_read = pd.read_csv('output.csv', converters={'Column2': lambda x: np.array(x.split(';')).astype(float)})

# Import Parquet with Embedding Formatting


In [72]:
token_df = pd.read_parquet('token_data_embed.parquet')
token_df.head(1)

Unnamed: 0,title,text,journal,tokenLength,Embeddings
0,shen-et-al-2017-2017-acc-aha-hrs-guideline-for-the-evaluation-and-management-of-patients-with-syncope,"CLINICAL PRACTICE GUIDELINE 2017 ACC/AHA/HRS Guideline for the Evaluation and Management of Patients With Syncope A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines and the Heart Rhythm Society Developed in Collaboration With the American College of Emergency Physicians and Society for Academic Emergency Medicine Endorsed by the Pediatric and Congenital Electrophysiology Society Writing Committee Members* Win-Kuang Shen, MD, F...",ACC,1253,"[-0.7635219, 0.028012963, -0.2274014, -1.2573576, -0.42328897, 0.6349173, 0.03296693, -0.24082562, -0.46529308, -1.5539907, -0.009345492, 1.0497372, -0.15796132, -0.38976455, -0.82200634, 0.42113262, 0.39703044, 0.41187605, 0.53316236, 0.9349648, -0.032043707, -0.511867, -0.36021292, 0.13832396, 1.0502408, 0.48061788, -0.15158539, -0.35105857, -0.71472317, -0.08860559, -0.38612437, -0.5157135, -0.18839104, 0.18588898, -0.09911649, -0.01646175, 0.15159963, 0.2816479, -0.49376765, -0.66066873,..."


# Testing Questions with Embeddings

In [14]:
token_df.shape

(2941, 5)

In [54]:
token_df.sample(1)

Unnamed: 0,title,text,journal,tokenLength,Embeddings
1923,abdalla-et-al-2023-implementation-strategies-to-improve-blood-pressure-control-in-the-united-states-a-scientific,mitigation efforts can be paired with multilevel strategies that support wider adoption of evidence-based inter- ventions for BP control to increase access to high- quality health care services within communities. • Accurate BP measurement and increased use of SMBP monitoring – Increase and synergize efforts to educate and train clinicians and patients in how to select validated BP measurement devices and to measure BP accu- rately inside and outside of the office setting – Increase the use ...,American Heart Association,1051,"[-0.3061797, 0.38565108, -0.26488474, -0.3007733, -0.19774589, 0.24407826, -0.7206458, 0.012624744, -0.07914578, -1.0910585, 0.5293296, -0.1326029, 0.078499116, -0.8333931, -0.45629704, 0.10201185, 0.28902215, 0.3733354, 0.29396212, 0.8514324, -0.50445575, -0.27386874, -0.14887941, -0.63517034, 0.5932322, -0.0874806, 0.31339285, -0.963422, -0.11002591, 0.05186662, 0.2377259, -0.6157047, 0.16658436, 0.4672873, -0.008122997, 0.1201987, -0.4960513, 0.4040435, -0.30293545, 0.06858949, -0.1535663..."


In [24]:
len(token_df.loc[20, 'Embeddings'])

768

In [75]:
#count the number of elements in each list in each row of token_df
#must be read in as a parquet
# this shows there are exactly 768 embeddings in each and every row
counts = {}
for index, row in token_df.iterrows():
    count = len(row['Embeddings'])
    if count in counts:
        counts[count] += 1
    else:
        counts[count] = 1

# Create a DataFrame to display the counts
counts_df = pd.DataFrame(counts.items(), columns=['Number of Elements', 'Count'])
print(counts_df)

   Number of Elements  Count
0                 768   2941


In [4]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np

In [76]:
# Example questions
#question = "What are the symptoms of heart disease?"
question =input()

 What are the symptoms of heart disease?


In [6]:
def heartquestion(token_df):
    question = input()

    # Load the SentenceTransformer model
    model = SentenceTransformer("neuml/pubmedbert-base-embeddings")
    # Encode questions into embeddings
    test_question_embeddings = model.encode(question)

    # Reshape and compute Cosine Similarity
    test_question_embeddings_2d = test_question_embeddings.reshape(1, -1)
    embeddings_array_2d = np.stack(token_df['Embeddings']).reshape(len(token_df), -1)

    # Compute cosine similarity
    similarities = cosine_similarity(test_question_embeddings_2d, embeddings_array_2d)

    # Find the index of the most similar embedding for each question
    most_similar_indices = similarities.argmax()

    # Get the corresponding text for each question
    results = token_df.loc[most_similar_indices, 'text']
    journal = token_df.loc[most_similar_indices, 'journal']
    title = token_df.loc[most_similar_indices, 'title']
    maxcossim = round(np.max(similarities),4)
    # Print the results
    print(f"Question: {question}\nTitle: {title}\nJournal:{journal}")
    print("Cosine Similarity: ", maxcossim)
    print(f"Result: {results}\n")

In [77]:
heartquestion(token_df)

 What are the symptoms of heart disease?


  return self.fget.__get__(instance, owner)()


Question: What are the symptoms of heart disease?
Title: heidenreich-et-al-2022-2022-aha-acc-hfsa-guideline-for-the-management-of-heart-failure
Journal:ACC
Cosine Similarity:  0.5937
Result: Speciﬁcally About Family Members* With Cardiac morphology Marked LV hypertrophy Any mention of cardiomyopathy, enlarged or weak heart, HF. Document even if attributed to other causes, such as alcohol or peripartum cardiomyopathy LV noncompaction Right ventricular thinning or fatty replacement on imaging or biopsy Findings on 12- lead ECG Abnormal high or low voltage or conduction, and repolarization, altered RV forces Long QT or Brugada syndrome Dysrhythmias Frequent NSVT or very frequent PVCs ICD Recurrent syncope Sudden death attributed to “massive heart attack” without known CAD Unexplained fatal event such as drowning or single-vehicle crash Sustained ventricular tachycardia or ﬁbrillation Early onset AF “Lone” AF before age 65 y Early onset conduction disease Pacemaker before age 65 y Extracar

# Extra

In [258]:
# Load the SentenceTransformer model
model = SentenceTransformer("neuml/pubmedbert-base-embeddings")
# Encode questions into embeddings
test_question_embeddings = model.encode(question)

  return self.fget.__get__(instance, owner)()


In [259]:
# Reshape and compute Cosine Similarity
test_question_embeddings_2d = test_question_embeddings.reshape(1, -1)
embeddings_array_2d = np.stack(token_df['Embeddings']).reshape(len(token_df), -1)

# Compute cosine similarity
similarities = cosine_similarity(test_question_embeddings_2d, embeddings_array_2d)

In [266]:
#display index with max cosine similarity
similarities.argmax()

568

In [261]:
# Find the index of the most similar embedding for each question
most_similar_indices = similarities.argmax()

# Get the corresponding text for each question
results = token_df.loc[most_similar_indices, 'text']
journal = token_df.loc[most_similar_indices, 'journal']
title = token_df.loc[most_similar_indices, 'title']
# Print the results
print(f"Question: {question}\nTitle: {title}\nJournal:{journal}\nResult: {results}\n")

Question: What drugs should a patient undergoing cardiac surgery be given to prevent postoperative atrial fibrillation?
Title: joglar-et-al-2023-2023-acc-aha-accp-hrs-guideline-for-the-diagnosis-and-management-of-atrial-fibrillation
Journal:ACC
Result: AF After Cardiac Surgery Referenced studies that support the recommendations are summarized in the Online Data Supplement. COR LOE RECOMMENDATIONS 2a B-R 1. In patients undergoing cardiac surgery who are at high risk for postoperative AF, it is reasonable to administer short-term prophylactic beta blockers or amiodarone to reduce the incidence of postoperative AF.1-5 2a B-R 2. In patients undergoing CABG, aortic valve, or ascending aortic aneurysm operations, it is reasonable to perform concomitant posterior left pericardiotomy to reduce the incidence of postoperative AF.6,7 FIGURE 25 Prevention of AF After Cardiac Surgery AF indicates atrial ﬁbrillation; and CABG, coronary artery bypass graft. Colors correspond to Table 2. Joglar et al 

### Cosine Similarity Example

In [111]:
embedding1 = [
    [7, 0.4, 0.1, 0.8],
    [0.3, 0.5, 0.2, 0.7]
]

embedding2 = [
    [7, 0.6, 0.3, 0.9],
    [0.4, 0.7, 0.2, 0.8],
    [4,5,6,7]
]

# Convert lists of lists to NumPy arrays
embedding1 = np.array(embedding1)
embedding2 = np.array(embedding2)

# Calculate cosine similarity between the embeddings
similarity_matrix = cosine_similarity(embedding1, embedding2)
similarity_matrix

array([[0.99911073, 0.45950211, 0.45694615],
       [0.46729121, 0.99471423, 0.93601148]])

### Test Questions

This is an 85-year-old female patient with a history of diabetes, immunocompromised status, and a family history of premature coronary artery disease (CAD). She is a current daily smoker and consumes 2-7 drinks per week. The patient has stable angina and is classified as NYHA Class I. She has two diseased vessels and an ejection fraction of 20%. The patient is undergoing her first cardiovascular surgery, specifically a coronary artery bypass graft. The surgery is elective and scheduled this week. She is not on home oxygen and does not have chronic lung disease or pneumonia. Her hematocrit level is 45%, white blood cell count is 10 x 10^9/L, platelet count is 380,000/mL, and her last recorded creatinine level is 1.0. She has a history of an IgE mediated reaction to penicillin. What antibiotic should be given to this patient before surgery this week?

What drugs should a patient undergoing cardiac surgery be given to prevent postoperative atrial fibrillation?

A patient presents with postoperative SND and hemodynamic instability after mitral valve repair. How should this patient be managed?

## Notes/Terminal Commands


follow port forwarding to open jupyter:
jupyter lab --port=8080

In [None]:
MUSC_PRACTICUM/VectorDB.ipynb

In [None]:
MUSC_PRACTICUM/token_data_embed.parquet

HERE:

add token_data_embed.parquet to datasets on musc server:
scp ~/MUSC_PRACTICUM/token_data_embed.parquet suttoshe@pe8545-innovationcenter-surgery.mdc.musc.edu:/home/suttoshe/datasets

scp ~/MUSC_PRACTICUM/VectorDB.ipynb suttoshe@pe8545-innovationcenter-surgery.mdc.musc.edu:/home/suttoshe/notebooks

connect to musc via ssh:

First connect to VPN
ssh suttoshe@pe8545-innovationcenter-surgery.mdc.musc.edu
enter pw
bash
maybe enter pw again
exit (to exit)


streamlit - websites from notebook


In [None]:
Also created a repo:

git clone git-innovationcenter-surgery.mdc.musc.edu:/repos/mr_tr_reduction.git