<a href="https://colab.research.google.com/github/sammargolis/GI-Board-Examination/blob/main/GI_Board_Examination_vShare.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Guide for the GI Board Examination Script

This guide will assist you in setting up and running the `GI board examination vShare` Python script. The script is designed to automate the uploading, extraction, and analysis of data related to GI Board Examinations. Here's how to get everything set up and running smoothly.

#### Requirements
Before running the script, ensure that you have the following:
- **Python Environment**: A Python environment that supports package installation, such as Anaconda or a virtual environment in a development setup like Jupyter Notebook or Google Colab.
- **Internet Connection**: Required for installing packages and potentially for API calls.

#### Installation of Dependencies
All dependencies are listed within the code blocks.  Packages can often be updated and may require updates

#### Files Needed
To run the script, you will need the following files:
- **`2022_Test_Blank.csv`**: Contains the blank GI board exam questions with text and references to associated images stored in `2022_ACG_Files.zip`. Place this file in a directory that the script can access.
- **`ACG_self_assessment_examples.csv`**: Includes 5-shot learning examples with textual content and references to images in `Example_Images.zip`. Place this file in a directory that the script can access.
- **`2022_ACG_Files.zip`**: Contains all images referenced in the `2022_Test_Blank.csv`, necessary for the visual components of the exam questions. Ensure this file is in the same directory as the script.
- **`Example_Images.zip`**: Holds all images referenced in the `ACG_self_assessment_examples.csv`, providing visual support for the assessment examples. Ensure this file is in the same directory as the script.
- **API Key**: You need an API key for OpenAI & Gemini services. Store this securely and update the script to retrieve this key as needed.

All files can be found in the Google Drive Link here: https://drive.google.com/drive/folders/116W2snTaJ6l4Y55oDu9mpjF1pdW6gdul?usp=sharing

#### Example Image Files
If the script processes or generates images, ensure that these image files are correctly formatted and named according to the script's requirements. Include them in the directory or upload feature as needed.

## Initial Set Up

In [None]:
!pip install langchain
!pip install openai
!pip install patool
!pip install requests
!pip install langchain_community

Collecting langchain
  Downloading langchain-0.2.5-py3-none-any.whl (974 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m974.6/974.6 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
Collecting langchain-core<0.3.0,>=0.2.7 (from langchain)
  Downloading langchain_core-0.2.7-py3-none-any.whl (315 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.6/315.6 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_text_splitters-0.2.1-py3-none-any.whl (23 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.77-py3-none-any.whl (125 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m125.2/125.2 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.3.0,>=0.2.7->langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting orjson<4.0.0,>=3.9.14 (from langsmit

In [None]:
import pandas as pd
import io
from pathlib import Path
import base64
import time
import requests
from openai import OpenAI
import os, glob
from langchain.llms import OpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.chains import SimpleSequentialChain
import patoolib

In [None]:
from google.colab import files
uploaded = files.upload()

Saving GIRedoUpdate.csv to GIRedoUpdate.csv


In [None]:
# Dataset is now stored in a Pandas Dataframe
testMedicalDF = pd.read_csv(io.BytesIO(uploaded['2022_Test_Blank.csv']))

In [None]:
patoolib.extract_archive("2022_ACG_Files.zip",outdir="/content")

INFO patool: Extracting 2022_ACG_Files.zip ...
INFO:patool:Extracting 2022_ACG_Files.zip ...
INFO patool: running /usr/bin/7z x -o/content -- 2022_ACG_Files.zip
INFO:patool:running /usr/bin/7z x -o/content -- 2022_ACG_Files.zip
INFO patool:     with input=''
INFO:patool:    with input=''
INFO patool: ... 2022_ACG_Files.zip extracted to `/content'.
INFO:patool:... 2022_ACG_Files.zip extracted to `/content'.


'/content'

In [None]:
#Ensure you add your API key into the collab secrets
openai_api_key= userdata.get('openai_api_key')

In [None]:
image_folder_path="/content/2022_ACG_Files/"

In [None]:
base_prompt= """
Please answer the question

Format your response as follows:

Multiple choice answer to the question:

Justification:
"""
sys_prompt= """
You are an expert in gastroenterology, hepatology, and interventional gastroenterology with extensive experience in endoscopic procedures, radiographic image interpretation (MRI, CT, Ultrasound), and esophageal manometry studies. You have a deep understanding of both normal and abnormal findings in these areas. Additionally, you are well-versed in the medical guidelines from leading organizations such as the ACG, AGA, AASLD, and ASGE, and you're familiar with gastroenterology and hepatology board exam review preparation content.

Given a gastroenterology and hepatology board exam question, which may include associated images, apply the following approach:

Read the Question: Understand the clinical scenario and what it asks you to identify or solve.
Analyze Images (if any): Describe observed findings and relate them to the clinical question.
Evaluate Answer Choices: Use your expertise, understanding of guidelines, and test-taking skills to assess each option.
Select the Best Answer: Choose the option that best fits the clinical scenario based on evidence and guidelines.
Please format your response as below:

Multiple Choice Answer: Provide the letter or option you've chosen.
Justification: Briefly explain why this answer is the most appropriate, including any relevant clinical guidelines, findings from images, or key points from the question stem that guided your decision
"""
reviewer_prompt="""
Expertise and Role Description: You possess specialized knowledge in gastroenterology and hepatology with skills in endoscopic procedure image interpretation, interpreting radiographic images (MRI, CT, Ultrasound), and manometry. You are very familiar with clinical guidelines from medical societies such as the ACG, AGA, AASLD, and ASGE. Additionally, you have experience with gastroenterology and hepatology board exam preparatory content. Your primary role is to critically assess a response to a practice exam question, incorporating image analysis if images are provided and detailed examination of the provided answers.
Review Process:
Step 1: Thoroughly review the board exam practice question. Understand the clinical scenario and the specific inquiry posed.
Step 2: Examine the provided answer to the question along with the rationale. Consider the details of the question from Step 1.
Step 3: Review all the given answer choices with a focus on detail, linking back to your understanding from Steps 1 and 2.
Step 4: Based on your comprehensive review from Steps 1-3, determine the accuracy of the selected answer in Step 2. Use your expertise to evaluate the answer. Be detail oriented. Take your time. Be systematic. If the initial answer is correct, confirm it with your endorsement. If incorrect, identify the right answer, providing a thorough justification based on clinical guidelines, radiographic findings, and key aspects of the question.
Response Formatting:
Multiple Choice Answer: Indicate the best answer (either confirm the existing answer or provide a corrected option based on your reasoning from step 4).
Justification: Elaborate on why the answer you chose is the most appropriate. Include references to relevant clinical guidelines, radiographic interpretations, or critical details from the question stem that influenced your decision.
Additional Guidelines:
Ensure meticulous evaluation, grounding your justification in solid evidence and contemporary medical standards.
Consider any provided images as part of the response. If images are pertinent but overlooked in the provided answer, reassess their impact on the question and possibly revise your answer choice.
Analyze the logic used in the provided answer for accuracy and consistency with the question, answer choices, and any associated images. Address any incorrect assumptions or errors found in the thought process. If such inaccuracies might influence the correct answer, adjust your response accordingly.
Verify that the explanations and justifications in the chosen answer are accurate and align with the information from the question stem and any images provided. If you find inconsistencies or inaccuracies, assess whether these could alter the correct answer choice.
"""

## RAG Set up

In [None]:
# Package management
!pip install langchain --upgrade
!pip install pypdf
!pip install llama-index
!pip install chromadb

Collecting pypdf
  Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pypdf
Successfully installed pypdf-4.2.0


In [None]:
#Some packages may not download initially.  If they do not, uninstall and reinstall the package.
!pip install openai --upgrade
!pip install chromadb --upgrade



## RAG Splitting

In [None]:
# Additional packages to download.  At times some of the langchain packages can give issues.  If they do I recommend restarting the cluster or going through to download/ upgrade the necessary packages
# PDF Loaders. If unstructured gives you a hard time, try PyPDFLoader
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os
# from langchain.vectorstores import Chroma
# from langchain.embeddings.openai import OpenAIEmbeddings

#load_dotenv()

In [None]:
loader = TextLoader(file_path="/content/combined_text_file.txt")

In [None]:
data = loader.load()

In [None]:
# Note: If you switched to using PyPDFLoader then it will split by page for you already
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your sample document')
print (f'Here is a sample: {data[0].page_content[:200]}')

You have 1 document(s) in your data
There are 26396102 characters in your sample document
Here is a sample: PRACTICE GUIDANCE
A multidisciplinary approach to the diagnosis and
management of Wilson disease: 2022 Practice Guidance onWilson disease from the American Association for the Studyof Liver Diseases
M


In [None]:
# We'll split our data into chunks around 500 characters each with a 50 character overlap. These are relatively small.  We use this so that we can pass multiple small relevant chunks of documentation
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(data)

In [None]:
# Sense check on number of documents
print (f'Now you have {len(texts)} documents')

Now you have 61350 documents


In [None]:
openai_api_key= userdata.get('openai_api_key')

In [None]:
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

  warn_deprecated(


In [None]:
# load it into Chroma
vectorstore = Chroma.from_documents(texts, embeddings)

In [None]:
query = "What should I do for GERD?"
docs = vectorstore.similarity_search(query)

Below are a few checks to make sure the RAG is working correctly

In [None]:
# Here's an example of the first document that was returned
for doc in docs:
    print (f"{doc.page_content}\n")

For patients presenting with GERD symptoms, a
stepwise diagnostic approach will identify mechanismsdriving symptoms for a precision managementapproach. Patients should receive education on GERDpathophysiology and lifestyle modi cations, and be
involved in a shared decision-making model. A 4- to 8-
week trial of single-dose PPI is considered safe and
appropriate for patients with typical re ux symptoms

We suggest elevating head of bed for nighttime GERD symptoms. Low ConditionalWe recommend treatment with PPIs over treatment with H2RA for healing EE. High StrongWe recommend treatment with PPIs over H2RA for maintenance of healing for EE. Moderate StrongWe recommend PPI administration 30 60 min before a meal rather than at bedtime for GERD symptom
control.Moderate Strong
For patients with GERD who do not have EE or Barrett s esophagus, and whose symptoms have resolved

meal rather than at bedtime for GERD symptom control (strongrecommendation, moderate level of evidence).
9. For patient

In [None]:
testQ = """
A 37-year-old man with human immunodeficiency virus and a history of Pneumocystis jirovecii pneumonia, presented to the hospital with diarrhea and hypotension. Over the past month, he reports increasingly watery diarrhea, upwards of 8-10 times per day, with nocturnal symptoms and episodes of fecal incontinence. He also reports diffuse abdominal pain and cramping, night sweats, lethargy, and new cough. He has no headache, arthralgias, or rashes. He reports that he has not been compliant with his anti-retroviral therapy for the past year.

On arrival to the hospital, his vital signs are notable for a temperature of 102.7°F, heart rate of 107 beats per minute, and blood pressure 92/61 mm Hg. On examination, he is pale, diaphoretic, but oriented to person, place, and time. His cardiac examination is notable for tachycardia with a soft systolic murmur. His pulmonary examination reveals diffuse rhonchi in the upper lobes bilaterally. His abdomen is soft, but diffusely tender. Laboratory studies in the ED are notable for sodium 124 mEq/L (normal: 136-145 mEq/L), potassium 2.7 mEq/L (normal: 3.5-5.0 mEq/L), white blood cell count 4,500/µL (normal: 4,000-10,000/µL), albumin 1.9 g/dL (normal: 3.5-5.5 g/dL), AST and ALT were normal, but his CD4 count was 9/µL. Chest radiograph showed upper lobe infiltrates bilaterally. An upper endoscopy and colonoscopy are performed for evaluation of diarrhea with the finding in the duodenum shown in the figure. The pathology from the duodenal biopsies demonstrated numerous foamy macrophages filling the lamina propria with intracellular periodic acid-Schiff (PAS)-positive organisms. What is the most likely etiology of the patient’s symptoms?


A
Mycobacterium avium intracellulare complex

B
Cryptosporidium parvum

C
Tropheryma whipplei

D
Histoplasma capsulatum
"""
docs = vectorstore.similarity_search(testQ)

In [None]:
# Here's an example of the first document that was returned
for doc in docs:
    print (f"{doc.page_content}\n")

In a study of more than 15,000 hospitalized HIV patients in1998, 2.8% were admitted for a diarrheal diagnosis.
66Data
on the endoscopic evaluation of patients with HIV are
mostly from studies that preceded the use of highly active
antiretroviral therapy.67Although CMV is the most common
pathogen detected in these patients, histopathologic evalu-ation may identify other pathogens, such as adenovirus andenteropathogenic bacteria.
68-70Furthermore, a pathogen

Pathol 1996;106:544-8.
66. Anastasi JK, Capili B. HIV and diarrhea in the era of HAART: 1998 New
York State hospitalizations. Am J Infect Control 2000;28:262-6.
67. Orenstein JM, Dieterich DT. The histopathology of 103 consecutive co-
lonoscopy biopsies from 82 symptomatic patients with acquired im-
munodeficiency syndrome. Arch Pathol Lab Med 2001;125:1042-6.
68. Bini EJ. Endoscopic approach to HIV associated diarrhea: how far is far
enough? Am J Gastroenterol 1999;94:556-9.

74. Mo nkemu ller KE, Wilcox CM. Investigation of diarrh

In [None]:
docs_content = "\n\n".join(doc.page_content for doc in docs)

In [None]:
docs_content

'In a study of more than 15,000 hospitalized HIV patients in1998, 2.8% were admitted for a diarrheal diagnosis.\n66Data\non the endoscopic evaluation of patients with HIV are\nmostly from studies that preceded the use of highly active\nantiretroviral therapy.67Although CMV is the most common\npathogen detected in these patients, histopathologic evalu-ation may identify other pathogens, such as adenovirus andenteropathogenic bacteria.\n68-70Furthermore, a pathogen\n\nPathol 1996;106:544-8.\n66. Anastasi JK, Capili B. HIV and diarrhea in the era of HAART: 1998 New\nYork State hospitalizations. Am J Infect Control 2000;28:262-6.\n67. Orenstein JM, Dieterich DT. The histopathology of 103 consecutive co-\nlonoscopy biopsies from 82 symptomatic patients with acquired im-\nmunodeficiency syndrome. Arch Pathol Lab Med 2001;125:1042-6.\n68. Bini EJ. Endoscopic approach to HIV associated diarrhea: how far is far\nenough? Am J Gastroenterol 1999;94:556-9.\n\n74. Mo nkemu ller KE, Wilcox CM. Inves

In [None]:
docs_content[:50]

'In a study of more than 15,000 hospitalized HIV pa'

## OpenAI Model

In [None]:
# Searches for files in the specified folder path that match the given filename without an extension,
# appending any extension available. Returns the first match found with its complete filename and extension.
# Parameters:
# folder_path (str): The directory path where the file will be searched.
# filename_without_extension (str): The base name of the file without any extension.
# Returns:
# str: The complete filename with extension of the first matching file, or None if no match is found.
def find_file_extension(folder_path, filename_without_extension):
    # Create a search pattern
    search_pattern = os.path.join(folder_path, filename_without_extension + ".*")

    # Use glob to find matching files
    matching_files = glob.glob(search_pattern)

    if not matching_files:
        return None  # No matching file found

    # Assuming you want the first matching file
    first_matching_file = matching_files[0]

    # Extract the complete file name
    complete_file_name = os.path.basename(first_matching_file)

    return complete_file_name


# Retrieves documents similar to the provided question using a vector store's similarity search, concatenating the page contents of each resulting document into a single string.
# Parameters:
# question (str): The query question for retrieving relevant documents.
# Returns:
# str: Concatenated page contents of all documents similar to the question.
def getRagDocs(question):
  docs = vectorstore.similarity_search(question)
  docs_content = "\n\n".join(doc.page_content for doc in docs)
  return docs_content


# Encodes an image from the given path to a base64 string, allowing for image data to be easily transmitted or stored in text format.
# Parameters:
# image_path (str): The file path of the image to encode.
# Returns:
# str: The base64-encoded representation of the image.
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')


# Encodes multiple images from a list of filenames within a specified folder into their base64 string representations.
# Parameters:
# file_list (list): A list of filenames to be converted.
# folder_path (str): The path to the folder containing the images.
# Returns:
# list: A list of base64-encoded strings of the images.
def Build64(file_list, folder_path):
    base64_images = []
    # print(file_list)
    for file_name in file_list:
        file_path = os.path.join(folder_path, file_name)
        if os.path.exists(file_path):
          # print(encode_image(file_path))
          base64_images.append(encode_image(file_path))
    return base64_images

# Sends a JSON payload to the OpenAI API and handles the response, retrying upon failures and handling errors.
# Parameters:
# payload (dict): The JSON payload to be sent to the API.
# Returns:
# dict: The JSON response from the API or an error message with status code.
def ChatRequest(payload):
    api_key = openai_api_key
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }
    while True:
        response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
        print(f"Request Status Code: {response.status_code}")
        if response.status_code == 400:
            # If the status code is 400, print the error and break the loop
            print("Bad request. Please check your payload for errors.")
            break  # Break the loop on a 400 status code
        elif response.status_code != 200:
            # Assume rate is limited or other errors and wait 25 seconds.
            default_wait = 25
            print(f"Failed request, status code: {response.status_code}, waiting {default_wait} seconds.")
            time.sleep(default_wait)
        else:
            # If the request is successful, break the loop and return the response
            break
    if response.status_code == 400:
        return {"error": "Bad request", "status_code": 400}  # Return an error message and status code
    else:
        return response  # Return the successful response



# Prepares a JSON payload including system and user prompts, images, and the main question for interaction via OpenAI's API. It sends this payload through the ChatRequest function and processes the response.
# Parameters:
# sys_prompt (str): System-generated introductory text.
# user_prompt (str): User-provided context or prompt.
# initialAnswer (str): Initial automated response.
# question (str): Main question to be addressed in the interaction.
# model (str): Identifier for the AI model to use.
# batch (list): List of base64-encoded images to include in the payload.
# Returns:
# dict: A dictionary containing the API's response and tokens used for the completion.
def vchatreviewer(sys_prompt, user_prompt, initialAnswer, question, model, batch):
    messages = [
        {
            "role": "system",
            "content": [{"type": "text", "text": sys_prompt}]
        },
        {
            "role": "user",
            "content": []  # Initial empty content, will populate below
        }
    ]

    # Question for the model, given after the examples if any
    question_message = {
        "type": "text",
        "text": f"\n Here is the question you will answer:\n{question}\n\n\n"
    }
    messages[1]["content"].append(question_message)


    # Add image_url messages for each base64 encoded image in the batch
    for base64_image in batch:
        image_message = {
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
        }
        messages[1]["content"].append(image_message)

    # Construct the JSON payload
    payload = {
        "model": model,
        "messages": messages,
        "max_tokens": 4096  # Modify as needed, 4096 can get expensive with little value
    }

    # Assume ChatRequest is a function that sends the payload to the API
    response = ChatRequest(payload)
    # Curate the response object
    response_data = response.json()
    assistant_response = response_data['choices'][0]['message']['content']
    completion_tokens_used = response_data['usage']['completion_tokens']

    res_dict = {"response": assistant_response, "tokens": completion_tokens_used}
    return res_dict

If you are including 5-shot examples use the 2 cells below.  Find the necessary files for the examples here:

In [None]:
exampleShots = pd.read_csv(io.BytesIO(uploaded['ACG_self_assessment_examples.csv']))
patoolib.extract_archive("Example_Images.zip",outdir="/content")
bool_example=True
example_image_folder_path="/content/Example_Images"

In [None]:
example_list = []
#If the user has opted-in to the example check box, create the example_list from the excel sheet.
if bool_example:
  for index, row in exampleShots.iterrows():
    if pd.notnull(row['question']):
      question_num = row[0]
      print(question_num)
      #Initiallize lists to hold image file names and the base64 images
      batch_file_names = []
      batch = []

      if not any(row):  # Check if the entire row is empty
        break  # Stop reading when an empty row is encountered
      first_7_cells = row[:7]  # Get the first 7 cells of the row
      # Loop through the last 5 cells in first_5_cells and check if they are empty
      for cell_value in first_7_cells[-5:]:
          if cell_value is None or cell_value == "" or pd.isna(cell_value):
              break  # Exit the inner loop if an empty cell is found
          else:
              image_file_name=cell_value
              #Use the code below if the file extension is not provided
              image_file_name = find_file_extension(example_image_folder_path, cell_value)
              batch_file_names.append(str(image_file_name))

      batch = Build64(batch_file_names, example_image_folder_path)

      #Extract question from excel sheet
      question = row['question']
      answer = row['Sample answer']
      example_list.append([question, batch, answer])

2023C4
2023P2
2023C5
2023st15
2023l9


In [None]:
# Loop through each row and fill in the answers.
# This is not the most elegant approach but with a small n (300) it works well

for index, row in testMedicalDF.iterrows():
    if pd.notnull(row['Question']):
      question_num = row[0]
      print(question_num)
      #Initiallize lists to hold image file names and the base64 images
      batch_file_names = []
      batch = []

      if not any(row):  # Check if the entire row is empty
          break  # Stop reading when an empty row is encountered
      first_7_cells = row[:7]  # Get the first 7 cells of the row
      # Loop through the last 5 cells in first_5_cells and check if they are empty
      for cell_value in first_7_cells[-5:]:
          if cell_value is None or cell_value == "" or pd.isna(cell_value):
              break  # Exit the inner loop if an empty cell is found
          else:
              image_file_name=cell_value
              #Use the code below if the file extension is not provided
              # image_file_name = find_file_extension(image_folder_path, cell_value)
              batch_file_names.append(str(image_file_name))

      print(batch_file_names)

      batch = Build64(batch_file_names, image_folder_path)

      #Extract question from excel sheet
      question = row['Question']

      # intialAnswer = row['Initial Answer']
      intialAnswer = "none"

      #Call the API to answer the question using vision or no vision
      results = vchatreviewer(reviewer_prompt, user_prompt, intialAnswer, question, "gpt-4-turbo", batch)

      if 'error' in results and results['status_code'] == 400:
        print("Encountered a bad request error. Stopping further processing for this row.")
        results= "Error encountered"
        continue  # Skip the rest of the current iteration and move to the next row

      answer = str(results["response"])  # Extracts the answer from results if no error
      testMedicalDF.at[index, 'Correct answer'] = answer

3sb
['3sb1.jpg', '3sb2.jpg', '3sb3.jpg']
Request Status Code: 200
5m
['5m1.jpg', '5m2.jpg']
Request Status Code: 200
1IBD
['1IBD1.jpg']
Request Status Code: 200
7es
['6es1.png']
Request Status Code: 200
10st
[]
Request Status Code: 200
11m
['11m1.png']
Request Status Code: 200
17IBD
['16IBD1.jpg', '16IBD2.png']
Request Status Code: 200
29c
['29c1.jpg', '29c2.jpg']
Request Status Code: 200
3sb
['3sb1.jpg', '3sb2.jpg', '3sb3.jpg']
Request Status Code: 200
5m
['5m1.jpg', '5m2.jpg']
Request Status Code: 200
1IBD
['1IBD1.jpg']
Request Status Code: 200
7es
['6es1.png']
Request Status Code: 200
10st
[]
Request Status Code: 200
11m
['11m1.png']
Request Status Code: 200
17IBD
['16IBD1.jpg', '16IBD2.png']
Request Status Code: 200
29c
['29c1.jpg', '29c2.jpg']
Request Status Code: 200
3sb
['3sb1.jpg', '3sb2.jpg', '3sb3.jpg']
Request Status Code: 200
5m
['5m1.jpg', '5m2.jpg']
Request Status Code: 200
1IBD
['1IBD1.jpg']
Request Status Code: 200
7es
['6es1.png']
Request Status Code: 200
10st
[]
Reque

In [None]:
testMedicalDF.to_csv('20240508ModelRun.csv', index=False)
files.download('20240508ModelRun.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Google Gemini

###Set up

In [None]:
!pip install -q -U google-generativeai
!pip install langchain
!pip install patool
!pip install requests

In [None]:
import pandas as pd
import io
from pathlib import Path
import base64
import time
from google.colab import files
import patoolib
import PIL.Image
import pathlib
import textwrap

import google.generativeai as genai

from IPython.display import display
from IPython.display import Markdown

# Used to securely store your API key
from google.colab import userdata

In [None]:
image_folder_path="/content/2022_ACG_Files"

In [None]:
def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

In [None]:
GOOGLE_API_KEY=os.getenv('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

In [None]:
for m in genai.list_models():
  if 'generateContent' in m.supported_generation_methods:
    print(m.name)

models/gemini-1.0-pro
models/gemini-1.0-pro-001
models/gemini-1.0-pro-latest
models/gemini-1.0-pro-vision-latest
models/gemini-1.5-flash
models/gemini-1.5-flash-001
models/gemini-1.5-flash-latest
models/gemini-1.5-pro
models/gemini-1.5-pro-001
models/gemini-1.5-pro-latest
models/gemini-pro
models/gemini-pro-vision


In [None]:
# Model names change regularly with gemini
model = genai.GenerativeModel('gemini-pro-vision')
textmodel = genai.GenerativeModel('gemini-pro')

### Running the model

In [None]:
###
# Searches for files in the specified folder path that match the given filename without an extension, appending any extension available. Returns the first match found with its complete filename and extension.
# Parameters:
# folder *path* (str): The directory path where the file will be searched.
# filename *without_extension* (str): The base name of the file without any extension.
# Returns:
# str: The complete filename with extension of the first matching file, or None if no match is found.
###
def find_file_extension(folder_path, filename_without_extension):
    # Create a search pattern
    search_pattern = os.path.join(folder_path, filename_without_extension + ".*")

    # Use glob to find matching files
    matching_files = glob.glob(search_pattern)

    if not matching_files:
        return None  # No matching file found

    # Assuming you want the first matching file
    first_matching_file = matching_files[0]

    # Extract the complete file name
    complete_file_name = os.path.basename(first_matching_file)

    return complete_file_name

###
# Creates a list of PIL image objects from a list of filenames within a specified folder. Each image is opened and added to the list as a PIL image object.
# Parameters:
# file_list (list): A list of filenames to be opened as images.
# folder_path (str): The path to the folder containing the image files.
# Returns:
# list: A list of PIL Image objects. Each object in the list represents an image file that was opened from the specified folder.
###
def BuildImageListGoogle(file_list, folder_path):
    pil_images = []
    for file_name in file_list:
        file_path = os.path.join(folder_path, file_name)
        pil_images.append(PIL.Image.open(file_path))
    return pil_images

###
# Retrieves documents similar to the provided question using a vector store's similarity search, concatenating the page contents of each resulting document into a single string.
# Parameters:
# question (str): The query question for retrieving relevant documents.
# Returns:
# str: Concatenated page contents of all documents similar to the question.
###
def getRagDocs(question):
  docs = vectorstore.similarity_search(question)
  docs_content = "\n\n".join(doc.page_content for doc in docs)
  return docs_content

In [None]:
# Loop through each row and fill in the answers
for index, row in testMedicalDF.iterrows():
    if pd.notnull(row['Question']):
      question_num = row[0]
      print(question_num)
      #Initiallize lists to hold image file names and the base64 images
      batch_file_names = []
      batch = []

      #Extract question from excel sheet
      question = row['Question']

      contents = [
          sys_prompt
      ]
      # If it fails on the image model
      text_contents = [
          sys_prompt
      ]

      ragDocs=getRagDocs(question)
      contents.append("\n\nBelow is some additional documentation that may or may not be useful.  Try to use the guidelines if you can but do not forget the question asked. \n\n")
      contents.append(ragDocs)

      text_contents.append("\n\nBelow is some additional documentation that may or may not be useful.  Try to use the guidelines if you can but do not forget the question asked. \n\n")
      text_contents.append(ragDocs)


      if not any(row):  # Check if the entire row is empty
          break  # Stop reading when an empty row is encountered
      first_7_cells = row[:7]  # Get the first 7 cells of the row
      # Loop through the last 5 cells in first_5_cells and check if they are empty
      for cell_value in first_7_cells[-5:]:
          if cell_value is None or cell_value == "" or pd.isna(cell_value):
              break  # Exit the inner loop if an empty cell is found
          else:
              # image_file_name = find_file_extension(image_folder_path, cell_value)
              image_file_name=cell_value
              print(image_file_name)
              if image_file_name is not None:
                if os.path.exists(image_folder_path):
                  file_path = os.path.join(image_folder_path, image_file_name)
                  contents.append(PIL.Image.open(file_path))
                  print("complete")

      contents.append("\n\nAnd now here is the question that you will be answering.  Please answer this question based on your knowledge and expertise \n\n")
      contents.append(question)

      text_contents.append("\n\nAnd now here is the question that you will be answering.  Please answer this question based on your knowledge and expertise \n\n")
      text_contents.append(question)


      # At times the Gemini model can error out especially the image model.  Use this try catch and include as error if it fails both
      try:
        results = model.generate_content(contents)
        answer = results.text  #extracts the answer from results
      except:
          try:
            results=textmodel.generate_content(text_contents)
            answer = results.text  #extracts the answer from results
            print("using text model")
          except:
            answer="Error based on input"
            print("waiting 10 seconds")
            time.sleep(10)
            print("Error Caught")
            pass
          pass

      testMedicalDF.at[index, 'Correct answer'] = answer

3sb
3sb1.jpg
complete
3sb2.jpg
complete
3sb3.jpg
complete
5m
5m1.jpg
complete
5m2.jpg
complete
1IBD
1IBD1.jpg
complete
using text model
7es
6es1.png
complete
10st




using text model
11m
11m1.png
complete
17IBD
16IBD1.jpg
complete
16IBD2.png
complete
29c
29c1.jpg
complete
29c2.jpg
complete
3sb
3sb1.jpg
complete
3sb2.jpg
complete
3sb3.jpg
complete
5m
5m1.jpg
complete
5m2.jpg
complete
1IBD
1IBD1.jpg
complete
using text model
7es
6es1.png
complete
10st




using text model
11m
11m1.png
complete
17IBD
16IBD1.jpg
complete
16IBD2.png
complete
29c
29c1.jpg
complete
29c2.jpg
complete
3sb
3sb1.jpg
complete
3sb2.jpg
complete
3sb3.jpg
complete
5m
5m1.jpg
complete
5m2.jpg
complete
1IBD
1IBD1.jpg
complete


In [None]:
testMedicalDF.to_csv('20240506GeminiModelRun.csv', index=False)
files.download('20240506GeminiModelRun.csv')