# Own your health app!

Trying to show a simple PoC where I give it a medical test document and then ask 
some structured responses from it. 

For our real app, we will have 10's or 100's of such documents per individual 
and each document could be 10's to 100's of pages. This is because someone's diagnostic 
journey e.g. cancer is spread across tests and visits to 100's of institutions before they
get refered to a large cancer center oncologist and this oncology team has to make sense of all you have
endured {treatment, outcomes, discharge summaries} to give you the right next treatment when 
you arrive at their doorstep.


For now success will be if for a single report I am able to get back all the
diagnostic tests done without missing. It tends to miss a few and on repeated nudging since 
I know the answer, pull out more and more tests missed previously.
Also, as a bonus what I would really really like would be I ask a question with a schema 
and it returns *all* the elements in the same schema.


In [None]:
from langchain.document_loaders import UnstructuredPDFLoader

# Right now this does not do directory, it takes one file at a time, could this be a limitation later !?!?

## First approach is doing this with one big gulp, no splitting and using function calls to structure

In [None]:
## Core algorithm which parses the PDF file and structres the output.
# the plus of using pedantic models is yo can force what you want the return obect to look like
# pedantic then does the validation to make sure!
# In the future we can make this configurable!

#TODO: Adding a list object into the pydantic object which itself is an object?


from pydantic import BaseModel, Field
from langchain.chains.openai_functions.extraction import _get_extraction_function
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser

doc = UnstructuredPDFLoader(file_path="/Users/vinayak/projects/medical_records_parser/data/MinnieMouseReport.pdf")
# Above I read the whole thing as ONE large blob, this was possible since the file is only 7 pages!
# if the file becomes too large, this is not possible.
docs = doc.load()


query = """
You are an expert medical transcriber. Please give me back a table of all the analytes measured. The table should have following columns: analyte_measured, result, reference_interval, unit, notes. 
If a patricular column does not exist please say NA.
Please double check your work and do not miss any analyte.
"""


class Analyte(BaseModel):
    """Information about an analyte."""
    analyte_measured: str
    result: str
    reference_interval: str
    unit: str
    notes: str

openai_function = _get_extraction_function(Analyte.schema())

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

model = ChatOpenAI(model="gpt-3.5-turbo-16k") #this approach will work as far as document size is small.

prompt = ChatPromptTemplate.from_messages([
    ("system", query),
    ("user", "{doc}")
])


output_parser = JsonKeyOutputFunctionsParser(key_name="info")
chain = prompt | model.bind(functions=[openai_function], function_call={"name": "information_extraction"}) | output_parser

response = chain.invoke({"doc": docs[0].page_content})

In [None]:
import pandas as pd
pd.DataFrame(response) # Amit/Guy can you please see if the output is correct?

## Second approach is doing the same thing as above but one page at a time since sometimes the file might be too large to fit context size of the model.

In [None]:
from langchain.document_loaders import UnstructuredPDFLoader
doc = UnstructuredPDFLoader(file_path="/Users/vinayak/projects/medical_records_parser/data/MinnieMouseReport.pdf", mode="paged")
# Above is the key difference where it is loading the data as "paged" mode.

docs = doc.load()

query = """
Please give me back a table of all the analytes measured. The table should have following columns: analyte_measured, result, reference_interval, unit, notes. 
If a patricular column does not exist please say NA.
Please double check your work and do not miss any analyte.
"""

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser

model = ChatOpenAI(model="gpt-3.5-turbo-16k")

prompt = ChatPromptTemplate.from_messages([
    ("system", query),
    ("user", "{doc}")
])

# This is because of what _get_extraction_function does
output_parser = JsonKeyOutputFunctionsParser(key_name="info")

chain = prompt | model.bind(functions=[openai_function], function_call={"name": "information_extraction"}) | output_parser


responses = chain.batch([{"doc": d.page_content} for d in docs], {"max_concurrency": 5})

extracted_by_function_call = []
for response in responses:
    extracted_by_function_call.extend(response)
    


In [None]:
pd.DataFrame(extracted_by_function_call)

## Now for the third approach, this is a VERY large document of 130+ pages including mishmash of diffrent kind of reports since it has come from an EMR dump (likely EPIC)

In [None]:
doc = UnstructuredPDFLoader(file_path="/Users/vinayak/projects/kaiser/data/Barbara/UCLA Health.pdf")
docs = doc.load()

In [None]:
#Showing the entire contents of the document
docs[0].page_content

In [None]:
# we need some kind of splitter which closely resembles where records start and end
# I am using the default on, which is sub optimal and does a lot of repeats and not so useful summarization too!

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(        
    separator = "\n\n",
    chunk_size = 10000,
    chunk_overlap  = 500,
    length_function = len,
    is_separator_regex = False,
)

doc = UnstructuredPDFLoader("/Users/vinayak/projects/kaiser/data/Barbara/UCLA Health.pdf")

docs = doc.load_and_split(text_splitter)

print("Number of splits %d"%(len(docs)))


### Notice I also changed the question from parsing a diagnostic report to parsing medical visits. Likely we will have to do a hybrid where we first split the very large document into different pieces and for each piece parse what is relevant (diagnostic report vs visits vs pathology report)

In [None]:
query = """
You are an expert medical transcriber and can transcribe electronic health records with great skill.
Please give me back a table of all the visits from the patient. Columns to return are:
patient_name, date_of_visit, category, provider, institution, brief_summary
The category can only be one of the following values: LAB_REPORT, PATHOLOGY, RADIOLOGY, PROCEDURE, DIAGNOSTIC_TEST, ROUTINE_VISIT
If a patricular column does not exist please say NA.
Please double check your work and do not miss any visits.
"""


from pydantic import BaseModel, Field
from langchain.chains.openai_functions.extraction import _get_extraction_function

class Visit(BaseModel):
    """Information about visit to the medical facility."""
    patient_name: str
    date_of_visit: str
    category: str
    provider: str
    institution: str
    brief_summary: str

openai_function = _get_extraction_function(Visit.schema())

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

model = ChatOpenAI(model="gpt-3.5-turbo-16k")

prompt = ChatPromptTemplate.from_messages([
    ("system", query),
    ("user", "{doc}")
])

chain = prompt | model.bind(functions=[openai_function], function_call={"name": "information_extraction"}) | output_parser

# Make subset of docs below (8) so I don't become bankrupt! with openAI bills

responses = chain.batch([{"doc": d.page_content} for d in docs], {"max_concurrency": 5})


## The above code aks a question per subset of the data (according to the split which is 10k). This means it will have
## answers per split. The response object is list of response, each response hiving a list of dicts

In [None]:
# we need to flatten the list

flattened_list = list()
for d in responses:
    flattened_list.extend(d)

flattened_list

In [None]:
import pandas as pd
df = pd.DataFrame.from_records(flattened_list)
df


In [None]:
#Doing cleanup to remove junk, this is because our parser is not yet good enough.

df1 = df[(df.patient_name != 'NA')]
df1 = df1[(df1.category != 'NA')]

df1['date_cleanedup']= pd.to_datetime(df1['date_of_visit'], format='mixed')
df1['final_date'] = df1['date_cleanedup'].apply(lambda x: x.strftime('%B %d, %Y'))

df1.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'}, inplace=True)

df1.sort_values(by='date_cleanedup').to_csv('/Users/vinayak/projects/df_to_test.tsv', sep="\t", index=False)




In [None]:
df1.rename(columns={'final_date': 'title','institution': 'cardTitle', 'category': 'cardSubtitle', 'brief_summary': 'cardDetailedText' }, inplace=True)
cols_needed = ['title', 'cardTitle', 'cardSubtitle', 'cardDetailedText']
df1[cols_needed].to_dict('records')

## Now trying with GPT4 instead

In [None]:
query = """
You are an expert medical transcriber and can transcribe electronic health records with great skill.
Please give me back a table of all the visits from the patient. Columns to return are:
patient_name, date_of_visit, category, provider, institution, brief_summary
The category can only be one of the following values: LAB_REPORT, PATHOLOGY, RADIOLOGY, PROCEDURE, DIAGNOSTIC_TEST, ROUTINE_VISIT
If a patricular column does not exist please say NA.
Please double check your work and do not miss any visits.
"""


from pydantic import BaseModel, Field
from langchain.chains.openai_functions.extraction import _get_extraction_function

class Visit(BaseModel):
    """Information about visit to the medical facility."""
    patient_name: str
    date_of_visit: str
    category: str
    provider: str
    institution: str
    brief_summary: str

openai_function = _get_extraction_function(Visit.schema())

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

model = ChatOpenAI(model="gpt-4")

prompt = ChatPromptTemplate.from_messages([
    ("system", query),
    ("user", "{doc}")
])

chain = prompt | model.bind(functions=[openai_function], function_call={"name": "information_extraction"}) | output_parser

# Make subset of docs below (8) so I don't become bankrupt! with openAI bills

responses = chain.batch([{"doc": d.page_content} for d in docs], {"max_concurrency": 5})


## The above code aks a question per subset of the data (according to the split which is 10k). This means it will have
## answers per split. The response object is list of response, each response hiving a list of dicts

In [None]:
responses

In [None]:
# we need to flatten the list

flattened_list = list()
for d in responses:
    flattened_list.extend(d)

import pandas as pd
df = pd.DataFrame.from_records(flattened_list)
df


In [None]:
df1 = df[(df.patient_name != 'NA')]
df1 = df1[(df1.category != 'NA')]

df1['date_cleanedup']= pd.to_datetime(df1['date_of_visit'], format='mixed')
df1['final_date'] = df1['date_cleanedup'].apply(lambda x: x.strftime('%B %d, %Y'))

df1.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'}, inplace=True)

df1.sort_values(by='date_cleanedup').to_csv('/Users/vinayak/projects/df_to_test_gpt4.tsv', sep="\t", index=False)


## This is with Antropic Claudin2

In [None]:
query = """
You are an expert medical transcriber and can transcribe electronic health records with great skill.
Please give me back a table of all the visits from the patient. Columns to return are:
patient_name, date_of_visit, category, provider, institution, brief_summary
The category can only be one of the following values: LAB_REPORT, PATHOLOGY, RADIOLOGY, PROCEDURE, DIAGNOSTIC_TEST, ROUTINE_VISIT
If a patricular column does not exist please say NA.
Please double check your work and do not miss any visits.
"""


from pydantic import BaseModel, Field
from langchain.chains.openai_functions.extraction import _get_extraction_function

class Visit(BaseModel):
    """Information about visit to the medical facility."""
    patient_name: str
    date_of_visit: str
    category: str
    provider: str
    institution: str
    brief_summary: str

openai_function = _get_extraction_function(Visit.schema())

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

model = AnthropicFunctions(model='claude-2')

prompt = ChatPromptTemplate.from_messages([
    ("system", query),
    ("user", "{doc}")
])

chain = prompt | model.bind(functions=[openai_function], function_call={"name": "information_extraction"}) | output_parser

# Make subset of docs below (8) so I don't become bankrupt! with openAI bills

responses = chain.batch([{"doc": d.page_content} for d in docs], {"max_concurrency": 5})


## The above code aks a question per subset of the data (according to the split which is 10k). This means it will have
## answers per split. The response object is list of response, each response hiving a list of dicts

# Scratch space

### The stuff below is working but chunking is per page

In [None]:
from langchain.document_loaders import UnstructuredPDFLoader
doc = UnstructuredPDFLoader(file_path="/Users/vinayak/projects/kaiser/data/Barbara/UCLA Health.pdf", mode="paged")

docs = doc.load()

query = """
Please give me back a table of all the visits from the patient. Columns to return are:
visit_date, visit_reason, visit_department, visit_summary
If a patricular column does not exist please say NA.
Please double check your work and do not miss any visits.
"""

class Visit(BaseModel):
    """Information about visit to the medical facility."""
    visit_date: str
    visit_reason: str
    visit_department: str
    visit_summary: str



openai_function = _get_extraction_function(Visit.schema())

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser

model = ChatOpenAI(model="gpt-3.5-turbo-16k")

prompt = ChatPromptTemplate.from_messages([
    ("system", query),
    ("user", "{doc}")
])

# This is because of what _get_extraction_function does
output_parser = JsonKeyOutputFunctionsParser(key_name="info")

chain = prompt | model.bind(functions=[openai_function], function_call={"name": "information_extraction"}) | output_parser


responses = chain.batch([{"doc": d.page_content} for d in docs], {"max_concurrency": 5})

In [None]:
mport sys
from PyPDF2 import PdfFileReader, PdfFileWriter

def edit_pdf_text(input_pdf_path, output_pdf_path, old_text, new_text):
    # Read the existing PDF
    with open(input_pdf_path, "rb") as file_handle:
        pdf = PdfFileReader(file_handle)
        content = pdf.getPage(0).extractText()

    # Replace the old text with the new text
    content = content.replace(old_text, new_text)

    # Write the modified content to a new PDF
    pdf_writer = PdfFileWriter()
    pdf_writer.addPage(pdf.getPage(0))
    with open(output_pdf_path, "wb") as output_pdf:
        pdf_writer.write(output_pdf)
        
edit_pdf_text()

In [None]:
!pip install PyPDF2

In [None]:
!pip install PyPDF2 pdfplumber

In [None]:
import PyPDF2
import pdfplumber

def replace_text_in_pdf(input_pdf_path, output_pdf_path, text_to_find, replacement_text):
    with pdfplumber.open(input_pdf_path) as pdf:
        pages = pdf.pages
        #print(pages)
        for i, page in enumerate(pages):
            text = page.extract_text()
            #print(text)
            replaced_text = text.replace("Mouse", "Vinayak")
            pages[i] = replaced_text

    with open(output_pdf_path, 'wb') as output_pdf:
        pdf_writer = PyPDF2.PdfWriter()
        for page in pages:
            print(page)
            pdf_writer.add_page(page)
        pdf_writer.write('~/Desktop/Vinayak.pdf')
        
replace_text_in_pdf(input_pdf_path, output_pdf_path, 'Minnie', 'Vinayak')

In [None]:
replace_text_in_pdf(input_pdf_path, output_pdf_path, 'Minnie', 'Vinayak')

In [None]:
import PyPDF2

def change_text(pdf_file, old_text, new_text):
  """
  This function changes the text in a PDF file.

  Args:
    pdf_file: The path to the PDF file.
    old_text: The text to be replaced.
    new_text: The new text.

  Returns:
    None.
  """

  pdf_reader = PyPDF2.PdfReader(pdf_file)
  pdf_writer = PyPDF2.PdfWriter()

  for page in pdf_reader.pages:
    text = page.extract_text()
    text = text.replace(old_text, new_text)
    page.(text)

  pdf_writer.write(pdf_file)


if __name__ == "__main__":
  pdf_file = input_pdf_path
  old_text = "This is the old text."
  new_text = "This is the new text."

  change_text(pdf_file, old_text, new_text)

  print("Text successfully changed.")


In [None]:
input_pdf_path

In [None]:
!pip install borb

In [None]:
#!chapter_007/src/snippet_013.py
from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit import SimpleFindReplace

import typing


def main():

    # attempt to read a PDF
    doc: typing.Optional[Document] = None
    with open(input_pdf_path, "rb") as pdf_file_handle:
        doc = PDF.loads(pdf_file_handle)

    # check whether we actually read a PDF
    assert doc is not None

    # find/replace
    doc = SimpleFindReplace.sub("Minnie", "Vinayak", doc)

    # store
    with open(output_pdf_path, "wb") as pdf_file_handle:
        PDF.dumps(pdf_file_handle, doc)


if __name__ == "__main__":
    main()


In [None]:
output_pdf_path

In [None]:
import PyPDF2
import pdfplumber

def replace_text_in_pdf(input_pdf_path, output_pdf_path, text_to_find, replacement_text):
    # Lists to hold the text content and their bounding boxes
    replacements = []

    # Extract text and their bounding boxes using pdfplumber
    with pdfplumber.open(input_pdf_path) as pdf:
        for page in pdf.pages:
            for word in page.extract_words():
                if text_to_find in word['text']:
                    replaced_text = word['text'].replace(text_to_find, replacement_text)
                    bbox = (word['x0'], word['y0'], word['x1'], word['y1'])
                    replacements.append((replaced_text, bbox))

    # Open the PDF with PyPDF2
    with open(input_pdf_path, 'rb') as pdf_file:
        reader = PyPDF2.PdfFileReader(pdf_file)
        writer = PyPDF2.PdfFileWriter()

        for page_num in range(reader.numPages):
            page = reader.getPage(page_num)

            # Overlay the replacement texts
            for text, bbox in replacements:
                x0, y0, x1, y1 = bbox
                # Adjust the coordinates as needed
                page.merge_text(text, x0, y0, size=y1-y0)

            writer.addPage(page)

        # Write the modified content to the output PDF
        with open(output_pdf_path, 'wb') as output_pdf:
            writer.write(output_pdf)

replace_text_in_pdf(input_pdf_path, '/Users/vinayak/Desktop/Vinaya.pdf', 'Mouse', 'Vinayak')


## This section is trying to map a file to a type of report


In [None]:
from langchain.document_loaders import UnstructuredPDFLoader

doc = UnstructuredPDFLoader(file_path="/Users/vinayak/projects/kaiser/data/barbara_split/ucla_1-6.pdf")

docs = doc.load()


query = """
You are an expert medical transcriber. Given a document you can clearly distinguish and categorize it.
"""

In [None]:
docs[0].page_content

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.chains import create_tagging_chain, create_tagging_chain_pydantic

In [None]:
# Schema
schema = {
    "properties": {
        "aggressiveness": {
            "type": "integer",
            "enum": [1, 2, 3, 4, 5],
            "description": "describes how aggressive the statement is, the higher the number the more aggressive",
        },
        "language": {
            "type": "string",
            "enum": ["spanish", "english", "french", "german", "italian"],
        },
    },
    "required": ["language", "sentiment", "aggressiveness"],
}

# LLM
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613")
chain = create_tagging_chain(schema, llm)

In [None]:
from marvin import ai_classifier
from enum import Enum


@ai_classifier
class ReportClassifer(Enum):
    """You are an expert clinican and medical notes interpreter. Classify the report based on which part of the healthcare network it came from"""

    DIAGNOSTIC_REPORT = 1
    OFFICE_VISIT = 2
    SURGERY_VISIT = 3
    RADIOLOGY_REPORT = 4
    BLOOD_WORK = 5
    MEDICATION_LIST = 6

ReportClassifer(docs[0].page_content)


In [None]:
from langchain.document_loaders import UnstructuredPDFLoader

doc = UnstructuredPDFLoader(file_path="/Users/vinayak/projects/kaiser/data/barbara_split/ucla_104-107.pdf")

docs = doc.load()

ReportClassifer(docs[0].page_content)

In [None]:
docs

In [None]:
##Trying summarization for the document

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

# Load the document, split it into chunks, embed each chunk and load it into the vector store.


from langchain.document_loaders.image import UnstructuredImageLoader

loader = UnstructuredImageLoader("/Users/vinayak/projects/kaiser/data/tcga_scanned_image/TCGA1.png")


In [None]:
data = loader.load()
data[0]

In [None]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

In [None]:
from langchain.indexes import VectorstoreIndexCreator
index = VectorstoreIndexCreator().from_loaders([loader])

In [None]:
query = "What is the summary of the document?"
index.query(query)

In [None]:
query = """Please give me a two part answer to my question. 
First, starting with Answer: is the answer to the question and second paragraph starting with Citation: the exact lines from the document used to give the answer
What is the summary of this oncology report?
"""
index.query(query)

In [None]:
query = "What kind of cancer does the patient have? Please also provide the exact line from the document you used to answer the question"
index.query(query)

## TODO: Can I ask a question and get an answer back as a pydantic object?