## Break up a BIG PDF
This Notebook will break a BIG PDF that is made up of many separate PDFs all bundled together into its many individual embedded PDFs.
No magic here. It just looks for the words "Page 1 of " (ie "Page 1 of 3" or "Page 1 of 6") which is typical in business documents. 
Depending on your use-case, you may need to look for different wording.

It first determines the start and end pages and writes that info to an array (List) and then goes back and reads the array,
writing the indiviual PDFs out to new files based on the start and end pages in the array.  

The original BIG PDF is left intact.

It uses the Azure Document Intelligence Service (specifically the "prebuilt-layout" model) to do the OCR.


#### Install Libraries
if needed


In [None]:
# %pip install PyPDF2
# %pip install datefinder
# %pip install python-dotenv
# %pip install azure-core
# %pip install azure-ai-formrecognizer

## Create Functions 
#### Below are some needed functions that are used to break the one big PDf into its many embedded PDFs

In [19]:
# ************************************************************************* #
# Description:
#   This function first reads the environment vars into local vars
#   It then uses the prebuilt layout model to extract the text from a local 
#   PDF file.
# Input: 
#   a local PDF file's path and name
# Output: 
#   The entire JSON results from the API call are returned
# ************************************************************************* #
#
def get_Local_content(my_file):
    import os
    from azure.ai.formrecognizer import DocumentAnalysisClient
    from azure.core.credentials import AzureKeyCredential
    from dotenv import load_dotenv
    #
    # Load the Environment vars from the .env file
    load_dotenv()
    #
    # The endpoint and key values should be in the .env file
    # Retreiving those values here
    endpoint = os.getenv('AZURE_FORM_RECOGNIZER_ENDPOINT', '')
    key =  os.getenv('AZURE_FORM_RECOGNIZER_KEY', '')
    #
    # expecting input files in a folder called Input_PDFs
    full_path="Input_PDFs" + "/" + my_file  
    #
    # Creates the Document Intelligence client
    document_analysis_client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key))
    #
    # Opens the file and makes the API call
    with open(full_path, "rb") as f:
        poller = document_analysis_client.begin_analyze_document(
        "prebuilt-layout", document=f
        )
    #
    # retreive the Result
    result = poller.result()
    #
    # The entire results from the prebuilt layout model are returned
    return result


# **********************************************************************
# Description:
#   This function finds specific pages that might need special handling
#   In this example the code below handels the cover page
# Input:
#
# Output:
#
# **********************************************************************
#
def get_embedded_pages(pages):
    # 'skip' is a flag that indicates if we are in the middle of an embedded document
    # if so we will skip the pages until we get to the end of the embedded document
    # before looking for another embedded document's start and end pages
    skip=0
    #
    # 'embedded_document_page_numbers' is an array of arrays
    # It will contain a array for each embedded document found
    embedded_document_page_numbers=[]
    #
    # 'this_pages_content' contains the text from just the current page of the current embedded document
    this_pages_content=""
    #
    #We need to know thw start and end pages of the current embedded document
    this_documents_start_page=0
    this_documents_end_page=0
    #
    # 'This_documents_content' contains the text from just the one embedded document
    This_documents_content=""
    #
    this_documents_type=""
    print("There are {} pages in the Big PDF... Chechecking for embedded documents".format(len(pages)))
    for page in pages:
        # if skip is 0 then we are not in the middle of an embedded document but looking for first page of one
        if skip==0:
            # we are at a new page 1, lets initialize our document vars
            this_pages_content=""
            this_documents_start_page=0
            this_documents_end_page=0
            this_documents_type=""
            this_documents_visit_date=""
            This_documents_content=""
            #
            # determine if the text "Page 1 of " is present in the page
            # we will call the "is_page_1" function to do this
            # we will get back the end page number of the embedded document
            total_pages =is_page_1(page)
            if total_pages>-1:
                # the start page of this embedded document is i
                # the last page is i + total_pages -1
                # add that information to the array
                this_documents_start_page=page.page_number
                this_documents_end_page = page.page_number + total_pages-1
                # uncomment next line to help debug
                # print("starting on page ", page.page_number, "embedded doc is ",  total_pages, " pages, and ends on page ", page.page_number+total_pages-1)
                # skip to the last page of the embedded document
                skip=total_pages-1
            # process page 1
            # initialize the page content variable
            this_pages_content=""
            mypage=page.to_dict()
            lines=mypage["lines"]
            # loop through the lines and load them into the page content variable
            for line in lines:
                this_pages_content += line["content"] + " "  
            # add page 1's content to overall content for the embedded document
            This_documents_content += this_pages_content
            # 
            # The code below is used determine type of embedded document
            # This code is ** use case specific **
            # The sample PDF used here can contain Lab Resuts or Doctor reports
            # or possibly other types of documents
            # The code below looks for specific words in the text to determine the type
            # Your use case will be different
            #
            # START of the document type determination code
            this_documents_type=""
            if this_pages_content.lower().find("paraneds")>-1:
                this_documents_type="Cover Sheet"
            elif this_pages_content.lower().find(" laboratory ")>-1 or this_pages_content.lower().find(" lab ")>-1 or this_pages_content.lower().find(" laboratories ")>-1:
                this_documents_type="Test results"
            elif this_pages_content.lower().find(" xray ")>-1 or this_pages_content.lower().find(" imaging center ")>-1:
                this_documents_type="Imaging results"
            elif ((this_pages_content.lower().find(" visit date: ")>-1) | (this_pages_content.lower().find("hpi:")>-1)):
                this_documents_type="Office visit"
            elif this_pages_content.lower().find(" lab")>-1:
                this_documents_type="Test results"
            else:
                this_documents_type="Unknown"
            #
            # END of the document type determination code
            #
            # The following code is also ** use case specific **
            # No matter the type of embedded document, the documrnt's chronological order is important
            # If these PDFs end up in a RAG index the data of the visit will most
            # likely become metadata and be part of the index, so I look for it here
            #
            # Doctor reports usually have the date of the visit on the first page
            #
            # START of the determine date of the visit code
            this_documents_visit_date=""
            date_start=this_pages_content.lower().find("date")
            date_start=date_start+4
            split_date=this_pages_content.lower()[date_start:].split(" ")
            if len(split_date) > 1:
                # split_date[1] is a string test it to see if contains a date
                # strings dont have an isdate function                
                if is_date(split_date[1]):
                    this_documents_visit_date=split_date[1]
            else:
                if is_date(split_date[2]):
                    this_documents_visit_date=split_date[2]
            #
            # END of the determine date of the visit code
            #
            # ** use case specific **
            # This use case has a cover page which I do not write out as a separate PDF
            # for everything else
            if ((skip==0) and (this_documents_type != "Cover Sheet")):
                # done with all pages in the embedded document - write an element in the array
                # containg the info about the embedded document
                # Then we will continue looping, looking for more embedded documents ( the next page 1 of)
                embedded_document_page_numbers.append([ This_documents_content.strip(), this_documents_start_page, this_documents_end_page, this_documents_type, this_documents_visit_date]) 
        else:
            # skip must be > 0 which means we are in the middle of an embedded document
            skip-=1
            this_pages_content=""
            page_content=""
            mypage=page.to_dict()
            lines=mypage["lines"]
            for line in lines:
                this_pages_content += line["content"] + " "  
            # add another page to overall content for doc
            This_documents_content += this_pages_content
            #
            # lab test results can have their date in the middle pages 
            #
            # START of the determine date of the visit code
            if skip==0:
                if this_documents_visit_date=="":
                    date_start=This_documents_content.lower().find("date")
                    date_start=date_start+4
                    split_date=This_documents_content.lower()[date_start:].split(" ")
                    if len(split_date)>1:
                        if is_date(split_date[1]):
                            this_documents_visit_date=split_date[1]
                    else:
                        if is_date(split_date[2]):
                            this_documents_visit_date=split_date[2]
                #       
                if this_documents_visit_date=="":                
                    date_start=This_documents_content.lower().find("date collected:")
                    date_start=date_start+15
                    split_date=This_documents_content.lower()[date_start:].split(" ")
                    if len(split_date)>1:
                        if is_date(split_date[1]):
                            this_documents_visit_date=split_date[1]
                    else:
                        if is_date(split_date[2]):
                            this_documents_visit_date=split_date[2]
                # END of the determine date of the visit code
                #
                # done with all pages in the embedded document - write an element in the array
                # containg the info about the embedded document
                # Then we will continue looping, looking for more embedded documents ( the next page 1 of)
                embedded_document_page_numbers.append([ This_documents_content.strip(), this_documents_start_page, this_documents_end_page, this_documents_type, this_documents_visit_date]) 
    return embedded_document_page_numbers



# **********************************************************************
# Decription
#   a function that reads the array of start and end pages of embedded 
#   documents and saves each embedded document as a separate PDF
# Input:
#   an array of start and end pages of embedded documents
# Output:
#   an int representing the number of embedded documents written to disk
#   It write the embedded pages to disk during execution
# **********************************************************************
#
def save_embedded_documents(embedded_document_page_numbers, my_file):
    from PyPDF2 import PdfReader, PdfWriter
    #
    # initialize Vars
    files_written = 0
    #
    # expecting input files in a folder called Input_PDFs
    full_path="Input_PDFs" + "/" + my_file  
    #
    # open the big PDF file
    pdf_file = open(full_path, 'rb')
    #
    # read the big PDF file into an array of pages
    pdf_reader = PdfReader (pdf_file)
    #
    # loop through the array of embedded documents
    i=0
    for t,s,e,ty,d in embedded_document_page_numbers:
        # we will create a new PDF file for the embedded document
        # we first will get the start and end pages of the embedded document
        start_page = int(s)-1
        end_page =int(e)
        i=i+1
        # and open an output file to write to
        output=PdfWriter()
        #
        # now we loop through the pages of the embedded document
        for p in range(start_page,end_page):
            # get the page from the original PDF file
            # and write it to the output file    
            output.add_page(pdf_reader.pages[p])
        #
        # we have written all the pages of this embedded document
        # lets save the embedded document as a separate PDF file
        # into the output directory
        file_name='embedded_document_{}.pdf'.format(i+1)
        full_path='Output_PDFs' + '/' + file_name
        with open(full_path, 'wb') as output_pdf:
            output.write(output_pdf)
            files_written+=1
    #
    # all the embedded PDFs have been written
    # close the original big PDF file
    pdf_file.close()
    return files_written



# **********************************************************************    
# Decription
#   a funtion that take a JSON page as a parameter
#   and determines if the phrase "page 1 of " is present
#   anywhere on the page
# Input:
#   a JSON page that should be the starting page of an embedded document
# Output:
#   an int representing the end page of the embedded document 
#   which starts with the page  passed in
# **********************************************************************
#
def is_page_1(page):
    end_page=-1
    #
    # loop through lines on the page
    for line_idx, line in enumerate(page.lines):
        end_page=-1
        # determien if there is any number followed by pspace , then the phrase "of " and then any number
        # if there is, then the page is a multi-page document
        # if there is not, the assumption is that the page is a single page document
        # The word "page" does not need to be in the line just the "# of #" where the 2 #'s represents numbers
        if line.content.find("1 of") > -1:
            start_index = str(line.content).find("1 of")
            start_index=start_index+4
            if line.content[start_index:].split(" ")[1].isnumeric():
                end_page= int(line.content[start_index:].split(" ")[1])
                # test end number for numeric
                # an end page number was found
                return end_page
            else:
                end_page=1
                # error - an end page number was not found
                return end_page
        else:
            # keep looking on the page 
            continue
    # If it never found a match
    # it will default to an end page of 1
    end_page=1
    return end_page        


# ************************************************************************ #    
# Decription        
#   a funtion that take a page as a parameter
#   and determines and gets the first date mentioned on the page
#   after visit date:
# Input:
#   a JSON page
# Output:
#   a string containing the visit_date or an empty string in there is none
# ************************************************************************* #
#
def get_visit_date(page):
    visit_date=""
    # loop through lines on the page
    for line_idx, line in enumerate(page.lines):
        visit_date=""
        if line.content.lower().find(" date:") > -1:
            start_index = line.content.lower().find(" date: ")
            if start_index > -1:
                start_index=start_index+7
                if line.content[start_index:].split(" ")[1].isdate():
                    visit_date= int(line.content[start_index:].split(" ")[1])
                else:
                    start_index=start_index+5
                    if line.content[start_index:].split(":")[1].isdate():
                        visit_date= int(line.content[start_index:].split(":")[1])
                if visit_date.isdate():
                    return visit_date
                else:
                    return ""
        continue
    return ""


# **********************************************************************    
# Decription        
#   a generic utility function that tests a string to determine ifit is a date
# Input:
#   a string that you want to test for a date
# Output:
#   a boolean. True if the string is a date, False if it is not
# **********************************************************************
#
def is_date(string, fuzzy=False):
    from dateutil.parser import parse
    try: 
        parse(string, fuzzy=fuzzy)
        return True

    except ValueError:
        return False



## Main Program

In [8]:
# **********************************************************************    
# Decription        
#   this is the main program that calls all the functions
# Input:
#   put your PDF file in the Input_PDFs folder
# Output:
#   The embedded documents will be written out to the Output_PDFs folder
# **********************************************************************
#initialize vars
embedded_docs=0
#
# specify the Big PDF, Should be in the Input_PDFs folder, just the file name is needed
my_file=r"Sample big document.pdf"
#
# Start processing
result= get_Local_content(my_file)
# call the  to get a list of start and end pages in the main dcument
embedded_document_page_numbers =[]
embedded_document_page_numbers=get_embedded_pages(result.pages)
#
# pass in the array of start and end pages of embedded documents and the file path of the PDF
# This is where the individual PDFs get written
embedded_docs=save_embedded_documents(embedded_document_page_numbers, my_file)
# All done!
print("There were {} embedded documents written to the output folder (./Output_PDFs)".format(embedded_docs))


There are 126 pages in the Big PDF... Chechecking for embedded documents
There were 40 embedded documents written to the output folder (./Output_PDFs)
