# How to deal with complex/large Documents

 **Large documents are always a challenge for Search Engines.**

One example of such complex files is Technical Specification Guides or Product Manuals, which can span **hundreds of pages and contain information in the form of images, tables, forms, and more. **Books are also complex due to their length and the presence of images or tables.

These files are typically in PDF format. To better handle these PDFs, we need a smarter parsing method that treats each document as a special source and processes them page by page. The objective is to obtain more accurate and faster answers from our system. Fortunately, there are usually not many of these types of documents in an organization, allowing us to make exceptions and treat them differently.

If **your use case is just PDFs,** for example, you can just use [PyPDF library](https://pypi.org/project/pypdf/) or [Azure AI Document Intelligence SDK (former Form Recognizer)](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/overview?view=doc-intel-3.0.0), **vectorize using OpenAI API and push the content to a vector-based index.** And this is problably the simplest and fastest way to go.  

However if your use case entails **connecting to a datalake, or Sharepoint libraries or any other document data source with thousands of documents with multiple file types and that can change dynamically, then you would want to use the Ingestion and Document Cracking and AI-Enrichment capabilities of Azure Search engine, ** 


In [1]:
import os
import json
import time
import requests
import random
from collections import OrderedDict
import urllib.request
from tqdm import tqdm

from common.utils import parse_pdf, read_pdf_files, text_to_base64 # !!!!!

from common.utils import (
    get_search_results,
    num_tokens_from_docs,
    num_tokens_from_string
)

from IPython.display import Markdown, HTML, display  

from dotenv import load_dotenv
load_dotenv("credentials.env")

def printmd(string):
    display(Markdown(string))

In [2]:
os.makedirs("data/imagehugefiles/",exist_ok=True)
    

BLOB_CONTAINER_NAME = "imagehugefiles"
BASE_CONTAINER_URL = "https://blobstoragedjym6eiz2jhlk.blob.core.windows.net/" + BLOB_CONTAINER_NAME + "/" 
# go to storage account->Settings-> Endpoints->Blob Service 

LOCAL_FOLDER = "./data/imagehugefiles/"

os.makedirs(LOCAL_FOLDER,exist_ok=True)

## 1 - Manual Document Cracking with Push to Vector-based Index

Within our demo storage account, we have a container named `books`, which holds 5 books of different lengths, languages, and complexities. Let's create a `cogsrch-index-books-vector` and load it with the pages of all these books.

We begin by downloading these books to our local machine:

In [3]:
books = ["images.pdf","azure-search.pdf"] # upload these documents to the storage account

Let's download the files to the local `./data/` folder:

In [4]:
for book in tqdm(books):
    book_url = BASE_CONTAINER_URL + book + os.environ['BLOB_SAS_TOKEN']
    print(book_url)
    urllib.request.urlretrieve(book_url, LOCAL_FOLDER+ book)

100%|██████████| 2/2 [00:05<00:00,  2.77s/it]


https://blobstoragedjym6eiz2jhlk.blob.core.windows.net/imagehugefiles/images.pdf?sp=r&st=2024-07-05T00:16:30Z&se=2024-07-31T08:16:30Z&spr=https&sv=2022-11-02&sr=c&sig=%2FZFyg3Y9Ug4FsaUoqS2wZDMUW9AarunZSnG07SIBLXQ%3D
https://blobstoragedjym6eiz2jhlk.blob.core.windows.net/imagehugefiles/azure-search.pdf?sp=r&st=2024-07-05T00:16:30Z&se=2024-07-31T08:16:30Z&spr=https&sv=2022-11-02&sr=c&sig=%2FZFyg3Y9Ug4FsaUoqS2wZDMUW9AarunZSnG07SIBLXQ%3D


### What to use: pyPDF or AI Documment Intelligence API (Form Recognizer)?

In `utils.py` there is a **parse_pdf()** function. This utility function can parse local files using PyPDF library and can also parse local or from_url PDFs files using Azure AI Document Intelligence (Former Form Recognizer).

If `form_recognizer=False`, the function will parse the PDF using the python pyPDF library, which 75% of the time does a good job.<br>

Setting `form_recognizer=True`, is the best (and slower) parsing method using AI Documment Intelligence API (former known as Form Recognizer). You can specify the prebuilt model to use, the default is `model="prebuilt-document"`. However, if you have a complex document with tables, charts and figures , you can try
`model="prebuilt-layout"`, and it will capture all of the nuances of each page (it takes longer of course).

**Note: Many PDFs are scanned images. For example, any signed contract that was scanned and saved as PDF will NOT be parsed by pyPDF. Only AI Documment Intelligence API will work.**

In [5]:
book_pages_map = dict()
for book in books:
    print("Extracting Text from",book,"...")
    
    # Capture the start time
    start_time = time.time()
    
    # Parse the PDF
    book_path = LOCAL_FOLDER+book
    book_map = parse_pdf(file=book_path, form_recognizer=False, verbose=True)
    book_pages_map[book]= book_map
    
    # Capture the end time and Calculate the elapsed time
    end_time = time.time()
    elapsed_time = end_time - start_time

    print(f"Parsing took: {elapsed_time:.6f} seconds")
    print(f"{book} contained {len(book_map)} pages\n")

Extracting Text from images.pdf ...
Extracting text using PyPDF
Parsing took: 0.052433 seconds
images.pdf contained 2 pages

Extracting Text from azure-search.pdf ...
Extracting text using PyPDF
Parsing took: 40.272521 seconds
azure-search.pdf contained 2094 pages



Now let's check a random page of each book to make sure the parsing was done correctly:

In [6]:
book_pages_map['images.pdf']

[(0, 0, ' \n \n'), (1, 4, ' \n \n')]

Since this document only has images in it, we need a good PDF parser with good OCR capabilities in order to extract the content of this PDF. 

Let's try to parse this book again, but this time using Azure Document Intelligence API (former Form Recognizer)

In [7]:
%%time
book = "images.pdf" #YK
book_path = LOCAL_FOLDER+book

book_map = parse_pdf(file=book_path, form_recognizer=True, model="prebuilt-document",from_url=False, verbose=True)
book_pages_map[book]= book_map

Extracting text using Azure Document Intelligence
CPU times: user 140 ms, sys: 11.8 ms, total: 152 ms
Wall time: 9.54 s


In [8]:
book_pages_map['images.pdf']

[(0,
  0,
  'WCS Data Landing Zone Subscription\nAsk\nSTITT\nLydia\nShared App Landing Zone Subscription\nAsk STITT\nLydia\n= Microsoft Azure P Search resources, services, and docs (G+/)\nWCL\nData Landing Zone Subscription\nWPB\nData Landing Zone Subscription\nAsk STITT\nLydia\nAsk\nSTITT\nLydia\nWest US\n<\ntext-embedding-ada-002\nPriority 1 AOAI\nAPI Management Service\n089\nGPT-4 PTU\nPriority 2 AOAI\n< ! >\nGPT-4 PAYGO\nBilling/Logging Zone\n2000)\nSwitzerland North\nPriority 3 AOAI\nEvent Hubs\nGPT-4 PTU or PAYGO\ntext-embedding -ada-002\nPower BI\nStream Analytics Jobs\nCopilot\n2 :selected:\nHome > Resource groups > ai-bootcamp > blobstoragejed5nzg3k2jp6\nblobstoragejed5nzg3k2jp6 | Shared access signature\nStorage account\nO Search\n« 8 Give feedback\nFile shares\nQueues\nTables\nSecurity + networking\nNetworking\nFront Door and CDN\nAccess keys\nShared access signature\nQ Encryption\nMicrosoft Defender for Cloud\nData management :selected: Redundancy :selected: Data protection

As demonstrated above, Azure Document Intelligence proves to be superior to pyPDF. 

**For production scenarios, we strongly recommend using Azure Document Intelligence consistently**. When doing so, it's important to make a wise choice between the available models, such as "prebuilt-document," "prebuilt-layout," or others. You can find more information on model selection [HERE](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature?view=doc-intel-3.0.0).
