# Tutorial: Quickstart on Using the PDF Ingestion Pipeline

This notebook provides a **beginner-friendly** tutorial on **how to use** a multimodal PDF ingestion pipeline. It will guide you through the **configuration** options, how to run the pipeline on your own PDFs, and what outputs to expect.

We will **not** dive into the pipeline’s internal implementation code. Instead, we will focus on how to set it up and control it using the **configuration parameters** so you can quickly get started.


## What Does This Pipeline Do?

1. **Reads a PDF** and splits it into pages.
2. **Optionally converts** pages to JPG or PNG.
3. **Extracts text** from each page and uses a text-based LLM to process or clean that text.
4. **Identifies embedded images** or tables using a multimodal model and generates descriptive text or Markdown.
5. **Combines all** extracted data into a `DocumentContent` object.
6. **Generates** doc-level outputs, such as a condensed summary and a table of contents, if requested.
7. **Saves everything** (images, text, tables, JSON structure) into an output folder.


## Key Configuration Flags

When you create a `ProcessingPipelineConfiguration`, you can specify a **variety of flags** to tailor how the pipeline behaves:

- **pdf_path**: Required string path to the PDF file to be processed.
- **output_directory**: Optional string path to where all output files and folders will be saved.
- **process_pages_as_jpg**: If `True`, pages are rendered as JPEG (`.jpg`); if `False`, they are rendered as PNG (`.png`).
- **process_text**: If `True`, extracted text from each page is passed through a text-based LLM for cleanup and optional summarization.
- **process_images**: If `True`, the pipeline searches for embedded images (photos, graphs, etc.) and uses a multimodal model to describe them.
- **process_tables**: If `True`, the pipeline searches for table-like structures and uses a multimodal model to extract them into Markdown.
- **save_text_files**: If `True`, a combined text-based output is saved at the doc level, along with each page’s extracted text.
- **generate_condensed_text**: If `True`, the pipeline will create a condensed summary of the entire PDF.
- **generate_table_of_contents**: If `True`, the pipeline will generate a table of contents from the aggregated text.

### Choosing Your Models

The configuration also expects **two** types of model info objects:

1. **Multimodal Model Info (`MulitmodalProcessingModelInfo`)**: This model can process images, which allows the pipeline to extract insights about embedded images or tables in your PDF.
   - `provider`: Either `"azure"` or `"openai"`.
   - `model_name`: One of `"gpt-4o"` or `"o1"`. The pipeline requires a model that can handle images, so we **cannot** use something like `"o1-mini"` here.
   - `reasoning_efforts`: Low, medium, or high. Determines how detailed or resource-intensive the reasoning might be.
   - `endpoint`, `key`, and `api_version`: For hooking up to your cloud provider’s endpoint.

2. **Text Model Info (`TextProcessingModelnfo`)**: This model is strictly for text processing, so it can be set to `"gpt-4o"`, `"o1"`, or even `"o1-mini"` if you want a smaller text model.
   - `provider`: Also either `"azure"` or `"openai"`.
   - `model_name`: Includes possible values `"gpt-4o"`, `"o1"`, or `"o1-mini"`.
   - Other fields (`endpoint`, `key`, `api_version`) function similarly, specifying connection details to your text-based LLM.

### Special Note on `o1-mini`

Because `o1-mini` **cannot process images**, it is not suitable for the multimodal portion. That is why the code provides **two** separate classes: `MulitmodalProcessingModelInfo` for image-capable models and `TextProcessingModelnfo` for text-only models.


## Step-by-Step: Configuring and Running

Below is an example of how to **use** the pipeline, without showing its internal code. We assume you have already installed the necessary dependencies and have the relevant classes (`ProcessingPipelineConfiguration` and `PDFIngestionPipeline`) in your codebase.

We will:
1. Import the relevant classes.
2. Create a `ProcessingPipelineConfiguration`, explaining each parameter.
3. Instantiate `PDFIngestionPipeline` with the config.
4. Run the pipeline on a sample PDF.
5. Observe the created outputs.


In [None]:
# Step 1: Import the classes for configuration and pipeline usage.
# Also import the model info classes for specifying your chosen LLM endpoints.
# Adjust the import paths to match your project structure.
%load_ext autoreload
%autoreload 2

import sys
sys.path.append("../")  # Adjust the path as needed.

from configuration_models import ProcessingPipelineConfiguration
from pdf_ingestion_pipeline import PDFIngestionPipeline  
from utils.file_utils import *
from utils.openai_data_models import (
    MulitmodalProcessingModelInfo,
    TextProcessingModelnfo,
    EmbeddingModelnfo  # In case you want an embedding model as well
)  # adapt paths as needed

# Step 2: Set up your pipeline configuration.
# We'll pretend we have a sample PDF in 'sample_data/my_document.pdf', and we want to enable every feature.
# We'll also choose 'gpt-4o' for both multimodal and text, but you can pick 'o1' or 'o1-mini' for text if desired.

config = ProcessingPipelineConfiguration(
    pdf_path="../sample_data/1_London_Brochure.pdf",        # Required: path to your PDF.
    output_directory="my_pipeline_output",                  # Where to store the results.
    process_pages_as_jpg=True,                    # Render PDF pages as JPEG images.
    process_text=True,                            # Extract and process text.
    process_images=True,                          # Analyze images.
    process_tables=True,                          # Analyze tables.
    save_text_files=True,                         # Save aggregated text files.
    generate_condensed_text=True,                 # Produce a condensed summary.
    generate_table_of_contents=True               # Generate a table of contents.
)

# Step 3: Specify your multimodal and text models.
# Let's say you are using Azure OpenAI with an imaginary 'gpt-4o' multimodal endpoint,
# and also an 'o1-mini' model for text. But remember, 'o1-mini' cannot handle images.
# We show an example of how you might configure it.

config.multimodal_model = MulitmodalProcessingModelInfo(
    provider="azure",                 # or 'openai'
    model_name="gpt-4o",              # can be 'gpt-4o' or 'o1' for multimodal
    reasoning_efforts="high",         # can be 'low', 'medium', or 'high' for o1 model only
    endpoint="https://my-azure-endpoint.openai.azure.com/",
    key="MY_SUPER_SECRET_KEY",
    api_version="2024-12-01-preview"   # example API version
)

config.text_model = TextProcessingModelnfo(
    provider="azure",                 # or 'openai'
    model_name="o1-mini",             # can be 'gpt-4o', 'o1', or 'o1-mini' (text only)
    reasoning_efforts="medium",
    endpoint="https://my-other-azure-endpoint.openai.azure.com/",
    key="MY_OTHER_SECRET_KEY",
    api_version="2024-12-01-preview"   # example API version
)

# Step 4: Create the pipeline and run it.
pipeline = PDFIngestionPipeline(config)
document_content = pipeline.process_pdf()

# Step 5: Examine the 'document_content' or browse the 'my_pipeline_output' folder.
print("Pipeline run complete.")
print("Number of pages processed:", len(document_content.pages))


## Understanding the Outputs

1. **`my_pipeline_output/pages/page_X/`**: Contains extracted text, images, tables, and combined data for each page.
2. **`my_pipeline_output/text_twin.md`**: All pages’ text combined into a single Markdown file.
3. **`my_pipeline_output/condensed_text.md`**: A short summary of the PDF, if you turned on `generate_condensed_text`.
4. **`my_pipeline_output/table_of_contents.md`**: A table of contents, if you enabled `generate_table_of_contents`.
5. **`my_pipeline_output/document_content.json`**: A complete JSON representation of the entire document, including all pages and assets.

At this point, you have a variety of rich outputs describing the contents of your PDF. You can stop here or use them in subsequent steps for additional analytics, indexing, or other custom pipelines.

## Conclusion

By following this tutorial, you can easily set up your pipeline with **fine-grained control** over how it processes PDFs (images, tables, text extraction, and so on). You can also swap in different multimodal and text models to tailor performance and cost to your needs. In addition, you can see exactly what is generated in your output folder and continue your workflow from there.

Good luck exploring the **multimodal** capabilities of LLMs with your own PDF documents!
