## Parsing PDF example

This notebook shows the basic steps to parse a PDF image using layoutparser and google ocr/tesseract.

In [34]:
import layoutparser as lp
from pathlib import Path
from layoutparser.ocr.tesseract_agent import TesseractAgent
import numpy as np
import os
from copy import copy
from layoutparser.elements.layout_elements import TextBlock
from typing import Union

In [None]:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "../credentials/credentials.json"

In [None]:
# create downloads folder at root of repo and run scripts/fep_pypaperbot_example.sh to download these examples
path = Path("../downloads")

In [None]:
# Let's get a single image as an example.
for file in path.iterdir():
    file_name = file.name
    pdf_tokens, pdf_images = lp.load_pdf(path / file_name, load_images=True)
    break

In [None]:
# Start with the simplest fastest model. Model zoo has more complex models if we want to use them.
model = lp.AutoLayoutModel("lp://efficientdet/PubLayNet/tf_efficientdet_d0")

In [None]:
len(pdf_images)

In [None]:
image = pdf_images[11]

In [None]:
layout = model.detect(image)  # The page with reference
lp.draw_box(image, layout)

We can see that layoutparser has parsed things as expected. We may want to pad the examples before OCR as things are a little tight!

In [None]:
text_blocks = lp.Layout([b for b in layout if b.type == "Text"])
image = np.asarray(image)
tesseract_ocr_agent = TesseractAgent(languages="eng")
google_ocr_agent = lp.GCVAgent(languages="eng")

## Google OCR vs Tesseract OCR

In [None]:
def set_block_texts(
    ocr_agent: Union[lp.TesseractAgent, lp.GCVAgent],
    block: TextBlock,
    left_pad: int = 15,
    right_pad: int = 5,
    top_pad: int = 5,
    bottom_pad: int = 5,
) -> None:
    """Set the text of a block inplace using the Google or Tesseract OCR agents.

    :param ocr_agent: Agent to use for OCR. Google or Tesseract.
    :param block: Block to set text of.
    :return: None.
    """
    # Crop image around the detected layout, padding to improve OCR accuracy
    segment_image = block.pad(
        left=left_pad, right=right_pad, top=top_pad, bottom=bottom_pad
    ).crop_image(image)

    # Perform OCR
    text = ocr_agent.detect(segment_image, return_only_text=True)

    # Save OCR result
    block.set(text=text, inplace=True)


google_text_blocks = copy(text_blocks)

# Google OCR gives confidence scores for each character.
for block in google_text_blocks:
    set_block_texts(google_ocr_agent, block)

for txt in google_text_blocks:
    print(txt.text, end="\n---\n")

In [None]:
tesseract_text_blocks = copy(google_text_blocks)
for block in tesseract_text_blocks:
    set_block_texts(tesseract_ocr_agent, block)
for txt in tesseract_text_blocks:
    print(txt.text, end="\n---\n")

We can see that the Google OCR is better than the Tesseract OCR. This isn't surprising as it's a paid service. But it's relatively cheap and we can use it for our purposes.