#### Saturday, February 10, 2024

How do I load a pdf file, and then extract the text from it?

#### Libraries to consider:

* [pypdf](https://github.com/py-pdf/PyPDF) 
* [pdfminer.six](https://github.com/pdfminer/pdfminer.six)
* [pdfplumber](https://github.com/jsvine/pdfplumber) 

In [1]:
import os 

pdfFolder = "../data"

In [2]:
def list_files_in_folder(folder_path):
    # Get the list of files in the specified folder
    files = os.listdir(folder_path)

    # Filter out subdirectories, leaving only files
    files = [f for f in files if os.path.isfile(os.path.join(folder_path, f))]

    return files

In [3]:
pdfFiles = list_files_in_folder(pdfFolder)

In [4]:
for file_name in pdfFiles:
    print(file_name)

Jordan B. Peterson -12 Rules for Life_ An Antidote to Chaos.pdf
Jordan Peterson - Beyond Order_ 12 More Rules For Life.pdf


#### pypdf

* https://github.com/py-pdf/PyPDF
* https://pypdf.readthedocs.io/en/latest/
* mamba install conda-forge::pypdf 

In [5]:
from pypdf import PdfReader

In [6]:
pdfFile = pdfFolder +  "/" + pdfFiles[1]
print(pdfFile)

../data/Jordan Peterson - Beyond Order_ 12 More Rules For Life.pdf


In [7]:
reader = PdfReader(pdfFile)

In [8]:
page = reader.pages[22:26]

In [10]:
print(page[0].extract_text())

      
D O  N O T  C A R E L E S S L Y  D E N I G R A T E  S O C I A L
I N S T I T U T I O N S  O R  C R E A T I V E  A C H I E V E M E N T
                        
For years, I saw a client who lived by himself. *  He was isolated in
many other ways in addition to his living situation. He had extremely
limited family ties. Both of his daughters had moved out of the
country, and did not maintain much contact, and he had no other
relatives except a father and sister from whom he was estranged. His
wife and the mother of his children had passed away years ago, and
the sole relationship he endeavored to establish while he saw me
over the course of more than a decade and a half terminated
tragically when his new partner was killed in an automobile accident.
When we began to work together, our conversations were
decidedly awkward. He was not accustomed to the subtleties of social
interaction, so his behaviors, verbal and nonverbal, lacked the dance-
like rhythm and harmony that characterize

In [9]:
print(page[1].extract_text())

targeting by bullies continued into his adult life, particularly in his
place of work.
I soon noticed, however, that things worked out quite well during
our sessions if I kept mostly quiet. He would drop in, weekly or
biweekly, and talk about what had befallen and preoccupied him
during the previous seven to fourteen days. If I maintained silence
for the first fifty minutes of our one-hour sessions, listening intently,
then we could converse, in a relatively normal, reciprocal manner,
for the remaining ten minutes. This pattern continued for more than
a decade, as I learned, increasingly, to hold my tongue (something
that does not come easily to me). As the years passed, however, I
noticed that the proportion of time he spent discussing negative
issues with me decreased. Our conversation—his monologue, really—
had always started with what was bothering him, and rarely
progressed past that. But he worked hard outside our sessions,
cultivating friends, attending artistic gatherings and m

In [None]:
print(page[2].extract_text())

In [None]:
for page in reader.pages:
    if "/Annots" in page:
        for annot in page["/Annots"]:
            obj = annot.get_object()
            annotation = {"subtype": obj["/Subtype"], "location": obj["/Rect"]}
            print(annotation)

In [12]:
for page in reader.pages:
    if "/Annots" in page:
        for annot in page["/Annots"]:
            subtype = annot.get_object()["/Subtype"]
            if subtype == "/Text":
                print(annot.get_object()["/Contents"])

In [13]:
for page in reader.pages:
    if "/Annots" in page:
        for annot in page["/Annots"]:
            subtype = annot.get_object()["/Subtype"]
            if subtype == "/Highlight":
                coords = annot.get_object()["/QuadPoints"]
                x1, y1, x2, y2, x3, y3, x4, y4 = coords

In [None]:
page.extract_text(extraction_mode="layout", layout_mode_debug_path="debug2")

In [None]:
# extract text in a fixed width format that closely adheres to the rendered
# layout in the source pdf
print(page.extract_text(extraction_mode="layout"))

In [None]:
# extract text preserving horizontal positioning without excess vertical
# whitespace (removes blank and "whitespace only" lines)
print(page.extract_text(extraction_mode="layout", layout_mode_space_vertically=False))

In [None]:
# adjust horizontal spacing
print(page.extract_text(extraction_mode="layout", layout_mode_scale_weight=1.0))

In [None]:
# exclude (default) or include (as shown below) text rotated w.r.t. the page
print(page.extract_text(extraction_mode="layout", layout_mode_strip_rotated=False))

In [None]:
page = reader.pages[31]
print(page.extract_text())

#### pdf miner six

* https://github.com/pdfminer/pdfminer.six
* https://pdfminersix.readthedocs.io/en/latest/
* mamba install conda-forge::pdfminer.six

In [27]:
import pdfminer
from pdfminer.high_level import extract_text
from pdfminer.high_level import extract_pages
pdfminer.__version__

'20231228'

In [12]:
pdfFile

'../data/Jordan Peterson - Beyond Order_ 12 More Rules For Life.pdf'

In [13]:
text = extract_text(pdfFile)
# 7.8s

In [None]:
print(repr(text))

In [None]:
print(text)

In [None]:
from pdfminer.high_level import extract_pages
for page_layout in extract_pages(pdfFile):
    for element in page_layout:
        print(element)
        
# 8.2s

In [None]:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages(pdfFile):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            print(element.get_text())

The next two cells are from [How to resolve the target page of ToC entries](https://pdfminersix.readthedocs.io/en/latest/howto/toc_target_page.html)

In [14]:
from enum import Enum, auto
from pathlib import Path
from typing import Any, Optional
from pdfminer.pdfdocument import PDFDocument, PDFNoOutlines
from pdfminer.pdfpage import PDFPage, LITERAL_PAGE
from pdfminer.pdfparser import PDFParser, PDFSyntaxError
from pdfminer.pdftypes import PDFObjRef


class PDFRefType(Enum):
    """PDF reference type."""

    PDF_OBJ_REF = auto()
    DICTIONARY = auto()
    LIST = auto()
    NAMED_REF = auto()
    UNK = auto()  # fallback


class RefPageNumberResolver:
    """PDF Reference to page number resolver.

    .. note::

       Remote Go-To Actions (see 12.6.4.3 in
       `https://www.adobe.com/go/pdfreference/`__)
       are out of the scope of this resolver.

    Attributes:
        document (:obj:`pdfminer.pdfdocument.PDFDocument`):
            The document that contains the references.
        objid_to_pagenum (:obj:`dict[int, int]`):
            Mapping from an object id to the number of the page that contains
            that object.
    """

    def __init__(self, document: PDFDocument):
        self.document = document
        # obj_id -> page_number
        self.objid_to_pagenum: dict[int, int] = {
            page.pageid: page_num
            for page_num, page in enumerate(PDFPage.create_pages(document), 1)
        }

    @classmethod
    def get_ref_type(cls, ref: Any) -> PDFRefType:
        """Get the type of a PDF reference."""
        if isinstance(ref, PDFObjRef):
            return PDFRefType.PDF_OBJ_REF
        elif isinstance(ref, dict) and "D" in ref:
            return PDFRefType.DICTIONARY
        elif isinstance(ref, list) and any(isinstance(e, PDFObjRef) for e in ref):
            return PDFRefType.LIST
        elif isinstance(ref, bytes):
            return PDFRefType.NAMED_REF
        else:
            return PDFRefType.UNK

    @classmethod
    def is_ref_page(cls, ref: Any) -> bool:
        """Check whether a reference is of type '/Page'.

        Args:
            ref (:obj:`Any`):
                The PDF reference.

        Returns:
            :obj:`bool`: :obj:`True` if the reference references
            a page, :obj:`False` otherwise.
        """
        return isinstance(ref, dict) and "Type" in ref and ref["Type"] is LITERAL_PAGE

    def resolve(self, ref: Any) -> Optional[int]:
        """Resolve a PDF reference to a page number recursively.

        Args:
            ref (:obj:`Any`):
                The PDF reference.

        Returns:
            :obj:`Optional[int]`: The page number or :obj:`None`
            if the reference could not be resolved (e.g., remote Go-To
            Actions or malformed references).
        """
        ref_type = self.get_ref_type(ref)

        if ref_type is PDFRefType.PDF_OBJ_REF and self.is_ref_page(ref.resolve()):
            return self.objid_to_pagenum.get(ref.objid)
        elif ref_type is PDFRefType.PDF_OBJ_REF:
            return self.resolve(ref.resolve())

        if ref_type is PDFRefType.DICTIONARY:
            return self.resolve(ref["D"])

        if ref_type is PDFRefType.LIST:
            # Get the PDFObjRef in the list (usually first element).
            return self.resolve(next(filter(lambda e: isinstance(e, PDFObjRef), ref)))

        if ref_type is PDFRefType.NAMED_REF:
            return self.resolve(self.document.get_dest(ref))

        return None  # PDFRefType.UNK

In [15]:
def print_outlines(file: str) -> dict[int, int]:
    """Pretty print the outlines (ToC) of a PDF document."""
    with open(file, "rb") as fp:
        try:
            
            parser = PDFParser(fp)
            document = PDFDocument(parser)

            ref_pagenum_resolver = RefPageNumberResolver(document)

            outlines = list(document.get_outlines())

            if not outlines:
                print("No outlines found.")
            for (level, title, dest, a, se) in outlines:
                if dest:
                    page_num = ref_pagenum_resolver.resolve(dest)
                elif a:
                    page_num = ref_pagenum_resolver.resolve(a)
                elif se:
                    page_num = ref_pagenum_resolver.resolve(se)
                else:
                    page_num = None

                # Calculate leading spaces and filling dots for formatting.
                leading_spaces = (level-1) * 4
                fill_dots = 80 - len(title) - leading_spaces

                print(
                    f"{' ' * leading_spaces}"
                    f"{title}",
                    f"{'.' * fill_dots}",
                    f"{page_num:>3}"
                )
        except PDFNoOutlines:
            print("No outlines found.")
        except PDFSyntaxError:
            print("Corrupted PDF or non-PDF file.")
        finally:
            try:
                parser.close()
            except NameError:
                pass  # nothing to do

In [16]:
print_outlines(pdfFile)

Also by Jordan B. Peterson ......................................................   2
Title Page ......................................................................   3
Copyright .......................................................................   4
Dedication ......................................................................   5
Contents ........................................................................   6
Table of Illustrations ..........................................................   8
A Note from the Author in the Time of the Pandemic ..............................  10
Overture ........................................................................  11
Rule I: Do Not Carelessly Denigrate Social Institutions or Creative Achievement .  23
    Loneliness and Confusion ....................................................  23
    Sanity as a Social Institution ..............................................  25
    The Point of Pointing ............................

In [19]:
pages = extract_pages(pdf_file = pdfFile, page_numbers=[22,23])
pages

<generator object extract_pages at 0x7f94c070ae30>

In [20]:

extract_text(pdf_file = pdfFile, page_numbers=[22,23])

'\x00\x00\x00\x00 \x00\n\nDO NOT CARELESSLY DENIGRATE SOCIAL\nINSTITUTIONS OR CREATIVE ACHIEVEMENT\n\n\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\n\nFor years, I saw a client who lived by himself.* He was isolated in\nmany other ways in addition to his living situation. He had extremely\nlimited family ties. Both of his daughters had moved out of the\ncountry, and did not maintain much contact, and he had no other\nrelatives except a father and sister from whom he was estranged. His\nwife and the mother of his children had passed away years ago, and\nthe sole relationship he endeavored to establish while he saw me\nover the course of more than a decade and a half terminated\ntragically when his new partner was killed in an automobile accident.\n\nWhen we began to work together, our conversations were\n\ndecidedly awkward. He was not accustomed to the subtleties of social\ninteraction, so his behaviors, verbal and nonverbal, lacked the danc

In [25]:
extract_text(pdf_file = pdfFile, page_numbers=[24,25])

'well, so that we may organize our otherwise inchoate bodily\nreactions, motivations, and emotions into something articulate and\norganized, and dispense with those concerns that are exaggerated\nand irrational. We need to talk—both to remember and to forget.\nMy client desperately needed someone to listen to him. He also\nneeded to be fully part of additional, larger, and more complex social\ngroups—something he planned in our sessions together, and then\ncarried out on his own. Had he fallen prey to the temptation to\ndenigrate the value of interpersonal interactions and relationships\nbecause of his history of isolation and harsh treatment, he would\nhave had very little chance of regaining his health and well-being.\nInstead, he learned the ropes and joined the world.\n\n\x00\x00\x00\x00\x00\x00 \x00\x00 \x00 \x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\n\nFor Drs. Sigmund Freud and Carl Jung, the great depth\npsychologists, sanity was a characteristic of t

In [None]:
# https://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text
def pdf_to_csv(filename):
    from cStringIO import StringIO  #<-- added so you can copy/paste this to try it
    from pdfminer.converter import LTTextItem, TextConverter
    from pdfminer.pdfparser import PDFDocument, PDFParser
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter

    class CsvConverter(TextConverter):
        def __init__(self, *args, **kwargs):
            TextConverter.__init__(self, *args, **kwargs)

        def end_page(self, i):
            from collections import defaultdict
            lines = defaultdict(lambda : {})
            for child in self.cur_item.objs:
                if isinstance(child, LTTextItem):
                    (_,_,x,y) = child.bbox                   #<-- changed
                    line = lines[int(-y)]
                    line[x] = child.text.encode(self.codec)  #<-- changed

            for y in sorted(lines.keys()):
                line = lines[y]
                self.outfp.write(";".join(line[x] for x in sorted(line.keys())))
                self.outfp.write("\n")

    # ... the following part of the code is a remix of the 
    # convert() function in the pdfminer/tools/pdf2text module
    rsrc = PDFResourceManager()
    outfp = StringIO()
    device = CsvConverter(rsrc, outfp, codec="utf-8")  #<-- changed 
        # becuase my test documents are utf-8 (note: utf-8 is the default codec)

    doc = PDFDocument()
    fp = open(filename, 'rb')
    parser = PDFParser(fp)       #<-- changed
    parser.set_document(doc)     #<-- added
    doc.set_parser(parser)       #<-- added
    doc.initialize('')

    interpreter = PDFPageInterpreter(rsrc, device)

    for i, page in enumerate(doc.get_pages()):
        outfp.write("START PAGE %d\n" % i)
        interpreter.process_page(page)
        outfp.write("END PAGE %d\n" % i)

    device.close()
    fp.close()

    return outfp.getvalue()