#### Saturday, February 10, 2024

How do I load a pdf file, and then extract the text from it?

#### Libraries to consider:

* [pypdf](https://github.com/py-pdf/PyPDF) 
* [pdfminer.six](https://github.com/pdfminer/pdfminer.six)
* [pdfplumber](https://github.com/jsvine/pdfplumber) 

In [1]:
import os 

pdfFolder = "../data"

In [2]:
def list_files_in_folder(folder_path):
    # Get the list of files in the specified folder
    files = os.listdir(folder_path)

    # Filter out subdirectories, leaving only files
    files = [f for f in files if os.path.isfile(os.path.join(folder_path, f))]

    return files

In [3]:
pdfFiles = list_files_in_folder(pdfFolder)

In [4]:
for file_name in pdfFiles:
    print(file_name)

Jordan B. Peterson -12 Rules for Life_ An Antidote to Chaos.pdf
Jordan Peterson - Beyond Order_ 12 More Rules For Life.pdf


#### pypdf

* https://github.com/py-pdf/PyPDF
* https://pypdf.readthedocs.io/en/latest/
* mamba install conda-forge::pypdf 

In [5]:
from pypdf import PdfReader

In [6]:
pdfFile = pdfFolder +  "/" + pdfFiles[0]
print(pdfFile)

../data/Jordan B. Peterson -12 Rules for Life_ An Antidote to Chaos.pdf


In [7]:
reader = PdfReader(pdfFile)

#### 12 Rules for Life - An Antidote to Chaos

Extract the text from every chapter of the book into individual objects.

There does not seem to be any obvious way to automatically extract the page numbers from each chapter, so I am just gonna manually determine that.


In [8]:
# A helper function to validate the chapters start and end page values
def validateChapter(chapter):

    limit = 256
    start = chapter[0].extract_text()
    end = chapter[len(chapter)-1].extract_text()
    print("*** START ***")
    print(start[:limit])
    print("...")
    print("*** End ***")
    print(end[-limit:])

##### Forward

In [9]:
forward = reader.pages[4:19]
validateChapter(forward)

*** START ***
Foreword
Rules? More rules? Really? Isn’t life complicated enough, restricting enough,
without abstract rules that don’t take our unique, individual situations into
account? And given that our brains are plastic, and all develop differently
based on our li
...
*** End ***
, it is certain
you will never feel that your life has meaning.
And perhaps because, as unfamiliar and strange as it sounds, in the deepest
part of our psyche, we all want to be judged.
Dr. Norman Doidge, MD, is the author
of 
The Brain That Changes Itself


##### Overture

In [10]:
overture = reader.pages[19:28]
validateChapter(overture)

*** START ***
Overture
This book has a short history and a long history. We’ll begin with the short
history.
In 2012, I started contributing to a website called Quora. On Quora,
anyone can ask a question, of any sort—and anyone can answer. Readers
upvote those answers t
...
*** End ***
onsibility is identical to the decision to live a meaningful life.
If we each live properly, we will collectively flourish.
Best wishes to you all, as you proceed through these pages.
Dr. Jordan B. Peterson
Clinical Psychologist and Professor of Psychology


##### Chapter 1

In [11]:
chapter1 = reader.pages[30:56]
validateChapter(chapter1)

*** START ***
RULE 1
STAND UP STRAIGHT WITH YOUR
SHOULDERS BACK
LOBSTERS—AND TERRITORY
If you are like most people, you don’t often think about lobsters
2
—unless
you’re eating one. However, these interesting and delicious crustaceans are
very much worth considering. Th
...
*** End ***
uence of mortal despair at bay.
Then you may be able to accept the terrible burden of the World, and find
joy.
Look for your inspiration to the victorious lobster, with its 350 million
years of practical wisdom. Stand up straight, with your shoulders back.


##### Chapter 2

In [12]:
chapter2 = reader.pages[59:89]
validateChapter(chapter2)


*** START ***
RULE 2
TREAT YOURSELF LIKE SOMEONE YOU ARE
RESPONSIBLE FOR HELPING
WHY WON’T YOU JUST TAKE YOUR DAMN PILLS?
Imagine that a hundred people are prescribed a drug. Consider what happens
next. One-third of them won’t fill the prescription.
30
 Half of the rema
...
*** End ***
d replace your shame and self-consciousness with the
natural pride and forthright confidence of someone who has learned once
again to walk with God in the Garden.
You could begin by treating yourself as if you were someone you were
responsible for helping.


##### Chapter 3

In [13]:
chapter3 = reader.pages[92:107]
validateChapter(chapter3)

*** START ***
RULE 3
MAKE FRIENDS WITH PEOPLE WHO WANT
THE BEST FOR YOU
THE OLD HOMETOWN
The town I grew up in had been scraped only fifty years earlier out of the
endless flat Northern prairie. Fairview, Alberta, was part of the frontier, and
had the cowboy bars to pro
...
*** End ***
 person is an ideal. It
requires strength and daring to stand up near such a person. Have some
humility. Have some courage. Use your judgment, and protect yourself from
too-uncritical compassion and pity.
Make friends with people who want the best for you.


##### Chapter 4

In [14]:
chapter4 = reader.pages[109:133]
validateChapter(chapter4)

*** START ***
RULE 4
COMPARE YOURSELF TO WHO YOU WERE
YESTERDAY, NOT TO WHO SOMEONE ELSE
IS TODAY
THE INTERNAL CRITIC
It was easier for people to be good at something when more of us lived in
small, rural communities. Someone could be homecoming queen. Someone
else coul
...
*** End ***
k, as if you want to enter, you may be offered the chance
to improve your life, a little; a lot; completely—and with that improvement,
some progress will be made in Being itself.
Compare yourself to who you were yesterday, not to who someone else is
today.


##### Chapter 5

In [15]:
chapter5 = reader.pages[135:165]
validateChapter(chapter5)

*** START ***
RULE 5
DO NOT LET YOUR CHILDREN DO
ANYTHING THAT MAKES YOU DISLIKE
THEM
ACTUALLY, IT’S NOT OK
Recently, I watched a three-year-old boy trail his mother and father slowly
through a crowded airport. He was screaming violently at five-second
intervals—and, mo
...
*** End ***
os and the terrors of the underworld, where everything is
uncertain, anxiety-provoking, hopeless and depressing. There are no greater
gifts that a committed and courageous parent can bestow.
Do not let your children do anything that makes you dislike them.


##### Chapter 6

In [16]:
chapter6 = reader.pages[168:180]
validateChapter(chapter6)

*** START ***
RULE 6
SET YOUR HOUSE IN PERFECT ORDER
BEFORE YOU CRITICIZE THE WORLD
A RELIGIOUS PROBLEM
It does not seem reasonable to describe the young man who shot twenty
children and six staff members at Sandy Hook Elementary School in
Newtown, Connecticut, in 2012 
...
*** End ***
Set your house in perfect order before you criticize the world.


##### Chapter 7

In [17]:
chapter7 = reader.pages[182:219]
validateChapter(chapter7)

*** START ***
RULE 7
PURSUE WHAT IS MEANINGFUL (NOT WHAT
IS EXPEDIENT)
GET WHILE THE GETTING’S GOOD
Life is suffering. That’s clear. There is no more basic, irrefutable truth. It’s
basically what God tells Adam and Eve, immediately before he kicks them
out of Paradise.

...
*** End ***
. Meaning is the Way, the path of
life more abundant, the place you live when you are guided by Love and
speaking Truth and when nothing you want or could possibly want takes any
precedence over precisely that.
Do what is meaningful, not what is expedient.


##### Chapter 8

In [19]:
chapter8 = reader.pages[221:246]
validateChapter(chapter8)

*** START ***
RULE 8
TELL THE TRUTH—OR, AT LEAST, DON’T LIE
TRUTH IN NO-MAN’S-LAND
I trained to become a clinical psychologist at McGill University, in Montreal.
While doing so, I sometimes met my classmates on the grounds of Montreal’s
Douglas Hospital, where we had ou
...
*** End ***
o an ideology, or wallow in nihilism, try telling the truth. If you
feel weak and rejected, and desperate, and confused, try telling the truth. In
Paradise, everyone speaks the truth. That is what makes it Paradise.
Tell the truth. Or, at least, don’t lie.


##### Chapter 9

In [20]:
chapter9 = reader.pages[249:270]
validateChapter(chapter9)

*** START ***
RULE 9
ASSUME THAT THE PERSON YOU ARE
LISTENING TO MIGHT KNOW SOMETHING
YOU DON’T
NOT ADVICE
Psychotherapy is not advice. Advice is what you get when the person you’re
talking with about something horrible and complicated wishes you would just
shut up and 
...
*** End ***
phic Oracle in ancient Greece spoke
most highly of Socrates, who always sought the truth. She described him as
the wisest living man, because he knew that what he knew was nothing.
Assume that the person you are listening to might know something you
don’t.


##### Chapter 10

In [22]:
chapter10 = reader.pages[273:295]
validateChapter(chapter10)

*** START ***
RULE 10
BE PRECISE IN YOUR SPEECH
WHY IS MY LAPTOP OBSOLETE?
What do you see, when you look at a computer—at your own laptop, more
precisely? You see a flat, thin, grey-and-black box. Less evidently, you see
something to type on and look at. Nonetheless, e
...
*** End ***
ont the chaos of Being. Take aim against a sea of troubles. Specify
your destination, and chart your course. Admit to what you want. Tell those
around you who you are. Narrow, and gaze attentively, and move forward,
forthrightly.
Be precise in your speech.


##### Chapter 11

In [24]:
chapter11 = reader.pages[297:340]
validateChapter(chapter11)

*** START ***
RULE 11
DO NOT BOTHER CHILDREN WHEN THEY
ARE SKATEBOARDING
DANGER AND MASTERY
There was a time when kids skateboarded on the west side of Sidney Smith
Hall, at the University of Toronto, where I work. Sometimes I stood there and
watched them. There are rou
...
*** End ***
rself with such a thing. No
one aiming at moving up would allow him or herself to become possessed by
such a thing. And if you think tough men are dangerous, wait until you see
what weak men are capable of.
Leave children alone when they are skateboarding.


##### Chapter 12

In [25]:
chapter12 = reader.pages[343:360]
validateChapter(chapter12)

*** START ***
RULE 12
PET A CAT WHEN YOU ENCOUNTER ONE ON
THE STREET
DOGS ARE OK TOO
I am going to start this chapter by stating directly that I own a dog, an
American Eskimo, one of the many variants of the basic spitz type. They
were known as German spitzes until the 
...
*** End ***
ipse around in her bare feet. The calf muscle on
her damaged leg is growing back. She has much more flexion in the artificial
joint. This year, she got married and had a baby girl, Elizabeth, named after
my wife’s departed mother.
Things are good.
For now.


##### Coda

In [27]:
coda = reader.pages[361:373]
validateChapter(coda)

*** START ***
Coda
WHAT SHALL I DO WITH MY NEWFOUND PEN OF LIGHT?
In late 2016 I travelled to northern California to meet a friend and business
associate. We spent an evening together thinking and talking. At one point he
took a pen from his jacket and took a few notes.
...
*** End ***
skateboarding), that you strengthen and
encourage those who are committed to your care instead of protecting them
to the point of weakness.
I wish you all the best, and hope that you can wish the best for others.
What will you write with your pen of light?


Now that we have manually identified the start and end pages of every chapter, let's put those values into a dictionary object then write a simple function to read the text from the PDF file.

In [30]:
bookChapters = { "Forward" : [4,19],
                 "Overture": [19, 28],
                 "RULE 1: Stand up straight with your shoulders back" : [30, 56],
                 "RULE 2: Treat yourself like someone you are responsible for helping" : [59,89],
                 "RULE 3: Make friends with people who want the best for you" : [92, 107],
                 "RULE 4: Compare yourself to who you were yesterday,not to who someone else is today" : [109, 133],
                 "RULE 5: Do not let your children do anything that makes you disklike them" : [135,165],
                 "RULE 6: Set your house in perfect order before you criticize the world" : [168, 180],
                 "RULE 7: Pursue what is meaningful (not what is expedient)" : [182, 219],
                 "RULE 8: Tell the truth - or, at least, don't lie" : [221, 246],
                 "RULE 9: Assume that the person you are listening to might know something you don't" : [249, 270],
                 "RULE 10: Be precise in your speech" : [273, 295],
                 "RULE 11: Do not bother children when they are skateboarding" : [297, 340],
                 "RULE 12: Pet a cat when you encounter one on the street" : [343, 360],
                 "Coda" : [361, 373]
                }

#### pdf miner six

* https://github.com/pdfminer/pdfminer.six
* https://pdfminersix.readthedocs.io/en/latest/
* mamba install conda-forge::pdfminer.six

In [28]:
import pdfminer
from pdfminer.high_level import extract_text
from pdfminer.high_level import extract_pages
pdfminer.__version__

'20231228'

In [34]:
pdfFile

'../data/Jordan B. Peterson -12 Rules for Life_ An Antidote to Chaos.pdf'

In [35]:
text = extract_text(pdfFile)
# 7.8s  Beyond Order
# 32.9s 12 Rules

In [None]:
print(repr(text))

In [None]:
print(text)

In [None]:
from pdfminer.high_level import extract_pages
for page_layout in extract_pages(pdfFile):
    for element in page_layout:
        print(element)
        
# 8.2s

In [None]:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages(pdfFile):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            print(element.get_text())

The next two cells are from [How to resolve the target page of ToC entries](https://pdfminersix.readthedocs.io/en/latest/howto/toc_target_page.html)

In [30]:
from enum import Enum, auto
from pathlib import Path
from typing import Any, Optional
from pdfminer.pdfdocument import PDFDocument, PDFNoOutlines
from pdfminer.pdfpage import PDFPage, LITERAL_PAGE
from pdfminer.pdfparser import PDFParser, PDFSyntaxError
from pdfminer.pdftypes import PDFObjRef


class PDFRefType(Enum):
    """PDF reference type."""

    PDF_OBJ_REF = auto()
    DICTIONARY = auto()
    LIST = auto()
    NAMED_REF = auto()
    UNK = auto()  # fallback


class RefPageNumberResolver:
    """PDF Reference to page number resolver.

    .. note::

       Remote Go-To Actions (see 12.6.4.3 in
       `https://www.adobe.com/go/pdfreference/`__)
       are out of the scope of this resolver.

    Attributes:
        document (:obj:`pdfminer.pdfdocument.PDFDocument`):
            The document that contains the references.
        objid_to_pagenum (:obj:`dict[int, int]`):
            Mapping from an object id to the number of the page that contains
            that object.
    """

    def __init__(self, document: PDFDocument):
        self.document = document
        # obj_id -> page_number
        self.objid_to_pagenum: dict[int, int] = {
            page.pageid: page_num
            for page_num, page in enumerate(PDFPage.create_pages(document), 1)
        }

    @classmethod
    def get_ref_type(cls, ref: Any) -> PDFRefType:
        """Get the type of a PDF reference."""
        if isinstance(ref, PDFObjRef):
            return PDFRefType.PDF_OBJ_REF
        elif isinstance(ref, dict) and "D" in ref:
            return PDFRefType.DICTIONARY
        elif isinstance(ref, list) and any(isinstance(e, PDFObjRef) for e in ref):
            return PDFRefType.LIST
        elif isinstance(ref, bytes):
            return PDFRefType.NAMED_REF
        else:
            return PDFRefType.UNK

    @classmethod
    def is_ref_page(cls, ref: Any) -> bool:
        """Check whether a reference is of type '/Page'.

        Args:
            ref (:obj:`Any`):
                The PDF reference.

        Returns:
            :obj:`bool`: :obj:`True` if the reference references
            a page, :obj:`False` otherwise.
        """
        return isinstance(ref, dict) and "Type" in ref and ref["Type"] is LITERAL_PAGE

    def resolve(self, ref: Any) -> Optional[int]:
        """Resolve a PDF reference to a page number recursively.

        Args:
            ref (:obj:`Any`):
                The PDF reference.

        Returns:
            :obj:`Optional[int]`: The page number or :obj:`None`
            if the reference could not be resolved (e.g., remote Go-To
            Actions or malformed references).
        """
        ref_type = self.get_ref_type(ref)

        if ref_type is PDFRefType.PDF_OBJ_REF and self.is_ref_page(ref.resolve()):
            return self.objid_to_pagenum.get(ref.objid)
        elif ref_type is PDFRefType.PDF_OBJ_REF:
            return self.resolve(ref.resolve())

        if ref_type is PDFRefType.DICTIONARY:
            return self.resolve(ref["D"])

        if ref_type is PDFRefType.LIST:
            # Get the PDFObjRef in the list (usually first element).
            return self.resolve(next(filter(lambda e: isinstance(e, PDFObjRef), ref)))

        if ref_type is PDFRefType.NAMED_REF:
            return self.resolve(self.document.get_dest(ref))

        return None  # PDFRefType.UNK

In [31]:
def print_outlines(file: str) -> dict[int, int]:
    
    """Pretty print the outlines (ToC) of a PDF document."""

    with open(file, "rb") as fp:
        try:
            
            parser = PDFParser(fp)
            document = PDFDocument(parser)

            ref_pagenum_resolver = RefPageNumberResolver(document)

            outlines = list(document.get_outlines())

            if not outlines:
                print("No outlines found.")
            for (level, title, dest, a, se) in outlines:
                if dest:
                    page_num = ref_pagenum_resolver.resolve(dest)
                elif a:
                    page_num = ref_pagenum_resolver.resolve(a)
                elif se:
                    page_num = ref_pagenum_resolver.resolve(se)
                else:
                    page_num = None

                # Calculate leading spaces and filling dots for formatting.
                leading_spaces = (level-1) * 4
                fill_dots = 80 - len(title) - leading_spaces

                print(
                    f"{' ' * leading_spaces}"
                    f"{title}",
                    f"{'.' * fill_dots}",
                    f"{page_num:>3}"
                )
        except PDFNoOutlines:
            print("No outlines found.")
        except PDFSyntaxError:
            print("Corrupted PDF or non-PDF file.")
        finally:
            try:
                parser.close()
            except NameError:
                pass  # nothing to do

In [36]:
print_outlines(pdfFile)

Title Page ......................................................................   3
Foreword ........................................................................   5
Overture ........................................................................  20
‌RULE 1: Stand up straight with your shoulders back .............................  30
‌RULE 2: Treat yourself like someone you are responsible for helping ............  59
‌RULE 3: Make friends with people who want the best for you .....................  92
‌RULE 4: Compare yourself to who you were yesterday, not to who someone else is today  109
‌RULE 5: Do not let your children do anything that makes you dislike them ....... 135
‌RULE 6: Set your house in perfect order before you criticize the world ......... 168
‌RULE 7: Pursue what is meaningful (not what is expedient) ...................... 182
‌RULE 8: Tell the truth—or, at least, don’t lie ................................. 221
‌RULE 9: Assume that the person you are listening

In [40]:
pages = extract_pages(pdf_file = pdfFile)

In [45]:
pages = extract_pages(pdf_file = pdfFile, page_numbers=[0,10000])
pages

<generator object extract_pages at 0x7f94c070b760>

In [39]:

extract_text(pdf_file = pdfFile, page_numbers=[30,32])

'R U L E   1\n\nSTAND UP STRAIGHT WITH YOUR\nSHOULDERS BACK\n\nLOBSTERS—AND TERRITORY\n\nIf you are like most people, you don’t often think about lobsters2—unless\nyou’re eating one. However, these interesting and delicious crustaceans are\nvery much worth considering. Their nervous systems are comparatively\nsimple, with large, easily observable neurons, the magic cells of the brain.\nBecause of this, scientists have been able to map the neural circuitry of\nlobsters very accurately. This has helped us understand the structure and\nfunction of the brain and behaviour of more complex animals, including\nhuman beings. Lobsters have more in common with you than you might think\n(particularly when you are feeling crabby—ha ha).\n\nLobsters live on the ocean floor. They need a home base down there, a\nrange within which they hunt for prey and scavenge around for stray edible\nbits and pieces of whatever rains down from the continual chaos of carnage\nand death far above. They want somewher

In [25]:
extract_text(pdf_file = pdfFile, page_numbers=[24,25])

'well, so that we may organize our otherwise inchoate bodily\nreactions, motivations, and emotions into something articulate and\norganized, and dispense with those concerns that are exaggerated\nand irrational. We need to talk—both to remember and to forget.\nMy client desperately needed someone to listen to him. He also\nneeded to be fully part of additional, larger, and more complex social\ngroups—something he planned in our sessions together, and then\ncarried out on his own. Had he fallen prey to the temptation to\ndenigrate the value of interpersonal interactions and relationships\nbecause of his history of isolation and harsh treatment, he would\nhave had very little chance of regaining his health and well-being.\nInstead, he learned the ropes and joined the world.\n\n\x00\x00\x00\x00\x00\x00 \x00\x00 \x00 \x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\n\nFor Drs. Sigmund Freud and Carl Jung, the great depth\npsychologists, sanity was a characteristic of t

In [None]:
# https://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text
def pdf_to_csv(filename):
    from cStringIO import StringIO  #<-- added so you can copy/paste this to try it
    from pdfminer.converter import LTTextItem, TextConverter
    from pdfminer.pdfparser import PDFDocument, PDFParser
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter

    class CsvConverter(TextConverter):
        def __init__(self, *args, **kwargs):
            TextConverter.__init__(self, *args, **kwargs)

        def end_page(self, i):
            from collections import defaultdict
            lines = defaultdict(lambda : {})
            for child in self.cur_item.objs:
                if isinstance(child, LTTextItem):
                    (_,_,x,y) = child.bbox                   #<-- changed
                    line = lines[int(-y)]
                    line[x] = child.text.encode(self.codec)  #<-- changed

            for y in sorted(lines.keys()):
                line = lines[y]
                self.outfp.write(";".join(line[x] for x in sorted(line.keys())))
                self.outfp.write("\n")

    # ... the following part of the code is a remix of the 
    # convert() function in the pdfminer/tools/pdf2text module
    rsrc = PDFResourceManager()
    outfp = StringIO()
    device = CsvConverter(rsrc, outfp, codec="utf-8")  #<-- changed 
        # becuase my test documents are utf-8 (note: utf-8 is the default codec)

    doc = PDFDocument()
    fp = open(filename, 'rb')
    parser = PDFParser(fp)       #<-- changed
    parser.set_document(doc)     #<-- added
    doc.set_parser(parser)       #<-- added
    doc.initialize('')

    interpreter = PDFPageInterpreter(rsrc, device)

    for i, page in enumerate(doc.get_pages()):
        outfp.write("START PAGE %d\n" % i)
        interpreter.process_page(page)
        outfp.write("END PAGE %d\n" % i)

    device.close()
    fp.close()

    return outfp.getvalue()