#### Sunday, February 11, 2024

Continuing to work through the issues of creating embeddings from the text of a pdf file.

#### Saturday, February 10, 2024

How do I load a pdf file, and then extract the text from it?

#### Libraries to consider:

* [pypdf](https://github.com/py-pdf/PyPDF) 
* [pdfminer.six](https://github.com/pdfminer/pdfminer.six)
* [pdfplumber](https://github.com/jsvine/pdfplumber) 

In [1]:
import os 

pdfFolder = "../data"

In [2]:
def list_files_in_folder(folder_path):
    # Get the list of files in the specified folder
    files = os.listdir(folder_path)

    # Filter out subdirectories, leaving only files
    files = [f for f in files if os.path.isfile(os.path.join(folder_path, f))]

    return files

In [3]:
pdfFiles = list_files_in_folder(pdfFolder)

In [4]:
for file_name in pdfFiles:
    print(file_name)

Jordan B. Peterson -12 Rules for Life_ An Antidote to Chaos.pdf
Jordan Peterson - Beyond Order_ 12 More Rules For Life.pdf


#### pypdf

* https://github.com/py-pdf/PyPDF
* https://pypdf.readthedocs.io/en/latest/
* mamba install conda-forge::pypdf 

In [5]:
from pypdf import PdfReader

In [6]:
pdfFile = pdfFolder +  "/" + pdfFiles[0]
print(pdfFile)

../data/Jordan B. Peterson -12 Rules for Life_ An Antidote to Chaos.pdf


In [7]:
reader = PdfReader(pdfFile)

### 12 Rules for Life - An Antidote to Chaos

Extract the text from every chapter of the book into individual objects.

There does not seem to be any obvious way to automatically extract the page numbers from each chapter, so I am just gonna manually determine that.


#### 1) Identify the start and end pages for every chapter

In [8]:
# A helper function to validate the chapters start and end page values
def validateChapter(chapter):

    limit = 256
    start = chapter[0].extract_text()
    end = chapter[len(chapter)-1].extract_text()
    print("*** START ***")
    print(start[:limit])
    print("...")
    print("*** End ***")
    print(end[-limit:])

##### Forward

In [None]:
forward = reader.pages[4:19]
validateChapter(forward)

##### Overture

In [None]:
overture = reader.pages[19:28]
validateChapter(overture)

##### Chapter 1

In [None]:
chapter1 = reader.pages[30:56]
validateChapter(chapter1)

##### Chapter 2

In [None]:
chapter2 = reader.pages[59:89]
validateChapter(chapter2)


##### Chapter 3

In [None]:
chapter3 = reader.pages[92:107]
validateChapter(chapter3)

##### Chapter 4

In [None]:
chapter4 = reader.pages[109:133]
validateChapter(chapter4)

##### Chapter 5

In [None]:
chapter5 = reader.pages[135:165]
validateChapter(chapter5)

##### Chapter 6

In [None]:
chapter6 = reader.pages[168:180]
validateChapter(chapter6)

##### Chapter 7

In [None]:
chapter7 = reader.pages[182:219]
validateChapter(chapter7)

##### Chapter 8

In [None]:
chapter8 = reader.pages[221:246]
validateChapter(chapter8)

##### Chapter 9

In [None]:
chapter9 = reader.pages[249:270]
validateChapter(chapter9)

##### Chapter 10

In [None]:
chapter10 = reader.pages[273:295]
validateChapter(chapter10)

##### Chapter 11

In [None]:
chapter11 = reader.pages[297:340]
validateChapter(chapter11)

##### Chapter 12

In [None]:
chapter12 = reader.pages[343:360]
validateChapter(chapter12)

##### Coda

In [None]:
coda = reader.pages[361:373]
validateChapter(coda)

Now that we have manually identified the start and end pages of every chapter, let's put those values into a dictionary object then write a simple function to read the text from the PDF file.

In [32]:
bookChapters = { "Forward" : [4,19],
                 "Overture": [19, 28],
                 "RULE 1: Stand up straight with your shoulders back" : [30, 56],
                 "RULE 2: Treat yourself like someone you are responsible for helping" : [59,89],
                 "RULE 3: Make friends with people who want the best for you" : [92, 107],
                 "RULE 4: Compare yourself to who you were yesterday,not to who someone else is today" : [109, 133],
                 "RULE 5: Do not let your children do anything that makes you disklike them" : [135,165],
                 "RULE 6: Set your house in perfect order before you criticize the world" : [168, 180],
                 "RULE 7: Pursue what is meaningful (not what is expedient)" : [182, 219],
                 "RULE 8: Tell the truth - or, at least, don't lie" : [221, 246],
                 "RULE 9: Assume that the person you are listening to might know something you don't" : [249, 270],
                 "RULE 10: Be precise in your speech" : [273, 295],
                 "RULE 11: Do not bother children when they are skateboarding" : [297, 340],
                 "RULE 12: Pet a cat when you encounter one on the street" : [343, 360],
                 "Coda" : [361, 373]
                }

#### 2) Tokenize the Text

Now that we know how to read the text, the next step would be to tokenize it.

##### HuggingFace Sentence Transformers

In [35]:
# https://huggingface.co/sentence-transformers/all-mpnet-base-v2
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')


In [36]:
# This does not just tokenize the text, it also generates the embeddings
embeddings = model.encode(sentences)
print(embeddings)

[[ 0.02250259 -0.07829171 -0.02303074 ... -0.00827929  0.02652689
  -0.00201898]
 [ 0.04170233  0.00109744 -0.0155342  ... -0.02181628 -0.0635936
  -0.00875288]]


Let's try tokenizing every page from chapter 1 of the book.

In [37]:
len(chapter1)

26

In [42]:
for page in chapter1:
    text = page.extract_text()
    # print(text[:64])
    # print()
    embeddings = model.encode(text)
    

OK, that seemed simple enough. What about if we embed the text in chunks of a given length?

In [None]:
chunkSize = 256
for page in chapter1:
    text = page.extract_text()
    chunks = [text[i:i+chunkSize] for i in range(0, len(text), chunkSize)]
    for chunk in chunks:
        print(chunk)
        print("NEXT CHUNK")
        print()
    break

Simple enough, but this is splitting on words. It would probably be better to split on spaces.

In [50]:
page = chapter1[2].extract_text()

In [54]:
words = page.split()
len(words)
for word in words[:10]:
    print(word)

underwater,
and
are
seldom
served
with
butter.
However,
they
are


In [61]:
wordCount = 16
textBlock = [words[i:i+wordCount] for i in range(0, len(words), wordCount)]

In [62]:
sentence = ""

for word in textBlock[0]:
    sentence += word + " "

print(sentence)

underwater, and are seldom served with butter. However, they are also similar in important ways. Both 


In [56]:
textBlock[0]

['underwater,',
 'and',
 'are',
 'seldom',
 'served',
 'with',
 'butter.',
 'However,',
 'they',
 'are',
 'also',
 'similar',
 'in',
 'important',
 'ways.',
 'Both',
 'are',
 'obsessed',
 'with',
 'status',
 'and',
 'position,',
 'for',
 'example,',
 'like',
 'a',
 'great',
 'many',
 'creatures.',
 'The',
 'Norwegian',
 'zoologist',
 'and',
 'comparative',
 'psychologist',
 'Thorlief',
 'Schjelderup-Ebbe',
 'observed',
 '(back',
 'in',
 '1921)',
 'that',
 'even',
 'common',
 'barnyard',
 'chickens',
 'establish',
 'a',
 '“pecking',
 'order.”',
 '3',
 'The',
 'determination',
 'of',
 'Who’s',
 'Who',
 'in',
 'the',
 'chicken',
 'world',
 'has',
 'important',
 'implications',
 'for',
 'each',
 'individual',
 'bird’s',
 'survival,',
 'particularly',
 'in',
 'times',
 'of',
 'scarcity.',
 'The',
 'birds',
 'that',
 'always',
 'have',
 'priority',
 'access',
 'to',
 'whatever',
 'food',
 'is',
 'sprinkled',
 'out',
 'in',
 'the',
 'yard',
 'in',
 'the',
 'morning',
 'are',
 'the',
 'celebri

In [None]:
chunkSize = 256
for page in chapter1:
    text = page.extract_text()
    words = text.split(' ')
    for word in words:
        
  
    break

#### pdf miner six

* https://github.com/pdfminer/pdfminer.six
* https://pdfminersix.readthedocs.io/en/latest/
* mamba install conda-forge::pdfminer.six

In [28]:
import pdfminer
from pdfminer.high_level import extract_text
from pdfminer.high_level import extract_pages
pdfminer.__version__

'20231228'

In [34]:
pdfFile

'../data/Jordan B. Peterson -12 Rules for Life_ An Antidote to Chaos.pdf'

In [35]:
text = extract_text(pdfFile)
# 7.8s  Beyond Order
# 32.9s 12 Rules

In [None]:
print(repr(text))

In [None]:
print(text)

In [None]:
from pdfminer.high_level import extract_pages
for page_layout in extract_pages(pdfFile):
    for element in page_layout:
        print(element)
        
# 8.2s

In [None]:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages(pdfFile):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            print(element.get_text())

The next two cells are from [How to resolve the target page of ToC entries](https://pdfminersix.readthedocs.io/en/latest/howto/toc_target_page.html)

In [30]:
from enum import Enum, auto
from pathlib import Path
from typing import Any, Optional
from pdfminer.pdfdocument import PDFDocument, PDFNoOutlines
from pdfminer.pdfpage import PDFPage, LITERAL_PAGE
from pdfminer.pdfparser import PDFParser, PDFSyntaxError
from pdfminer.pdftypes import PDFObjRef


class PDFRefType(Enum):
    """PDF reference type."""

    PDF_OBJ_REF = auto()
    DICTIONARY = auto()
    LIST = auto()
    NAMED_REF = auto()
    UNK = auto()  # fallback


class RefPageNumberResolver:
    """PDF Reference to page number resolver.

    .. note::

       Remote Go-To Actions (see 12.6.4.3 in
       `https://www.adobe.com/go/pdfreference/`__)
       are out of the scope of this resolver.

    Attributes:
        document (:obj:`pdfminer.pdfdocument.PDFDocument`):
            The document that contains the references.
        objid_to_pagenum (:obj:`dict[int, int]`):
            Mapping from an object id to the number of the page that contains
            that object.
    """

    def __init__(self, document: PDFDocument):
        self.document = document
        # obj_id -> page_number
        self.objid_to_pagenum: dict[int, int] = {
            page.pageid: page_num
            for page_num, page in enumerate(PDFPage.create_pages(document), 1)
        }

    @classmethod
    def get_ref_type(cls, ref: Any) -> PDFRefType:
        """Get the type of a PDF reference."""
        if isinstance(ref, PDFObjRef):
            return PDFRefType.PDF_OBJ_REF
        elif isinstance(ref, dict) and "D" in ref:
            return PDFRefType.DICTIONARY
        elif isinstance(ref, list) and any(isinstance(e, PDFObjRef) for e in ref):
            return PDFRefType.LIST
        elif isinstance(ref, bytes):
            return PDFRefType.NAMED_REF
        else:
            return PDFRefType.UNK

    @classmethod
    def is_ref_page(cls, ref: Any) -> bool:
        """Check whether a reference is of type '/Page'.

        Args:
            ref (:obj:`Any`):
                The PDF reference.

        Returns:
            :obj:`bool`: :obj:`True` if the reference references
            a page, :obj:`False` otherwise.
        """
        return isinstance(ref, dict) and "Type" in ref and ref["Type"] is LITERAL_PAGE

    def resolve(self, ref: Any) -> Optional[int]:
        """Resolve a PDF reference to a page number recursively.

        Args:
            ref (:obj:`Any`):
                The PDF reference.

        Returns:
            :obj:`Optional[int]`: The page number or :obj:`None`
            if the reference could not be resolved (e.g., remote Go-To
            Actions or malformed references).
        """
        ref_type = self.get_ref_type(ref)

        if ref_type is PDFRefType.PDF_OBJ_REF and self.is_ref_page(ref.resolve()):
            return self.objid_to_pagenum.get(ref.objid)
        elif ref_type is PDFRefType.PDF_OBJ_REF:
            return self.resolve(ref.resolve())

        if ref_type is PDFRefType.DICTIONARY:
            return self.resolve(ref["D"])

        if ref_type is PDFRefType.LIST:
            # Get the PDFObjRef in the list (usually first element).
            return self.resolve(next(filter(lambda e: isinstance(e, PDFObjRef), ref)))

        if ref_type is PDFRefType.NAMED_REF:
            return self.resolve(self.document.get_dest(ref))

        return None  # PDFRefType.UNK

In [31]:
def print_outlines(file: str) -> dict[int, int]:
    
    """Pretty print the outlines (ToC) of a PDF document."""

    with open(file, "rb") as fp:
        try:
            
            parser = PDFParser(fp)
            document = PDFDocument(parser)

            ref_pagenum_resolver = RefPageNumberResolver(document)

            outlines = list(document.get_outlines())

            if not outlines:
                print("No outlines found.")
            for (level, title, dest, a, se) in outlines:
                if dest:
                    page_num = ref_pagenum_resolver.resolve(dest)
                elif a:
                    page_num = ref_pagenum_resolver.resolve(a)
                elif se:
                    page_num = ref_pagenum_resolver.resolve(se)
                else:
                    page_num = None

                # Calculate leading spaces and filling dots for formatting.
                leading_spaces = (level-1) * 4
                fill_dots = 80 - len(title) - leading_spaces

                print(
                    f"{' ' * leading_spaces}"
                    f"{title}",
                    f"{'.' * fill_dots}",
                    f"{page_num:>3}"
                )
        except PDFNoOutlines:
            print("No outlines found.")
        except PDFSyntaxError:
            print("Corrupted PDF or non-PDF file.")
        finally:
            try:
                parser.close()
            except NameError:
                pass  # nothing to do

In [36]:
print_outlines(pdfFile)

Title Page ......................................................................   3
Foreword ........................................................................   5
Overture ........................................................................  20
‌RULE 1: Stand up straight with your shoulders back .............................  30
‌RULE 2: Treat yourself like someone you are responsible for helping ............  59
‌RULE 3: Make friends with people who want the best for you .....................  92
‌RULE 4: Compare yourself to who you were yesterday, not to who someone else is today  109
‌RULE 5: Do not let your children do anything that makes you dislike them ....... 135
‌RULE 6: Set your house in perfect order before you criticize the world ......... 168
‌RULE 7: Pursue what is meaningful (not what is expedient) ...................... 182
‌RULE 8: Tell the truth—or, at least, don’t lie ................................. 221
‌RULE 9: Assume that the person you are listening

In [40]:
pages = extract_pages(pdf_file = pdfFile)

In [45]:
pages = extract_pages(pdf_file = pdfFile, page_numbers=[0,10000])
pages

<generator object extract_pages at 0x7f94c070b760>

In [39]:

extract_text(pdf_file = pdfFile, page_numbers=[30,32])

'R U L E   1\n\nSTAND UP STRAIGHT WITH YOUR\nSHOULDERS BACK\n\nLOBSTERS—AND TERRITORY\n\nIf you are like most people, you don’t often think about lobsters2—unless\nyou’re eating one. However, these interesting and delicious crustaceans are\nvery much worth considering. Their nervous systems are comparatively\nsimple, with large, easily observable neurons, the magic cells of the brain.\nBecause of this, scientists have been able to map the neural circuitry of\nlobsters very accurately. This has helped us understand the structure and\nfunction of the brain and behaviour of more complex animals, including\nhuman beings. Lobsters have more in common with you than you might think\n(particularly when you are feeling crabby—ha ha).\n\nLobsters live on the ocean floor. They need a home base down there, a\nrange within which they hunt for prey and scavenge around for stray edible\nbits and pieces of whatever rains down from the continual chaos of carnage\nand death far above. They want somewher

In [25]:
extract_text(pdf_file = pdfFile, page_numbers=[24,25])

'well, so that we may organize our otherwise inchoate bodily\nreactions, motivations, and emotions into something articulate and\norganized, and dispense with those concerns that are exaggerated\nand irrational. We need to talk—both to remember and to forget.\nMy client desperately needed someone to listen to him. He also\nneeded to be fully part of additional, larger, and more complex social\ngroups—something he planned in our sessions together, and then\ncarried out on his own. Had he fallen prey to the temptation to\ndenigrate the value of interpersonal interactions and relationships\nbecause of his history of isolation and harsh treatment, he would\nhave had very little chance of regaining his health and well-being.\nInstead, he learned the ropes and joined the world.\n\n\x00\x00\x00\x00\x00\x00 \x00\x00 \x00 \x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\n\nFor Drs. Sigmund Freud and Carl Jung, the great depth\npsychologists, sanity was a characteristic of t

In [None]:
# https://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text
def pdf_to_csv(filename):
    from cStringIO import StringIO  #<-- added so you can copy/paste this to try it
    from pdfminer.converter import LTTextItem, TextConverter
    from pdfminer.pdfparser import PDFDocument, PDFParser
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter

    class CsvConverter(TextConverter):
        def __init__(self, *args, **kwargs):
            TextConverter.__init__(self, *args, **kwargs)

        def end_page(self, i):
            from collections import defaultdict
            lines = defaultdict(lambda : {})
            for child in self.cur_item.objs:
                if isinstance(child, LTTextItem):
                    (_,_,x,y) = child.bbox                   #<-- changed
                    line = lines[int(-y)]
                    line[x] = child.text.encode(self.codec)  #<-- changed

            for y in sorted(lines.keys()):
                line = lines[y]
                self.outfp.write(";".join(line[x] for x in sorted(line.keys())))
                self.outfp.write("\n")

    # ... the following part of the code is a remix of the 
    # convert() function in the pdfminer/tools/pdf2text module
    rsrc = PDFResourceManager()
    outfp = StringIO()
    device = CsvConverter(rsrc, outfp, codec="utf-8")  #<-- changed 
        # becuase my test documents are utf-8 (note: utf-8 is the default codec)

    doc = PDFDocument()
    fp = open(filename, 'rb')
    parser = PDFParser(fp)       #<-- changed
    parser.set_document(doc)     #<-- added
    doc.set_parser(parser)       #<-- added
    doc.initialize('')

    interpreter = PDFPageInterpreter(rsrc, device)

    for i, page in enumerate(doc.get_pages()):
        outfp.write("START PAGE %d\n" % i)
        interpreter.process_page(page)
        outfp.write("END PAGE %d\n" % i)

    device.close()
    fp.close()

    return outfp.getvalue()