Support for outline item external references #2648

dawillcox · 2024-05-15T18:06:28Z

Explanation

I'm not sure if this is a request for a new feature or documentation to explain how this is already possible...

My knowledge of PDF internal format is microscopic, but I know that PDF supports internal links (to images, pages, etc.) and external links (web pages, other files, email addresses, ...) don't see how pypdf supports external links.

Here's my situation: I have a PDF file (from a CD I purchased) that has outline links to pages and external files. It's a scan of a book, almost 1200 pages, so the links to sections of the document are quite handy. Trouble is, the pages are all just images. It would be very useful to be able to search for text and copy text for use elsewhere. (Fair use, of course.)

Yes, I know there are resources that OCR scan PDF files, but everything I've tried balks at a file that large, at least without a charge.

So I:

Split the big file into 100 page chunks.
OCR scanned each chunk.
Merged the scanned chunks back into a single file.

Which worked perfectly. Except, while the text in the result is all nicely scanned, the outline is gone. So, I'm using pypdf to merge the original document's outline into the scanned document. And that works fine for the outline options that are just headers, and links to pages within the document, but the external links are gone.

See code example below. This is just the inner logic to deal with a single outline entry, obviously there's outer logic to deal with lists and embedded lists.

Code Example

Here's what I'm doing now:

from pypdf import PdfReader, PdfWriter

# Setup is basically this:
from_file = PdfReader(open(ORIGINAL_FILE, 'rb'))
scanned_file = PdfReader(open(SCANNED_FILE, 'rb'))
to_file = PdfWriter()
to_file.append_pages_from_reader(scanned_file)

# so at this point, from_file has the desired outlines, and
# to_file has all of the OCR scanned pages but no outlines. 
# (Or much of anything else.)

# Then follows loops to apply Destinations from from_file.outline to to_file. 
# Omitting the looping logic, each destination is handled as:

        pgno = from_file.get_destination_page_number(outline)
        if pgno is None:
            next_parent = to_file.add_outline_item_dict(outline, parent=parent_outline)
        else:
            next_parent = to_file.add_outline_item(outline.title, page_number=pgno, parent=parent_outline)

# next_parent becomes parent_outline for embedded lists.

# This works fine for references to pages, but external references are lost.
# They just become an item in the outline, but they don't behave like 
# in the original document.

So the question is: Is this something that can be done with the current release, but it's too obscure for me to figure out? Or would it be a useful addition in the future? Said feature probably would need a way to tell if an existing outline entry was an external reference, plus a way to specify such a reference in a new file.

Though now that I think of it, outlines can point to other internal things like images. Maybe those are IndirectObjects so already supported?

The text was updated successfully, but these errors were encountered:

pubpub-zz · 2024-05-15T18:16:23Z

I may have an idea, but I would need an example of an original page and the output of the OCR processing to confirm it

dawillcox · 2024-05-15T18:23:56Z

The problem is the file is quite large, and just a single page wouldn't demonstrate the problem. I could try this on a smaller file, though.

pubpub-zz · 2024-05-15T18:32:49Z

I would like to see if I can merge back the scanned data into the original page
Let me do my test 😉... It should worth it.

dawillcox · 2024-05-15T18:47:32Z

That would be awesome, but I couldn't see how I'd even start. You'd have to extract the text along with all of the data that matched text to location in the image. I see no way to do that.

pubpub-zz · 2024-05-15T19:04:35Z

please do What I've asked:
extract one page from your original doc

w = pypdf.PdfWriter()
w.append("doc_source.pdf",pages=[10])  # replace 10 by the page number with some text non sensitive/copyrighted
w.write("one_page_out.pdf")

apply the ocr process you've selected (out of pypdf scope)
publish one_page_out.pdf and the processsed page

dawillcox · 2024-05-15T20:32:11Z

The trouble with that is I don't know how I'd create the outlines items to correspond to the one page.

So here's a variation. Two files that reproduce the issue without being huge and with just cover page, so nobody should be unhappy about content. Files will be attached, I hope. Here's my tacky code to do what I want:

from pypdf.generic import Destination
from pypdf import PdfReader, PdfWriter

# This stands in for the original file. All of the images are removed,
# just the first couple of pages are there.
ORIGINAL_FILE = 'with_outline.pdf'

# This stands in for the OCR scanned file. The outline is gone, but
# a couple of pages have text added that isn't in ORIGINAL_FILE.
# This will verify that the final product has pages from SCANNED_FILE.
SCANNED_FILE = 'altered_pages.pdf'

# This is the output of the merge.  A couple of pages are marked to verify
# that the 'A' and 'B' outline items go to the right place.
OUTPUT_FILE = 'after_merge.pdf'


def copy_index(from_file: PdfReader, to_file: PdfWriter, outlines, parent_outline=None):
    next_parent = parent_outline
    for outline in outlines:
        if isinstance(outline, Destination):
            pgno = from_file.get_destination_page_number(outline)
            if pgno is None:
                next_parent = to_file.add_outline_item_dict(outline,
                                                            parent=parent_outline)
            else:
                next_parent = to_file.add_outline_item(outline.title,
                                                       page_number=pgno,
                                                       parent=parent_outline)
        elif isinstance(outline, list):
            copy_index(from_file, to_file, outline, parent_outline=next_parent)


index_pdf = PdfReader(open(ORIGINAL_FILE, 'rb'))
scanned_pdf = PdfReader(open(SCANNED_FILE, 'rb'))
writer = PdfWriter()
writer.append_pages_from_reader(scanned_pdf)
copy_index(index_pdf, writer, index_pdf.outline)
writer.write(OUTPUT_FILE)

altered_pages.pdf
with_outline.pdf

pubpub-zz · 2024-05-15T20:43:40Z

no worries about outlines, these will be naturally copied from your scanned document.
What I'm interested in your document is altered_pages. looking at page 2 I can clear text : can you clarify weither this is the output of the OCR ? can you extract send the original page : if you use the code I've provided the outlines should be extracted too.

Once I will have both, I should be able to propose some code merge the text/content from altered_pages. I need at least one page with the images the file with_outline is useless for me

dawillcox · 2024-05-15T21:01:54Z

I'm clearly not communicating this well.

Yes, generally when you do an OCR scan of a document, the outlines are preserved. Task done, game over. No complaints here.

The problem is that my actual document is so large that scanners balk. So I split the big file into smaller chunks, scanned each chunk, then joined the chunks into one big file. That leaves me with another big file with all of the scanned text but no outline. I'm trying to copy the outline from the original file to the scanned one.

The files and code I just uploaded demonstrate the problem of copying indexes; the content of the pages shouldn't be an issue.

But hmm. I wonder if I could load the original document into a writer, delete the pages, then add the pages from the scanned document. That would be way simpler.

But still, wouldn't it be nice if pypdf had some kind of support for external links?

Update: Replacing the pages in the original document with the scanned pages doesn't work, presumably because the outline refers to the actual page, and if the page is removed the outline can't point at it any more.

pubpub-zz · 2024-05-15T21:29:54Z

What I have in mind is the following approach:
using

w=PdfWriter()
w.append("input.pdf",(0,50))
w.write("chunk1.pdf")
w=PdfWriter()
w.append("input.pdf",(50,100))
w.write("chunk2.pdf")

you will have chunks of input that would have kept outlines
From your comments I do understand that outlines are preserved by OCR
so if you use:

w = PdfWriter()
w.append("OCR_chunk1.pdf")
w.append("OCR_chunk2.pdf")
w.write("fullOCR_with_outlines.pdf")

Should work.

your proposal

But hmm. I wonder if I could load the original document into a writer, delete the pages, then add the pages from the scanned document. That would be way simpler.

You should not need to remove the image:
use w = PdfWriter("original.pdf") to create and then use .merge_page(page_from_reader,over=False) to hide the text behind the image

But still, wouldn't it be nice if pypdf had some kind of support for external links?

I agree, I though it was already in... need to check more

dawillcox · 2024-05-16T02:11:15Z

So here's the problem. Outline items that point to a page refer to a specific page object, not just a page number. (Or an image, or other internal object.) That way, if you have an index set up, and then add or remove pages, the outline item still points to the same content.

If you remove a page I don't know if the index item is deleted or just doesn't point anywhere at all. If you remove a page with clean=True, the deleted page is replaced by a blank one and any index still points to it (I think).

Unfortunately, there seems to be no way to replace the content of a page, keeping the page ID the same but new content.

And looking at outline content in a debugger, I haven't been able to suss out how external destinations are specified. That seems to be thoroughly obfuscated in the code.

Bottom line, I finally got ocrmypdf working. (I had problems with the ghostscript library before.) I found that

It behaved much like the online tool I used. Outline items for pages were fine. External references disabled in the same way. I'm guessing there's something tricky about those external links. That's OK. I can live without them.
The ocrmypdf dies quite quickly. Blowing a stack, perhaps?

I'm guessing that the best bet would be to somehow copy the scanned text and the hints that say where it's placed and apply that to the original pages. No clue where to start for that, though. Certainly no clues from pypdf. And I can't be sure that the original pages weren't adjusted somehow.

So, bottom line, the file I have, absent the external links, works well enough for my purposes. I'd love to know how external references and/or the OCR-applied text works, and could be moved from one file to another. But at this point it's more a matter of intellectual curiosity.

stefan6419846 · 2024-05-16T07:11:57Z

Your code seems to handle outlines only. Shouldn't external links (however this would behave with scanned files) rather be an annotation (https://pypdf.readthedocs.io/en/latest/user/adding-pdf-annotations.html#link)?

dawillcox · 2024-05-16T14:57:41Z

Hmm. You may be onto something. But can annotations be on an outline item? The external links (to other files) in the original file seem to be attached to the outline (TOC) entries, not pages. If you click on an entry on the TOC it opens an external file, or goes to a page in the file. The latter is what my code does, but it can't figure out the external links.

Conversely, I'm guessing that OCR'd text may be an annotation and I could use that to copy the OCR'd text to the original document.

But doing a preliminary investigation, what the documentation shows for finding annotations on a page doesn't find annotations on outline items.

stefan6419846 · 2024-05-16T15:08:47Z

Conversely, I'm guessing that OCR'd text may be an annotation and I could use that to copy the OCR'd text to the original document.

Usually no. This probably just is a basic text layer, maybe with an "invisible" font which allows copying the text, but does not conflict with the possibly different font and text parameters of the scanned image.

dawillcox · 2024-05-16T16:10:39Z

Just stepping through the .extract_text() code in page I can see that pulling out the OCR results and applying to another page would be be a challenge (putting it mildly).

But can you suggest how annotations and outlines might be related? For example, if you click "Welcome Page" in the TOC of the with_outline.pdf file I attached earlier, it tries to open another document. Which isn't there so it fails, but at least the readier tried to open the file.

pubpub-zz · 2024-05-16T19:52:27Z

I've finally been able to generate a test as I was expecting:
file with image only but with outline:
tt1_outline.pdf
output of the OCR: the text is invisible but present and with an image on top:
tt1-sortie.pdf

then to merge it we can use the following code:

import pypdf
w = pypdf.PdfWriter("tt1_outline.pdf")
w2 = pypdf.PdfWriter("tt1-sortie.pdf")
w2.remove_images()   # to remove the scanned image before merging
w.pages[0].merge_page(w2.pages[0],over=False) # the OCR page is put behind to ensure to not overlay over the original image
w.write("tt1_merged.pdf")
w.pages[0].extract_text(extraction_mode="layout")
#returns : 'PDF        Reference\n  sixthedition\n\n\n  Adobe°  Portable  Document   Format\n     Version1.7\n     November2006\n\n\n\n     Adobe  SystemsIncorporated'

the output:
tt1_merged.pdf

pubpub-zz · 2024-05-16T20:05:04Z

So here's the problem. Outline items that point to a page refer to a specific page object, not just a page number. (Or an image, or other internal object.) That way, if you have an index set up, and then add or remove pages, the outline item still points to the same content.

If you look into the pdf spec, this is the way pages are pointed. page number are reserved for links to external pages

If you remove a page I don't know if the index item is deleted or just doesn't point anywhere at all. If you remove a page with clean=True, the deleted page is replaced by a blank one and any index still points to it (I think).

correct

Unfortunately, there seems to be no way to replace the content of a page, keeping the page ID the same but new content.

using replace_content will not transfert the resources. My solution is operational

And looking at outline content in a debugger, I haven't been able to suss out how external destinations are specified. That seems to be thoroughly obfuscated in the code.

I recommend you to use pdfbox with debug option

dawillcox · 2024-05-17T02:08:41Z

Well, except I had to change

w = pypdf.PdfWriter("tt1_outline.pdf")

to

w = PdfWriter()
w.clone_document_from_reader(PdfReader("tt1_outline.pdf"))

for each file, (otherwise the pages list was empty) but that worked! Magic!

I ran this on my monster file, and it worked too. (Took a while, though. A lot of mucking with stuff happens in there.) Thanks loads for your help! I never would have figured this out by myself.

I still wonder how external links from outline entries work, but at this point it's just intellectual curiosity, and I have plenty other things to keep me busy.

pubpub-zz · 2024-05-17T05:12:53Z

Oups
I'm on the dev
Use PdfWriter(clone_from='input pdf')

stefan6419846 · 2024-05-27T11:29:36Z

I am going to close this issue for now as it sounds solved.

dawillcox · 2024-05-27T21:35:39Z

Well, yes, we found a solution to my particular problem. It still would be nice if support for external links (in table of contents and maybe other places), both creating and finding, could be added to the list of possible enhancements.

stefan6419846 · 2024-05-28T08:33:05Z

External links should already be supported. For further discussions or issues about it, I recommend opening a new discussion with an explicit example file.

stefan6419846 added is-question Rather a question than an issue. Should usually be a Discussion instead workflow-annotation Everything about annotating PDF files labels May 17, 2024

stefan6419846 closed this as completed May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for outline item external references #2648

Support for outline item external references #2648

dawillcox commented May 15, 2024

pubpub-zz commented May 15, 2024

dawillcox commented May 15, 2024

pubpub-zz commented May 15, 2024

dawillcox commented May 15, 2024

pubpub-zz commented May 15, 2024 •

edited

Loading

dawillcox commented May 15, 2024

pubpub-zz commented May 15, 2024 •

edited

Loading

dawillcox commented May 15, 2024 •

edited

Loading

pubpub-zz commented May 15, 2024 •

edited

Loading

dawillcox commented May 16, 2024

stefan6419846 commented May 16, 2024

dawillcox commented May 16, 2024

stefan6419846 commented May 16, 2024

dawillcox commented May 16, 2024 •

edited

Loading

pubpub-zz commented May 16, 2024 •

edited

Loading

pubpub-zz commented May 16, 2024

dawillcox commented May 17, 2024 •

edited

Loading

pubpub-zz commented May 17, 2024

stefan6419846 commented May 27, 2024

dawillcox commented May 27, 2024 •

edited

Loading

stefan6419846 commented May 28, 2024

Support for outline item external references #2648

Support for outline item external references #2648

Comments

dawillcox commented May 15, 2024

Explanation

Code Example

pubpub-zz commented May 15, 2024

dawillcox commented May 15, 2024

pubpub-zz commented May 15, 2024

dawillcox commented May 15, 2024

pubpub-zz commented May 15, 2024 • edited Loading

dawillcox commented May 15, 2024

pubpub-zz commented May 15, 2024 • edited Loading

dawillcox commented May 15, 2024 • edited Loading

pubpub-zz commented May 15, 2024 • edited Loading

dawillcox commented May 16, 2024

stefan6419846 commented May 16, 2024

dawillcox commented May 16, 2024

stefan6419846 commented May 16, 2024

dawillcox commented May 16, 2024 • edited Loading

pubpub-zz commented May 16, 2024 • edited Loading

pubpub-zz commented May 16, 2024

dawillcox commented May 17, 2024 • edited Loading

pubpub-zz commented May 17, 2024

stefan6419846 commented May 27, 2024

dawillcox commented May 27, 2024 • edited Loading

stefan6419846 commented May 28, 2024

pubpub-zz commented May 15, 2024 •

edited

Loading

pubpub-zz commented May 15, 2024 •

edited

Loading

dawillcox commented May 15, 2024 •

edited

Loading

pubpub-zz commented May 15, 2024 •

edited

Loading

dawillcox commented May 16, 2024 •

edited

Loading

pubpub-zz commented May 16, 2024 •

edited

Loading

dawillcox commented May 17, 2024 •

edited

Loading

dawillcox commented May 27, 2024 •

edited

Loading