Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for outline item external references #2648

Closed
dawillcox opened this issue May 15, 2024 · 21 comments
Closed

Support for outline item external references #2648

dawillcox opened this issue May 15, 2024 · 21 comments
Labels
is-question Rather a question than an issue. Should usually be a Discussion instead workflow-annotation Everything about annotating PDF files

Comments

@dawillcox
Copy link

Explanation

I'm not sure if this is a request for a new feature or documentation to explain how this is already possible...

My knowledge of PDF internal format is microscopic, but I know that PDF supports internal links (to images, pages, etc.) and external links (web pages, other files, email addresses, ...) don't see how pypdf supports external links.

Here's my situation: I have a PDF file (from a CD I purchased) that has outline links to pages and external files. It's a scan of a book, almost 1200 pages, so the links to sections of the document are quite handy. Trouble is, the pages are all just images. It would be very useful to be able to search for text and copy text for use elsewhere. (Fair use, of course.)

Yes, I know there are resources that OCR scan PDF files, but everything I've tried balks at a file that large, at least without a charge.

So I:

  1. Split the big file into 100 page chunks.
  2. OCR scanned each chunk.
  3. Merged the scanned chunks back into a single file.

Which worked perfectly. Except, while the text in the result is all nicely scanned, the outline is gone. So, I'm using pypdf to merge the original document's outline into the scanned document. And that works fine for the outline options that are just headers, and links to pages within the document, but the external links are gone.

See code example below. This is just the inner logic to deal with a single outline entry, obviously there's outer logic to deal with lists and embedded lists.

Code Example

Here's what I'm doing now:

from pypdf import PdfReader, PdfWriter

# Setup is basically this:
from_file = PdfReader(open(ORIGINAL_FILE, 'rb'))
scanned_file = PdfReader(open(SCANNED_FILE, 'rb'))
to_file = PdfWriter()
to_file.append_pages_from_reader(scanned_file)

# so at this point, from_file has the desired outlines, and
# to_file has all of the OCR scanned pages but no outlines. 
# (Or much of anything else.)

# Then follows loops to apply Destinations from from_file.outline to to_file. 
# Omitting the looping logic, each destination is handled as:

        pgno = from_file.get_destination_page_number(outline)
        if pgno is None:
            next_parent = to_file.add_outline_item_dict(outline, parent=parent_outline)
        else:
            next_parent = to_file.add_outline_item(outline.title, page_number=pgno, parent=parent_outline)

# next_parent becomes parent_outline for embedded lists.

# This works fine for references to pages, but external references are lost.
# They just become an item in the outline, but they don't behave like 
# in the original document.

So the question is: Is this something that can be done with the current release, but it's too obscure for me to figure out? Or would it be a useful addition in the future? Said feature probably would need a way to tell if an existing outline entry was an external reference, plus a way to specify such a reference in a new file.

Though now that I think of it, outlines can point to other internal things like images. Maybe those are IndirectObjects so already supported?

@pubpub-zz
Copy link
Collaborator

I may have an idea, but I would need an example of an original page and the output of the OCR processing to confirm it

@dawillcox
Copy link
Author

The problem is the file is quite large, and just a single page wouldn't demonstrate the problem. I could try this on a smaller file, though.

@pubpub-zz
Copy link
Collaborator

I would like to see if I can merge back the scanned data into the original page
Let me do my test 😉... It should worth it.

@dawillcox
Copy link
Author

That would be awesome, but I couldn't see how I'd even start. You'd have to extract the text along with all of the data that matched text to location in the image. I see no way to do that.

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented May 15, 2024

please do What I've asked:
extract one page from your original doc

w = pypdf.PdfWriter()
w.append("doc_source.pdf",pages=[10])  # replace 10 by the page number with some text non sensitive/copyrighted
w.write("one_page_out.pdf")

apply the ocr process you've selected (out of pypdf scope)
publish one_page_out.pdf and the processsed page

@dawillcox
Copy link
Author

The trouble with that is I don't know how I'd create the outlines items to correspond to the one page.

So here's a variation. Two files that reproduce the issue without being huge and with just cover page, so nobody should be unhappy about content. Files will be attached, I hope. Here's my tacky code to do what I want:

from pypdf.generic import Destination
from pypdf import PdfReader, PdfWriter

# This stands in for the original file. All of the images are removed,
# just the first couple of pages are there.
ORIGINAL_FILE = 'with_outline.pdf'

# This stands in for the OCR scanned file. The outline is gone, but
# a couple of pages have text added that isn't in ORIGINAL_FILE.
# This will verify that the final product has pages from SCANNED_FILE.
SCANNED_FILE = 'altered_pages.pdf'

# This is the output of the merge.  A couple of pages are marked to verify
# that the 'A' and 'B' outline items go to the right place.
OUTPUT_FILE = 'after_merge.pdf'


def copy_index(from_file: PdfReader, to_file: PdfWriter, outlines, parent_outline=None):
    next_parent = parent_outline
    for outline in outlines:
        if isinstance(outline, Destination):
            pgno = from_file.get_destination_page_number(outline)
            if pgno is None:
                next_parent = to_file.add_outline_item_dict(outline,
                                                            parent=parent_outline)
            else:
                next_parent = to_file.add_outline_item(outline.title,
                                                       page_number=pgno,
                                                       parent=parent_outline)
        elif isinstance(outline, list):
            copy_index(from_file, to_file, outline, parent_outline=next_parent)


index_pdf = PdfReader(open(ORIGINAL_FILE, 'rb'))
scanned_pdf = PdfReader(open(SCANNED_FILE, 'rb'))
writer = PdfWriter()
writer.append_pages_from_reader(scanned_pdf)
copy_index(index_pdf, writer, index_pdf.outline)
writer.write(OUTPUT_FILE)

altered_pages.pdf
with_outline.pdf

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented May 15, 2024

no worries about outlines, these will be naturally copied from your scanned document.
What I'm interested in your document is altered_pages. looking at page 2 I can clear text : can you clarify weither this is the output of the OCR ? can you extract send the original page : if you use the code I've provided the outlines should be extracted too.

Once I will have both, I should be able to propose some code merge the text/content from altered_pages. I need at least one page with the images the file with_outline is useless for me

@dawillcox
Copy link
Author

dawillcox commented May 15, 2024

I'm clearly not communicating this well.

Yes, generally when you do an OCR scan of a document, the outlines are preserved. Task done, game over. No complaints here.

The problem is that my actual document is so large that scanners balk. So I split the big file into smaller chunks, scanned each chunk, then joined the chunks into one big file. That leaves me with another big file with all of the scanned text but no outline. I'm trying to copy the outline from the original file to the scanned one.

The files and code I just uploaded demonstrate the problem of copying indexes; the content of the pages shouldn't be an issue.

But hmm. I wonder if I could load the original document into a writer, delete the pages, then add the pages from the scanned document. That would be way simpler.

But still, wouldn't it be nice if pypdf had some kind of support for external links?

Update: Replacing the pages in the original document with the scanned pages doesn't work, presumably because the outline refers to the actual page, and if the page is removed the outline can't point at it any more.

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented May 15, 2024

What I have in mind is the following approach:
using

w=PdfWriter()
w.append("input.pdf",(0,50))
w.write("chunk1.pdf")
w=PdfWriter()
w.append("input.pdf",(50,100))
w.write("chunk2.pdf")

you will have chunks of input that would have kept outlines
From your comments I do understand that outlines are preserved by OCR
so if you use:

w = PdfWriter()
w.append("OCR_chunk1.pdf")
w.append("OCR_chunk2.pdf")
w.write("fullOCR_with_outlines.pdf")

Should work.

your proposal

But hmm. I wonder if I could load the original document into a writer, delete the pages, then add the pages from the scanned document. That would be way simpler.

You should not need to remove the image:
use w = PdfWriter("original.pdf") to create and then use .merge_page(page_from_reader,over=False) to hide the text behind the image

But still, wouldn't it be nice if pypdf had some kind of support for external links?

I agree, I though it was already in... need to check more

@dawillcox
Copy link
Author

So here's the problem. Outline items that point to a page refer to a specific page object, not just a page number. (Or an image, or other internal object.) That way, if you have an index set up, and then add or remove pages, the outline item still points to the same content.

If you remove a page I don't know if the index item is deleted or just doesn't point anywhere at all. If you remove a page with clean=True, the deleted page is replaced by a blank one and any index still points to it (I think).

Unfortunately, there seems to be no way to replace the content of a page, keeping the page ID the same but new content.

And looking at outline content in a debugger, I haven't been able to suss out how external destinations are specified. That seems to be thoroughly obfuscated in the code.

Bottom line, I finally got ocrmypdf working. (I had problems with the ghostscript library before.) I found that

  • It behaved much like the online tool I used. Outline items for pages were fine. External references disabled in the same way. I'm guessing there's something tricky about those external links. That's OK. I can live without them.
  • The ocrmypdf dies quite quickly. Blowing a stack, perhaps?

I'm guessing that the best bet would be to somehow copy the scanned text and the hints that say where it's placed and apply that to the original pages. No clue where to start for that, though. Certainly no clues from pypdf. And I can't be sure that the original pages weren't adjusted somehow.

So, bottom line, the file I have, absent the external links, works well enough for my purposes. I'd love to know how external references and/or the OCR-applied text works, and could be moved from one file to another. But at this point it's more a matter of intellectual curiosity.

@stefan6419846
Copy link
Collaborator

Your code seems to handle outlines only. Shouldn't external links (however this would behave with scanned files) rather be an annotation (https://pypdf.readthedocs.io/en/latest/user/adding-pdf-annotations.html#link)?

@dawillcox
Copy link
Author

Hmm. You may be onto something. But can annotations be on an outline item? The external links (to other files) in the original file seem to be attached to the outline (TOC) entries, not pages. If you click on an entry on the TOC it opens an external file, or goes to a page in the file. The latter is what my code does, but it can't figure out the external links.

Conversely, I'm guessing that OCR'd text may be an annotation and I could use that to copy the OCR'd text to the original document.

But doing a preliminary investigation, what the documentation shows for finding annotations on a page doesn't find annotations on outline items.

@stefan6419846
Copy link
Collaborator

Conversely, I'm guessing that OCR'd text may be an annotation and I could use that to copy the OCR'd text to the original document.

Usually no. This probably just is a basic text layer, maybe with an "invisible" font which allows copying the text, but does not conflict with the possibly different font and text parameters of the scanned image.

@dawillcox
Copy link
Author

dawillcox commented May 16, 2024

Just stepping through the .extract_text() code in page I can see that pulling out the OCR results and applying to another page would be be a challenge (putting it mildly).

But can you suggest how annotations and outlines might be related? For example, if you click "Welcome Page" in the TOC of the with_outline.pdf file I attached earlier, it tries to open another document. Which isn't there so it fails, but at least the readier tried to open the file.

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented May 16, 2024

I've finally been able to generate a test as I was expecting:
file with image only but with outline:
tt1_outline.pdf
output of the OCR: the text is invisible but present and with an image on top:
tt1-sortie.pdf

then to merge it we can use the following code:

import pypdf
w = pypdf.PdfWriter("tt1_outline.pdf")
w2 = pypdf.PdfWriter("tt1-sortie.pdf")
w2.remove_images()   # to remove the scanned image before merging
w.pages[0].merge_page(w2.pages[0],over=False) # the OCR page is put behind to ensure to not overlay over the original image
w.write("tt1_merged.pdf")
w.pages[0].extract_text(extraction_mode="layout")
#returns : 'PDF        Reference\n  sixthedition\n\n\n  Adobe°  Portable  Document   Format\n     Version1.7\n     November2006\n\n\n\n     Adobe  SystemsIncorporated'

the output:
tt1_merged.pdf

@pubpub-zz
Copy link
Collaborator

So here's the problem. Outline items that point to a page refer to a specific page object, not just a page number. (Or an image, or other internal object.) That way, if you have an index set up, and then add or remove pages, the outline item still points to the same content.

If you look into the pdf spec, this is the way pages are pointed. page number are reserved for links to external pages

If you remove a page I don't know if the index item is deleted or just doesn't point anywhere at all. If you remove a page with clean=True, the deleted page is replaced by a blank one and any index still points to it (I think).

correct

Unfortunately, there seems to be no way to replace the content of a page, keeping the page ID the same but new content.

using replace_content will not transfert the resources. My solution is operational

And looking at outline content in a debugger, I haven't been able to suss out how external destinations are specified. That seems to be thoroughly obfuscated in the code.

I recommend you to use pdfbox with debug option

@dawillcox
Copy link
Author

dawillcox commented May 17, 2024

Well, except I had to change

w = pypdf.PdfWriter("tt1_outline.pdf")

to

w = PdfWriter()
w.clone_document_from_reader(PdfReader("tt1_outline.pdf"))

for each file, (otherwise the pages list was empty) but that worked! Magic!

I ran this on my monster file, and it worked too. (Took a while, though. A lot of mucking with stuff happens in there.) Thanks loads for your help! I never would have figured this out by myself.

I still wonder how external links from outline entries work, but at this point it's just intellectual curiosity, and I have plenty other things to keep me busy.

@pubpub-zz
Copy link
Collaborator

Oups
I'm on the dev
Use PdfWriter(clone_from='input pdf')

@stefan6419846 stefan6419846 added is-question Rather a question than an issue. Should usually be a Discussion instead workflow-annotation Everything about annotating PDF files labels May 17, 2024
@stefan6419846
Copy link
Collaborator

I am going to close this issue for now as it sounds solved.

@dawillcox
Copy link
Author

dawillcox commented May 27, 2024

Well, yes, we found a solution to my particular problem. It still would be nice if support for external links (in table of contents and maybe other places), both creating and finding, could be added to the list of possible enhancements.

@stefan6419846
Copy link
Collaborator

External links should already be supported. For further discussions or issues about it, I recommend opening a new discussion with an explicit example file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-question Rather a question than an issue. Should usually be a Discussion instead workflow-annotation Everything about annotating PDF files
Projects
None yet
Development

No branches or pull requests

3 participants