-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for outline item external references #2648
Comments
I may have an idea, but I would need an example of an original page and the output of the OCR processing to confirm it |
The problem is the file is quite large, and just a single page wouldn't demonstrate the problem. I could try this on a smaller file, though. |
I would like to see if I can merge back the scanned data into the original page |
That would be awesome, but I couldn't see how I'd even start. You'd have to extract the text along with all of the data that matched text to location in the image. I see no way to do that. |
please do What I've asked:
apply the ocr process you've selected (out of pypdf scope) |
The trouble with that is I don't know how I'd create the outlines items to correspond to the one page. So here's a variation. Two files that reproduce the issue without being huge and with just cover page, so nobody should be unhappy about content. Files will be attached, I hope. Here's my tacky code to do what I want: from pypdf.generic import Destination
from pypdf import PdfReader, PdfWriter
# This stands in for the original file. All of the images are removed,
# just the first couple of pages are there.
ORIGINAL_FILE = 'with_outline.pdf'
# This stands in for the OCR scanned file. The outline is gone, but
# a couple of pages have text added that isn't in ORIGINAL_FILE.
# This will verify that the final product has pages from SCANNED_FILE.
SCANNED_FILE = 'altered_pages.pdf'
# This is the output of the merge. A couple of pages are marked to verify
# that the 'A' and 'B' outline items go to the right place.
OUTPUT_FILE = 'after_merge.pdf'
def copy_index(from_file: PdfReader, to_file: PdfWriter, outlines, parent_outline=None):
next_parent = parent_outline
for outline in outlines:
if isinstance(outline, Destination):
pgno = from_file.get_destination_page_number(outline)
if pgno is None:
next_parent = to_file.add_outline_item_dict(outline,
parent=parent_outline)
else:
next_parent = to_file.add_outline_item(outline.title,
page_number=pgno,
parent=parent_outline)
elif isinstance(outline, list):
copy_index(from_file, to_file, outline, parent_outline=next_parent)
index_pdf = PdfReader(open(ORIGINAL_FILE, 'rb'))
scanned_pdf = PdfReader(open(SCANNED_FILE, 'rb'))
writer = PdfWriter()
writer.append_pages_from_reader(scanned_pdf)
copy_index(index_pdf, writer, index_pdf.outline)
writer.write(OUTPUT_FILE) |
no worries about outlines, these will be naturally copied from your scanned document. Once I will have both, I should be able to propose some code merge the text/content from altered_pages. I need at least one page with the images the file with_outline is useless for me |
I'm clearly not communicating this well. Yes, generally when you do an OCR scan of a document, the outlines are preserved. Task done, game over. No complaints here. The problem is that my actual document is so large that scanners balk. So I split the big file into smaller chunks, scanned each chunk, then joined the chunks into one big file. That leaves me with another big file with all of the scanned text but no outline. I'm trying to copy the outline from the original file to the scanned one. The files and code I just uploaded demonstrate the problem of copying indexes; the content of the pages shouldn't be an issue. But hmm. I wonder if I could load the original document into a writer, delete the pages, then add the pages from the scanned document. That would be way simpler. But still, wouldn't it be nice if pypdf had some kind of support for external links? Update: Replacing the pages in the original document with the scanned pages doesn't work, presumably because the outline refers to the actual page, and if the page is removed the outline can't point at it any more. |
What I have in mind is the following approach:
you will have chunks of input that would have kept outlines
Should work. your proposal
You should not need to remove the image:
I agree, I though it was already in... need to check more |
So here's the problem. Outline items that point to a page refer to a specific page object, not just a page number. (Or an image, or other internal object.) That way, if you have an index set up, and then add or remove pages, the outline item still points to the same content. If you remove a page I don't know if the index item is deleted or just doesn't point anywhere at all. If you remove a page with clean=True, the deleted page is replaced by a blank one and any index still points to it (I think). Unfortunately, there seems to be no way to replace the content of a page, keeping the page ID the same but new content. And looking at outline content in a debugger, I haven't been able to suss out how external destinations are specified. That seems to be thoroughly obfuscated in the code. Bottom line, I finally got
I'm guessing that the best bet would be to somehow copy the scanned text and the hints that say where it's placed and apply that to the original pages. No clue where to start for that, though. Certainly no clues from pypdf. And I can't be sure that the original pages weren't adjusted somehow. So, bottom line, the file I have, absent the external links, works well enough for my purposes. I'd love to know how external references and/or the OCR-applied text works, and could be moved from one file to another. But at this point it's more a matter of intellectual curiosity. |
Your code seems to handle outlines only. Shouldn't external links (however this would behave with scanned files) rather be an annotation (https://pypdf.readthedocs.io/en/latest/user/adding-pdf-annotations.html#link)? |
Hmm. You may be onto something. But can annotations be on an outline item? The external links (to other files) in the original file seem to be attached to the outline (TOC) entries, not pages. If you click on an entry on the TOC it opens an external file, or goes to a page in the file. The latter is what my code does, but it can't figure out the external links. Conversely, I'm guessing that OCR'd text may be an annotation and I could use that to copy the OCR'd text to the original document. But doing a preliminary investigation, what the documentation shows for finding annotations on a page doesn't find annotations on outline items. |
Usually no. This probably just is a basic text layer, maybe with an "invisible" font which allows copying the text, but does not conflict with the possibly different font and text parameters of the scanned image. |
Just stepping through the But can you suggest how annotations and outlines might be related? For example, if you click "Welcome Page" in the TOC of the |
I've finally been able to generate a test as I was expecting: then to merge it we can use the following code:
the output: |
If you look into the pdf spec, this is the way pages are pointed. page number are reserved for links to external pages
correct
using replace_content will not transfert the resources. My solution is operational
I recommend you to use pdfbox with debug option |
Well, except I had to change w = pypdf.PdfWriter("tt1_outline.pdf") to w = PdfWriter()
w.clone_document_from_reader(PdfReader("tt1_outline.pdf")) for each file, (otherwise the I ran this on my monster file, and it worked too. (Took a while, though. A lot of mucking with stuff happens in there.) Thanks loads for your help! I never would have figured this out by myself. I still wonder how external links from outline entries work, but at this point it's just intellectual curiosity, and I have plenty other things to keep me busy. |
Oups |
I am going to close this issue for now as it sounds solved. |
Well, yes, we found a solution to my particular problem. It still would be nice if support for external links (in table of contents and maybe other places), both creating and finding, could be added to the list of possible enhancements. |
External links should already be supported. For further discussions or issues about it, I recommend opening a new discussion with an explicit example file. |
Explanation
I'm not sure if this is a request for a new feature or documentation to explain how this is already possible...
My knowledge of PDF internal format is microscopic, but I know that PDF supports internal links (to images, pages, etc.) and external links (web pages, other files, email addresses, ...) don't see how pypdf supports external links.
Here's my situation: I have a PDF file (from a CD I purchased) that has outline links to pages and external files. It's a scan of a book, almost 1200 pages, so the links to sections of the document are quite handy. Trouble is, the pages are all just images. It would be very useful to be able to search for text and copy text for use elsewhere. (Fair use, of course.)
Yes, I know there are resources that OCR scan PDF files, but everything I've tried balks at a file that large, at least without a charge.
So I:
Which worked perfectly. Except, while the text in the result is all nicely scanned, the outline is gone. So, I'm using pypdf to merge the original document's outline into the scanned document. And that works fine for the outline options that are just headers, and links to pages within the document, but the external links are gone.
See code example below. This is just the inner logic to deal with a single outline entry, obviously there's outer logic to deal with lists and embedded lists.
Code Example
Here's what I'm doing now:
So the question is: Is this something that can be done with the current release, but it's too obscure for me to figure out? Or would it be a useful addition in the future? Said feature probably would need a way to tell if an existing outline entry was an external reference, plus a way to specify such a reference in a new file.
Though now that I think of it, outlines can point to other internal things like images. Maybe those are
IndirectObjects
so already supported?The text was updated successfully, but these errors were encountered: