Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question :How to remove a word water_mark from PDF? #468

Closed
xjqx05 opened this issue Mar 18, 2020 · 11 comments
Closed

Question :How to remove a word water_mark from PDF? #468

xjqx05 opened this issue Mar 18, 2020 · 11 comments
Assignees
Labels

Comments

@xjqx05
Copy link

xjqx05 commented Mar 18, 2020

HI

Thank you for your great job .This project helps me a lot . Here I want to remove a word_mark ,which contains some paticular words .Let's say "www.abc.com" .

My question is , is it possible to remove this "www.abc.com" (on the top of page) ,or even just delete the link ,in that case I can cover it with some white images .

Thanks again.

@JorjMcKie
Copy link
Collaborator

Thanks for your feedback!

My question is , is it possible to remove this "www.abc.com" (on the top of page) ,or even just delete the link ,in that case I can cover it with some white images.

This is probably possible - depends on details how this element is stored on the page(s).

If it really is text and does not cover space already also covered by other visible elements on the page, then you can use that new PyMuPDF feature "Redaction Annotations". In a nutshell works like this:

  1. Determine the rectangle of the unwanted text, e.g. using page.searchFor(...)
  2. Add a 'Redact' annot for that / these rectangle(s). Possibly specifying other text to occupy that space, etc.
  3. Execute page.apply_redactions() and the text is gone (also physically - text extraction wouldn't find it anymore!)

This approach also works, if this is not text but really an image: Determine the image bbox via page.getImageBbox() and make a 'Redact' annot as before.

The caveat with this approach: applying redact annotations removes overlays everything within the resp. rectangles ...

If a watermark exists as part of the background (e.g. to prevent unnoticed copies), then we would need to talk again about the details of that implementation.
I have seen cases where a semi-transparent image is put in foreground or background. There the image xref and/or symbolic reference name would have to be identified before a removal is possible ...

@xjqx05
Copy link
Author

xjqx05 commented Mar 19, 2020

Thanks for your prompt reply ,it works!

what wired is ,where used to be "www.abc.com" ,now it's empty .That's so cool

But still when you click the empty place , it will link to the website of "www.abc.com" ,even if I replace the text to like "www.mmm.com".

Is there any way to remove the link ? It will help me a lot ,thanks again.

@JorjMcKie
Copy link
Collaborator

But still when you click the empty place , it will link to the website of "www.abc.com" ,even if I replace the text to like "www.mmm.com".

Ah ok. No problem I guess.
You can also get the links of a page. This is a list of dictionaries:

>>> import fitz
>>> doc=fitz.open("PyMuPDF.pdf")
>>> page=doc[8]  # we want to remove link reference to MuPDF web site
>>> for link in page.links(kinds=(fitz.LINK_URI,)):  # iterate over internet links only
	if 'www.mupdf.com' in link["uri"]:  # if found
		break
>>> page.addRedactAnnot(link["from"])  # use the link hot spot area on page
'Redact' annotation on page 8 of PyMuPDF.pdf
>>> page.apply_redactions()  # remove the text
True
>>> page.deleteLink(link)  # and now also the link itself
>>> doc.save("link-deleted.pdf")  # this has no more link at that place ...
>>> 

@xjqx05
Copy link
Author

xjqx05 commented Mar 19, 2020

I think my situation is a little different
when >page.getLinks()
the result is >[]
or >page.loadLinks() > None

when I followed your instructions ,here is my code:

addRedactAnnot(rect,text=' ')
page.apply_redactions()

when I do this :

page.searchFor('www.abc.com')

It's still there ,so I guess it is not deleted ,but covered by redactannotion.

did I do it wrong ? many thanks

@JorjMcKie
Copy link
Collaborator

when >page.getLinks()
the result is >[]

This is weird indeed!

It's still there ,so I guess it is not deleted ,but covered by redactannotion.

No, the text is really gone. But the link does still exist, the question is where.
Is this PDF big? Can you share it? I would like to have a look ...

@xjqx05
Copy link
Author

xjqx05 commented Mar 19, 2020

Sure , is it OK to send it to your email : jorj.x.mckie@outlook.de ?

@JorjMcKie
Copy link
Collaborator

yes

@JorjMcKie
Copy link
Collaborator

So my final question is : how to get the rect_like from rect=page.searchFor() ?

The output of rl = page,searchFor("something") always is a list! A list of rectangles or a list of quads.
So obviously in order to access one rectangle you must use indices like rl[0], etc.

@xjqx05
Copy link
Author

xjqx05 commented Mar 19, 2020

Yes, I just realized that .
All problems solved ,you can closed the issue.
Excellent PyMuPDF
hope I can donate one cup of coffee :)

@xjqx05 xjqx05 closed this as completed Mar 19, 2020
@JorjMcKie
Copy link
Collaborator

hope I can donate one cup of coffee :)

Always welcome - you know where to find the PayPal button 😉?

Here is a brute force script that removes watermarks from pages. Its approach is completely different: it scans through the paint commands of a page (after formatting and cleaning them up) and find and destroy /Watermark artifacts. The thus modified page commands are then rewritten.

This removes the watermarks obviously. But the link in page 1 still remains 🤔, because it is no watermark - it must still be treated via redact annotations.

import fitz

doc = fitz.open("2.pdf")
for page in doc:
    page.cleanContents()  # cleanup page painting commands
    xref = page.getContents()[0]  # get xref of the resulting source
    cont0 = doc.xrefStream(xref).decode().splitlines()  # and read it as lines of strings
    cont1 = []  # will contain reduced cont lines
    found = False  # indicates we are inside watermark instructions
    for line in cont0:
        if line.startswith("/Artifact") and "/Watermark" in line:  # start of watermark
            found = True  # switch on
            continue  # and skip line
        if found and line == "EMC":  # end of watermark
            found = False  # switch off
            continue  # and skip line
        if found is False:  # copy commands while outside watermarks
            cont1.append(line)
    cont = "\n".join(cont1)  # new paint commands source
    doc.updateStream(xref, cont.encode())  # replace old one with 'bytes' version

doc.save("2-no-watermarks.pdf", garbage=4)

@xjqx05
Copy link
Author

xjqx05 commented Mar 19, 2020

Your codes above is much more effecient !
brute but nice .
still there is one hiden link on my Page(0) ,but I will live with that :)
Have a nice day

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants