Question :How to remove a word water_mark from PDF? #468

xjqx05 · 2020-03-18T12:06:07Z

HI

Thank you for your great job .This project helps me a lot . Here I want to remove a word_mark ,which contains some paticular words .Let's say "www.abc.com" .

My question is , is it possible to remove this "www.abc.com" (on the top of page) ,or even just delete the link ,in that case I can cover it with some white images .

Thanks again.

JorjMcKie · 2020-03-18T12:59:40Z

Thanks for your feedback!

My question is , is it possible to remove this "www.abc.com" (on the top of page) ,or even just delete the link ,in that case I can cover it with some white images.

This is probably possible - depends on details how this element is stored on the page(s).

If it really is text and does not cover space already also covered by other visible elements on the page, then you can use that new PyMuPDF feature "Redaction Annotations". In a nutshell works like this:

Determine the rectangle of the unwanted text, e.g. using page.searchFor(...)
Add a 'Redact' annot for that / these rectangle(s). Possibly specifying other text to occupy that space, etc.
Execute page.apply_redactions() and the text is gone (also physically - text extraction wouldn't find it anymore!)

This approach also works, if this is not text but really an image: Determine the image bbox via page.getImageBbox() and make a 'Redact' annot as before.

The caveat with this approach: applying redact annotations removes overlays everything within the resp. rectangles ...

If a watermark exists as part of the background (e.g. to prevent unnoticed copies), then we would need to talk again about the details of that implementation.
I have seen cases where a semi-transparent image is put in foreground or background. There the image xref and/or symbolic reference name would have to be identified before a removal is possible ...

xjqx05 · 2020-03-19T05:35:57Z

Thanks for your prompt reply ,it works!

what wired is ,where used to be "www.abc.com" ，now it's empty .That's so cool

But still when you click the empty place , it will link to the website of "www.abc.com" ,even if I replace the text to like "www.mmm.com".

Is there any way to remove the link ? It will help me a lot ,thanks again.

JorjMcKie · 2020-03-19T07:08:30Z

But still when you click the empty place , it will link to the website of "www.abc.com" ,even if I replace the text to like "www.mmm.com".

Ah ok. No problem I guess.
You can also get the links of a page. This is a list of dictionaries:

>>> import fitz
>>> doc=fitz.open("PyMuPDF.pdf")
>>> page=doc[8]  # we want to remove link reference to MuPDF web site
>>> for link in page.links(kinds=(fitz.LINK_URI,)):  # iterate over internet links only
	if 'www.mupdf.com' in link["uri"]:  # if found
		break
>>> page.addRedactAnnot(link["from"])  # use the link hot spot area on page
'Redact' annotation on page 8 of PyMuPDF.pdf
>>> page.apply_redactions()  # remove the text
True
>>> page.deleteLink(link)  # and now also the link itself
>>> doc.save("link-deleted.pdf")  # this has no more link at that place ...
>>>

xjqx05 · 2020-03-19T07:30:55Z

I think my situation is a little different
when >page.getLinks()
the result is >[]
or >page.loadLinks() > None

when I followed your instructions ,here is my code:

addRedactAnnot(rect,text=' ')
page.apply_redactions()

when I do this :

page.searchFor('www.abc.com')

It's still there ,so I guess it is not deleted ,but covered by redactannotion.

did I do it wrong ? many thanks

JorjMcKie · 2020-03-19T07:39:39Z

when >page.getLinks()
the result is >[]

This is weird indeed!

It's still there ,so I guess it is not deleted ,but covered by redactannotion.

No, the text is really gone. But the link does still exist, the question is where.
Is this PDF big? Can you share it? I would like to have a look ...

xjqx05 · 2020-03-19T07:44:08Z

Sure , is it OK to send it to your email : jorj.x.mckie@outlook.de ?

JorjMcKie · 2020-03-19T07:58:35Z

yes

JorjMcKie · 2020-03-19T09:54:12Z

So my final question is : how to get the rect_like from rect=page.searchFor() ?

The output of rl = page,searchFor("something") always is a list! A list of rectangles or a list of quads.
So obviously in order to access one rectangle you must use indices like rl[0], etc.

xjqx05 · 2020-03-19T10:05:48Z

Yes, I just realized that .
All problems solved ,you can closed the issue.
Excellent PyMuPDF
hope I can donate one cup of coffee :)

JorjMcKie · 2020-03-19T12:06:45Z

hope I can donate one cup of coffee :)

Always welcome - you know where to find the PayPal button 😉?

Here is a brute force script that removes watermarks from pages. Its approach is completely different: it scans through the paint commands of a page (after formatting and cleaning them up) and find and destroy /Watermark artifacts. The thus modified page commands are then rewritten.

This removes the watermarks obviously. But the link in page 1 still remains 🤔, because it is no watermark - it must still be treated via redact annotations.

import fitz

doc = fitz.open("2.pdf")
for page in doc:
    page.cleanContents()  # cleanup page painting commands
    xref = page.getContents()[0]  # get xref of the resulting source
    cont0 = doc.xrefStream(xref).decode().splitlines()  # and read it as lines of strings
    cont1 = []  # will contain reduced cont lines
    found = False  # indicates we are inside watermark instructions
    for line in cont0:
        if line.startswith("/Artifact") and "/Watermark" in line:  # start of watermark
            found = True  # switch on
            continue  # and skip line
        if found and line == "EMC":  # end of watermark
            found = False  # switch off
            continue  # and skip line
        if found is False:  # copy commands while outside watermarks
            cont1.append(line)
    cont = "\n".join(cont1)  # new paint commands source
    doc.updateStream(xref, cont.encode())  # replace old one with 'bytes' version

doc.save("2-no-watermarks.pdf", garbage=4)

xjqx05 · 2020-03-19T14:58:57Z

Your codes above is much more effecient !
brute but nice .
still there is one hiden link on my Page(0) ,but I will live with that :)
Have a nice day

xjqx05 added the question label Mar 18, 2020

xjqx05 assigned JorjMcKie Mar 18, 2020

xjqx05 closed this as completed Mar 19, 2020

JorjMcKie mentioned this issue Apr 10, 2020

Question / Comment: how to access image bbox by its xref #487

Closed

Soumadip-Saha mentioned this issue Nov 20, 2023

Remove a background text which is overlapped with other texts. #2821

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question :How to remove a word water_mark from PDF? #468

Question :How to remove a word water_mark from PDF? #468

xjqx05 commented Mar 18, 2020

JorjMcKie commented Mar 18, 2020

xjqx05 commented Mar 19, 2020

JorjMcKie commented Mar 19, 2020

xjqx05 commented Mar 19, 2020

JorjMcKie commented Mar 19, 2020

xjqx05 commented Mar 19, 2020

JorjMcKie commented Mar 19, 2020

JorjMcKie commented Mar 19, 2020

xjqx05 commented Mar 19, 2020

JorjMcKie commented Mar 19, 2020

xjqx05 commented Mar 19, 2020

Question :How to remove a word water_mark from PDF? #468

Question :How to remove a word water_mark from PDF? #468

Comments

xjqx05 commented Mar 18, 2020

JorjMcKie commented Mar 18, 2020

xjqx05 commented Mar 19, 2020

JorjMcKie commented Mar 19, 2020

xjqx05 commented Mar 19, 2020

JorjMcKie commented Mar 19, 2020

xjqx05 commented Mar 19, 2020

JorjMcKie commented Mar 19, 2020

JorjMcKie commented Mar 19, 2020

xjqx05 commented Mar 19, 2020

JorjMcKie commented Mar 19, 2020

xjqx05 commented Mar 19, 2020