Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redact/Replace #344

Closed
Singrig opened this issue Aug 8, 2019 · 15 comments
Closed

Redact/Replace #344

Singrig opened this issue Aug 8, 2019 · 15 comments
Assignees
Labels

Comments

@Singrig
Copy link

Singrig commented Aug 8, 2019

Hi ,
Pymupdf is really a great package to work with PDF's and other type of formats..

I have a quick question, I need to redact the sensitivity information from PDF, Is there any function related to redact or can we replace the words in PDF while highlighting?

@JorjMcKie
Copy link
Collaborator

The new MuPDF version 1.16.0 also supports Redact annotations.
I haven't decided yet, whether / when to include support for this also in PyMuPDF.
Currently I am still in v1.16.0 development.
Quite a big update to the current v1.14.x unfortunately.

@JorjMcKie
Copy link
Collaborator

What do you mean by "sensitivity"? Encryption?
PyMuPDF v1.16.0 will definitely fully support password-based encryption / decryption and permission levels.

@Singrig
Copy link
Author

Singrig commented Aug 8, 2019

Sorry I'm confused Mupdf as redacting the specific words in PDF ? Can you please share some link?

Let's consider a eg, if PDF has Customer name either I need to redact that customer name or I need to replace with some junk letters

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Aug 8, 2019

No, we are talking about different things. Redact annotations are part of the most recent PDF specification.
What you want is not supported at all by MuPDF - sorry.
I have written an anonymizer going in that direction some time ago. This was covering Base-14 fonts only, too. This is significant effort going far beyond what this repository has to offer.

@Singrig
Copy link
Author

Singrig commented Aug 8, 2019

Sorry for bothering you, if I have list of can't we delete/replace those specific words like we highlighting those words using pymupdf?

@JorjMcKie
Copy link
Collaborator

No, this already goes fairly deep into how text is coded in a PDF. In the absolutely most simple cases you might even treat the PDF as a text file and use some editor.

Of course you can always cover sensitive things with - say - a black rectangle. But that is cosmetics only: the information is still there.
A text like "Mr. Anonymous" might be coded in a plethora of ways in unfortunate circumstances: hexadecimal, each single letter being separated from each other and not in natural reading sequence, the text may be split across several lines, ... and what not.

@Singrig
Copy link
Author

Singrig commented Aug 8, 2019

Can you please share some code or some links that can help to achieve as u suggested like making it into not in natural reading seq or split across line...

@JorjMcKie
Copy link
Collaborator

Read chapter text, specifically about textbox text extraction to see how a natural reading order can be re-established.
And of course you should look at the original PDF spec manual to see examples for how complex a simingly simple text may be coded. For example, the PDF manual for this repo on page 9 looks like:

grafik

and the highlighted text is coded like this in the PDF file:

BT\n/F38 9.9626 Tf 72 462.176 Td [(PyMuPDF)]TJ/F52 9.9626 Tf 45.81 0 Td [(is)-270(a)-270(Python)-270(binding)-270(for)]TJ

I cannot share the code I developed in the above mentioned case, because that was paid work and I thus do not own the copyright.

@Singrig
Copy link
Author

Singrig commented Aug 8, 2019

Thanks this will help me... Let me check what I can do... The difficulty would be I'm very new to this Field

@Singrig
Copy link
Author

Singrig commented Aug 9, 2019

Hi
I'm trying to add text in pdf i'm getting error "name 'Py_RETURN_NONE' is not defined

My code

import fitz
doc = fitz.open('PyMuPDF.pdf') # new or existing PDF
page = doc[0] # new or existing page via doc[n]
p = fitzPoint(50, 72) # start point of 1st line
text = "Some text,\nspread across\nseveral lines."

@JorjMcKie
Copy link
Collaborator

you are not using the current version - please switch to 1.14.20

@JorjMcKie
Copy link
Collaborator

@Singrig - the new v1.16.11 supports redaction annotations.

@Singrig
Copy link
Author

Singrig commented Feb 23, 2020 via email

@JorjMcKie
Copy link
Collaborator

@Singrig - sure you will. The new documentation is already uploaded. I am about to also populate PyPI with the installation material.

@Singrig
Copy link
Author

Singrig commented Feb 23, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants