-
Notifications
You must be signed in to change notification settings - Fork 700
blocks not identified and so disappear from exported PDF #2455
Copy link
Copy link
Closed
Labels
Description
Hi,
i'm trying to redact a few words from a PDF file.
The right words are redacted but other, unrelated words, disappear from the file, they are not being redacted, they just disappear.
trying to locate the problem i ran this code:
fitz.TOOLS.set_small_glyph_heights(True)
doc = fitz.open("file1.pdf")
page = doc[0]
page.clean_contents()
blocks = page.get_text("dict")["blocks"]
for b in blocks:
if b["type"] !=0: continue
for l in b["lines"]:
for s in l["spans"]:
bbox = fitz.Rect(s["bbox"])
if bbox.height < s["size"]:
bbox.y0 = bbox.y1 - s["size"]
page.draw_rect(bbox, width=0.3, color=(1,0,0))
doc.save('file1_redacted.pdf')
in the new PDF i can see that all words have a red bounding box around them except the words that are missing from the redacted file. looking a bit further i noticed that these spans are also missing from the blocks object but these do exist if i run page.get_text("text").
my base code that i'm running to redact the file is:
import fitz
import re
fitz.TOOLS.set_small_glyph_heights(True)
doc = fitz.open("file1.pdf")
page = doc[0]
page.wrap_contents()
lines = page.get_text("text").split('\n')
def get_sensitive_data(lines):
reg_exp = r"PACIENTE:"
for i,line in enumerate(lines):
if re.search(reg_exp, line, re.IGNORECASE):
found_name = lines[i+1]
found_id = lines[i+2]
return [found_name, found_id]
sensitive = get_sensitive_data(lines)
for data in sensitive:
areas = page.search_for(data)
for rect in areas:
page.add_redact_annot(rect, fill = (0, 0, 0))
page.apply_redactions()
doc.save('file1_redacted.pdf')
i'll add a few screenshots to illustrate this. notice that the words: "Rh (D)" and "Biometría Hemática" disappear.
i'm using version 1.21.1.
any help would be much appreciated.
thanks,
Aviad
original file
bounding box file
Reactions are currently unavailable