Skip to content

blocks not identified and so disappear from exported PDF #2455

@aviadkl

Description

@aviadkl

Hi,

i'm trying to redact a few words from a PDF file.
The right words are redacted but other, unrelated words, disappear from the file, they are not being redacted, they just disappear.
trying to locate the problem i ran this code:

fitz.TOOLS.set_small_glyph_heights(True)
doc = fitz.open("file1.pdf")
page = doc[0]
page.clean_contents()
blocks = page.get_text("dict")["blocks"]
for b in blocks:
	if b["type"] !=0: continue
	for l in b["lines"]:
		for s in l["spans"]:
			bbox = fitz.Rect(s["bbox"])
			if bbox.height < s["size"]:
				bbox.y0 = bbox.y1 - s["size"]
			page.draw_rect(bbox, width=0.3, color=(1,0,0))
doc.save('file1_redacted.pdf')

in the new PDF i can see that all words have a red bounding box around them except the words that are missing from the redacted file. looking a bit further i noticed that these spans are also missing from the blocks object but these do exist if i run page.get_text("text").

my base code that i'm running to redact the file is:

import fitz
import re

fitz.TOOLS.set_small_glyph_heights(True)
doc = fitz.open("file1.pdf")
page = doc[0]
page.wrap_contents()
lines = page.get_text("text").split('\n')

def get_sensitive_data(lines):                
        reg_exp = r"PACIENTE:"
        for i,line in enumerate(lines):            
            if re.search(reg_exp, line, re.IGNORECASE):
                found_name = lines[i+1]
                found_id = lines[i+2]
                return [found_name, found_id]

sensitive = get_sensitive_data(lines)

for data in sensitive:
    areas = page.search_for(data)
    for rect in areas: 
        page.add_redact_annot(rect, fill = (0, 0, 0))                   

                     
page.apply_redactions() 
doc.save('file1_redacted.pdf')

i'll add a few screenshots to illustrate this. notice that the words: "Rh (D)" and "Biometría Hemática" disappear.

i'm using version 1.21.1.

any help would be much appreciated.
thanks,

Aviad

original file

Capture

bounding box file

Capture2

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions