Replace text #1674

mameIIas · 2022-04-12T07:58:00Z

mameIIas
Apr 12, 2022

Hi I'm back again. I'm trying to replace text from a pdf and in almost of all the pdfs I tried it worked well, but this one has the characters too close and seems to generate a white box for the text too big and deletes the text below:

I guess it has to be something with the rect but i don't know how to change it because it worked with the other examples.
The code used:

rl = page.searchFor(nombre)
sustituto = "RIGOBERTO"

for rect in rl:
    for b in page.getText("dict", clip=rect)["blocks"]:
        for l in b["lines"]:
            for span in l["spans"]:
                fsize = span["size"]
                origin = fitz.Point(span["origin"])  # the insertion point
                flags = span["flags"]
                if flags & 2 ** 3:  # is this font monospaced?
                    font = "cour"  # use Courier for new text
                else:
                    font = "helv"  # else stick with Helvetica
    page.addRedactAnnot(rect)  # redact the word
    page.apply_redactions()  # and imediately apply!
    # insert the new text separately - outside redaction


    tl = fitz.getTextlength(nombre, fontname=font, fontsize=fsize)
    #tl = fitz.getTextlength(sustituto, fontname=font, fontsize=fsize)
    fsize = fsize * rect.width / tl 
    page.insertText(origin, sustituto, fontname=font, fontsize=fsize)#, color=blue
doc.save("sustituto_pymu.pdf")

And another problem is that if the name starts in one line and ends in the next one it replaces it three times:

PD: First pdf file attached if you need to test something.

Thanks in advance
sustituto_pymu.pdf
.

Answered by JorjMcKie

Apr 20, 2022

If you use get_text("words",sort=True) you don't need search!

nombre = "Emilio Jose Martí Gómez"
nombre_t = nombre.split(" ")
lnombre = len(nombre_t)  # number of name components
words = page.get_text("words", sort=True)
i = 0
while i < len(words):
    word = words[i]
    if word[4] == nombre_t[0]:  # found 1st part of name
        rects = [fitz.Rect(w[:4]) for w in words[i : i+lnombre]]
        # rects = list of rectangles containing the full name
        # process them adequately, then ...
        i += lnombre
    else:
        i += 1

View full answer

JorjMcKie · 2022-04-12T10:39:17Z

JorjMcKie
Apr 12, 2022
Maintainer

Unexpectedly large replacement rectangles are due to overlapping lines in the PDF. Unfortunately, some PDF creators insert lines at a closer vertical distance, than the lineheight (a font property, computed as (font.ascender - font.descender) * fontsize). The redaction removal algorithm (inside base library MuPDF - cannot be influenced) removes any character intersecting the redact rect.

A first way out is setting a global PyMuPDF parameter: fitz.TOOLS.set_small_glyph_heights(True). This will overrule the font lineheight by setting it to fontsize. Impacts the rectangles / bboxes returned by page.search_for() and page.get_text() (only for options other than (X)HTML and XML).

If lines still overlap vertically, the PDF creator made a really ugly looking document! You can still fiddle around this situation by creating very small-height rectangles for redactions only (e.g. 20% of original height around the middle horizontal line of the original rectangle), and still use the original rectangle for inserting the new text.

1 reply

mameIIas Apr 12, 2022
Author

Adding that "fitz.TOOLS.set_small_glyph_heights(True)" and resizing the rect with "rect = rect * 0.999" gives me this:

So thank you so much!

JorjMcKie · 2022-04-12T10:50:09Z

JorjMcKie
Apr 12, 2022
Maintainer

Another problem is that if the name starts in one line and ends in the next one it replaces it three times.

This is something you must react to. Search will return several rectangles if needle is distributed across more than one line. This is may happen for needles containing spaces or are hyphenated: if needle = "method" and the word has been hyphenated like meth-/ od, you will get two rectangles covering meth and od respectively - the hyphen will not be covered.
The standard search flags use hyphenation detection which yields the above behavior (and will find hyphenated needles).
So in general, you must check the hit rectangles contents, e.g. by doing page.get_textbox(hitrect).

1 reply

mameIIas Apr 12, 2022
Author

The pdfs I'm working on won't have that meth-od feature, they are just name strings like "Emilio Jose Martí Gómez", so half of the time the name will be like:
ashdsadasdas Emilio Jose \n
Martí Gómez sadasdasdf \n
next line. \n
And I don't know how to implemente the hitrect, I pass it as a string to that function? Have I to define it first?

JorjMcKie · 2022-04-12T12:12:48Z

JorjMcKie
Apr 12, 2022
Maintainer

Ok, so maybe your names will never be hyphenated. Just mentioned it in case.
But because of the spaces in those names, you may get up to 4 hit rectangles for one name (although just two are more realistic - like in your example).
So you must find out which content is in those rectangles - there is no way out.
Walk through the hits contents and compare whether the name is already complete. If not, check the next rectangle etc.
Once you got all rects making up one name, decide in which of these rects you want to place your substitute text.

2 replies

mameIIas Apr 12, 2022
Author

Okay, thanks for the guidance, so if i have a 4 element name with 2 in one line and 2 in the next one, I can compare the y position of the bbox to know that they are in different lines and then as I have 1/2 name above replace it with 1/2 of the substitute, and the same for the next one, right?

PD: i put the 1/2 example as I could use fractions to separate the name between lines

JorjMcKie Apr 12, 2022
Maintainer

Algo así. If you are sure that your name only occurs once on a page, things are simple.
Otherwise,

take a hit rect and extract its text
if text != needle, check the following hit rects until you find one with needle.endswith(text) ... done with this name occurrence
next hit rect ...

mameIIas · 2022-04-13T06:22:21Z

mameIIas
Apr 13, 2022
Author

A hit rect is the "rect" from my previous example? Because the names can occur more than once haha

2 replies

JorjMcKie Apr 13, 2022
Maintainer

this is simply one of the items in page.search_for()

mameIIas Apr 13, 2022
Author

Okay okay, I'm going to try to do it, and if I achieve it or if I fail I'll keep you informed

JorjMcKie · 2022-04-13T15:19:13Z

JorjMcKie
Apr 13, 2022
Maintainer

Just to confuse you a bit more:
You could also simply extract the page's words via words = page.get_text("words", sort=True). This will deliver those long names as 4 items in correct sequence already.
You simply need to iterate through the list which consists of items like (x0, y0, x1, y1, "word", bn, ln, wn). Whenever "word" == "Emilio" then you know that the next 3 items will be the ones for "Jose", "Martí" and "Gómez".

1 reply

mameIIas Apr 20, 2022
Author

And here I'm kind of lost. Is this more or less the way to go?

    for rect in page.searchFor(nombre):
        for words in page.get_text("words", sort=True):
            count = 0
            if words[4][0] == nombre[0]: #equal to first element of the string name
                for elem in nombre: #(x0, y0, x1, y1, "word", bn, ln, wn)
                    if words[2][0] != words[2] [elem]: #they are on a different line y0[0]!=y0[1]
                        count = count + 1 #counts how many elem are in different pos comparing with nombre[0]
                                   
                                    #print in first line len(nombre)-count elements
                                    #print in second line count elements
            else:#previous code
                            ```

mameIIas · 2022-04-20T09:44:10Z

mameIIas
Apr 20, 2022
Author

Hi I'm back, sorry for answering late, I was on a little vacation. I'm still working on that replace in two lines, but I found a problem with a pdf. In that one I can replace everything but one string, if I try to replace a name in this case "Antonio González", it says NameError: name 'origin' is not defined. But is defined, and the string is in the pdf. Do you know why this might happen?
Thanks in advance.

2 replies

JorjMcKie Apr 20, 2022
Maintainer

This name "origin" only exists in "span" and "char" dictionaries which are returned as part of get_text("dict") (or "rawdict") responses.

mameIIas Apr 20, 2022
Author

Ookay, that happened because trying to solve a problem I had (the replacing name was printed in every bbox) this solved it but generates also this one. Thank you!

JorjMcKie · 2022-04-20T15:35:17Z

JorjMcKie
Apr 20, 2022
Maintainer

If you use get_text("words",sort=True) you don't need search!

nombre = "Emilio Jose Martí Gómez"
nombre_t = nombre.split(" ")
lnombre = len(nombre_t)  # number of name components
words = page.get_text("words", sort=True)
i = 0
while i < len(words):
    word = words[i]
    if word[4] == nombre_t[0]:  # found 1st part of name
        rects = [fitz.Rect(w[:4]) for w in words[i : i+lnombre]]
        # rects = list of rectangles containing the full name
        # process them adequately, then ...
        i += lnombre
    else:
        i += 1

1 reply

mameIIas Apr 21, 2022
Author

Okaay, now I get how to do the search correctly, but I do not get the Redact/insert/origin, in this case do i have to redact insert twice? or two redact an then insert?

lnombre = len(nombre_t)  # number of name components
skip_line = 0
for page in doc:
	words = page.get_text("words", sort=True)
	i = 0
	while i < len(words):
		word = words[i]
		if word[4] == nombre_t[0]:  # found 1st part of name
			rects = [fitz.Rect(w[:4]) for w in words[i : i+lnombre]]
			# rects = list of rectangles containing the full name
			i += lnombre
			for elem in rects:
				if rects[0][2] != rects[elem][2] and skip_line == 0: #they are on a different line y0[0]!=y0[1]
					skip_line = elem # save which elem goes into the next line
			rect1 = [:elem] #first line words
			#redact rect1 at rects[0] position
			rect2 = [elem:] #second line words
			#redact rect 2 at rects[elem] position
			
		###normal replacing code
		else:
			i += 1
			for l in word["lines"]:
				for span in l["spans"]:
					fsize = span["size"]
					origin = fitz.Point(span["origin"])  # the insertion point
					font = "helv"  # stick with Helvetica
		rect = rect * 0.999
		page.addRedactAnnot(rect)  # redact the word
		page.apply_redactions()  # and imediately apply!
        # insert the new text separately - outside redaction

		
		####scaling
        #tl = fitz.getTextlength(nombre, fontname=font, fontsize=fsize)
        #tl = fitz.getTextlength(sustituto, fontname=font, fontsize=fsize)
        #fsize = fsize * rect.width / tl 
		
        page.insertText(origin, sustituto, fontname=font, fontsize=fsize)#, color=blue
		
#doc.save("sustituto_pymu_linea.pdf")
doc.save(salida)```

JorjMcKie · 2022-04-21T09:35:11Z

JorjMcKie
Apr 21, 2022
Maintainer

The origin depends on the font you use for insertion. There are two values: font.ascender (positive) and font.descender (negative). Look at this picture:

So if you use Helvetica for insertion, first get those two values and then add ascender to its x0:

helv = fitz.Font("helv")
origin = rect.tl + (0, helv.ascender * rect.height)

3 replies

mameIIas Apr 21, 2022
Author

Okay, thanks for the image, then when applying that to my code will be something like this?

		```rect1 = [:elem] #first line words
		origin_1 = rect1.tl + (0, helv.ascender * rect1.height)
		page.addRedactAnnot(rect1)
		page.apply_redactions()
		page.insertText(origin_1, sustituto_1, fontname=font, fontsize=fsize)
		
		#here the same for rect2
		```

And to get the fsize do i have to iterate like in the previous example through spans and word["lines"]?
Thanks in advance

JorjMcKie Apr 21, 2022
Maintainer

to get the fsize
Choose this to be the rect.height

mameIIas Apr 22, 2022
Author

Okaay, that's working right now, the thing is it finally replaces both texts in both lines but erasing everthing haha, and i don't know why this happens

sustituto_1 = "Carlos del"
sustituto_2 = "Río Sanjuan"
nombre_t = nombre.split(" ")
lnombre = len(nombre_t)  # number of name components
skip_line = 0
helv = fitz.Font("helv")
font = "helv"
for page in doc:
	words = page.get_text("words", sort=True)
	i = 0
	while i < len(words):
		word = words[i]
		if word[4] == nombre_t[0]:  # found 1st part of name
			rects = [fitz.Rect(w[:4]) for w in words[i : i+lnombre]]
			# rects = list of rectangles containing the full name
			i += lnombre
			for index,elem in enumerate(rects):
				if rects[0][2] != rects[index][2] and skip_line == 0: #they are on a different line y0[0]!=y0[1]
					skip_line = index # save which elem goes into the next line
					
					#redact rect1 at rects[0] position
					rect1 = rects[:index] #first line words
					origin_1 = rect1[0].tl + (0, helv.ascender * rect1[0].height)
					
					page.addRedactAnnot(rect1)
					page.apply_redactions()
					
					

					
					#redact rect2 at rects[index] position
					rect2 = rects[index:] #second line words
					origin_2 = rect2[index].tl + (0, helv.ascender * rect2[index].height)
					
					page.addRedactAnnot(rect2)
					page.apply_redactions()
					
					page.insertText(origin_1, sustituto_1, fontname=font, fontsize=rect1[0].height)
					page.insertText(origin_2, sustituto_2, fontname=font, fontsize=rect2[index].height)
					#redact rect 2 at rects[elem] position
				###normal replacing code
		      else:
			      i += 1
			      """for l in word["lines"]:
				      for span in l["spans"]:
					      fsize = span["size"]
					      origin = fitz.Point(span["origin"])  # the insertion point
					      font = "helv"  # stick with Helvetica
		      rect = rect * 0.999
		      page.addRedactAnnot(rect)  # redact the word
		      page.apply_redactions()  # and imediately apply!
              # insert the new text separately - outside redaction
      
		      
		      ####scaling
              #tl = fitz.getTextlength(nombre, fontname=font, fontsize=fsize)
              #tl = fitz.getTextlength(sustituto, fontname=font, fontsize=fsize)
              #fsize = fsize * rect.width / tl 
		      
              page.insertText(origin, sustituto, fontname=font, fontsize=fsize)#, color=blue
		"""
#doc.save("sustituto_pymu_linea.pdf")
doc.save(salida)

JorjMcKie · 2022-04-22T10:18:16Z

JorjMcKie
Apr 22, 2022
Maintainer

As I said some post before:

if all rectangles are on the same line (y1 of 1st rect equals y1 of last) then you can build a redaction rect from the union of them all
Otherwise build two redaction rects for the two lines where parts of the name appear. Then choose in which of the two to put the replacement text and leave the other one empty.

6 replies

mameIIas Apr 22, 2022
Author

And when I do the replacing in one line

The words are kind of separated, and also don't know why :(

elif rects[0][2] == rects[-1][2]: 
					
    origin = rects[-1].tl + (0, helv.ascender * rects[-1].height)
					
    page.addRedactAnnot(rects)
					
    page.apply_redactions()
					
    page.insertText(origin, sustituto_3, fontname=font, fontsize=rects[-1].height)```

JorjMcKie Apr 22, 2022
Maintainer

Oha. Why don't you do this ("rects" is the bbox list of name components):

newrects = [] # will contain rectangles joined per line
bottoms = set([bbox[3] for bbox in rects]) # contains the y1 values for the rects
for y in bottoms:
    rect = fitz.EMPTY_RECT()
    for r in rects:
        if r[3] == y:
            rect |= r
    newrects.append(rect)
# there are as many rectangles in newrects now as there are lines with names components
for r in newrects:
    page.add_redact_annot(r)
page.apply_redactions()
page.insert_text(newrects[0], ...)

mameIIas Apr 22, 2022
Author

Trying my text above ir repeats itself all the lane, but yours shows:

Two things else, can't solve the erasing problem and i don't know why comparisons are not working:
if rects[0][0] == rects[-1][0]:
if x0 from first element is equal to the x0 from the last element, but it does never enters the loop. And they are in the same line indeed
Sorry for all the message and bothering

JorjMcKie Apr 22, 2022
Maintainer

You for sure owe me something!
I thought it was clear that you need to adjust the insert_text! The dots just indicate that you should fill in your other parameters.

mameIIas Apr 25, 2022
Author

After filling it with page.insert_text(newrects[0], sustituto_1, fontname=font, fontsize=newrects[0].height)
I'm getting this error

And the thing is, I don't know yet how to solve the erasing-all-page problem, what i have to do?
Thanks in advance

Replace text #1674

mameIIas Apr 12, 2022

Replies: 9 comments · 19 replies

JorjMcKie Apr 12, 2022 Maintainer

mameIIas Apr 12, 2022 Author

JorjMcKie Apr 12, 2022 Maintainer

mameIIas Apr 12, 2022 Author

JorjMcKie Apr 12, 2022 Maintainer

mameIIas Apr 12, 2022 Author

JorjMcKie Apr 12, 2022 Maintainer

mameIIas Apr 13, 2022 Author

JorjMcKie Apr 13, 2022 Maintainer

mameIIas Apr 13, 2022 Author

JorjMcKie Apr 13, 2022 Maintainer

mameIIas Apr 20, 2022 Author

mameIIas Apr 20, 2022 Author

JorjMcKie Apr 20, 2022 Maintainer

mameIIas Apr 20, 2022 Author

JorjMcKie Apr 20, 2022 Maintainer

mameIIas Apr 21, 2022 Author

JorjMcKie Apr 21, 2022 Maintainer

mameIIas Apr 21, 2022 Author

JorjMcKie Apr 21, 2022 Maintainer

mameIIas Apr 22, 2022 Author

JorjMcKie Apr 22, 2022 Maintainer

mameIIas Apr 22, 2022 Author

JorjMcKie Apr 22, 2022 Maintainer

mameIIas Apr 22, 2022 Author

JorjMcKie Apr 22, 2022 Maintainer

mameIIas Apr 25, 2022 Author

mameIIas
Apr 12, 2022

Replies: 9 comments 19 replies

JorjMcKie
Apr 12, 2022
Maintainer

mameIIas Apr 12, 2022
Author

JorjMcKie
Apr 12, 2022
Maintainer

mameIIas Apr 12, 2022
Author

JorjMcKie
Apr 12, 2022
Maintainer

mameIIas Apr 12, 2022
Author

JorjMcKie Apr 12, 2022
Maintainer

mameIIas
Apr 13, 2022
Author

JorjMcKie Apr 13, 2022
Maintainer

mameIIas Apr 13, 2022
Author

JorjMcKie
Apr 13, 2022
Maintainer

mameIIas Apr 20, 2022
Author

mameIIas
Apr 20, 2022
Author

JorjMcKie Apr 20, 2022
Maintainer

mameIIas Apr 20, 2022
Author

JorjMcKie
Apr 20, 2022
Maintainer

mameIIas Apr 21, 2022
Author

JorjMcKie
Apr 21, 2022
Maintainer

mameIIas Apr 21, 2022
Author

JorjMcKie Apr 21, 2022
Maintainer

mameIIas Apr 22, 2022
Author

JorjMcKie
Apr 22, 2022
Maintainer

mameIIas Apr 22, 2022
Author

JorjMcKie Apr 22, 2022
Maintainer

mameIIas Apr 22, 2022
Author

JorjMcKie Apr 22, 2022
Maintainer

mameIIas Apr 25, 2022
Author