Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can search and replace functions be added to python-docx? #30

Open
frenet opened this issue Mar 26, 2014 · 35 comments
Open

Can search and replace functions be added to python-docx? #30

frenet opened this issue Mar 26, 2014 · 35 comments

Comments

@frenet
Copy link

frenet commented Mar 26, 2014

It is very easy to create a docx file by python-docx, but I like to search some specific words and count the number it occurs, how can I do in python-docx. I know this can be done in mikemaccana/python-docx, but the mikemaccana/python-docx code grammer is different from python-openxml / python-docx, I do not like to switch to mikemaccana/python-docx .

@scanny
Copy link
Contributor

scanny commented Mar 26, 2014

@frenet you should be able to get most of what you describe with something like this:

document = Document('your_file.docx')
for paragraph in document.paragraphs:
    paragraph_text = paragraph.text
    # ... code to search paragraph_text for desired word(s)

If your document included tables you wanted to search as well, you would need to look there too, something like:

for table in document.tables:
    for cell in table.cells:
        for paragraph in cell.paragraphs:
            # ... same as above ...

There can be a recursion involved, not sure how frequent it is in practice, but structurally there's nothing preventing a paragraph within a table from itself containing a table, which contains paragraphs that can contain a table, etc. Right now python-docx only detects top-level tables, so that might be something to watch out for.

I'm sure it will be good to have something that allows search and replace operations. I'm not quite sure what the search operation would return, since there's no concept so far of a 'cursor' location nor is there a concept of a word or phrase, only paragraphs and runs, which of course wouldn't always match up on a particular word. But I'll leave it here for more noodling when we get to this in the backlog.

@sk1tt1sh
Copy link

sk1tt1sh commented Apr 3, 2014

@scanny awesome tool! Having some trouble along these same lines though. This spits out an attribute error when trying to update the text.
(added to api.py):

    def paragraph_replace(self, search, replace):
        searchre = re.compile(search)
        for paragraph in self.paragraphs:
            paragraph_text = paragraph.text
            if paragraph_text:
                if searchre.search(paragraph_text):
                    paragraph.text = re.sub(search, replace, paragraph_text)
        return paragraph_text

The error. I'm okay at python but couldn't really see any way to fix this looking at text.py
"paragraph.text = re.sub(search, replace, paragraph_text)
AttributeError: can't set attribute"

@scanny
Copy link
Contributor

scanny commented Apr 3, 2014

@sk1tt1sh Hmm, yes, this is a very useful use case. There's actually no way yet to do what you're trying to accomplish here.

The specific error is raised because the .text property on Paragraph is read-only, there is no property setter yet.

Let me have a noodle on it and see what new features might best serve here. It would be possible to add a .text setter on Paragraph that removes everything that was there and replace it with a single new run containing the assigned string. Any existing formatting would be lost and certain uncommon cases would need to be accounted for, like when the paragraph contained a hyperlink or a picture. It would essentially imply a delete_paragraph() method, which we've been wanting to have anyway.

Probably be a while though, I'm busy on python-pptx just at the moment. Let me know if you're interested in taking a crack at it yourself, I might be able to help point you in the right direction.

@sk1tt1sh
Copy link

sk1tt1sh commented Apr 3, 2014

@scanny Thanks for the quick reply.

I definitely understand the being busy.

I can try taking a swing at it if you can give me an idea of how you want to do it. I'll try to conform to pep8 and naming conventions as much as possible :)

I'd like to note that my usage case is to have a prebuilt document with some sections that will be modified, so removing the paragraph and adding it at the end of the document is not ideal. It would make more sense at that point to just build it from scratch.

@scanny
Copy link
Contributor

scanny commented Apr 3, 2014

A good place to start would be having a few methods local to your solution that get you what you need. We can address the question of getting permanent features into python-docx a bit later, that tends to be substantially more involved.

I believe if you had a clear_paragraph() function, that would get the job done for you. It would work something like:

paragraph = ... matching paragraph ...
clear_paragraph(paragraph)
paragraph.add_run(replacement_text)

The clear_paragraph() function would look something like this:

def clear_paragraph(paragraph):
    p_element = paragraph._p
    p_child_elements = [elm for elm in p_element.iterchildren()]
    for child_element in p_child_elements:
        p_element.remove(child_element)

I don't have time to test this right this minute, but if you want to give it a try and see if it gives you any trouble. What you're doing with it is manipulating the lxml elements that underly the paragraph.

It's possible you might need to be more sophisticated too, this brute-force approach will get rid of paragraph properties you might prefer to preserve. But it's a good start anyway :)

@sk1tt1sh
Copy link

sk1tt1sh commented Apr 3, 2014

@scanny I lol'd at the "...that tends to be substantially more involved." line. I've been working my way through the internals of this thing and it is quite complex!! I am excited for this one though. I think once we iron out the remove/replace and get headers/footers manipulation down it will be awesome!

The loss of style could become an issue, especially when you're working with cells within a table and the cell has 2 differently paragraphs. I'll consider how I can preserve them along the way while I'm working with it. I'll have to let you know how I'm doing some time tomorrow.

Thanks!

@frenet
Copy link
Author

frenet commented Apr 4, 2014

You all have inspired me. My initial requirement is very simple: do some counts about some specific words, such as "sustainable development", and search some words "comparision" and then replace it with another words "comparison". I create a docx document from the scratch instead of using a replace function in place. I try to use the functions as less as possible.

my further interests is focused on how to use regular expression to do text mining.
Thank you very much.

@sk1tt1sh
Copy link

sk1tt1sh commented Apr 7, 2014

Hey @scanny ... sorry for the delay on the update. After a little testing here's what I've come up with.

Limitations:

  • Haven't tested it on tables yet as tablesearch isn't working quite right.
  • Cannot replace multiple things. (You could use lists and loop through those)
  • Lose styling. You could potentially set paragraph_replace(self, search, replace, style=None) and pass that during paragraph.add_run() if that is modified.

I forked it, here's a link if you are alright with it I'll submit a pull request.
https://github.com/sk1tt1sh/python-docx/blob/develop/docx/api.py

Overall thank you! This did exactly what I needed. Also, it doesn't re-order the paragraphs.

    def paragraph_replace(self, search, replace):
        searchre = re.compile(search)
        for paragraph in self.paragraphs:
            paragraph_text = paragraph.text
            if paragraph_text:
                if searchre.search(paragraph_text):
                    self.clear_paragraph(paragraph)
                    paragraph.add_run(re.sub(search, replace, paragraph_text))
        return paragraph

    def clear_paragraph(self, paragraph):
        p_element = paragraph._p
        p_child_elements = [elm for elm in p_element.iterchildren()]
        for child_element in p_child_elements:
            p_element.remove(child_element)

This is how I've modified the API.
Example
reportfire

@scanny
Copy link
Contributor

scanny commented Apr 8, 2014

@sk1tt1sh Glad you got it working :)

A general solution here will require a substantial amount of other work, so probably best if we leave this here as a feature request and I'll come back to it once I've burned down some other backlog. Also any pulls require all the tests etc., but all that only makes sense once the API is determined.

In particular, a general-purpose API for search/replace is a challenging problem (the good kind of challenging :). I'm thinking it entails this concept called 'Range' in the Microsoft API, which is essentially a range of characters, as though the characters in the document were arranged into a single string and each had its character offset, e.g. range(3, 6) on 'foobar' would have the text value 'bar'. The challenge being different parts of the range could occur in different runs etc. and could start and end at other than element boundaries. So how you translate that into XML element manipulations gets pretty complex; especially when you figure in all the other revision marks etc. elements Word puts in.

I'll give it a noodle and come back to it when I free up a bit :)

@sk1tt1sh
Copy link

sk1tt1sh commented Apr 8, 2014

Thanks @scanny I'll continue looking into it as well. I'm learning all kinds of stuff about python and office xml from this tool ;)

@scanny scanny added this to the v0.7.0 milestone May 1, 2014
@scanny scanny modified the milestones: v0.8.0, v0.7.0 Table Enhancements, later May 13, 2014
@holyoaks
Copy link

Thanks @sk1tt1sh AND @scanny for getting a basic functional solution, as I need this too.

@elyparker
Copy link

is it possible see a code with this solution?

@scanny
Copy link
Contributor

scanny commented Mar 19, 2016

@Abolfazl - No, not yet. There's no current work on this as far as I know. In the meantime you'd need to run through the paragraphs and perhaps search paragraph.text to locate matches. Then you'd have to manipulate the text at the run level to do the replace. If you do, I'm sure you'll encounter the challenges that explain why this hasn't be taken up yet :)

It's actually a nice challenging problem. It's just waiting for someone motivated and able to pick it up and run with it.

@scanny scanny removed this from the later milestone Apr 9, 2016
@kdwarn
Copy link

kdwarn commented Mar 20, 2017

I'm not saying this will work for all cases, but it worked pretty well for my scenario where I knew there would only be one instance in a paragraph - even kept style in the one run it should have:

for paragraph in document.paragraphs:
    for run in paragraph.runs:
        if 'text_you_search_for' in run.text:
            text = run.text.split('text_you_search_for')
            run.text = text[0] + 'replacement_text' + text[1]

@josephernest
Copy link

@scanny For the table case, this does not work anymore:

for table in document.tables:
    for cell in table.cells:
        for paragraph in cell.paragraphs:

at least for the .docx I tried it with.

I had to use this instead:

for table in doc.tables:
    for col in table.columns:
        for cell in col.cells:
            for p in cell.paragraphs:

@josephernest
Copy link

josephernest commented Apr 27, 2020

Here is a working solution which doesn't break the formatting (bold, italic):

for run in [run for par in doc.paragraphs for run in par.runs] + \
           [run for table in doc.tables for col in table.columns for cell in col.cells for par in cell.paragraphs for run in par.runs]:
    s = run.text.replace("foo", "bar")
    if s != run.text:    # rewrite run.text if (and only if) it has changed; *always* rewriting is not good, it could destroy column breaks, etc.
        run.text = s

It works for tables too.

@sanyoggupta
Copy link

sanyoggupta commented May 4, 2020

@sk1tt1sh Glad you got it working :)

A general solution here will require a substantial amount of other work, so probably best if we leave this here as a feature request and I'll come back to it once I've burned down some other backlog. Also any pulls require all the tests etc., but all that only makes sense once the API is determined.

In particular, a general-purpose API for search/replace is a challenging problem (the good kind of challenging :). I'm thinking it entails this concept called 'Range' in the Microsoft API, which is essentially a range of characters, as though the characters in the document were arranged into a single string and each had its character offset, e.g. range(3, 6) on 'foobar' would have the text value 'bar'. The challenge being different parts of the range could occur in different runs etc. and could start and end at other than element boundaries. So how you translate that into XML element manipulations gets pretty complex; especially when you figure in all the other revision marks etc. elements Word puts in.

I'll give it a noodle and come back to it when I free up a bit :)

Hi @scanny, Any solution on Search and replace a particular word in paragraph without altering the formatting of the entire 'Paragraph'.

@bksim
Copy link

bksim commented Jun 10, 2020

I think I have a solution, is there still active development on this?

@PedroReboli
Copy link

PedroReboli commented Nov 3, 2020

I think I have a solution, is there still active development on this?

@bksim
can you share ? i'm having some trouble to figuring out to replace a text because the text is separated in two <w:r>

@anitatea
Copy link

@bksim
do you have a working solution? I'm working on something similar for a work project and would really love your input!

@PedroReboli
Copy link

@anitatea @scanny i have found a way to replace text between runs without removing the style
the average time to run this code in a paragraph with 300 letters in a FX 8300 is +- 400 ms
this code need some improvement

def clear_paragraph(paragraph):
	p_element = paragraph._p
	p_child_elements = [elm for elm in p_element.iterchildren()]
	for child_element in p_child_elements:
		if "'<w:r>'" in str(child_element):
			p_element.remove(child_element)
		
		
class Char ():
	def __init__(self,run ,Char : str):
		self.Char = Char
		self.style = run.style
		self.font  = run.font

class Base():
	def __init__(self,Char:Char = None):
		self.style = Char.style
		self.font  = Char.font

class ParagraphHandle ():
	def __init__(self,pagraph):
		self.Chars = []
		self.paragraph = pagraph
		for run in pagraph.runs:
			for x in range(len(run.text)): #convert 
				self.Chars.append(Char(run,run.text[x]))

	def replace(self,OldWord,NewWord):
		text = ""
		for x in self.Chars:
			text += x.Char
		fist = text.find(OldWord)
		
		if fist == -1: return False
		f = Base(self.Chars[fist])
		
		for i in range(len(OldWord)):
			self.Chars.pop(fist)
		i = 0
		
		for l in NewWord:
			self.Chars.insert(fist+i,Char(f,l))
			i += 1
		return True
	def build (self):
		if len(self.Chars) == 0 :return
		paraestilo = self.paragraph.style
		clear_paragraph(self.paragraph)
		self.paragraph.style = paraestilo
		runs = []
		fonts = []
		font = self.Chars[0].font
		run = ""
		for x in self.Chars:
			if x.font == font:
				run += x.Char
			else:
				runs.append(run)
				run = x.Char
				fonts.append(font)
				font = x.font
		runs.append(run)
		fonts.append(font)

		for i in range (len(runs)):
			run = self.paragraph.add_run(runs[i])
			fonte = run.font
			fonte.bold = fonts[i].bold
			fonte.color.rgb = fonts[i].color.rgb
			fonte.complex_script = fonts[i].complex_script
			fonte.cs_bold = fonts[i].cs_bold
			fonte.cs_italic = fonts[i].cs_italic
			fonte.double_strike = fonts[i].double_strike
			fonte.emboss = fonts[i].emboss
			fonte.hidden = fonts[i].hidden
			fonte.highlight_color = fonts[i].highlight_color
			fonte.imprint = fonts[i].imprint
			fonte.italic = fonts[i].italic
			fonte.math = fonts[i].math
			fonte.name = fonts[i].name
			fonte.no_proof = fonts[i].no_proof
			fonte.outline = fonts[i].outline
			fonte.rtl = fonts[i].rtl
			fonte.shadow = fonts[i].shadow
			fonte.size = fonts[i].size
			fonte.small_caps = fonts[i].small_caps
			fonte.snap_to_grid = fonts[i].snap_to_grid
			fonte.spec_vanish = fonts[i].spec_vanish
			fonte.strike = fonts[i].strike
			fonte.subscript = fonts[i].subscript
			fonte.superscript = fonts[i].superscript
			fonte.underline = fonts[i].underline
			fonte.web_hidden = fonts[i].web_hidden

and you should use like

hand = ParagraphHandle(paragraph) 
hand.replace("helow","hello") #replace once
while hand.replace("helow","hello"): #replace all
   pass
hand.build() #apply new text to paragraph

@Ryanauger95
Copy link

^ This doesn't work at all.

@scanny
Copy link
Contributor

scanny commented Jul 14, 2021

Something like this should do the trick on a paragraph-by-paragraph basis. You can just call it iteratively over document.paragraphs to get document-wide search and replace. Note headers and footers are not part of the main document so you'd have to iterate those separately. Also each cell of a table would have to be iterated separately, including recursively if there are nested tables.

import re

from docx import Document

regex = re.compile("foo")


def paragraph_replace_text(paragraph, regex, replace_str):
    """Return `paragraph` after replacing all matches for `regex` with `replace_str`.

    `regex` is a compiled regular expression prepared with `re.compile(pattern)`
    according to the Python library documentation for the `re` module.
    """
    # --- a paragraph may contain more than one match, loop until all are replaced ---
    while True:
        text = paragraph.text
        match = regex.search(text)
        if not match:
            break

        # --- when there's a match, we need to modify run.text for each run that
        # --- contains any part of the match-string.
        runs = iter(paragraph.runs)
        start, end = match.start(), match.end()

        # --- Skip over any leading runs that do not contain the match ---
        for run in runs:
            run_len = len(run.text)
            if start < run_len:
                break
            start, end = start - run_len, end - run_len

        # --- Match starts somewhere in the current run. Replace match-str prefix
        # --- occurring in this run with entire replacement str.
        run_text = run.text
        run_len = len(run_text)
        run.text = "%s%s%s" % (run_text[:start], replace_str, run_text[end:])
        end -= run_len  # --- note this is run-len before replacement ---

        # --- Remove any suffix of match word that occurs in following runs. Note that
        # --- such a suffix will always begin at the first character of the run. Also
        # --- note a suffix can span one or more entire following runs.
        for run in runs:  # --- next and remaining runs, uses same iterator ---
            if end <= 0:
                break
            run_text = run.text
            run_len = len(run_text)
            run.text = run_text[end:]
            end -= run_len

    # --- optionally get rid of any "spanned" runs that are now empty. This
    # --- could potentially delete things like inline pictures, so use your judgement.
    # for run in paragraph.runs:
    #     if run.text == "":
    #         r = run._r
    #         r.getparent().remove(r)

    return paragraph


if __name__ == "__main__":
    document = Document()
    paragraph = document.add_paragraph()
    paragraph.add_run("f").bold = True
    paragraph.add_run("o").bold = True
    paragraph.add_run("o to").bold = True
    paragraph.add_run(" you and ")
    paragraph.add_run("foo").bold = True
    paragraph.add_run(" to the horse")
    paragraph_replace_text(paragraph, regex, "bar")

    import pprint
    pprint.pprint(list(r.text for r in paragraph.runs))

>>> ['bar', ' to', ' you and ', 'bar', ' to the horse']

@scanny
Copy link
Contributor

scanny commented Jul 14, 2021

The concept here is that the matched string can and often will span multiple runs. In order to preserve character formatting, you need to operate at the run level, changing only the text of the required runs and leaving the rest of each run alone (the non-text run properties are where the character formatting lives).

Note that the replacement word gets the character formatting of the first character of the matched-string. This is because the initial prefix of the match string is entirely replaced by the replace-str; the rest of the match-str text is just deleted. One consequence of this is that if a matched phrase starts not-bold but has a bold word in it, the replacement will appear entirely not-bold. Likewise, if only the first word of a multi-word matched phrase is bold, the entire replacement string will appear in bold.

There are two cases for the matched-string:

  1. It occurs entirely in a single run, maybe in the beginning, middle, or end, that doesn't matter, just that its text can be entirely replaced in the text of a single run.

  2. The match str starts in one run and ends in a subsequent run. This includes the possibility of "spanning" one or more runs in the middle like "foobar" does here:

     a   l a r g e   f o o b a r   i s   s t i l l   s m a l l
    |-----------------|-----|---------------------------------|
    

    Note that any subsequent runs that are spanned or otherwise contain part of the match-str ("foobar") suffix start with that suffix. So handling the suffix characters is just a matter of deleting characters from the beginning of that run to the end of the suffix. The "end" position is decremented by the number of characters deleted until it goes to zero, signaling the replacement is complete.

In general, any "spanned" runs are left empty by this process. The few lines at the end remove any empty runs. That's optional but probably doesn't cost a lot and seems neater than leaving them around. Note that could remove things you'd rather keep around like inline pictures, so use your judgement. If you're clever you can probably work out how to do that just for "spanned" runs a few lines up.

@cridenour
Copy link

I had just started testing the solution from @PedroReboli and managed to speed it up 10-15x by working with the style_id instead of style on the run. Was going to investigate further but then @scanny comes and contributes a solution that improves performance by another 10-15x, well done.

@scanny
Copy link
Contributor

scanny commented Jul 14, 2021

@cridenour what kind of times are you seeing performance-wise? I'm wondering how much optimization might be worthwhile.

I'm thinking the next frontier for performance improvements would be:

  1. Move the process down to the oxml/lxml level, probably getting all the w:r elements at once with a single XPath call and working with them from there.
  2. Use re.finditer() to get all matches in a single call, avoiding reparsing after every match. This would involve fancier bookkeeping to keep track of changes to all the starts and ends, but could be multiple times more efficient when matches are dense.

Another improvement that occurs to me is to support replacement strs formed from groups in the match expression, like to change 14 July 2021 to 7/14/2021 using groups in the regex like r"(\d+) July (\d+)" and a function f(match) that returns "%s/%s/%s" % (month_number(match.groups(2)), match.groups(1), match.groups(3)).

@cridenour
Copy link

@scanny on the performance front, for a test document about one page, 8 paragraphs, 5 substitutions on a 2018 Macbook Pro:

Timings include only looping through paragraphs and making the replacements.

Original code from Pedro: 750-900ms
Switching to style ID: 60-80ms
Regex solution from you: 4-6ms.

And that was me compiling the regex on the fly inside the loop. Could probably get a hair faster.

@scanny
Copy link
Contributor

scanny commented Jul 15, 2021

Ok @cridenour, cool, thanks. That seems fairly tolerable for your average-size Word document. By 5 substitutions, did you mean there were five instances of the search-word all to be replaced by the same replace-word? Or did you have five different search-and-replace iterations on the one document?

For anyone else that tries this with a larger document, I'd be interested to collect a few more timing :)

@cridenour
Copy link

5 different checks.

I'll report back with some additional metrics once we get better test documents in place.

@PedroReboli
Copy link

  1. Use re.finditer() to get all matches in a single call, avoiding reparsing after every match. This would involve fancier bookkeeping to keep track of changes to all the starts and ends, but could be multiple times more efficient when matches are dense.

I did, and it does't worth the time, Even when is absurd like replacing 3872 times in a single paragraph the difference between them is only ~200ms

@scanny
Copy link
Contributor

scanny commented Jul 15, 2021

@PedroReboli 200 ms is a long time if it's being multiplied by a large number (of paragraphs or pages or whatever). What was your test setup (code) and what did your test data look like?

@PedroReboli
Copy link

@scanny i actually was doing a POC to see if is worth to spent more time, so the code only works if the size of the text to search is a fixed size otherwise it won't work properly
And i mean that the diference between while True and finditer is 200ms in a Ryzen 5 3600x

while True
8.74219012260437 ~ 8.624446868896484
finditer
8.500852823257446 ~ 8.490002393722534

the document that i tested was a document with only one paragraph with 3872 "A"s and the test was changing those "A"s to "hello"

def paragraph_replace_text(paragraph, regex, replace_str):
    """Return `paragraph` after replacing all matches for `regex` with `replace_str`.

    `regex` is a compiled regular expression prepared with `re.compile(pattern)`
    according to the Python library documentation for the `re` module.
    """
    # --- store how many times the string was replaced ---
    count = 0
    # --- a paragraph may contain more than one match, loop until all are replaced ---
    for match in regex.finditer(paragraph.text):
        # --- calculate how much characters must be shifted to fix the match ---
        padding = (len(replace_str) - (match.end() -match.start()) ) *count

        # --- when there's a match, we need to modify run.text for each run that
        # --- contains any part of the match-string.
        runs = iter(paragraph.runs)
        start, end = match.start() + padding , match.end() + padding

        # --- Skip over any leading runs that do not contain the match ---
        for run in runs:
            run_len = len(run.text)
            if start < run_len:
                break
            start, end = start - run_len, end - run_len

        # --- Match starts somewhere in the current run. Replace match-str prefix
        # --- occurring in this run with entire replacement str.
        run_text = run.text
        run_len = len(run_text)
        run.text = "%s%s%s" % (run_text[:start], replace_str, run_text[end:])
        end -= run_len  # --- note this is run-len before replacement ---

        # --- Remove any suffix of match word that occurs in following runs. Note that
        # --- such a suffix will always begin at the first character of the run. Also
        # --- note a suffix can span one or more entire following runs.
        for run in runs:  # --- next and remaining runs, uses same iterator ---
            if end <= 0:
                break
            run_text = run.text
            run_len = len(run_text)
            run.text = run_text[end:]
            end -= run_len
        count += 1
    # --- optionally get rid of any "spanned" runs that are now empty. This
    # --- could potentially delete things like inline pictures, so use your judgement.
    # for run in paragraph.runs:
    #     if run.text == "":
    #         r = run._r
    #         r.getparent().remove(r)
    return paragraph

apparently it does't seems too hard to make it work with different search length

if you don't mind i like the way you comment and will start using it :)

@scanny
Copy link
Contributor

scanny commented Jul 16, 2021

Ah, got it, thanks :)

@scanny
Copy link
Contributor

scanny commented Jul 24, 2021

Here's another related snippet I developed for an SO question that isolates a range in the text of a paragraph to be its own distinct run, preserving the character formatting (of the start of the range anyway): #980 (comment)

This could be useful for example if you just want to make certain words bold or highlight them or perhaps give them a different type-face or size.

Many of the run-boundary manipulation concepts are the same. Maybe there's a paragraph-text-range helper object lurking between the two of these utility functions somewhere :)

@jemontgomery
Copy link

I ran into this same problem when attempting to use a .docx as a template for a mail merge from Python. That's to say, I could only find matches when the search text was contained within a single run. As we know, that's inconsistently the case, even when there's no change in style between two consecutive runs. Thank you, @scanny for your solution. I was going in a different direction on a fix, and although this is solved I thought others might benefit from an alternative approach.

As with the previously offered solutions, the replacement text is fully inserted in the run in which the first character of the search text is found. Elements of the search text that continue into subsequent runs are deleted, This can result in empty runs (as in the example below). This could be cleaned up as a last step if desired.

I ran a couple of quick benchmarks, and this solution appears to offer a good pickup in speed.

import re
from docx import Document

def replace_text_in_paragraph(paragraph, search, replace):
   """
   Replace occurrences of specific text in a paragraph with a new text, even when
   the text occurs across multiple consecutive runs

   Parameters:
   paragraph (docx.text.paragraph.Paragraph): The original paragraph where text 
      will be replaced.
   search (str): The text to be replaced.
   replace (str): The new text that will replace the search text.

   Returns:
   docx.text.paragraph.Paragraph: The updated paragraph with the search text
   replaced by the replace text.
   """
   # if the search text is not in the paragraph then exit
   if not re.search(search, paragraph.text): 
      return paragraph
   
   # use a join character to delimit runs, I selected the null character '\0' because it 
   # can't appear in oxml document text.
   J_CHAR = "\0"

   # join the paragraph run text with the J_CHAR as the run delimiter
   all_runs = J_CHAR.join([r.text for r in paragraph.runs])

   # compile a regex search string that allows 0,1+ occurrences of the run delimiter
   # between each character of the search string.
   pattern = re.compile(f"{J_CHAR}*".join(search))

   # subsitute the replacement text, plus the contained delimiter count in the match to 
   # keep the runs consistent
   all_runs_replaced = re.sub(pattern, 
                          lambda m: replace + m.group(0).count(J_CHAR) * J_CHAR,
                          all_runs)
   
   # iterate the paragraph runs and replace any text that has changed after the substitution
   for orig, new in zip(paragraph.runs, all_runs_replaced.split(J_CHAR)):
      if orig.text != new:
         orig.text = new
   return paragraph

if __name__ == "__main__":
    document = Document()
    paragraph = document.add_paragraph()
    paragraph.add_run("f").bold = True
    paragraph.add_run("o").bold = True
    paragraph.add_run("o to").bold = True
    paragraph.add_run(" you and ")
    paragraph.add_run("foo").bold = True
    paragraph.add_run(" to the horse")
    replace_text_in_paragraph(paragraph, "foo", "bar")

    import pprint
    pprint.pprint(list(r.text for r in paragraph.runs))

>>> ['bar', '', ' to', ' you and ', 'bar', ' to the horse']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests