-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can search and replace functions be added to python-docx? #30
Comments
@frenet you should be able to get most of what you describe with something like this: document = Document('your_file.docx')
for paragraph in document.paragraphs:
paragraph_text = paragraph.text
# ... code to search paragraph_text for desired word(s) If your document included tables you wanted to search as well, you would need to look there too, something like: for table in document.tables:
for cell in table.cells:
for paragraph in cell.paragraphs:
# ... same as above ... There can be a recursion involved, not sure how frequent it is in practice, but structurally there's nothing preventing a paragraph within a table from itself containing a table, which contains paragraphs that can contain a table, etc. Right now I'm sure it will be good to have something that allows search and replace operations. I'm not quite sure what the search operation would return, since there's no concept so far of a 'cursor' location nor is there a concept of a word or phrase, only paragraphs and runs, which of course wouldn't always match up on a particular word. But I'll leave it here for more noodling when we get to this in the backlog. |
@scanny awesome tool! Having some trouble along these same lines though. This spits out an attribute error when trying to update the text. def paragraph_replace(self, search, replace):
searchre = re.compile(search)
for paragraph in self.paragraphs:
paragraph_text = paragraph.text
if paragraph_text:
if searchre.search(paragraph_text):
paragraph.text = re.sub(search, replace, paragraph_text)
return paragraph_text The error. I'm okay at python but couldn't really see any way to fix this looking at text.py |
@sk1tt1sh Hmm, yes, this is a very useful use case. There's actually no way yet to do what you're trying to accomplish here. The specific error is raised because the Let me have a noodle on it and see what new features might best serve here. It would be possible to add a Probably be a while though, I'm busy on |
@scanny Thanks for the quick reply. I definitely understand the being busy. I can try taking a swing at it if you can give me an idea of how you want to do it. I'll try to conform to pep8 and naming conventions as much as possible :) I'd like to note that my usage case is to have a prebuilt document with some sections that will be modified, so removing the paragraph and adding it at the end of the document is not ideal. It would make more sense at that point to just build it from scratch. |
A good place to start would be having a few methods local to your solution that get you what you need. We can address the question of getting permanent features into I believe if you had a paragraph = ... matching paragraph ...
clear_paragraph(paragraph)
paragraph.add_run(replacement_text) The def clear_paragraph(paragraph):
p_element = paragraph._p
p_child_elements = [elm for elm in p_element.iterchildren()]
for child_element in p_child_elements:
p_element.remove(child_element) I don't have time to test this right this minute, but if you want to give it a try and see if it gives you any trouble. What you're doing with it is manipulating the It's possible you might need to be more sophisticated too, this brute-force approach will get rid of paragraph properties you might prefer to preserve. But it's a good start anyway :) |
@scanny I lol'd at the "...that tends to be substantially more involved." line. I've been working my way through the internals of this thing and it is quite complex!! I am excited for this one though. I think once we iron out the remove/replace and get headers/footers manipulation down it will be awesome! The loss of style could become an issue, especially when you're working with cells within a table and the cell has 2 differently paragraphs. I'll consider how I can preserve them along the way while I'm working with it. I'll have to let you know how I'm doing some time tomorrow. Thanks! |
You all have inspired me. My initial requirement is very simple: do some counts about some specific words, such as "sustainable development", and search some words "comparision" and then replace it with another words "comparison". I create a docx document from the scratch instead of using a replace function in place. I try to use the functions as less as possible. my further interests is focused on how to use regular expression to do text mining. |
Hey @scanny ... sorry for the delay on the update. After a little testing here's what I've come up with. Limitations:
I forked it, here's a link if you are alright with it I'll submit a pull request. Overall thank you! This did exactly what I needed. Also, it doesn't re-order the paragraphs. def paragraph_replace(self, search, replace):
searchre = re.compile(search)
for paragraph in self.paragraphs:
paragraph_text = paragraph.text
if paragraph_text:
if searchre.search(paragraph_text):
self.clear_paragraph(paragraph)
paragraph.add_run(re.sub(search, replace, paragraph_text))
return paragraph
def clear_paragraph(self, paragraph):
p_element = paragraph._p
p_child_elements = [elm for elm in p_element.iterchildren()]
for child_element in p_child_elements:
p_element.remove(child_element) |
@sk1tt1sh Glad you got it working :) A general solution here will require a substantial amount of other work, so probably best if we leave this here as a feature request and I'll come back to it once I've burned down some other backlog. Also any pulls require all the tests etc., but all that only makes sense once the API is determined. In particular, a general-purpose API for search/replace is a challenging problem (the good kind of challenging :). I'm thinking it entails this concept called 'Range' in the Microsoft API, which is essentially a range of characters, as though the characters in the document were arranged into a single string and each had its character offset, e.g. range(3, 6) on 'foobar' would have the text value 'bar'. The challenge being different parts of the range could occur in different runs etc. and could start and end at other than element boundaries. So how you translate that into XML element manipulations gets pretty complex; especially when you figure in all the other revision marks etc. elements Word puts in. I'll give it a noodle and come back to it when I free up a bit :) |
Thanks @scanny I'll continue looking into it as well. I'm learning all kinds of stuff about python and office xml from this tool ;) |
is it possible see a code with this solution? |
@Abolfazl - No, not yet. There's no current work on this as far as I know. In the meantime you'd need to run through the paragraphs and perhaps search paragraph.text to locate matches. Then you'd have to manipulate the text at the run level to do the replace. If you do, I'm sure you'll encounter the challenges that explain why this hasn't be taken up yet :) It's actually a nice challenging problem. It's just waiting for someone motivated and able to pick it up and run with it. |
I'm not saying this will work for all cases, but it worked pretty well for my scenario where I knew there would only be one instance in a paragraph - even kept style in the one run it should have:
|
@scanny For the table case, this does not work anymore:
at least for the .docx I tried it with. I had to use this instead:
|
Here is a working solution which doesn't break the formatting (bold, italic):
It works for tables too. |
Hi @scanny, Any solution on Search and replace a particular word in paragraph without altering the formatting of the entire 'Paragraph'. |
I think I have a solution, is there still active development on this? |
@bksim |
@bksim |
@anitatea @scanny i have found a way to replace text between runs without removing the style def clear_paragraph(paragraph):
p_element = paragraph._p
p_child_elements = [elm for elm in p_element.iterchildren()]
for child_element in p_child_elements:
if "'<w:r>'" in str(child_element):
p_element.remove(child_element)
class Char ():
def __init__(self,run ,Char : str):
self.Char = Char
self.style = run.style
self.font = run.font
class Base():
def __init__(self,Char:Char = None):
self.style = Char.style
self.font = Char.font
class ParagraphHandle ():
def __init__(self,pagraph):
self.Chars = []
self.paragraph = pagraph
for run in pagraph.runs:
for x in range(len(run.text)): #convert
self.Chars.append(Char(run,run.text[x]))
def replace(self,OldWord,NewWord):
text = ""
for x in self.Chars:
text += x.Char
fist = text.find(OldWord)
if fist == -1: return False
f = Base(self.Chars[fist])
for i in range(len(OldWord)):
self.Chars.pop(fist)
i = 0
for l in NewWord:
self.Chars.insert(fist+i,Char(f,l))
i += 1
return True
def build (self):
if len(self.Chars) == 0 :return
paraestilo = self.paragraph.style
clear_paragraph(self.paragraph)
self.paragraph.style = paraestilo
runs = []
fonts = []
font = self.Chars[0].font
run = ""
for x in self.Chars:
if x.font == font:
run += x.Char
else:
runs.append(run)
run = x.Char
fonts.append(font)
font = x.font
runs.append(run)
fonts.append(font)
for i in range (len(runs)):
run = self.paragraph.add_run(runs[i])
fonte = run.font
fonte.bold = fonts[i].bold
fonte.color.rgb = fonts[i].color.rgb
fonte.complex_script = fonts[i].complex_script
fonte.cs_bold = fonts[i].cs_bold
fonte.cs_italic = fonts[i].cs_italic
fonte.double_strike = fonts[i].double_strike
fonte.emboss = fonts[i].emboss
fonte.hidden = fonts[i].hidden
fonte.highlight_color = fonts[i].highlight_color
fonte.imprint = fonts[i].imprint
fonte.italic = fonts[i].italic
fonte.math = fonts[i].math
fonte.name = fonts[i].name
fonte.no_proof = fonts[i].no_proof
fonte.outline = fonts[i].outline
fonte.rtl = fonts[i].rtl
fonte.shadow = fonts[i].shadow
fonte.size = fonts[i].size
fonte.small_caps = fonts[i].small_caps
fonte.snap_to_grid = fonts[i].snap_to_grid
fonte.spec_vanish = fonts[i].spec_vanish
fonte.strike = fonts[i].strike
fonte.subscript = fonts[i].subscript
fonte.superscript = fonts[i].superscript
fonte.underline = fonts[i].underline
fonte.web_hidden = fonts[i].web_hidden and you should use like hand = ParagraphHandle(paragraph)
hand.replace("helow","hello") #replace once
while hand.replace("helow","hello"): #replace all
pass
hand.build() #apply new text to paragraph |
^ This doesn't work at all. |
Something like this should do the trick on a paragraph-by-paragraph basis. You can just call it iteratively over import re
from docx import Document
regex = re.compile("foo")
def paragraph_replace_text(paragraph, regex, replace_str):
"""Return `paragraph` after replacing all matches for `regex` with `replace_str`.
`regex` is a compiled regular expression prepared with `re.compile(pattern)`
according to the Python library documentation for the `re` module.
"""
# --- a paragraph may contain more than one match, loop until all are replaced ---
while True:
text = paragraph.text
match = regex.search(text)
if not match:
break
# --- when there's a match, we need to modify run.text for each run that
# --- contains any part of the match-string.
runs = iter(paragraph.runs)
start, end = match.start(), match.end()
# --- Skip over any leading runs that do not contain the match ---
for run in runs:
run_len = len(run.text)
if start < run_len:
break
start, end = start - run_len, end - run_len
# --- Match starts somewhere in the current run. Replace match-str prefix
# --- occurring in this run with entire replacement str.
run_text = run.text
run_len = len(run_text)
run.text = "%s%s%s" % (run_text[:start], replace_str, run_text[end:])
end -= run_len # --- note this is run-len before replacement ---
# --- Remove any suffix of match word that occurs in following runs. Note that
# --- such a suffix will always begin at the first character of the run. Also
# --- note a suffix can span one or more entire following runs.
for run in runs: # --- next and remaining runs, uses same iterator ---
if end <= 0:
break
run_text = run.text
run_len = len(run_text)
run.text = run_text[end:]
end -= run_len
# --- optionally get rid of any "spanned" runs that are now empty. This
# --- could potentially delete things like inline pictures, so use your judgement.
# for run in paragraph.runs:
# if run.text == "":
# r = run._r
# r.getparent().remove(r)
return paragraph
if __name__ == "__main__":
document = Document()
paragraph = document.add_paragraph()
paragraph.add_run("f").bold = True
paragraph.add_run("o").bold = True
paragraph.add_run("o to").bold = True
paragraph.add_run(" you and ")
paragraph.add_run("foo").bold = True
paragraph.add_run(" to the horse")
paragraph_replace_text(paragraph, regex, "bar")
import pprint
pprint.pprint(list(r.text for r in paragraph.runs))
>>> ['bar', ' to', ' you and ', 'bar', ' to the horse'] |
The concept here is that the matched string can and often will span multiple runs. In order to preserve character formatting, you need to operate at the run level, changing only the text of the required runs and leaving the rest of each run alone (the non-text run properties are where the character formatting lives). Note that the replacement word gets the character formatting of the first character of the matched-string. This is because the initial prefix of the match string is entirely replaced by the replace-str; the rest of the match-str text is just deleted. One consequence of this is that if a matched phrase starts not-bold but has a bold word in it, the replacement will appear entirely not-bold. Likewise, if only the first word of a multi-word matched phrase is bold, the entire replacement string will appear in bold. There are two cases for the matched-string:
In general, any "spanned" runs are left empty by this process. The few lines at the end remove any empty runs. That's optional but probably doesn't cost a lot and seems neater than leaving them around. Note that could remove things you'd rather keep around like inline pictures, so use your judgement. If you're clever you can probably work out how to do that just for "spanned" runs a few lines up. |
I had just started testing the solution from @PedroReboli and managed to speed it up 10-15x by working with the |
@cridenour what kind of times are you seeing performance-wise? I'm wondering how much optimization might be worthwhile. I'm thinking the next frontier for performance improvements would be:
Another improvement that occurs to me is to support replacement strs formed from groups in the match expression, like to change |
@scanny on the performance front, for a test document about one page, 8 paragraphs, 5 substitutions on a 2018 Macbook Pro: Timings include only looping through paragraphs and making the replacements. Original code from Pedro: 750-900ms And that was me compiling the regex on the fly inside the loop. Could probably get a hair faster. |
Ok @cridenour, cool, thanks. That seems fairly tolerable for your average-size Word document. By 5 substitutions, did you mean there were five instances of the search-word all to be replaced by the same replace-word? Or did you have five different search-and-replace iterations on the one document? For anyone else that tries this with a larger document, I'd be interested to collect a few more timing :) |
5 different checks. I'll report back with some additional metrics once we get better test documents in place. |
I did, and it does't worth the time, Even when is absurd like replacing 3872 times in a single paragraph the difference between them is only ~200ms |
@PedroReboli 200 ms is a long time if it's being multiplied by a large number (of paragraphs or pages or whatever). What was your test setup (code) and what did your test data look like? |
@scanny i actually was doing a POC to see if is worth to spent more time, so the code only works if the size of the text to search is a fixed size otherwise it won't work properly while True the document that i tested was a document with only one paragraph with 3872 "A"s and the test was changing those "A"s to "hello" def paragraph_replace_text(paragraph, regex, replace_str):
"""Return `paragraph` after replacing all matches for `regex` with `replace_str`.
`regex` is a compiled regular expression prepared with `re.compile(pattern)`
according to the Python library documentation for the `re` module.
"""
# --- store how many times the string was replaced ---
count = 0
# --- a paragraph may contain more than one match, loop until all are replaced ---
for match in regex.finditer(paragraph.text):
# --- calculate how much characters must be shifted to fix the match ---
padding = (len(replace_str) - (match.end() -match.start()) ) *count
# --- when there's a match, we need to modify run.text for each run that
# --- contains any part of the match-string.
runs = iter(paragraph.runs)
start, end = match.start() + padding , match.end() + padding
# --- Skip over any leading runs that do not contain the match ---
for run in runs:
run_len = len(run.text)
if start < run_len:
break
start, end = start - run_len, end - run_len
# --- Match starts somewhere in the current run. Replace match-str prefix
# --- occurring in this run with entire replacement str.
run_text = run.text
run_len = len(run_text)
run.text = "%s%s%s" % (run_text[:start], replace_str, run_text[end:])
end -= run_len # --- note this is run-len before replacement ---
# --- Remove any suffix of match word that occurs in following runs. Note that
# --- such a suffix will always begin at the first character of the run. Also
# --- note a suffix can span one or more entire following runs.
for run in runs: # --- next and remaining runs, uses same iterator ---
if end <= 0:
break
run_text = run.text
run_len = len(run_text)
run.text = run_text[end:]
end -= run_len
count += 1
# --- optionally get rid of any "spanned" runs that are now empty. This
# --- could potentially delete things like inline pictures, so use your judgement.
# for run in paragraph.runs:
# if run.text == "":
# r = run._r
# r.getparent().remove(r)
return paragraph apparently it does't seems too hard to make it work with different search length if you don't mind i like the way you comment and will start using it :) |
Ah, got it, thanks :) |
Here's another related snippet I developed for an SO question that isolates a range in the text of a paragraph to be its own distinct run, preserving the character formatting (of the start of the range anyway): #980 (comment) This could be useful for example if you just want to make certain words bold or highlight them or perhaps give them a different type-face or size. Many of the run-boundary manipulation concepts are the same. Maybe there's a paragraph-text-range helper object lurking between the two of these utility functions somewhere :) |
I ran into this same problem when attempting to use a .docx as a template for a mail merge from Python. That's to say, I could only find matches when the search text was contained within a single run. As we know, that's inconsistently the case, even when there's no change in style between two consecutive runs. Thank you, @scanny for your solution. I was going in a different direction on a fix, and although this is solved I thought others might benefit from an alternative approach. As with the previously offered solutions, the replacement text is fully inserted in the run in which the first character of the search text is found. Elements of the search text that continue into subsequent runs are deleted, This can result in empty runs (as in the example below). This could be cleaned up as a last step if desired. I ran a couple of quick benchmarks, and this solution appears to offer a good pickup in speed. import re
from docx import Document
def replace_text_in_paragraph(paragraph, search, replace):
"""
Replace occurrences of specific text in a paragraph with a new text, even when
the text occurs across multiple consecutive runs
Parameters:
paragraph (docx.text.paragraph.Paragraph): The original paragraph where text
will be replaced.
search (str): The text to be replaced.
replace (str): The new text that will replace the search text.
Returns:
docx.text.paragraph.Paragraph: The updated paragraph with the search text
replaced by the replace text.
"""
# if the search text is not in the paragraph then exit
if not re.search(search, paragraph.text):
return paragraph
# use a join character to delimit runs, I selected the null character '\0' because it
# can't appear in oxml document text.
J_CHAR = "\0"
# join the paragraph run text with the J_CHAR as the run delimiter
all_runs = J_CHAR.join([r.text for r in paragraph.runs])
# compile a regex search string that allows 0,1+ occurrences of the run delimiter
# between each character of the search string.
pattern = re.compile(f"{J_CHAR}*".join(search))
# subsitute the replacement text, plus the contained delimiter count in the match to
# keep the runs consistent
all_runs_replaced = re.sub(pattern,
lambda m: replace + m.group(0).count(J_CHAR) * J_CHAR,
all_runs)
# iterate the paragraph runs and replace any text that has changed after the substitution
for orig, new in zip(paragraph.runs, all_runs_replaced.split(J_CHAR)):
if orig.text != new:
orig.text = new
return paragraph
if __name__ == "__main__":
document = Document()
paragraph = document.add_paragraph()
paragraph.add_run("f").bold = True
paragraph.add_run("o").bold = True
paragraph.add_run("o to").bold = True
paragraph.add_run(" you and ")
paragraph.add_run("foo").bold = True
paragraph.add_run(" to the horse")
replace_text_in_paragraph(paragraph, "foo", "bar")
import pprint
pprint.pprint(list(r.text for r in paragraph.runs))
>>> ['bar', '', ' to', ' you and ', 'bar', ' to the horse'] |
It is very easy to create a docx file by python-docx, but I like to search some specific words and count the number it occurs, how can I do in python-docx. I know this can be done in mikemaccana/python-docx, but the mikemaccana/python-docx code grammer is different from python-openxml / python-docx, I do not like to switch to mikemaccana/python-docx .
The text was updated successfully, but these errors were encountered: