isolate_run() #980

scanny · 2021-07-24T18:55:24Z

This is some code I developed to answer this SO question.

You give it a character-position range in a paragraph and it does the needful to isolate that range of characters into its own single run having the same character formatting as the original. If you don't change the text of the paragraph between calls, it can be called repeatedly with different ranges to isolate multiple ranges, like multiple matches to re.Pattern.findall().

I'm not sure what will become of it but it was more work than I originally guessed so I want to keep it around for future reference.

import itertools

def isolate_run(paragraph, start, end):
    """Return docx.text.Run object containing only `paragraph.text[start:end]`.

    Runs are split as required to produce a new run at the `start` that ends at `end`.
    Runs are unchanged if the indicated range of text already occupies its own run. The
    resulting run object is returned.

    `start` and `end` are as in Python slice notation. For example, the first three
    characters of the paragraph have (start, end) of (0, 3). `end` is *not* the index of
    the last character. These correspond to `match.start()` and `match.end()` of a regex
    match object and `s[start:end]` in Python slice notation.
    """
    rs = tuple(paragraph._p.r_lst)

    def advance_to_run_containing_start(start, end):
        """Return (r_idx, start, end) triple indicating start run and adjusted offsets.

        The start run is the run the `start` offset occurs in. The returned `start` and
        `end` values are adjusted to be relative to the start of `r_idx`.
        """
        # --- add 0 at end so `r_ends[-1] == 0` ---
        r_ends = tuple(itertools.accumulate(len(r.text) for r in rs)) + (0,)
        r_idx = 0
        while start >= r_ends[r_idx]:
            r_idx += 1
        skipped_rs_offset = r_ends[r_idx - 1]
        return rs[r_idx], r_idx, start - skipped_rs_offset, end - skipped_rs_offset

    def split_off_prefix(r, start, end):
        """Return adjusted `end` after splitting prefix off into separate run.

        Does nothing if `r` is already the start of the isolated run.
        """
        if start > 0:
            prefix_r = copy.deepcopy(r)
            r.addprevious(prefix_r)
            r.text = r.text[start:]
            prefix_r.text = prefix_r.text[:start]
        return end - start

    def split_off_suffix(r, end):
        """Split `r` at `end` such that suffix is in separate following run."""
        suffix_r = copy.deepcopy(r)
        r.addnext(suffix_r)
        r.text = r.text[:end]
        suffix_r.text = suffix_r.text[end:]

    def lengthen_run(r, r_idx, end):
        """Add prefixes of following runs to `r` until `end` is reached."""
        while len(r.text) < end:
            suffix_len_reqd = end - len(r.text)
            r_idx += 1
            next_r = rs[r_idx]
            if len(next_r.text) <= suffix_len_reqd:
                # --- subsume next run ---
                r.text = r.text + next_r.text
                next_r.getparent().remove(next_r)
                continue
            if len(next_r.text) > suffix_len_reqd:
                # --- take prefix from next run ---
                r.text = r.text + next_r.text[:suffix_len_reqd]
                next_r.text = next_r.text[suffix_len_reqd:]

    # --- 1. skip over any runs before the one containing the start of our range ---
    r, r_idx, start, end = advance_to_run_containing_start(start, end)

    # --- 2. split first run where our range starts, placing "prefix" to our range
    # ---    in a new run inserted just before this run. After this, our run will begin
    # ---    at the right point and the left-hand side of our work is done.
    end = split_off_prefix(r, start, end)

    # --- 3. if run is longer than isolation-range we need to split-off a suffix run ---
    if len(r.text) > end:
        split_off_suffix(r, end)

    # --- 4. But if our run is shorter than the desired isolation-range we need to
    # ---    lengthen it by taking text from subsequent runs
    elif len(r.text) < end:
        lengthen_run(r, r_idx, end)

    # --- if neither 3 nor 4 apply it's because the run already ends in the right place
    # --- and there's no further work to be done.

    return Run(r, paragraph)

The text was updated successfully, but these errors were encountered:

ymmeng · 2021-08-20T08:34:31Z

I want to set each character in each paragraph of the document as a separate "run". What parameters should I pass in? After trying for a long time, I always have a bunch of strange results:

demo.docx:

ABCD
EFGA

code:

doc = docx.Document('demo.docx')
for par in doc.paragraphs:
    isolate_run(par, 0, len(par.runs))
    for run in par.runs:
        print(run.text)

print:

A
BCD
E
FGA

ymmeng · 2021-08-20T08:35:51Z

What I expect:
print:

A
B
C
D
E
F
G
A

scanny · 2021-08-20T17:02:18Z

If you want each run to be one character long you can use something like:

for start in range(len(paragraph.text)):
    end = start + 1
    isolate_run(paragraph, start, end)

The len(par.runs) that appears in your code is the number of runs in the paragraph, which just doesn't have anything to do with what you're trying to do.

The way to think about start and end is like this:

"""
paragraph.text: "ABCDE"

      A   B   C   D   E 
        |           |
    start           end
"""
>>> start, end = 1, 4
>>> run = isolate_run(paragraph, start, end)
>>> run.text
'BCD'

ymmeng · 2021-08-21T00:11:54Z

Thank you very much for your patience. He will be of great help to me. Thank you!

tranduchuy682 · 2022-07-23T03:54:40Z

Thank @scanny . That's awesome

itslily88 · 2023-01-04T18:50:15Z

I am having trouble understanding the return Run(r, paragraph).

NameError: name 'Run' is not defined

I have a word document that is already created through different checkboxes selected by the user. Those check boxes get text from plain text files to input into the word document. I would like to add styles to certain runs that I use a series of hyphens to signal. For example, the word document may look like this:

------Details about the incident

Here are the events that detail the specific incident in question. below are the listed sub categories

---This is sub category 1

details about this section

---This is sub category 2

details about this section

I would like any line with "------" to be bold and a larger font, and any line with "---" to just be bold.

#30 (comment) works great for simply replacing the text (removing the ------ but leaving the text), but any formatting on run sets everything to the same formatting.

I would assume the isolate_run() would work for me, but I cannot get passed the return Run(r, paragraph) to even walk through how to make it work what I need.

Here is how I am calling paragraph_replace_text(doc, '------'):

def formatting(document, oldText):
    oldTextLength = len(oldText)
    for oldPara in document.paragraphs:    
        if oldPara.text.find(oldText) >= 0:
            paraText = oldPara.text
            for line in paraText.splitlines():
                if oldText in line: 
                    newText = line[oldTextLength:]
                    paragraph_replace_text(oldPara, re.compile(f'{oldText}{newText}'), newText)

My thought would be inside paragraph_replace_text I would call isolate_run after the '------' is removed with the start and end in there as the passed variables, but I cant get it to run with the return Run(r, paragraph) to try.

Any help would be appreciated

This was referenced Jul 24, 2021

Replace text in paragraph keeping the runs object and styles #415

Open

Can search and replace functions be added to python-docx? #30

Open

scanny mentioned this issue Aug 19, 2021

Single character modify style #988

Open

scanny mentioned this issue Sep 27, 2023

How to modify table/paragraph/footnote text without destroying the formatting #1251

Closed

scanny mentioned this issue Mar 19, 2024

Replace or add text to a table cell paragraph without altering paragraph font properties #1358

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

isolate_run() #980

isolate_run() #980

scanny commented Jul 24, 2021 •

edited

Loading

ymmeng commented Aug 20, 2021

ymmeng commented Aug 20, 2021

scanny commented Aug 20, 2021 •

edited

Loading

ymmeng commented Aug 21, 2021

tranduchuy682 commented Jul 23, 2022

itslily88 commented Jan 4, 2023

isolate_run() #980

isolate_run() #980

Comments

scanny commented Jul 24, 2021 • edited Loading

ymmeng commented Aug 20, 2021

I want to set each character in each paragraph of the document as a separate "run". What parameters should I pass in? After trying for a long time, I always have a bunch of strange results:

ymmeng commented Aug 20, 2021

What I expect: print:

scanny commented Aug 20, 2021 • edited Loading

ymmeng commented Aug 21, 2021

tranduchuy682 commented Jul 23, 2022

itslily88 commented Jan 4, 2023

scanny commented Jul 24, 2021 •

edited

Loading

What I expect:
print:

scanny commented Aug 20, 2021 •

edited

Loading