How to Insert HTML into Docx? #352

tbell511 · 2017-01-21T21:04:22Z

Is it possible to insert HTML into a Document using python-docx with styling applied?
The only thing I need to work are italics.

For example how to insert "Today is <i>Saturday</i>" with Saturday actually being inserted with italics?

Thanks!

vik378 · 2017-01-22T13:17:38Z

I had a similar issue however our HTML is way more complex and I couldn't find any direct method, so we ended up translating HTML fragments into a run map and then something like:
par = doc.add_paragraph()
for run_item in run_map:
run = par.add_run(run_item["text_fragment"])
if run_item["type"] == "italic":
run.style.italic = True
Something like that, see the docs around run topic for exact methods.

tbell511 · 2017-01-22T15:09:15Z

Wow thank you for the reply! Any way you could provide the link for that documentation?
Thanks,

vik378 · 2017-01-22T20:53:53Z

Have a look at docx.text.run
This is an extract from what i did in a similar case (simplified for italics only):

import re
from docx import Document


test_items = [
    "Trailing <i>tag</i>",
    "<i>Leading</i> tag",
    "This <i>time a tag</i> is in the middle",
]

class HTMLHelper(object):
    """ Translates some html into word runs. """
    def __init__(self):
        self.get_tags = re.compile("(<[a-z,A-Z]+>)(.*?)(</[a-z,A-Z]+>)")

    def html_to_run_map(self, html_fragment):
        """ breakes an html fragment into a run map """
        ptr = 0
        run_map = []
        for match in self.get_tags.finditer(html_fragment):
            if match.start() > ptr:
                text = html_fragment[ptr:match.start()]
                if len(text) > 0:
                    run_map.append((text, "plain_text"))
            run_map.append((match.group(2), match.group(1)))
            ptr = match.end()
        if ptr < len(html_fragment) - 1:
            run_map.append((html_fragment[ptr:], "plain_text"))
        return run_map
    
    def insert_runs_from_html_map(self, paragraph, run_map):
        """ inserts some runs into a paragraph object. """
        for run_item in run_map:
            run = paragraph.add_run(run_item[0])
            if run_item[1] == "<i>":
                run.italic = True

    
doc = Document()
html_helper = HTMLHelper()
for test_item in test_items:
    run_map = html_helper.html_to_run_map(test_item)
    print "------------------------------\nTest item:", test_item
    print "\nRun map: ", run_map
    par = doc.add_paragraph()
    print "\n XML before:\n", par._element.xml
    html_helper.insert_runs_from_html_map(par, run_map)
    print "\n XML after:\n", par._element.xml
doc.save("test_run_mapping.docx")

hope it helps

tbell511 · 2017-01-22T20:56:33Z

Thank you so much for spending time to do that! It looks great. I am going to try to implement it now.

tbell511 · 2017-01-22T23:08:23Z

I am getting a type error while trying to make the run fragments.

if match.start > ptr:
TypeError: unorderable types: builtin_function_or_method() > int()]

Any Ideas on how to fix this?

vik378 · 2017-01-23T08:11:32Z

I would expect it to be sorted by now, but for the sake of consistency:
it should be if match.start() > ptr: - simply missing ()

realsby · 2017-10-24T15:29:25Z

@electron378
its realy good sample but i wish you share full sample, not just for italic.
your sample doesnt work with alone br tag and double tag recursively inside each other immediately :(

realsby · 2017-10-25T14:57:30Z

I prepare something like this... This is enough for me..


from HTMLParser import HTMLParser
from htmlentitydefs import name2codepoint
from docx import Document

.....
document = Document()
document_html_parser = DocumentHTMLParser(document)

document_html_parser.add_paragraph_and_feed(html_code)
.....
class DocumentHTMLParser(HTMLParser):
    def __init__(self, document):
        HTMLParser.__init__(self)
        self.document = document
        self.paragraph = self.document.add_paragraph()
        self.run = self.paragraph.add_run()

    def add_paragraph_and_feed(self, html):
        self.paragraph = self.document.add_paragraph()
        self.run = self.paragraph.add_run()
        self.feed(html)

    def handle_starttag(self, tag, attrs):
        self.run = self.paragraph.add_run()
        if tag == "i":
            self.run.italic = True
        if tag == "b":
            self.run.bold = True
        if tag == "u":
            self.run.underline = True
        if tag in ["br", "ul", "ol"]:
            self.run.add_break()
        if tag == "li":
            self.run.add_text(u'● ')
        if tag == "p":
            self.run.add_break()
            # self.run.add_break()
            # self.run.add_tab()

    def handle_endtag(self, tag):
        if tag in ["br", "li", "ul", "ol"]:
            self.run.add_break()
        self.run = self.paragraph.add_run()

    def handle_data(self, data):
        self.run.add_text(data)

    def handle_entityref(self, name):
        c = unichr(name2codepoint[name])
        self.run.add_text(c)

    def handle_charref(self, name):
        if name.startswith('x'):
            c = unichr(int(name[1:], 16))
        else:
            c = unichr(int(name))
        self.run.add_text(c)

prashantmalicomp · 2021-12-10T10:25:57Z

Hey, @realsby can you please share full code of python-docx html tags text parser link?

tbell511 closed this as completed Feb 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Insert HTML into Docx? #352

How to Insert HTML into Docx? #352

tbell511 commented Jan 21, 2017 •

edited

vik378 commented Jan 22, 2017

tbell511 commented Jan 22, 2017

vik378 commented Jan 22, 2017 •

edited

tbell511 commented Jan 22, 2017

tbell511 commented Jan 22, 2017

vik378 commented Jan 23, 2017

realsby commented Oct 24, 2017 •

edited

realsby commented Oct 25, 2017

prashantmalicomp commented Dec 10, 2021

How to Insert HTML into Docx? #352

How to Insert HTML into Docx? #352

Comments

tbell511 commented Jan 21, 2017 • edited

vik378 commented Jan 22, 2017

tbell511 commented Jan 22, 2017

vik378 commented Jan 22, 2017 • edited

tbell511 commented Jan 22, 2017

tbell511 commented Jan 22, 2017

vik378 commented Jan 23, 2017

realsby commented Oct 24, 2017 • edited

realsby commented Oct 25, 2017

prashantmalicomp commented Dec 10, 2021

tbell511 commented Jan 21, 2017 •

edited

vik378 commented Jan 22, 2017 •

edited

realsby commented Oct 24, 2017 •

edited