Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Insert HTML into Docx? #352

Closed
tbell511 opened this issue Jan 21, 2017 · 9 comments
Closed

How to Insert HTML into Docx? #352

tbell511 opened this issue Jan 21, 2017 · 9 comments

Comments

@tbell511
Copy link

tbell511 commented Jan 21, 2017

Is it possible to insert HTML into a Document using python-docx with styling applied?
The only thing I need to work are italics.

For example how to insert "Today is <i>Saturday</i>" with Saturday actually being inserted with italics?

Thanks!

@vik378
Copy link

vik378 commented Jan 22, 2017

I had a similar issue however our HTML is way more complex and I couldn't find any direct method, so we ended up translating HTML fragments into a run map and then something like:
par = doc.add_paragraph()
for run_item in run_map:
run = par.add_run(run_item["text_fragment"])
if run_item["type"] == "italic":
run.style.italic = True
Something like that, see the docs around run topic for exact methods.

@tbell511
Copy link
Author

Wow thank you for the reply! Any way you could provide the link for that documentation?
Thanks,

@vik378
Copy link

vik378 commented Jan 22, 2017

Have a look at docx.text.run
This is an extract from what i did in a similar case (simplified for italics only):

import re
from docx import Document


test_items = [
    "Trailing <i>tag</i>",
    "<i>Leading</i> tag",
    "This <i>time a tag</i> is in the middle",
]

class HTMLHelper(object):
    """ Translates some html into word runs. """
    def __init__(self):
        self.get_tags = re.compile("(<[a-z,A-Z]+>)(.*?)(</[a-z,A-Z]+>)")

    def html_to_run_map(self, html_fragment):
        """ breakes an html fragment into a run map """
        ptr = 0
        run_map = []
        for match in self.get_tags.finditer(html_fragment):
            if match.start() > ptr:
                text = html_fragment[ptr:match.start()]
                if len(text) > 0:
                    run_map.append((text, "plain_text"))
            run_map.append((match.group(2), match.group(1)))
            ptr = match.end()
        if ptr < len(html_fragment) - 1:
            run_map.append((html_fragment[ptr:], "plain_text"))
        return run_map
    
    def insert_runs_from_html_map(self, paragraph, run_map):
        """ inserts some runs into a paragraph object. """
        for run_item in run_map:
            run = paragraph.add_run(run_item[0])
            if run_item[1] == "<i>":
                run.italic = True

    
doc = Document()
html_helper = HTMLHelper()
for test_item in test_items:
    run_map = html_helper.html_to_run_map(test_item)
    print "------------------------------\nTest item:", test_item
    print "\nRun map: ", run_map
    par = doc.add_paragraph()
    print "\n XML before:\n", par._element.xml
    html_helper.insert_runs_from_html_map(par, run_map)
    print "\n XML after:\n", par._element.xml
doc.save("test_run_mapping.docx")

hope it helps

@tbell511
Copy link
Author

Thank you so much for spending time to do that! It looks great. I am going to try to implement it now.

@tbell511
Copy link
Author

I am getting a type error while trying to make the run fragments.

if match.start > ptr:
TypeError: unorderable types: builtin_function_or_method() > int()]

Any Ideas on how to fix this?

@vik378
Copy link

vik378 commented Jan 23, 2017

I would expect it to be sorted by now, but for the sake of consistency:
it should be if match.start() > ptr: - simply missing ()

@tbell511 tbell511 closed this as completed Feb 1, 2017
@realsby
Copy link

realsby commented Oct 24, 2017

@electron378
its realy good sample but i wish you share full sample, not just for italic.
your sample doesnt work with alone br tag and double tag recursively inside each other immediately :(

@realsby
Copy link

realsby commented Oct 25, 2017

I prepare something like this... This is enough for me..


from HTMLParser import HTMLParser
from htmlentitydefs import name2codepoint
from docx import Document

.....
document = Document()
document_html_parser = DocumentHTMLParser(document)

document_html_parser.add_paragraph_and_feed(html_code)
.....
class DocumentHTMLParser(HTMLParser):
    def __init__(self, document):
        HTMLParser.__init__(self)
        self.document = document
        self.paragraph = self.document.add_paragraph()
        self.run = self.paragraph.add_run()

    def add_paragraph_and_feed(self, html):
        self.paragraph = self.document.add_paragraph()
        self.run = self.paragraph.add_run()
        self.feed(html)

    def handle_starttag(self, tag, attrs):
        self.run = self.paragraph.add_run()
        if tag == "i":
            self.run.italic = True
        if tag == "b":
            self.run.bold = True
        if tag == "u":
            self.run.underline = True
        if tag in ["br", "ul", "ol"]:
            self.run.add_break()
        if tag == "li":
            self.run.add_text(u'● ')
        if tag == "p":
            self.run.add_break()
            # self.run.add_break()
            # self.run.add_tab()

    def handle_endtag(self, tag):
        if tag in ["br", "li", "ul", "ol"]:
            self.run.add_break()
        self.run = self.paragraph.add_run()

    def handle_data(self, data):
        self.run.add_text(data)

    def handle_entityref(self, name):
        c = unichr(name2codepoint[name])
        self.run.add_text(c)

    def handle_charref(self, name):
        if name.startswith('x'):
            c = unichr(int(name[1:], 16))
        else:
            c = unichr(int(name))
        self.run.add_text(c)


@prashantmalicomp
Copy link

Hey, @realsby can you please share full code of python-docx html tags text parser link?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants