Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to detect a hyperlink run and extract the URL? #1113

Closed
anthonyrubin opened this issue Jun 29, 2022 · 2 comments
Closed

How to detect a hyperlink run and extract the URL? #1113

anthonyrubin opened this issue Jun 29, 2022 · 2 comments
Labels
hyperlink Read and write hyperlinks in paragraph

Comments

@anthonyrubin
Copy link

anthonyrubin commented Jun 29, 2022

I am trying to get functionality such that when I go over the runs of a paragraph I can extract the text from a hyperlink but also the URL. Currently, a hyperlink run is totally omitted and not at all caught by python-docx. Digging through the issues I have found the following code snippet here

`from docx.oxml.shared import qn

def GetParagraphRuns(paragraph):
def _get(node, parent):
for child in node:
if child.tag == qn('w:r'):
yield Run(child, parent)
if child.tag == qn('w:hyperlink'):
yield from _get(child, parent)
return list(_get(paragraph._element, paragraph))

Paragraph.runs = property(lambda self: GetParagraphRuns(self))`

However, this simply converts the hyperlink into plain text and I lose the URL. Is there any way to extract the hyperlink text and URL and return it in a run with my own html tags surrounding the extracted hyperlink?

For example:
hyperlink run

would be extracted to

<a href="https://github.com/python-openxml/python-docx/issues/85">hyperlink run</a>

The use case for this is for a news website where authors submit their articles as .docx files and the articles then have any HTML markdown added to it before being pushed to the website.

@icegreentea
Copy link

Hey,

So each hyperlink run has a r:id attribute on the <w:hyperlink> element. That r:id element appears in document.xml.rels (which defines all the different relationships for the document). You can access that list of relationships with doc._part.rels (it's a dictionary).

The following snippet will enumerate the relationship id and target (URL) of all hyperlink relationships:

from docx.opc.constants import RELATIONSHIP_TYPE as RT

for relId, rel in doc._part.rels.items():
	if rel.rel_type == RT.HYPERLINK:
		print(relId)
		print(rel._target)

On the actual runs/hyperlink elements, you'll need to access to r:id. The library doesn't have proxy for <w:hyperlink> implemented, so we can get a little janky with lxml/xpath and friends:

for para in doc.paragraphs:
	for link in para._element.xpath(".//w:hyperlink"):
		inner_run = link.xpath("w:r", namespaces=link.nsmap)[0]
		# print link text
		print(inner_run.text)
		# print link relationship id
		rId = link.get("{http://schemas.openxmlformats.org/officeDocument/2006/relationships}id")
		print(rId)
		# print link URL
		print(doc._part.rels[rId]._target)

This code makes a bunch of assumptions:

  • Your hyperlink contains a single run that contains the link text you want
  • All your hyperlinks have r:id attribute. I think various internal links (like table of contents) don't have r:id, so you may need some error catching in that case

@scanny scanny added the hyperlink Read and write hyperlinks in paragraph label Sep 28, 2023
@scanny
Copy link
Contributor

scanny commented Oct 2, 2023

This functionality is added in v1.0.0 released circa Oct 1, 2023.

There are two approaches depending on whether you care about where the hyperlink is in the paragraph or not.

The simplest is this:

for paragraph in document.paragraphs:
    if not paragraph.hyperlinks:
        continue
    for hyperlink in paragraph.hyperlinks:
        print(hyperlink.address)

If you need to know where they fit in with the runs of the paragraph:

for paragraph in document.paragraphs:
    if not paragraph.hyperlinks:
        continue
    for item in paragraph.iter_inner_content():
        if isinstance(item, Run):
            ... do the run thing ...
        elif isinstance(item, Hyperlink):
            ... do the hyperlink thing ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hyperlink Read and write hyperlinks in paragraph
Projects
None yet
Development

No branches or pull requests

3 participants