<a href="https://colab.research.google.com/github/sreent/data-management-intro/blob/main/XPath%20-%20Sep%202022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1 Introduction to XPath

XPath (XML Path Language) is a query language for selecting nodes from an XML document. It provides a way to navigate through elements and attributes in XML.

# 2 Setting Up XPaht Environment

First, we need to install the `lxml` library, which provides a powerful API for XML and HTML parsing.

In [None]:
# Install lxml library
!pip install lxml

We also need to import the display tools from IPython.

In [None]:
# Import display tools
from IPython.display import display, HTML, Markdown

# 3. Sample XML Data

Let's start with a sample XML document. We will use this XML data for our XPath queries.

In [None]:
from lxml import etree

xml_data = """
<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:id="manuscript_3945" xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader xmlns:tei="http://www.tei-c.org/ns/1.0">
    <fileDesc>
      <titleStmt>
        <title>Christ Church MS. 341</title>
        <title type="collection">Christ Church MSS.</title>
        <respStmt>
          <resp>Cataloguer</resp>
          <persName>Ralph Hanna</persName>
          <persName>David Rundle</persName>
        </respStmt>
      </titleStmt>
    </fileDesc>
  </teiHeader>
</TEI>
"""

# Clean the XML data to ensure no unwanted characters are before the declaration
xml_data = xml_data.strip()

# 4. Parsing XML Data

We will use the `lxml` library to parse the XML data.

In [None]:
# Convert the XML string to a byte string
xml_data_bytes = xml_data.encode('utf-8')

# Parse the XML data
root = etree.fromstring(xml_data_bytes)

# Display the root tag to verify parsing
root.tag

# 5. Utility Function to Display XML Nodes

Define a utility function to simplify displaying XML content.

In [None]:
# Utility function to display XML attribute values
def display_values(values):
    for value in values:
        display(Markdown(f'```text\n{value}\n```'))

# 6. XPath Queries

Let's start with some basic XPath queries to extract information from the XML document.

In [None]:
# Define namespaces (if any)
namespaces = {'tei': 'http://www.tei-c.org/ns/1.0'}

# Adjust XPath query to include namespace
results = root.xpath('//tei:fileDesc//tei:title/@type', namespaces=namespaces)

# Display the content of title attribute values
display_values(results)

In [None]:
# Adjust XPath query to include namespace
results = root.xpath('//tei:resp[text()="Cataloguer"]/../tei:persName', namespaces=namespaces)

# Extract the text from each <persName> element
names = [name.text for name in results]

# Display the names
display_values(names)

In [35]:
xml_data = """
<royal name="Henry" xml:id="HenryVII">
    <title rank="king" territory="England" regnal="VII" from="1485-08-22" to="1509-04-21" />
    <relationship type="marriage" spouse="#ElizabethOfYork">
        <children>
            <royal name="Arthur" xml:id="ArthurTudor" />
            <royal name="Henry" xml:id="HenryVIII">
                <title rank="king" territory="England" regnal="VIII" from="1509-04-22" to="1547-01-28" />
                <relationship type="marriage" spouse="#CatherineOfAragon" from="1509-06-11" to="1533-05-23">
                    <children>
                        <royal name="Mary">
                            <title rank="queen" territory="England" regnal="I" from="1553-07-19" to="1558-11-17" />
                            <relationship type="marriage" spouse="#PhilipOfSpain" from="1554-07-25" />
                        </royal>
                    </children>
                </relationship>
                <relationship type="marriage" spouse="#AnneBoleyn" from="1533-01-25" to="1536-05-17">
                    <children>
                        <royal name="Elizabeth">
                            <title rank="queen" territory="England" regnal="I" from="1558-11-17" to="1603-03-24" />
                        </royal>
                    </children>
                </relationship>
                <relationship type="marriage" spouse="#JaneSeymour" from="1536-05-30" to="1537-10-24">
                    <children>
                        <royal name="Edward">
                            <title rank="king" territory="England" regnal="VI" from="1547-01-28" to="1553-07-06" />
                        </royal>
                    </children>
                </relationship>
            </royal>
        </children>
    </relationship>
</royal>
"""

# Clean the XML data to ensure no unwanted characters are before the declaration
xml_data = xml_data.strip()

In [36]:
# Convert the XML string to a byte string
xml_data_bytes = xml_data.encode('utf-8')

# Parse the XML data
root = etree.fromstring(xml_data_bytes)

# Display the root tag to verify parsing
root.tag

'royal'

In [41]:
# Let's extract all royal names for those who are titled "king" with regnal="VIII"
results = root.xpath('//royal/title[@rank="king" and @regnal="VIII"]/../@name')

# Display the results
for name in results:
    print(name)

<Element title at 0x7dd90595abc0>
<Element relationship at 0x7dd8ec37d6c0>
<Element relationship at 0x7dd8ec37e340>
<Element relationship at 0x7dd8ec37e1c0>
