<a href="https://colab.research.google.com/github/sreent/data-management-intro/blob/main/XPath%20Hand-On%20Lab%20-%20Solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1 Introduction to XPath

XPath (XML Path Language) is a query language for selecting nodes from an XML document. It provides a way to navigate through elements and attributes in XML.

# 2 Setting Up XPaht Environment

First, we need to install the `lxml` library, which provides a powerful API for XML and HTML parsing.

In [None]:
# Install lxml library
!pip install lxml

We also need to import the display tools from IPython.

In [None]:
# Import display tools
from IPython.display import display, HTML, Markdown

# 3. Sample XML Data

Let's start with a sample XML document. We will use this XML data for our XPath queries.

In [None]:
xml_data = """
<library>
    <book id="1">
        <title>Python Programming</title>
        <author>John Doe</author>
        <year>2020</year>
        <price>29.99</price>
    </book>
    <book id="2">
        <title>Learning XPath</title>
        <author>Jane Smith</author>
        <year>2019</year>
        <price>19.99</price>
    </book>
    <book id="3">
        <title>Data Science Handbook</title>
        <author>Emily Davis</author>
        <year>2018</year>
        <price>39.99</price>
    </book>
</library>
"""

# 4. Parsing XML Data

We will use the `lxml` library to parse the XML data.

In [None]:
from lxml import etree

# Parse the XML data
root = etree.fromstring(xml_data)

# Display the root tag to verify parsing
root.tag

# 5. Utility Function to Display XML Nodes

Define a utility function to simplify displaying XML content.

In [None]:
# Utility function to display XML content without empty lines
def display_xml(nodes):
    for node in nodes:
        xml_str = etree.tostring(node, pretty_print=True, encoding='unicode').strip()
        display(Markdown(f'```xml\n{xml_str}\n```'))

# 6. Basic XPath Queries

Let's start with some basic XPath queries to extract information from the XML document.

**a. Extract all book titles:**

In [None]:
# Extract all book title nodes
title_nodes = root.xpath('//book/title')
# Display the content of title nodes
display_xml(title_nodes)

**b. Extract the author of the first book:**

In [None]:
# Extract the author node of the first book
author_first_book = root.xpath('//book[1]/author')
# Display the content of the author node
display_xml(author_first_book)

**c. Extract all prices:**

In [None]:
# Extract all price nodes
price_nodes = root.xpath('//book/price')
# Display the content of price nodes
display_xml(price_nodes)

# 7. Advanced XPath Queries

Now, let's move on to some advanced queries.

**a. Extract books published after 2018:**

In [None]:
# Extract book nodes published after 2018
books_after_2018 = root.xpath('//book[year > 2018]')
# Display the content of the book nodes
display_xml(books_after_2018)

**b. Extract the title and price of books that cost more than $20:**

In [None]:
# Extract book nodes with price greater than $20
expensive_books = root.xpath('//book[price > 20]')
# Display the content of the book nodes
display_xml(expensive_books)

**c. Extract book details with a specific attribute:**

In [None]:
# Extract book node with id=2
book_id_2 = root.xpath('//book[@id="2"]')
# Display the content of the book node
display_xml(book_id_2)

# 8. Exploring Lists and Parent Navigation

XPath also allows navigating lists and moving to the parent level using `..`.


**a. Extract titles of all books (list example):**


In [None]:
# Extract all book title nodes
book_titles_nodes = root.xpath('//book/title')
# Display the content of title nodes
display_xml(book_titles_nodes)

**b. Navigate to the parent and back down to another child:**

In [None]:
# Navigate to the parent of the first book's title and get the price
parent_price_node = root.xpath('//book/title[text()="Python Programming"]/../price')
# Display the content of the price node
display_xml(parent_price_node)

**c. Use `..` to navigate from an element to its parent and then select another sibling:**

In [None]:
# Use '..' to navigate from author to title
titles_from_authors_nodes = root.xpath('//book/author[text()="Jane Smith"]/../title')
# Display the content of title nodes
display_xml(titles_from_authors_nodes)

# 9. Using `//` and Wildcard `*` in XPath

**a. Using `//` to select nodes regardless of their position in the document:**

In [None]:
# Extract all author nodes regardless of their position in the document
all_authors_nodes = root.xpath('//author')
# Display the content of author nodes
display_xml(all_authors_nodes)

**b. Using the wildcard `*` to select any element:**

In [None]:
# Extract all child nodes of the first book
first_book_children = root.xpath('//book[1]/*')
# Display the content of child nodes
display_xml(first_book_children)

**c. Combine `//` and `*` to select all elements:**

In [None]:
# Extract all elements in the document
all_elements = root.xpath('//*')
# Display the content of all elements
display_xml(all_elements)

# 10. Additional XPath Functions and Expressions


**a. Using `@` to Select Attributes:**

In [None]:
# Extract the IDs of all books
book_ids = root.xpath('//book/@id')
book_ids

**b. Using Position Functions:**

In [None]:
# Extract the title of the last book
last_book_title_node = root.xpath('//book[last()]/title')
# Display the content of the title node
display_xml(last_book_title_node)

In [None]:
# Extract the titles of the first two books
first_two_books_title_nodes = root.xpath('//book[position() <= 2]/title')
# Display the content of the title nodes
display_xml(first_two_books_title_nodes)

**c. Using Boolean Functions:**

In [None]:
# Check if there are any books published in 2020
books_2020 = root.xpath('boolean(//book[year=2020])')
books_2020

**d. Using Aggregation Functions:**

In [None]:
# Count the number of books
book_count = root.xpath('count(//book)')
book_count

**e. Combining Functions:**

In [None]:
# Extract titles and authors of books costing more than $20
expensive_books_nodes = root.xpath('//book[price > 20]')
# Display the content of the book nodes
display_xml(expensive_books_nodes)

# 11. Conclusion

XPath is a powerful tool for navigating and querying XML documents. In this lab, we've covered basic to advanced XPath queries, explored lists, navigated using `..`, used `//` to select nodes regardless of their position, utilized the wildcard `*`, and explored various XPath functions and expressions without always relying on `text()`. You can further explore XPath to handle more complex XML structures and queries.

