# Xpath

Xpath is a language used to query and navigate XML-formatted documents, such as HTML.

It is a useful tool for web scraping, as the syntax is standardized across browsers and web parsing software packages.

For this reason, xpath is a seamless workflow between live websites in a browser and custom software to parse out fields (from static HTML) or interact with elements (using browser automation).

Although it's a unique language on its own, it can generate simple, precise, and generalizable expressions to parse web pages.

# Static Websites

We're going to identify all the recent article titles and links from NPR.
In your browser and go to our example website: https://text.npr.org/

Next, open the dev tools by right-clicking anywhere and selecting "Inspect" (or however else).

Select any element in the "Elements" tab and copy the xpath

<figure>
    <img src='assets/xpath_console.png' width=75%>
    <figcaption align = "left" style="font-size:80%;"> 
    How to copy the xpath of an element in Dev Tools.
    </figcaption>
</figure>

The element we're selecting is an `<a>` tag with a link and a title that looks like this:

```
<a class="topic-title" href="/nx-s1-5035272">What is in Project 2025? </a>
```

The resulting xpath that we copied looks like this:

```
/html/body/main/div/ul/li[1]/a
```

### What is xpath?

Xpath records hierarchically across a cascade of HTML tags, with the last tag denoting the destination.

It designates where an element lives in an HTML document (as if you were honing in on a street address from the center of the earth).

The example above is long and specific to one element on the page (BAD!). 

At it's worst, xpath provides is directions to a _specifc_ destination (for example the Shake Shack in Madison Square). At best, xpath provides directions that lead to _every_ Shake Shack.

With a little practice xpath can be both precise and generalizable, providing an elegant way to locate and select elements from web pages.

Here is the other extreme: short and generic (ALSO BAD!!). 

```
.//a
```

This syntax yields the target element mixed with every other element on the page with an `<a>` tag. Following the Shake Shack analogy, this xpath represents directions to every restaurant on Earth.

You'll notice the ".//" before the `<a>` tag, which denotes a search _anywhere_ on the page.

My favorite part about xpath is that you can identify and refine them _in browser_, and use the same xpath in different frameworks to make web parsing a breeze. 

The Goldie Locks approach is not about specifying the exact route, biut rather the destinguishing attributes of the destination.

### Identifying the optimal xpath in the browser

Let's jump into the live website:

1. In Dev Tools, switch over to the "console" tab. This allows us to execute JavaScript on the page.

2. We'll use the `$x()` function to select elements on the page by xpath ("x" for **x**path). As a start, type a HTML tag such as a "header", "a", "div":
```
$x('.//a')
```

xpath offers an easy way specify attributes and other distinguishable features. 

You just add an "@" sign before the attribute name. This allows you to denote specific attribute values `.//a[@href="/nx-s1-5035272"]` or simply the presence of an attribute `.//a[@href]`. 

A simple workflow: 
- Start a xpath with a HTML tag with closed brackets: `.//a[ ... ]`
- copy + paste attributes from a live element (for example `<a class="topic-title" href="/nx-s1-5035272">`)
- Add "@" before each attribute, ending up with: `.//a[@class="topic-title" and @href="/nx-s1-5035272"]`.

From there, you can remove overly-specific attributes. In the case above, the class is unique enough to isolate news articles.

```
$x('.//a[@class="topic-title"]')
```

## Text Matching
xpath also allows for text-matching. 

Here's how you can match for a link on the page with text mentioning "2025"
```
$x('.//a[@href and contains(text(),"2025")]')
```
To sanity check your results, you can expand the resulting list and click any of the elements. This will highlight the element on the page and shoot you back to the Dev Tools "Elements" tab to view the element.


<video width=100% controls loop>
    <source src="assets/click_xpath.mp4" type=video/mp4>
</video>

## Parsing Xpath from HTML pages

With the correct xpath in hand, we can automate this parsing in Python using the `lxml` package.

In [1]:
!pip install lxml



In [2]:
from lxml import etree
import requests

Let's visit the website and retrieve the static HTML from the page.

In [3]:
url = "https://text.npr.org/"

In [4]:
resp = requests.get(url)

Read the HTML into the tree to parse.

In [5]:
tree = etree.HTML(resp.text)

In [6]:
elements = tree.findall('.//a[@class="topic-title"]')
len(elements)

20

Now we can iterate through each headline and grab the title and link of each story:

In [7]:
data = []
for elm in elements:
    link = elm.get('href')
    link = f"https://npr.org{link}"
    title = elm.text
    
    row = {'link' : link, 'title': title}
    data.append(row)

In [8]:
data

[{'link': 'https://npr.org/g-s1-10379',
  'title': "Judge Cannon dismisses Trump's documents case over how the special counsel was appointed"},
 {'link': 'https://npr.org/g-s1-10620',
  'title': 'Artists on the schedule for Monday night at the RNC'},
 {'link': 'https://npr.org/nx-s1-5039234',
  'title': 'Conspiracy theories surge following the assassination attempt on Trump'},
 {'link': 'https://npr.org/1138759124',
  'title': '5 strategies to help you cope with a nagging feeling of dread'},
 {'link': 'https://npr.org/1244273973',
  'title': 'The Forever Stamp just went up in price. How does the U.S. cost compare globally?'},
 {'link': 'https://npr.org/nx-s1-5040280',
  'title': 'Copenhagen begins offering free perks to tourists who make sustainable choices'},
 {'link': 'https://npr.org/nx-s1-5039185',
  'title': 'What we know about the Trump shooter'},
 {'link': 'https://npr.org/nx-s1-5039293',
  'title': 'The man killed in the assassination attempt on Trump died shielding his family'

## Browser Automation

The same xpath can be used for browser automation frameworks, such as Playwright.

In [9]:
# download software
!pip install playwright
!playwright install



In [10]:
from playwright.async_api import async_playwright

playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
page = await browser.new_page()

The exact function to access xpaths is a little different across frameworks, but the xpaths stay the same.

In [12]:
# go to the webpage, and find all the links to parse
await page.goto(url)
elms = await page.locator('//a[@class="topic-title"]').all()

In [13]:
# parse the links
data = []
for elm in elms:
    title = await elm.text_content()
    link = await elm.get_attribute('href')
    link = f'https://npr.org{link}'
    
    row = {'link': link, 'title': title,}
    data.append(row)

In [14]:
data

[{'link': 'https://npr.org/g-s1-10379',
  'title': "Judge Cannon dismisses Trump's documents case over how the special counsel was appointed"},
 {'link': 'https://npr.org/g-s1-10620',
  'title': 'Artists on the schedule for Monday night at the RNC'},
 {'link': 'https://npr.org/nx-s1-5039234',
  'title': 'Conspiracy theories surge following the assassination attempt on Trump'},
 {'link': 'https://npr.org/1138759124',
  'title': '5 strategies to help you cope with a nagging feeling of dread'},
 {'link': 'https://npr.org/1244273973',
  'title': 'The Forever Stamp just went up in price. How does the U.S. cost compare globally?'},
 {'link': 'https://npr.org/nx-s1-5040280',
  'title': 'Copenhagen begins offering free perks to tourists who make sustainable choices'},
 {'link': 'https://npr.org/nx-s1-5039185',
  'title': 'What we know about the Trump shooter'},
 {'link': 'https://npr.org/nx-s1-5039293',
  'title': 'The man killed in the assassination attempt on Trump died shielding his family'

In [15]:
await browser.close()