# Regular expressions and XPath

To use these notebook, please install the dependencies *requests* and *lxml*. If you use Anaconda Python distribution, then issue:

```
conda install requests
conda install lxml
```

## Web page example

Let's try to extract some data from the example Web page below. We are interested into extraction of the title, the image url and the date (data within red rectangles):

![An example web page](webpage.png)

In [None]:
# Import libraries
import requests

# Example of loading of a live web page (try if it works - it worked on April 9, 2019)
"""
pageContent=requests.get(
     'https://www.avto-magazin.si/novice/odkrivamo/razkrivamo-volkswagen-golf-8-prihaja-naslednje-leto-namesto-letos/'
)
pageContent = pageContent.content
"""

# We rather use the locally-cached file as it may have changed online.
pageContent = open('Golf8.html', 'r').read()
print("Page content:\n'%s'." % pageContent)

## Regular expressions

One of the standardized ways to extract structured information from text is the use of regular expressions. To get familiar with the regular expressions, follow a tutorial at [RegexOne](https://regexone.com/) or go through explanations at [Regular-Expressions.info](http://www.regular-expressions.info/). 

While you learn and also when you want to test some regular expression examples against some text, you can help yourself using tools such as [Regex101](https://regex101.com/).

### Regular expressions in Python
Let's try some regular expressions in Python.

In [None]:
# Import libraries
import re

In [None]:
text = """Mr. Swensen, 62, runs the school’s $25.4 billion endowment, one of the largest in the country. 
Since November 1 2016 he is joined by his intellectual sparring partner, Mr. Dean Takahashi, his senior director."""

In [None]:
regex = "Mr. (\w+)"

# Find a person
personPattern = re.compile(regex)
match = personPattern.search(text)
print("Found person: '{}'.".format( match.group(1) ))

In [None]:
# Find all persons
matches = re.finditer(regex, text)
for match in matches:
    print("Found person: '{}'.".format( match.group(1) ))

In [None]:
# Find money amounts
regex = "[$€]\s*[0-9\.,]+"
matches = re.finditer(regex, text)
for match in matches:
    print("Amount: '{}'.".format( match.group(0) ))

In [None]:
# Find dates
regex = "(January|February|November|December)\s(\d{1,2})\s(\d{2,4})"

matches = re.finditer(regex, text)
for match in matches:
    print("Found date: '{}. {} {}'.".format( match.group(2), match.group(1), match.group(3) ))

### Web extraction using regular expressions

Similarly to the above, we read HTML document into a string and search within it using regular expressiona.

In [None]:
# Get the article title
regex = r"<div class=\"col-xs-12\">\s+<h1>(.*)<\/h1>"

match = re.compile(regex).search(pageContent)
title = match.group(1)
print("Found title: '%s'." % title)

In [None]:
# Get the article date
regex = r"<div class=\"col-md-6 col-xs-12 date\">[\n\s]*([0-9]*\.\s+[0-9]*\.\s+[0-9]*)"

match = re.compile(regex).search(pageContent)
date = match.group(1)
print("Found date: '%s'." % date)

In [None]:
# Extract image URL
regex = r"<img class=\"img-responsive\".*?src=\"(.*)\""

match = re.compile(regex).search(pageContent)
imageUrl = match.group(1)
print("Found imageURL: '%s'." % imageUrl)

In [None]:
# Form and output JSON
import json

dataItem = {
    "title": title,
    "date": date,
    "imageUrl": imageUrl
}

print("Output object:\n%s" % json.dumps(dataItem, indent = 4))

## XPath

XPath is a language to address and filter elements in XML data. To get familiar with XPath, follow the [W3Schools tutorial](https://www.w3schools.com/xml/xpath_intro.asp) or check a [cheatsheet](https://devhints.io/xpath).

### XPath in Python

Let's first define a small XML document and retrieve some data items from it.

In [None]:
# Import libraries
from lxml import html

In [None]:
# Define the XML document
xmlDocument = """
          <?xml version="1.0" encoding="UTF-8"?>
          
          <bookstore>
          
          <book category="cooking">
            <title lang="en">Everyday Italian</title>
            <author>Giada De Laurentiis</author>
            <year>2005</year>
            <price>30.00</price>
          </book>
          
          <book category="children">
            <title lang="en">Harry Potter</title>
            <author>J K. Rowling</author>
            <year>2005</year>
            <price>29.99</price>
          </book>
          
          <book category="web">
            <title lang="en">XQuery Kick Start</title>
            <author>James McGovern</author>
            <author>Per Bothner</author>
            <author>Kurt Cagle</author>
            <author>James Linn</author>
            <author>Vaidyanathan Nagarajan</author>
            <year>2003</year>
            <price>49.99</price>
          </book>
          
          <book category="web">
            <title lang="en">Learning XML</title>
            <author>Erik T. Ray</author>
            <year>2003</year>
            <price>39.95</price>
          </book>
          
          </bookstore>
"""

# Form an XML tree using lxml library
tree = html.fromstring(xmlDocument)

In [None]:
# Select all titles (the result will be a list of lxml Element objects)
tree.xpath('//title')

In [None]:
# Select the first element from the above query and retrieve text data from it
tree.xpath('//title')[0].text

In [None]:
# Similar to the above but getting title texts directly using XPath
tree.xpath('//title/text()')

In [None]:
# More explicit version of the above
tree.xpath('//bookstore/book/title/text()')

In [None]:
# Select first authors only
tree.xpath('//bookstore/book/author[1]/text()')

In [None]:
# Select titles of english books that are cheaper than 30 EUR
tree.xpath('//bookstore/book[price<30]/title[@lang="en"]/text()')

### Web extraction using XPath

Similarly to the above, we read HTML document as XML and retrieve the needed data.

In [None]:
# Form an XML tree using lxml library
tree = html.fromstring(pageContent)

In [None]:
# Get the article title
title = str(tree.xpath('//*[@id="container"]/div//h1/text()')[0])
print("Found title: '%s'." % title)

In [None]:
# Get the date (cannot handle text directly using XPath)
date = str(tree.xpath('//*[@id="container"]/div/div[1]/div[1]/div[2]/div[1]/text()')[0])
print("Found date: '%s'." % date)

In [None]:
# Additionally format date
date = re.sub(r"\s", "", date)
print("Found date: '%s'." % date)

In [None]:
# Extract the image URL
imageUrl = str(tree.xpath('//*[@id="container"]/div/div[1]/div[1]/div[1]/div/div[2]/a/img/@src')[0])
print("Found imageUrl: '%s'." % imageUrl)

In [None]:
# Form and output JSON
import json

dataItem = {
    "title": title,
    "date": date,
    "imageUrl": imageUrl
}

print("Output object:\n%s" % json.dumps(dataItem, indent = 4))

**Hint:** Using developer tools in Chrome, we can easily retrieve XPath for a selected HTML element. If you use that feature for the assignment, you need at least understand the result. It would be nice also to shorten it if possible.

![Chrome XPath retrieval](XPathHelper.png)