# Introduction

A key to effective scraping is to understand how content and data is stored on web servers, how to identify the data you want to retrieve, an understand how the tools support this extraction.  In this chapter we will discuss website structures and the DOM, introduce techniques to parse and query websites with lxml, XPath and CSS.  We will also look at how to work with websites developed in other languages and different encoding types such as Unicode.

Ultimately, understanding how to find and extract data within an HTML document comes down to understanding the structure of the HTML page, it's representation with DOM, the process of querying the DOM for specific elements, and how to specify which elements you need want to retrieve based upon how the data is represented.

In this chapter we will cover:
* How to parse websites & navigate tree
* How to parse XML & HTML with lxml
* Dealing with children, parents, sibling and attributes
* Query data with XPath
* Query data with CSS
* Beautiful Soap’s find methods
* Selectors in Scrapy
* Handling HTML in UTF-8 format

## Skills Learned
* Develop an understanding of the structure of a web page when represented with the DOM
* Learn how to navigate through DOM elements using children, siblings and parents
* Learn the fundamentals of XPath, and understand how it can be used to find specific pieces of data within an HTML document.
* Be able to query and extract data with XPath, CSS and Regular expression
* Know how to handle Unicode and UTF-8 encodings for docouments

## References
[Planetary Facts](https://nssdc.gsfc.nasa.gov/planetary/factsheet/)

# How to parse websites & navigate tree, dealing with children, parents, sibling and attributes

When the browser displays a web page it builds a model of the content of the page in a representation known as the DOM (document object model).  The DOM is a hierarchical representation of all content of the page, as well as structural information, style information, scripts and links to other content.  It is critical to understand this structure to be able to effectively scrape data from web pages.  We will look at an example web page, its DOM, and examine how to navigate the DOM with beautiful soup.

## Getting ready
It is possible to examine the DOM in Chrome by right clicking the page and selecting Inspect.  The following shows opening inspection on the page http://127.0.0.1:8080/pages/planets.min.html 

![](img/01_01.png)

This opens the developer tools and the inspector.  The DOM can be examined in the elements tab.  The following shows the selection of the first row in the table.

![](img/01_02.png)

The tr element represents the row, and there are several characteristics of this element and its neighboring elements that we will examine.  First is that this element has three attributes: id, planet and name.  Attributes are often important in scraping as they commonly are used identify data embedded in the HTML.

Second, the tr element has children, in this case the five td elements.  We will often need to look into the children of a specific element to find the actual data we desire.

This element has a parent, <tbody>. It also has siblings, the other elements that are also children of the parent element.  As we will see, we can use constructs in XPath to navigate these relationships.

## How to do it...

We can load this page into a BeautifulSoup object using the following code.

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://127.0.0.1:8080/pages/planets.min.html")
bsobj = BeautifulSoup(html)

Examining the resulting object bsobj reports the underlying HTML.

In [None]:
bsobj.text[:1000]

We can navigate the elements in the DOM using properties of bsobj.  bsobj represents the overall document and we can drill into the document by chaining the tag names.  The following navigates to the table containing the data.

In [None]:
bsobj.html.body.div.table

Each node has both children and descendants. Descendants are all the nodes underneath a given node, while children are those that are a first level descendant.  The following retrieves the children of the table.

In [None]:
bsobj.html.body.div.table.children

We can look over each child element using a for loop. The following will get all the children of the table element, each of which is a <tr>.

In [None]:
for c in bsobj.html.body.div.table.children:
    print (c)

Beautiful soup will always return the first descendant of an element when using tags as properties.  While the table has many rows, .tr only returns the first tr child.

In [None]:
bsobj.html.body.div.table.tr

From any given sibling, we can progress to the next sibling using .find_next_sibling().

In [None]:
bsobj.html.body.div.table.tr.find_next_sibling()

The following demonstrates iterating all descendants of the first tr element.

In [None]:
for d in bsobj.html.body.div.table.tr.descendants:
    print (d)

The parent of a node can be found using the .parent property.

In [None]:
bsobj.html.body.div.table.tr.parent

## How it works
Beautiful soup converts the HTML from the page into it’s own internal representation.  This model has an identical representation to the DOM that would be created by a browser.  But beautiful soup also provides many powerful capabilities for navigating the elements in the DOM, such as what we have seen using the tags as properties 

## There's more...
This manner of navigating the DOM is relatively inflexible and is highly dependent upon the structure.  It is possible that this structure can change over time as web pages are updated by their creator(s).  They could even look identical, but have a completely different structure that breaks your scraping code.

So how can we deal with this?  As we will see, there are several ways of searching for elements that are much better than defining explicit paths.  In general, we can do this using XPath, and also using the find methods of beautiful soup.  We will examine both in further recipes in this chapter.

---
# BeautifulSoup's find methods
Items can also be located within the DOM using beautiful soups find methods.  These methods give us a much more flexible and powerful construct for finding elements that is not dependent upon the hierarchy of those elements, and which also provides us with search capabilities.

We will examine several common uses of these functions to locate various elements in the DOM.

## Getting ready
We will start by loading the sample page.

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://127.0.0.1:8080/pages/planets.min.html")
bsobj = BeautifulSoup(html)

## How to do it...
In the previous example to access all of the tr elements that contained data we needed to get the table property and then either select the children of that element, or get the first tr descendant and iterate through all remaining siblings.  

We can do this more effectively with the following.

In [None]:
all_tr = bsobj.html.body.div.table.findAll("tr")
all_tr

Awesome, this has returned us all the tr elements that are descendants of the table with one simple statement, and also gave us them as a python list instead of an iterator.

We can also specify that the attributes of the tag that we are searching for be a specific name and values.  The following retrieves tr elements with the an id="planet1" attribute (Mercury).

In [None]:
mercury = bsobj.html.body.div.table.findAll("tr", {"id": "planet1"})
mercury

Awesome!  And we used the fact that this page uses this attribute to represent table rows with actual data, with the result then also omitting the header tr element.  The following demonstrates by building a dictionary of the planets and their masses.

In [None]:
items_price = dict()
planet_rows = bsobj.html.body.div.table.findAll("tr", {"class": "planet"})
for i in planet_rows:
    tds = i.findAll("td")
    items_price[tds[1].text.strip()] = tds[2].text.strip()
items_price

## How it works
This works because the findAll performs a search for all DOM elements with the given name.  Under the covers, it likely converts the string provided to findAll to XPath.

## There's more...
Speaking of XPath, it's now time to cover it in some detail.  That's the next section.

# Query with XPath
The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API. The latest release works with all CPython versions from 2.6 to 3.6. See the introduction for more information about background and goals of the lxml project. Some common questions are answered in the FAQ.

Lxml is a Python wrapper on top of the libxml2 XML parsing library written in C, which helps make it faster than Beautiful Soup but also harder to install on some computers. The latest installation instructions are available at http://lxml.de/installation.html.

lxml supports XPath which makes it considerably easy to manage complex XML and HTML documents.  We will examine several techniques of using lxml and XPath together, and how to use lxml and XPath to navigate the DOM and access data.

## Getting ready
Start by importing html from lxml, and also requests.

In [15]:
from lxml import html
import requests

Now let's load the page with requests.

In [21]:
page = requests.get("http://127.0.0.1:8080/pages/planets.html")
page.text[:1000]

'<html>\r\n\r\n<head>\r\n</head>\r\n\r\n<body>\r\n    <div id="planets">\r\n        <h1>Planetary data</h1>\r\n        <div id="content">Here are some interesting facts about the planets in our solar system</div>\r\n        <p></p>\r\n        <table id="planetsTable" border="1">\r\n            <tr id="planetHeader">\r\n                <th>\r\n                </th>\r\n                <th>\r\n                    Name\r\n                </th>\r\n                <th>\r\n                    Mass (10^24kg)\r\n                </th>\r\n                <th>\r\n                    Diameter (km)\r\n                </th>\r\n                <th>\r\n                    How it got its Name\r\n                </th>\r\n            </tr>\r\n\r\n            <tr id="planet1" class="planet" name="Mercury">\r\n                <td>\r\n                    <img src="img/mercury-150x150.png">\r\n                </td>\r\n                <td>\r\n                    Mercury\r\n                </td>\r\n            

Now load the content into an element tree.

In [22]:
tree = html.fromstring(page.content)
tree

<Element html at 0x10ec85cc8>

The tree variable is now an lxml representation of the DOM which models the HTML content.

## How to do it...
The ultimate goal of this task is to learn about XPath and how to use it to extract data from HTML.  We will examine this specific document and look to extract data for the planets and learn various XPath concepts.

Let's start with the following XPath.

In [23]:
for v in tree.xpath("/html/body/div/table/tr"):
    print (v)

<Element tr at 0x10ebe7ea8>
<Element tr at 0x10ebe7c28>
<Element tr at 0x10ebe7e58>
<Element tr at 0x10ea48048>
<Element tr at 0x10ea48bd8>
<Element tr at 0x10ec767c8>
<Element tr at 0x10ec76bd8>
<Element tr at 0x10ec760e8>
<Element tr at 0x10ec76958>
<Element tr at 0x10ec76cc8>
<Element tr at 0x10ec76c78>


This XPath asks to return all the tr elements found from the root of the document and descending through tags with names html, body, div and table.

This returned 11 tr elements.  This is perhaps a little curious as there are only 9 planets in the data, so why 11 rows?

Let's examine by changing the statement slightly.

In [24]:
for v in tree.xpath("/html/body/div/table/tr"):
    print (v.xpath("@id"))

['planetHeader']
['planet1']
['planet2']
['planet3']
['planet4']
['planet5']
['planet6']
['planet7']
['planet8']
['planet9']
['footerRow']


This modification prints the value of the id attribute of the tr rows that are found.  The XPath for an attribute of the current node is @ followed by the attribute name.

So this is returning all the rows from two different tables.  At each level of the XPath (between any /'s) there can be multiple return values.  There are two div tags beneath body in this document.  So the XPath engine continues to look down the next level (to table) on all the found div elements.  The table then returns two tables, and then the process continues with finding all the tr elements on all the found tables.  So 9 planet rows, plus one header, and one footer row.

So how can we select just the table and rows with planet data?  In this document, there are several ways (by design).

Take the following as a first example.

In [41]:
for v in tree.xpath("/html/body/div[1]/table/tr"):
    print (v, v.xpath("@id"))

<Element tr at 0x10eca3e08> ['planetHeader']
<Element tr at 0x10eca3908> ['planet1']
<Element tr at 0x10eca3ae8> ['planet2']
<Element tr at 0x10eca3368> ['planet3']
<Element tr at 0x10eca3818> ['planet4']
<Element tr at 0x10ea48bd8> ['planet5']
<Element tr at 0x10ec85728> ['planet6']
<Element tr at 0x10ec85c78> ['planet7']
<Element tr at 0x10ec850e8> ['planet8']
<Element tr at 0x10eca3a98> ['planet9']


Each level of the XPath can return multiple elements in an array.  This array starts at 1 instead of 0 (a common source of errors).  So this statement states that we want the first div that is found.

With this document we can also specify that we only want a div with an id attribute with a particular value, in this case "planets".

In [40]:
for v in tree.xpath("/html/body/div[@id='planets']/table/tr"):
    print (v, v.xpath("@id"))

<Element tr at 0x10ec8a138> ['planetHeader']
<Element tr at 0x10ec8a1d8> ['planet1']
<Element tr at 0x10ec8a3b8> ['planet2']
<Element tr at 0x10ec8a228> ['planet3']
<Element tr at 0x10ec8a2c8> ['planet4']
<Element tr at 0x10ec8a098> ['planet5']
<Element tr at 0x10ec8a0e8> ['planet6']
<Element tr at 0x10ec8a458> ['planet7']
<Element tr at 0x10ec8a4a8> ['planet8']
<Element tr at 0x10eca3a98> ['planet9']


We will want to exclude the header row from this result.  There are several ways to do this.  The first can use the fact that the row has an id of "planetHeader".

In [39]:
for v in tree.xpath("/html/body/div[@id='planets']/table/tr[@id!='planetHeader']"):
    print (v, v.xpath("@id"))

<Element tr at 0x10eca3f48> ['planet1']
<Element tr at 0x10eca3f98> ['planet2']
<Element tr at 0x10eca3548> ['planet3']
<Element tr at 0x10eca39a8> ['planet4']
<Element tr at 0x10eca3cc8> ['planet5']
<Element tr at 0x10ec8a048> ['planet6']
<Element tr at 0x10ec8a098> ['planet7']
<Element tr at 0x10ec8a0e8> ['planet8']
<Element tr at 0x10eca3a98> ['planet9']


We could also use the fact that the planet rows have an attribute class with value of planet.

In [38]:
for v in tree.xpath("/html/body/div[@id='planets']/table/tr[@class='planet']"):
    print (v, v.xpath("@id"))

<Element tr at 0x10eca3c78> ['planet1']
<Element tr at 0x10eca39a8> ['planet2']
<Element tr at 0x10eca3cc8> ['planet3']
<Element tr at 0x10eca3598> ['planet4']
<Element tr at 0x10eca3278> ['planet5']
<Element tr at 0x10eca3548> ['planet6']
<Element tr at 0x10eca3d18> ['planet7']
<Element tr at 0x10eca3d68> ['planet8']
<Element tr at 0x10eca3a98> ['planet9']


Say the planet rows did not have attributes (nor the header row), then we could do this by position, skipping the first row.

In [37]:
for v in tree.xpath("/html/body/div[1]/table/tr[position() > 1]"):
    print (v, v.xpath("@name"))

<Element tr at 0x10eca3908> ['Mercury']
<Element tr at 0x10eca3638> ['Venus']
<Element tr at 0x10eca3958> ['Earth']
<Element tr at 0x10eca39a8> ['Mars']
<Element tr at 0x10eca3228> ['Jupiter']
<Element tr at 0x10eca31d8> ['Saturn']
<Element tr at 0x10eca3278> ['Uranus']
<Element tr at 0x10eca3548> ['Neptune']
<Element tr at 0x10eca3a98> ['Pluto']


We can also navigate to the parent of a node using parent::*.

In [36]:
for v in tree.xpath("/html/body/div/table/tr/parent::*"):
    print (v, v.xpath("@id"))

<Element table at 0x10eca3598> ['planetsTable']
<Element table at 0x10ebe7c28> ['footerTable']


This returned two got two parents as remember that this xpatch returns the rows from two tables, so the parents of all those rows are found.

The * is a wild card that represents any parent tags with any name.  In this case, the two parents are both tables, but in general the result can be of any number of element types.  To specify a specific type, replace the * with the name.

In [42]:
for v in tree.xpath("/html/body/div/table/tr/parent::table"):
    print (v, v.xpath("@id"))

<Element table at 0x10eca3908> ['planetsTable']
<Element table at 0x10eca3ae8> ['footerTable']


It is also possible to specific a specific parent by position or attribute.  The following selects the parent with the id footerTable.

In [44]:
for v in tree.xpath("/html/body/div/table/tr/parent::*[@id='footerTable']"):
    print (v, v.xpath("@id"))

<Element table at 0x10eca3ae8> ['footerTable']


A shortcut for parent is .. (and . also represents the current node).

In [45]:
for v in tree.xpath("/html/body/div/table/tr/.."):
    print (v, v.xpath("@id"))

<Element table at 0x10ec849f8> ['planetsTable']
<Element table at 0x10eca3ae8> ['footerTable']


The following finds the mass of Earth.

In [56]:
tree.xpath("/html/body/div[1]/table/tr[@name='Earth']/td[3]/text()[1]")[0].strip()

'5.97'

## How it works
XPath is a element of the XSLT standard, and provides the ability to select nodes in an XML document.  Since HTML is a variant of HTML, XPath can also be used to find elements in HTML.

lxml is a python library created for manipulating HTML document.  One of the features of lxml is the ability to execute XPath statements against a document, which lxml will then return the resulting set of nodes to your code.

XPath itself is designed to model the structure of XML nodes, attributes and properties.  The systax provides means of finding items in the XML that match the expression.  This can include matching or logical comparison of any of the nodes, attributes, values or text in the XML document.  

Expressions can be combined to form very complex paths within the document.  It is also possible to navigate the document based upon relative positions, which helps greatly in finding data based upon relative instead of absolute positions within the DOM.

Understanding XPath is essential to knowing how to parse HTML and perform web scraping.  And as we will see, it underlies and provides an implementation for many of the higher level libraries such as lxml.  


## There's more...
Xpath is actually an amazing tool for working with XML and HTML documents.  It is quite rich in its capabilities, and we barely touched the surface of it's capabilities to demonstrate a few examples that are common to scraping data in HTML documents.  To learn much more, please visit the following links:

* https://www.w3schools.com/xml/xml_xpath.asp
* https://www.w3.org/TR/xpath/

# CSS Select
CSS selectors are patterns used for selecting elements and are often used to define which elements that styles should be applied.  They can also be used with lxml to select nodes in the DOM. CSS selectors are common to usje as they are more compact than XPath and generally can be more reusable in code. As examples of common selectors which may be used:

| What you are looking for | Example |
| -- | -- |
| All tags | * |
| A sprecific tag (ie: tr) | tr |
| A class name (ie: "planet") | .planet |
| A tag with a class "planet" | tr.planet |
| A tag with an ID "planet3" | tr#planet3 |
| A child tr of a table | table > tr |
| A descendant tr of a table | table tr |
| A tag with an attribute (ie: tr with id="planet4") | a[id=Mars] |

## Getting ready
Let's start examining css selectors by loading the page.

In [None]:
from lxml import html
import requests
page = requests.get("http://127.0.0.1:8080/pages/planets.min.html")
tree = html.fromstring(page.content)

## How to do it...

The following selects all tr elements with a class equal to "planet".

In [59]:
for v in tree.cssselect('tr.planet'):
    print (v, v.xpath("@name"))

<Element tr at 0x10ece1ae8> ['Mercury']
<Element tr at 0x10ece1bd8> ['Venus']
<Element tr at 0x10ece1e08> ['Earth']
<Element tr at 0x10ed1aa48> ['Mars']
<Element tr at 0x10ed1abd8> ['Jupiter']
<Element tr at 0x10ed1ac28> ['Saturn']
<Element tr at 0x10ed1ac78> ['Uranus']
<Element tr at 0x10ed1acc8> ['Neptune']
<Element tr at 0x10eca34a8> ['Pluto']


Data for the Earth can be found by several means.  The following gets the row based on ID.

In [74]:
tr = tree.cssselect("tr#planet3")
tr[0], tr[0].xpath("./td[2]/text()")[0].strip()

(<Element tr at 0x10ece1e08>, 'Earth')

Or find with an attribute with a specific value.

In [77]:
tr = tree.cssselect("tr[name='Pluto']")
tr[0], tr[0].xpath("td[2]/text()")[0].strip()

(<Element tr at 0x10eca34a8>, 'Pluto')

Note that unlike XPath, the @ symbol need not be used to specify an attribute.

## How it works
lxml converts the css selector you provide to XPath, and then performs that XPath expression against the underying document.  In essence, css selectors in lxml provide a shorthand to XPath that makes finding nodes fitting certain patterns simpler than with XPath.

## There's more...
Because css selectors utilize XPath under the covers, there is overhead to its used as compared to using XPath directly.  This difference is however almost a non-issue, and hence it certain scenarios it is easiest to just use cssselect.

A full description of css selectors can be found at [Selectors Level 3](https://www.w3.org/TR/2011/REC-css3-selectors-20110929/).

---
# Scrapy Selectors
Scrapy is a Python web spider framework that is used to extract data from websites.  It's provides many powerful features for navigating entire websites, such as the ability to follow links.  A feature that it provides is the ability to find data within a document using the DOM, and using the now quite familiar XPath.

The example we will look at will load the list of current questions on StackOverflow, and then parse this using a scrapy selector.  Using that selector, we will extract the text of each question.

## Getting ready
We start by importing Selector from scrapy, and also requests so that we can retrieve the page.

In [78]:
from scrapy.selector import Selector
import requests
from urllib.request import urlopen

## How to do it...
We start with the loading of the page.  The following loads the 10 most recent questions from StackOverflow.

In [None]:
payload = { 'pagesize': 1, 'sort': 'newest'}
response = requests.get("http://stackoverflow.com/questions", params=payload)
response.url
#response.text

Now create a Selector by passing it the response object.

In [None]:
selector = Selector(response)
selector

Now we can use XPath to retreive the question summaries.

In [None]:
summaries = selector.xpath('//div[@class="summary"]/h3')
summaries[0:5]

And the following prints all of the contained questions.

In [None]:
for question in summaries:
    print (question.xpath('a[@class="question-hyperlink"]/text()').extract()[0])

## How it works
## There's more...

---
# How to load data in unicode / UTF-8
A documents encoding tells an application how the characters in the document are represented as bytes in the file.  Essentially, the encoding specified how many bits there are per character.  In a standard ASCII document, all characters are 8-bits.  HTML files are often encoded as 8-bits per character, but with the globalization of the Internet, this is not always the case.  Many HTML documents are encoded as as 16-bit characters, or using a combination of 8 and 16-bit characters.

A particularly common form of encoding of HTML documents is referred to as UTF-8.  This is the encoding form that we will examine.

## Getting ready
We will read a file named unicode.html.  This file is UTF-8 encoded and contains several sets of characters in different parts of the encoding space.  For example, the page looks like the following in your browser.

![Unicode.html](img/01_04.png)

Using an editor that supports UTF-8 we can see how the Cyrillic characters are rendered in the editor (in my case, Visual Studio Code).

![Cyrillic](img/01_05.png)

## How to do it...
We will look at using urlopen and requests to handle HTML in UTF-8.  Let's start with urlopen.  The following reads the data and displays the section of the file for the Cyrillic table.

In [None]:
from urllib.request import urlopen
page = urlopen("http://localhost:8080/pages/unicode.html")
content = page.read()
content[840:1280]

Note how the Cyrillic characters were read in as multi-byte codes using \ notation, such as \xd0\x89.  To rectify this, we can convert the content to UTF-8 format.

In [None]:
str(content, "utf-8")[837:1270]

Note that the output now has the characters encoded properly.

With requests we can do this in one statement.

In [None]:
import requests
r = requests.get("http://localhost:8080/pages/unicode.html")
r.text

## How it works
In the case of using urlopen, the conversion was explicitly performed by using the str statement and specifying that the content should be converted to UTF-8.  For Requests, the library was able to determine from the content within the HTML that it was in UTF-8 format by seeing the following tag in the document:

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">


## There's more...
There are a number of resources available on the Internet to learn about Unicode and UTF-8 encoding techniques.  Perhaps the best is the following Wikipedia article with is an excellent summary and has a great table describing the encoding technique.

https://en.wikipedia.org/wiki/UTF-8