# Web scraping

Python has many libraries for reading and writing data in HMTL and XML formats. Examples include `lxml` (http://lxml.de), `Beautiful Soup`, `html5lib`. While lxml is much faster Beautiful Soup and html5lib can handle malformed HTML or XML files. `pandas` has a builtin function `read_html` wich uses Beautiful Soup (https://www.crummy.com/software/BeautifulSoup/) and lxml under the hood. Since we might need to fix or parameterize things 'under the hood' we will work with a Beautiful Soup study case in this course as well. Later on in this programming 1 course we work with JSON, web API's, *hiearchal data formats* like HDF5 files and SQL databases. 

Before we start working with web scraper libraries we need to install them. If the system does not have the libraries lxml, beautifulsoup4 and html5lib we can install them in a virtual environment on our system.

In [1]:
virtualenv -p /usr/bin/python3 venv
source venv/bin/activate

#install tools
pip3 install beautifulsoup4
pip3 install html5lib
pip3 install lxml

## Scraping XML

XML is a common structured data format supporting hierarchal nested data with metadata. XML and HTML are structured simular but XML is more general. An example of XML you find below

We can fetch the xml tree by open the url and read the data and decode the data. 

In [2]:
import urllib.request, urllib.parse, urllib.error
import xml.etree.ElementTree as ET
import ssl


# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = 'https://bioinf.nl/~fennaf/DSLS/plants.xml'
#url = 'http://www.phyloxml.org/examples/apaf.xml'
print('Retrieving', url)
uh = urllib.request.urlopen(url, context=ctx)

data = uh.read()
print('Retrieved', len(data), 'characters')
print(data.decode())


Retrieving https://bioinf.nl/~fennaf/DSLS/plants.xml
Retrieved 7086 characters
<?xml version="1.0" encoding="ISO8859-1" ?>
<CATALOG>
 <PLANT>
 <COMMON>Bloodroot</COMMON>
 <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
 <ZONE>4</ZONE>
 <LIGHT>Mostly Shady</LIGHT>
 <PRICE>$2.44</PRICE>
 <AVAILABILITY>031599</AVAILABILITY>
 </PLANT>
 
 <PLANT>
 <COMMON>Columbine</COMMON>
 <BOTANICAL>Aquilegia canadensis</BOTANICAL>
 <ZONE>3</ZONE>
 <LIGHT>Mostly Shady</LIGHT>
 <PRICE>$9.37</PRICE>
 <AVAILABILITY>030699</AVAILABILITY>
 </PLANT>
 
 <PLANT>
 <COMMON>Marsh Marigold</COMMON>
 <BOTANICAL>Caltha palustris</BOTANICAL>
 <ZONE>4</ZONE>
 <LIGHT>Mostly Sunny</LIGHT>
 <PRICE>$6.81</PRICE>
 <AVAILABILITY>051799</AVAILABILITY>
 </PLANT>
 
 <PLANT>
 <COMMON>Cowslip</COMMON>
 <BOTANICAL>Caltha palustris</BOTANICAL>
 <ZONE>4</ZONE>
 <LIGHT>Mostly Shady</LIGHT>
 <PRICE>$9.90</PRICE>
 <AVAILABILITY>030699</AVAILABILITY>
 </PLANT>
 
 <PLANT>
 <COMMON>Dutchman's-Breeches</COMMON>
 <BOTANICAL>Diecentra cucullari

If we need specific information we can use element tree to fetch that data

In [3]:
import xml.etree.ElementTree as ET

data = '''
<person>
  <name>Fenna</name>
  <phone type="intl">
    +31646080034
  </phone>
  <email hide="yes" />
</person>'''

tree = ET.fromstring(data)
print('Name:', tree.find('name').text)
print('Attr:', tree.find('email').get('hide'))

Name: Fenna
Attr: yes


In [4]:
import urllib.request, urllib.parse, urllib.error
import xml.etree.ElementTree as ET
import ssl


# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = 'https://bioinf.nl/~fennaf/DSLS/plants.xml'
print('Retrieving', url)
uh = urllib.request.urlopen(url, context=ctx)

data = uh.read()
tree = ET.fromstring(data)
for child in tree:
    print('\n')
    for element in child:
        print(element.tag, element.text)


Retrieving https://bioinf.nl/~fennaf/DSLS/plants.xml


COMMON Bloodroot
BOTANICAL Sanguinaria canadensis
ZONE 4
LIGHT Mostly Shady
PRICE $2.44
AVAILABILITY 031599


COMMON Columbine
BOTANICAL Aquilegia canadensis
ZONE 3
LIGHT Mostly Shady
PRICE $9.37
AVAILABILITY 030699


COMMON Marsh Marigold
BOTANICAL Caltha palustris
ZONE 4
LIGHT Mostly Sunny
PRICE $6.81
AVAILABILITY 051799


COMMON Cowslip
BOTANICAL Caltha palustris
ZONE 4
LIGHT Mostly Shady
PRICE $9.90
AVAILABILITY 030699


COMMON Dutchman's-Breeches
BOTANICAL Diecentra cucullaria
ZONE 3
LIGHT Mostly Shady
PRICE $6.44
AVAILABILITY 012099


COMMON Ginger, Wild
BOTANICAL Asarum canadense
ZONE 3
LIGHT Mostly Shady
PRICE $9.03
AVAILABILITY 041899


COMMON Hepatica
BOTANICAL Hepatica americana
ZONE 4
LIGHT Mostly Shady
PRICE $4.45
AVAILABILITY 012699


COMMON Liverleaf
BOTANICAL Hepatica americana
ZONE 4
LIGHT Mostly Shady
PRICE $3.99
AVAILABILITY 010299


COMMON Jack-In-The-Pulpit
BOTANICAL Arisaema triphyllum
ZONE 4
LIGHT Mostly Shad