## PRACTICAL :: Scraping & XML

Learning Outcomes:
1. Run HTTP requests from within Python.
2. Interpret desired HTML tags as XML objects.
3. Use XML extraction techniques to scrape data out of HTML.

## The Scenario

As a market analyst working for a tourism agency, your boss has approached you with a client in need of a recommendation regarding the top tourist destinations of 2018.

While this may sound easy, in hopes that it will improve their tourism experience, the client has also requested that places that are more innovative be prioritised in the recommendation.

## Required Libraries

For this task, we will need the following libraries:

In [74]:
import urllib.request
import xml.etree.ElementTree as ET
import re

Using the 'urllib' library, you can pull HTML down from a web page, and interpret its data as raw text. Below, we declare a function that only requires a URL to do this, handling the bulk of the web request. However, a crucial part relating to how the library functions is missing and needs to be added to complete the method (hint: visit https://docs.python.org/3/howto/urllib2.html for help):

In [34]:
def request(url):
	response = urllib.request.urlopen(url)
	html = response.read()
	return html

Provided that the method works, you can run the following lines of code to retrieve a simple web page:

In [38]:
url_world_tourism = 'https://en.wikipedia.org/wiki/World_Tourism_rankings' # Wikipedia's Top Tourism Rankings
raw_html = request(url_world_tourism) # here, 'replace()' is simply used to prettify the text's
print(str(raw_html).replace('\\n','\n'))

b'<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>World Tourism rankings - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"World_Tourism_rankings","wgTitle":"World Tourism rankings","wgCurRevisionId":829904334,"wgRevisionId":829904334,"wgArticleId":14752049,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Use dmy dates from August 2017","Lists of countries by economic indicator","International rankings","Tourism-related lists","World Tourism Organization"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable

The HTML shown above corresponds to the page at this link: https://en.wikipedia.org/wiki/World_Tourism_rankings. As seen on the page, there is a section for the world's top tourism destinations are given. In your experience with working with HTML, figure out the HTML tag that these details are stored in on this page.

In the next lines of code, we use a weak method of extraction to analyse the HTML relating specifically to the section describing the top international tourist destinations. The code is almost complete, however the tag we'll be searching for (in which the desired data is enclosed) is missing:

In [29]:
tag = 'table'
anchor_one = raw_html.index('<'+tag)
anchor_two = raw_html.index('</'+tag)
print(raw_html[anchor_one:anchor_two - anchor_one].replace('\\n','\n'))

<table class="wikitable sortable" style="margin:1em auto 1em auto;">
<tr>
<th>Rank</th>
<th>Destination</th>
<th>International<br />
tourist<br />
arrivals<br />
(2016)<sup id="cite_ref-WTO_Tourism_Highlights_2016_Edition_1-1" class="reference"><a href="#cite_note-WTO_Tourism_Highlights_2016_Edition-1">[1]</a></sup></th>
<th>International<br />
tourist<br />
arrivals<br />
(2015)<sup id="cite_ref-WTO_Tourism_Highlights_2016_Edition_1-2" class="reference"><a href="#cite_note-WTO_Tourism_Highlights_2016_Edition-1">[1]</a></sup></th>
<th>Change<br />
(2015 to<br />
2016)<br />
(%)</th>
<th>Change<br />
(2014 to<br />
2015)<br />
(%)</th>
</tr>
<tr align="center">
<td>1</td>
<td align="left"><span class="flagicon"><img alt="" src="//upload.wikimedia.org/wikipedia/en/thumb/c/c3/Flag_of_France.svg/23px-Flag_of_France.svg.png" width="23" height="15" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/c/c3/Flag_of_France.svg/35px-Flag_of_France.svg.png 1.5x, //upload.wikimedi

The method used above is simply executed, and helped us get a more focussed look at the HTML we are after in the international tourism rankings. However, applying this technique for a more comprehensive interpretation of the HTML is very difficult.

As a much more powerful alternative, the XML library allows us to interpret HTML as a standard XML DOM Tree (for a quick intro to understanding XML, visit here: https://www.w3schools.com/xml/default.asp). Looking at the raw HTML of the page we just analysed, we can develop a method using the XML library to easily isolate any data contained in the tag we want. To isolate the data even further, we can use attributes from contained sub-elements to get only the names of the countries we are after, and nothing else. Complete the method below to do this. For help on understanding the XML library, you can visit here: https://docs.python.org/2/library/xml.etree.elementtree.html

In [36]:
def xml_retrieve_locations(html):
    locations = []
    parser = ET.XMLParser(encoding="utf-8")
    root = ET.fromstring(html, parser=parser)
    for elem in root.iter():
        if (elem.tag == 'table'):
            for sub_elem in elem.iter():
                if 'title' in sub_elem.attrib:
                    if 'Tourism in ' in sub_elem.attrib['title']:
                        locations.append(sub_elem.text)
    return locations

xml_retrieve_locations(raw_html)

['France',
 'United States',
 'Spain',
 'China',
 'Italy',
 'United Kingdom',
 'Germany',
 'Mexico',
 'Thailand',
 'Austria',
 'Morocco',
 'Egypt',
 'South Africa',
 'Tunisia',
 'Zimbabwe',
 'Algeria',
 'Mozambique',
 'Botswana',
 'Ivory Coast',
 'Namibia',
 'United States',
 'Mexico',
 'Canada',
 'Brazil',
 'Argentina',
 'Dominican Republic',
 'Chile',
 'Puerto Rico',
 'Cuba',
 'Peru',
 'China',
 'Thailand',
 'Malaysia',
 'Hong Kong',
 'Japan',
 'South Korea',
 'Macau',
 'India',
 'Singapore',
 'Indonesia',
 'France',
 'Spain',
 'Italy',
 'United Kingdom',
 'Germany',
 'Austria',
 'Turkey',
 'Greece',
 'Russia',
 'Poland',
 'Saudi Arabia',
 'Egypt',
 'Iran',
 'Jordan',
 'Israel',
 'Oman',
 'Lebanon',
 'United States',
 'Spain',
 'Thailand',
 'China',
 'France',
 'Italy',
 'United Kingdom',
 'Germany',
 'Hong Kong',
 'Australia',
 'South Africa',
 'Morocco',
 'Egypt',
 'Tanzania',
 'Mauritius',
 'Tunisia',
 'Botswana',
 'Nigeria',
 'Zimbabwe',
 'United States',
 'Mexico',
 'Canada',
 '

The method above retrieved the names of all countries contained in all 'table' tags on the page. However, we only want the top ranking table for the international tourism locations, and nothing else. To do this, we must alter the method to stop iterating through table entries after the first table element. Below, the solution is almost complete:

In [39]:
def xml_retrieve_locations_by_index(html,index):
	locations = []
	iterator = 0
	parser = ET.XMLParser(encoding="utf-8")
	root = ET.fromstring(html, parser=parser)
	for elem in root.iter():
		if (elem.tag == 'table'):
			if (iterator == index):
				for sub_elem in elem.iter():
					if 'title' in sub_elem.attrib:
						if 'Tourism in ' in sub_elem.attrib['title']:
							locations.append(sub_elem.text)
			iterator += 1
	return locations

To test that this method has worked, the following line of code should return the names of the countries featured in the top international tourism locations list only:

In [49]:
international_tourism_locations = xml_retrieve_locations_by_index(raw_html,0)
print(international_tourism_locations)

['France', 'United States', 'Spain', 'China', 'Italy', 'United Kingdom', 'Germany', 'Mexico', 'Thailand', 'Austria']


Now that we have the locations we are after, the next part of the task is to find which out of the countries are most innovative. A good indication of innovation is the 'HDI' index, which is included on every 'country' page of Wikipedia. Visit the following link to find out the HDI of Australia: https://en.wikipedia.org/wiki/Australia

The HTML of the section that corresponds to the 'HDI' of Australia is given as follows:

From looking at the raw HTML in which the HDI (of 0.939) is contained, what can be said about the surrounding content. Brainstorm ways in which we could use the XML Library to extract the HDI out of the given HTML.

Below, the method to do this is declared, however it is incomplete. Use your newfound knowledge in the XML Library to finish it off:

In [135]:
def xml_retrieve_HDI(html):
    parser = ET.XMLParser(encoding="utf-8")
    root = ET.fromstring(html, parser=parser)
    HDI = ''
    close = False
    actioned = False
    for elem in root.iter():
        if close:
            if elem.tag == 'td':
                HDI = elem[0].tail
                close = False
        if 'title' in elem.attrib and (actioned == False):
            if (elem.attrib['title'] == "Human Development Index"):
                close = True
                actioned = True
    regex = r"[0-9].[0-9]*"
    return re.findall(regex, str(HDI))[0]

To test that the method works, run the following code to see if it can output the HDI from Australia's wiki page:

In [136]:
xml_retrieve_HDI(request('https://en.wikipedia.org/wiki/Australia'))

'0.939'

To make our HDI checker universally applicable, we can write it into a method. To do this, complete the method below:

In [137]:
def wiki_country_HDI(country):
	return xml_retrieve_HDI(request('https://en.wikipedia.org/wiki/'+country))

wiki_country_HDI('Australia')

'0.939'

At this point, we have created a method that can return raw HTML from web pages, a method that can retrieve the top international tourism locations for 2018, and a method that can retrieve the HDI of any given country featured on Wikipedia. To have enough information required to form a recommendation for the client, all we have to do now is run the HDI checker method on the list of top tourism locations to clearly show the HDIs of each of the competing countries. The incomplete code for this is given below:

In [139]:
def print_location_HDIs():
    tourism_locations = xml_retrieve_locations_by_index(raw_html,0)
    location_HDIs = []
    for location in tourism_locations:
        this_loc = location.replace(' ','_')
        this_HDI = wiki_country_HDI(this_loc)
        location_HDIs.append([this_loc, this_HDI])
    print(location_HDIs)

print_location_HDIs()

[['France', '0.897'], ['United_States', '0.920'], ['Spain', '0.884'], ['China', '0.738'], ['Italy', '0.887'], ['United_Kingdom', '0.909'], ['Germany', '0.926'], ['Mexico', '0.762'], ['Thailand', '0.740'], ['Austria', '0.893']]


Provided these details, you can now have enough information to form a recommendation on which country to visit. Still, suppose client also had in mind that they were only going to visit, say Asian countries. How would you re-execute your implementation such that the countries given were not the international tourism locations, but only those in Asia.

In [140]:
def print_location_HDIs():
    tourism_locations = xml_retrieve_locations_by_index(raw_html,3)
    location_HDIs = []
    for location in tourism_locations:
        this_loc = location.replace(' ','_')
        this_HDI = wiki_country_HDI(this_loc)
        location_HDIs.append([this_loc, this_HDI])
    print(location_HDIs)

print_location_HDIs()

[['China', '0.738'], ['Thailand', '0.740'], ['Malaysia', '0.789'], ['Hong_Kong', '0.917'], ['Japan', '0.903'], ['South_Korea', '0.901'], ['Macau', '0.905'], ['India', '0.624'], ['Singapore', '0.925'], ['Indonesia', '0.689']]


With this newer criteria, how would you form your recommendation.