<div class="alert alert-block alert-info"><b>IAB303</b> - Data Analytics for Business Insight</div>

# TOWS - Threats and opportunities

This session focuses on obtaining data from web pages, via a process called web scraping.

The tasks to be completed are:
1. Understanding how information is structured within a web page
2. Finding specific information
3. Retrieving information from web pages
4. A business intelligence scenario to find threats/opportunities


## Web scraping

### 1. Understanding how information is structured within a web page

One of the most common ways of serving information online is through web pages. Web page information is semi-structured using HTML (HyperText Markup Language).

HTML represents documents as elements, with sub-elements branching out from containing elements. As they 'branch' out further and further, they form what is referred to as the DOM (Document Object Model) - a kind of 'tree' shaped format.

A DOM tree has a 'head' element, and a 'body' element, with most of the relevant viewing content being stored in the 'body' e.g.

To begin, open a web browser. Depending on the browser you are using, follow instructions in the following link to open its developer tools:

https://www.lifewire.com/web-browser-developer-tools-3988965

From there, a tabbed sub-window will appear that displays information about the current web page you have opened. It can tell you a lot about the page, although all we are concerned with is the 'Elements' section, which shows the actual HTML DOM tree of the page itself.

### 2. Finding specific information

Now that we can load up a web page's HTML just from opening it in a browser, let's try something a bit more specific. 

The following web page is the 'Wikipedia' article for Australia:

https://en.wikipedia.org/wiki/Australia

Open a new tab and go to this page. On the page, you'll see in the right sidebar that the capital of Australia is 'Canberra'.

Simply right-click the text, and hit the 'Inspect'/'Inspect Element' option. This will load up the 'Developer Tools', which will not only open up the 'Elements' tab of the page, but will jump to the location of the element in which the information is stored.

### 3. Retrieving information from web pages

So far, we have been given a clear understanding about how a page renders content in HTML, as well as how to trace information from a web page back to the element it is contained in within the DOM tree.

In this task, we introduce 'BeautifulSoup', a powerful library for Python that enables us to automate information retrieval from a web page:

In [1]:
from bs4 import BeautifulSoup

BeautifulSoup can interpret the DOM tree from a HTML document, so that we can easily pull out elements from the page with simple expressions. Take for instance the following HTML that we load into a variable:

In [2]:
some_HTML_page = \
    '<html>'\
    '   <head>'\
    '   </head>'\
    '   <body>'\
    '      <div>Not Here</div>'\
    '      <div class="target">The Text We Are After</div>'\
    '   </body>'\
    '</html>'

Using BeautifulSoup, we first interpret the page into a variable. From here, there are many possible ways of getting the target element (element with the 'class' of value 'target'). 

The most obvious way for this scenario involves finding the first element with the class of 'target':

In [3]:
soup = BeautifulSoup(some_HTML_page, "html.parser")

for element in soup(attrs={'class' : 'target'}):
    print(element)

<div class="target">The Text We Are After</div>


In more complex situations, we might not know the target element's class value, but may know details about its previous element (e.g. the text inside the previous element):

In [4]:
element = soup.find(string="Not Here") # the text before
print(element.find_next("div"))  # the tag that you want to find

<div class="target">The Text We Are After</div>


If we wanted to run BeatifulSoup on an actual web page, we could simply call the 'requests' library to load down the raw text of that page. Here we specify a basic method for pulling down HTML from a real web page, by specifying its URL:

In [5]:
import requests

def get_HTML(url):
    # get data from server
    response = requests.get(url)
    html = response.content
    return html

Recalling the Wikipedia page for Australia, we can then get its raw HTML using the following code:

In [6]:
Australia_Wiki_HTML = get_HTML('https://en.wikipedia.org/wiki/Australia')

In [7]:
# view the first 500 characters of the page
Australia_Wiki_HTML[:500]

b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-enabled vector-feature-appearance-pinned-clien'

The next question is: What are details about the element in which the name of Australia's capital city is stored?

ANSWER: From perusing the elements in the 'Developer Tools' window, we can state the following facts about our target element:

* It has an 'a' tag
* It is inside an element with a 'td' tag
* The element with the 'td' tag is preceded by an element with a 'th' tag
* The element with the 'th' tag contains the text 'Capital'

So the code that would get the exact element we are after is described in the method below:

In [8]:
def get_the_capital(HTML):
    soup = BeautifulSoup(HTML, "html.parser") # the html input and the parser name
    th_element = soup.find(string="Capital") # the text that we are looking for
    target_element = th_element.find_next("a") # the tag that we are looking for
    print(target_element)

get_the_capital(Australia_Wiki_HTML)

<a href="/wiki/Canberra" title="Canberra">Canberra</a>


Before reading any further, follow this link (https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to learn more about the functions ('find', 'findNext') used in the code above.

To demonstrate just how flexible our solution is, we can run the exact same method on a different country e.g. France:

In [9]:
France_Wiki_HTML = get_HTML('https://en.wikipedia.org/wiki/France')
get_the_capital(France_Wiki_HTML)

<a href="/wiki/Paris" title="Paris">Paris</a>


### 4. A Business Intelligence Scenario

As a market analyst working for a tourism agency, your boss has approached you with a client in need of a recommendation regarding the top tourist destinations.

While this may sound easy, in hopes that it will improve their tourism experience, the client has also requested that places that are more innovative be prioritised in the recommendation.

Fortunately for this task, the top tourist destinations can be found on the following web page:

In [10]:
top_tourism_destinations = 'https://en.wikipedia.org/wiki/World_Tourism_rankings'

Using the Developer Tools, identify things that could be used to isolate the names of the countries in the table, in the section entitled "Most visited destinations by international tourist arrivals". 

For this task, the details have been given, however, the code that retrieves the values is only half completed:

Details:
- A 'span' element that is contained by a 'h2' element; the title of the target 'table' is inside the 'span' element.
- A 'table' element proceeds the 'h2' element.
- There are 'td' elements inside the 'table' element.
- Each 'td' element has an attribute of 'align' with the value 'left'.
- In each 'td' element, there is an 'a' element with the name of a given country inside it.


In [11]:
top_tourist_locations = []

Tourism_Wiki_HTML = get_HTML(top_tourism_destinations)
soup = BeautifulSoup(Tourism_Wiki_HTML, "html.parser")

span_element = soup.find(id="Most_visited_destinations_by_international_tourist_arrivals")
#print(span_element)

h2_element = span_element.parent #'h2' is the parent of the span element
#print(h2_element)

table_element = h2_element.find_next("table")
#print(table_element)

for td_element in table_element.find_all("td",attrs={'align':'left'}): # a tag with specific attributes
    print(td_element)
    a_element = td_element.find("a") # the tag we are looking for
    if a_element != None:
        top_tourist_locations.append(a_element.text)

# If you enter the missing code, this will return a list of names of the top tourist locations.
top_tourist_locations

<td align="left"><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/en/thumb/c/c3/Flag_of_France.svg/23px-Flag_of_France.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/c/c3/Flag_of_France.svg/35px-Flag_of_France.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/c/c3/Flag_of_France.svg/45px-Flag_of_France.svg.png 2x" width="23"/></span></span></span> <a href="/wiki/Tourism_in_France" title="Tourism in France">France</a></td>
<td align="left"><span typeof="mw:File"><span title="Increase"><img alt="Increase" class="mw-file-element" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //

['France',
 'Spain',
 'United States',
 'Italy',
 'Turkey',
 'Mexico',
 'United Kingdom',
 'Germany',
 'Greece',
 'Austria']

Knowing that the client is also looking for places that have higher innovation, what could we use from a single country's Wikipedia page to determine this quality?

Going back to 'https://en.wikipedia.org/wiki/Australia', the HDI of the country will give a good indication of this; so how do we describe the HDI?

Once again, here are some details to help:

   * The text 'HDI' is in an 'a' element.
   * The 'a' element is in a 'th' element.
   * The 'th' is proceeded by a 'td' element.
   * The 'td' element contains an 'img' element.
   * Next to the 'img' element is the HDI value.

The code that retrieves the HDI from a country's Wikipedia page is included in the following method, but it is incomplete:

In [12]:
def get_country_HDI(html):
    soup = BeautifulSoup(html, "html.parser")
    a_element = soup.find('a',string='HDI')
    #print(a_element)
    th_element = a_element.parent
    #print(th_element)
    td_element = th_element.find_next('td')
    #print(td_element.text)
    return td_element.text.strip()

# If you enter the missing code, this function will produce the value '0.903'
get_country_HDI(France_Wiki_HTML)

'0.910[12]very high\xa0(28th)'

Now all we have to do to get the HDI of each country is to substitute each country's name into the Wikipedia country's URL, and to feed the returned HTML into the 'get_country_HDI' method:

In [13]:
for i in range(0, len(top_tourist_locations)):
    print("Country: "+top_tourist_locations[i])
    print("Ranking: "+str(i+1))
    print("HDI: "+get_country_HDI(
        get_HTML('https://en.wikipedia.org/wiki/'+top_tourist_locations[i].replace(' ','%20'))
    ))
    print('\n')

Country: France
Ranking: 1
HDI: 0.910[12]very high (28th)


Country: Spain
Ranking: 2
HDI: 0.911[10]very high (27th)


Country: United States
Ranking: 3
HDI: 0.927[15]very high (20th)


Country: Italy
Ranking: 4
HDI: 0.906[9]very high (30th)


Country: Turkey
Ranking: 5
HDI: 0.855[11]very high (45th)


Country: Mexico
Ranking: 6
HDI: 0.781[8]high (77th)


Country: United Kingdom
Ranking: 7
HDI: 0.940[19]very high (15th)


Country: Germany
Ranking: 8
HDI: 0.950[10]very high (7th)


Country: Greece
Ranking: 9
HDI: 0.893[8]very high (33rd)


Country: Austria
Ranking: 10
HDI: 0.926[11]very high (22nd)




Comparing rankings and HDIs, what would you state in your recommendation: