<div class="alert alert-block alert-info"><b>IAB303</b> - Data Analytics for Business Insight</div>

# Studio :: External data (Threats and opportunities)

Two possible sources of external data. Web pages and APIs

For web pages, the process is web scrapping. The tasks to be completed are:
1. Understanding How Information Is Given In Web Pages
2. Finding How Specific Information In Web Pages Is Encoded
3. Using Python To Automate Information Retrieval From Web Pages
4. A Business Intelligence Scenario

For APIs. The tasks to be completed are:
1. Get the API key/s
2. Understand the structure of the API including end-points and query strings
3. Using Python to call the API

## Web scrapping

### 1. Understanding How Information Is Given In Web Pages

One of the most common ways of serving information online is through web pages. Web pages are usually given in HTML, a programming language that represents documents as well-structured elements, with sub-elements branching out from containing elements.

As they 'branch' out further and further, they form what is typically called a DOM (Document Object Model) 'tree'.

A DOM tree has a 'head' element, and a 'body' element, with most of the relevant viewing content being stored in the 'body' e.g.

To begin, open a web browser. Depending on the browser you are using, follow instructions in the following link to open its developer tools:

https://www.lifewire.com/web-browser-developer-tools-3988965

From there, a tabbed sub-window will appear that displays information about the current web page you have opened. It can tell you a lot about the page, although all we are concerned with is the 'Elements' section, which shows the actual HTML DOM tree of the page itself.

### 2. Finding How Specific Information In Web Pages Is Encoded

Now that we can load up a web page's HTML just from opening it in a browser, let's try something a bit more specific. 

The following web page is the 'Wikipedia' article for Australia:

https://en.wikipedia.org/wiki/Australia

Open a new tab and go to this page. On the page, you'll see in the right sidebar that the capital of Australia is 'Canberra'.

Simply right-click the text, and hit the 'Inspect'/'Inspect Element' option. This will load up the 'Developer Tools', which will not only open up the 'Elements' tab of the page, but will jump to the location of the element in which the information is stored.

### 3. Using Python To Automate Information Retrieval From Web Pages

So far, we have been given a clear understanding about how a page renders content in HTML, as well as how to trace information from a web page back to the element it is contained in within the DOM tree.

In this task, we introduce 'BeautifulSoup', a powerful library for Python that enables us to automate information retrieval from a web page:

In [1]:
from bs4 import BeautifulSoup

BeautifulSoup can interpret the DOM tree from a HTML document, so that we can easily pull out elements from the page with simple expressions. Take for instance the following HTML that we load into a variable:

In [2]:
some_HTML_page = \
    '<html>'\
    '   <head>'\
    '   </head>'\
    '   <body>'\
    '      <div>Not Here</div>'\
    '      <div class="target">The Text We Are After</div>'\
    '   </body>'\
    '</html>'

Using BeautifulSoup, we first interpret the page into a variable. From here, there are many possible ways of getting the target element (element with the 'class' of value 'target'). 

The most obvious way for this scenario involves finding the first element with the class of 'target':

In [3]:
soup = BeautifulSoup(some_HTML_page, "html.parser")

for element in soup(attrs={'class' : 'target'}):
    print(element)

<div class="target">The Text We Are After</div>


In more complex situations, we might not know the target element's class value, but may know details about its previous element (e.g. the text inside the previous element):

In [5]:
element = soup.find(text="Not Here") # the text before
print(element.findNext("div"))  # the tag that you want to find

<div class="target">The Text We Are After</div>


If we wanted to run BeatifulSoup on an actual web page, we could simply call the 'requests' library to load down the raw text of that page. Here we specify a basic method for pulling down HTML from a real web page, by specifying its URL:

In [8]:
import urllib.request

def get_HTML(url):
    response = urllib.request.urlopen(url)
    html = response.read()
    return html

Recalling the Wikipedia page for Australia, we can then get its raw HTML using the following code:

In [9]:
Australia_Wiki_HTML = get_HTML('https://en.wikipedia.org/wiki/Australia')

The next question is: What are details about the element in which the name of Australia's capital city is stored?

ANSWER: From perusing the elements in the 'Developer Tools' window, we can state the following facts about our target element:

* It has an 'a' tag
* It is inside an element with a 'td' tag
* The element with the 'td' tag is preceded by an element with a 'th' tag
* The element with the 'th' tag contains the text 'Capital'

So the code that would get the exact element we are after is described in the method below:

In [10]:
def get_the_capital(HTML):
    soup = BeautifulSoup(HTML, "html.parser") # the html input and the parser name
    th_element = soup.find(text="Capital") # the text that we are looking for
    target_element = th_element.findNext("a") # the tag that we are looking for
    print(target_element)

get_the_capital(Australia_Wiki_HTML)

<a href="/wiki/Canberra" title="Canberra">Canberra</a>


Before reading any further, follow this link (https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to learn more about the functions ('find', 'findNext') used in the code above.

To demonstrate just how flexible our solution is, we can run the exact same method on a different country e.g. France:

In [11]:
France_Wiki_HTML = get_HTML('https://en.wikipedia.org/wiki/France')
get_the_capital(France_Wiki_HTML)

<a href="/wiki/Paris" title="Paris">Paris</a>


### 4. A Business Intelligence Scenario

As a market analyst working for a tourism agency, your boss has approached you with a client in need of a recommendation regarding the top tourist destinations of 2018.

While this may sound easy, in hopes that it will improve their tourism experience, the client has also requested that places that are more innovative be prioritised in the recommendation.

Fortunately for this task, the top tourist destinations of 2018 are stored on the following URL:

In [19]:
top_tourism_destinations = 'https://en.wikipedia.org/wiki/World_Tourism_rankings'

Using the Developer Tools, identify things that could be used to isolate the names of the countries in the table, in the section entitled "Most visited destinations by international tourist arrivals". 

For this task, the details have been given, however, the code that retrieves the values is only half completed:

Details:
     * A 'span' element contains a 'h2' element with the title of the target 'table' inside it.
     * A 'table' element proceeds the 'span' element.
     * There are 'td' elements inside the 'table' element.
     * Each 'td' element has an attribute of 'align' with the value 'left'.
     * In each 'td' element, there is an 'a' element with the name of a given country inside it.

In [21]:
top_tourist_locations = []

Tourism_Wiki_HTML = get_HTML(???)
soup = BeautifulSoup(???, "html.parser")
span_element = soup.find(text=???)
h2_element = span_element.parent
table_element = h2_element.findNext(???) # a parent tag
for td_element in table_element.findAll(???,attrs={'align':'left'}): # a tag with specific attributes
    a_element = td_element.find(???) # the tag we are looking for
    if a_element != None:
        top_tourist_locations.append(a_element.text)

# If you enter the missing code, this will return a list of names of the top tourist destinations for 2018.
top_tourist_locations

['France',
 'Spain',
 'United States',
 'China',
 'Italy',
 'Mexico',
 'United Kingdom',
 'Turkey',
 'Germany',
 'Thailand']

Knowing that the client is also looking for places that have higher innovation, what could we use from a single country's Wikipedia page to determine this quality?

Going back to 'https://en.wikipedia.org/wiki/Australia', the HDI of the country will give a good indication of this; so how do we describe the HDI?

Once again, here are some details to help:

   * The text 'HDI' is in an 'a' element.
   * The 'a' element is in a 'th' element.
   * The 'th' is proceeded by a 'td' element.
   * The 'td' element contains an 'img' element.
   * Next to the 'img' element is the HDI value.

The code that retrieves the HDI from a country's Wikipedia page is included in the following method, but it is incomplete:

In [22]:
def get_country_HDI(html):
    soup = BeautifulSoup(html, ???)
    a_element = soup.find('a',text=???)
    th_element = a_element.parent
    td_element = th_element.findNext(???)
    HDI_value = td_element.find(???).findNextSibling(text=True)
    return HDI_value.strip()

# If you enter the missing code, this function will produce the value '0.897'
get_country_HDI(France_Wiki_HTML)

'0.901'

Now all we have to do to get the HDI of each country is to substitute each country's name into the Wikipedia country's URL, and to feed the returned HTML into the 'get_country_HDI' method:

In [23]:
for i in range(0, len(top_tourist_locations)):
    print("Country: "+top_tourist_locations[i])
    print("Ranking: "+str(i+1))
    print("HDI: "+get_country_HDI(
        get_HTML('https://en.wikipedia.org/wiki/'+top_tourist_locations[i].replace(' ','%20'))
    ))
    print('\n')

Country: France
Ranking: 1
HDI: 0.901


Country: Spain
Ranking: 2
HDI: 0.891


Country: United States
Ranking: 3
HDI: 0.924


Country: China
Ranking: 4
HDI: 0.752


Country: Italy
Ranking: 5
HDI: 0.880


Country: Mexico
Ranking: 6
HDI: 0.774


Country: United Kingdom
Ranking: 7
HDI: 0.922


Country: Turkey
Ranking: 8
HDI: 0.791


Country: Germany
Ranking: 9
HDI: 0.936


Country: Thailand
Ranking: 10
HDI: 0.755




Comparing rankings and HDIs, what would you state in your recommendation:

## APIs

### 1. Get the API key

[The Guardian](https://www.theguardian.com/au) is a quality **open** News outlet with an easy to use [open-platform API](https://open-platform.theguardian.com).

* Explore and experiment with the [platform here](https://open-platform.theguardian.com/explore/)
* Get your own [developer API key here](https://bonobo.capi.gutools.co.uk/register/developer)

For this example, I've save my key at the beginning of a file called `guardian_key.txt`. I load the key before anything else...

In [14]:
#load key
with open('data/guardian_key.txt', 'r') as file:
    key = file.read().strip()
len(key) # check key loaded by reading its length - don't want to display the actual key!!

FileNotFoundError: [Errno 2] No such file or directory: 'data/guardian_key.txt'

### 2. Understand the structure of the API including end-points and query strings

The Guardian API is called through a single URL which needs to be composed according to your search requirements.

In [13]:
 #build a search URL
baseUrl = 'https://content.guardianapis.com/search?q=' # content search

searchString = "submarine"
office = "&production-office=aus"
tag = "&tag=politics/politics"
fromDate = "&from-date=2021-09-01"

url = baseUrl+'"'+searchString+'"'+office+fromDate
print(url)

https://content.guardianapis.com/search?q="submarine"&production-office=aus&from-date=2021-09-01


Now that we have the URL that we want to use for our search, we add our api-key to the end of it.

In [15]:
urlkey = url +'&api-key='+key # add my API key to the end of the URL

### 3. Using Python to call the API

To use the API, we need the `requests` library to connect to it, and the `json` library to be able to work with the json data that the API returns.

In [17]:
#import required libraries
import requests
import json

#call the API
response = requests.get(urlkey)

We're hoping for a `200` response from the server to say that everything was OK. If you get a different number response, then there was probably an issue with your URL.

In [18]:
response

<Response [200]>

We can look at the content of the response - which should be in json format if your request was successful.

In [19]:
response.content

b'{"response":{"status":"ok","userTier":"developer","total":158,"startIndex":1,"pageSize":10,"currentPage":1,"pages":16,"orderBy":"relevance","results":[{"id":"world/2021/nov/09/australia-promises-jobs-to-workers-stranded-by-scrapping-of-french-submarine-deal","type":"article","sectionId":"world","sectionName":"World news","webPublicationDate":"2021-11-08T16:30:35Z","webTitle":"Australia promises jobs to workers stranded by scrapping of French submarine deal","webUrl":"https://www.theguardian.com/world/2021/nov/09/australia-promises-jobs-to-workers-stranded-by-scrapping-of-french-submarine-deal","apiUrl":"https://content.guardianapis.com/world/2021/nov/09/australia-promises-jobs-to-workers-stranded-by-scrapping-of-french-submarine-deal","isHosted":false,"pillarId":"pillar/news","pillarName":"News"},{"id":"world/2021/oct/01/fears-australias-france-submarine-snub-could-scupper-closer-eu-economic-ties","type":"article","sectionId":"world","sectionName":"World news","webPublicationDate":"2

Load the json from the content into a variable (as a dictionary) that we can navigate in python.

In [20]:
data = json.loads(response.content)
data

{'response': {'status': 'ok',
  'userTier': 'developer',
  'total': 158,
  'startIndex': 1,
  'pageSize': 10,
  'currentPage': 1,
  'pages': 16,
  'orderBy': 'relevance',
  'results': [{'id': 'world/2021/nov/09/australia-promises-jobs-to-workers-stranded-by-scrapping-of-french-submarine-deal',
    'type': 'article',
    'sectionId': 'world',
    'sectionName': 'World news',
    'webPublicationDate': '2021-11-08T16:30:35Z',
    'webTitle': 'Australia promises jobs to workers stranded by scrapping of French submarine deal',
    'webUrl': 'https://www.theguardian.com/world/2021/nov/09/australia-promises-jobs-to-workers-stranded-by-scrapping-of-french-submarine-deal',
    'apiUrl': 'https://content.guardianapis.com/world/2021/nov/09/australia-promises-jobs-to-workers-stranded-by-scrapping-of-french-submarine-deal',
    'isHosted': False,
    'pillarId': 'pillar/news',
    'pillarName': 'News'},
   {'id': 'world/2021/oct/01/fears-australias-france-submarine-snub-could-scupper-closer-eu-econ

In [21]:
results = data['response']['results']
results

[{'id': 'world/2021/nov/09/australia-promises-jobs-to-workers-stranded-by-scrapping-of-french-submarine-deal',
  'type': 'article',
  'sectionId': 'world',
  'sectionName': 'World news',
  'webPublicationDate': '2021-11-08T16:30:35Z',
  'webTitle': 'Australia promises jobs to workers stranded by scrapping of French submarine deal',
  'webUrl': 'https://www.theguardian.com/world/2021/nov/09/australia-promises-jobs-to-workers-stranded-by-scrapping-of-french-submarine-deal',
  'apiUrl': 'https://content.guardianapis.com/world/2021/nov/09/australia-promises-jobs-to-workers-stranded-by-scrapping-of-french-submarine-deal',
  'isHosted': False,
  'pillarId': 'pillar/news',
  'pillarName': 'News'},
 {'id': 'world/2021/oct/01/fears-australias-france-submarine-snub-could-scupper-closer-eu-economic-ties',
  'type': 'article',
  'sectionId': 'world',
  'sectionName': 'World news',
  'webPublicationDate': '2021-10-01T09:00:58Z',
  'webTitle': 'Fears Australia’s France submarine snub could scupper c

Once we have our results, it could be helpful to get a list of the titles. Then we could do unstructed data analytics on the titles to see if they contain specific words or phrases of interest.

In [22]:
titles = []
for result in results:
    titles.append(result['webTitle'])

In [23]:
titles

['Australia promises jobs to workers stranded by scrapping of French submarine deal',
 'Fears Australia’s France submarine snub could scupper closer EU economic ties',
 'Australia tore up French submarine contract ‘for convenience’ Naval Group says',
 '‘Naughty guy’: top Chinese diplomat accuses Australia of ‘sabre wielding’ with nuclear submarine deal',
 'Australia’s foreign minister to meet French ambassador in bid to heal submarine rift ',
 'Macron’s anger over nuclear submarine deal linked to French election, Peter Dutton says',
 'Former US navy secretary now Scott Morrison’s Aukus middleman on submarine plan',
 'Malcolm Turnbull excoriates Scott Morrison over ‘appalling episode’ with French submarine deal',
 '‘We felt fooled’: France still furious after Australia scraps $90bn submarine deal',
 'Game-changer or irresponsible? The known unknowns on Australia’s nuclear submarine deal']