# QUT Data Analytics taster 

**Andrew Gibson**

School of Information Systems, Faculty of Science

# 1 - Introductions

* 7 x Groups of 4/5 - at least 1 person who knows Python in each group
* Who am I?
* Who are you?
* What is this taster about?

### Agenda

1. Introductions, groups, getting logged in
2. **Activity 1 - Getting Connected**
3. Why open data?
4. **Activity 2 - Working with data**
5. **Activity 3 - Live data**
6. Wrap up
7. QUT Ambassador - Why study...

# 2 - Activity 1 :: Getting Connected

*In this activity, we're going to use our Jupyter notebook environments to connect to other computers on the internet and retrieve information from them. We will explore the difference between HTML (web pages), and JSON from web APIs (Application Programming Interfaces).*

In [None]:
# Load some Python packages (libraries) that we will use
import requests
import json
import pandas as pd
import plotly.express as px
from IPython.core.display import HTML

### When we visit a website...
[XKCD Comic](https://xkcd.com)

### When a computer visits a website...

We can program the computer to get information from a remote computer on the internet by initiating an `HTTP request` to the remote computer which will respond to our local computer with an `HTTP response`.

In [None]:
# Send a Get request to server and display response

response = requests.get("http://xkcd.com")
???

In [None]:
# Retreive the content from the response

response.???

### APIs - Application Programming Interfaces

`HTML` is designed for web browsers which render the content into a visual form that is easy for a human to read. But this data is not easy to work with for analysis purposes. However, there are many computers on the internet providing data services for sharing with other computers (not humans directly), and they do this through an `Application Programming Interface` or `API`. 

We can visit the XKCD API in a browser to see what it gives us:

[XKCD API - http://xkcd.com/info.0.json](http://xkcd.com/info.0.json)

In [None]:
#The URL for the API
xkcd_api_url = "???"

# Make a GET request to the url
comic_response = requests.get(???)
print("Response Status: ",comic_response.status_code)

# Load the resulting JSON content in a form that we can use
comic_content = json.loads(comic_response.content)
???

***NOTE: You can find info on status codes here: [Wikipedia - HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes). You've probably seen the famous error code before: [404 - Not Found](https://jupyter.qutanalytics.io/show_me_an_error).***

#### Do something with the data

The data returned by the API is in a structure that we can use...

In [None]:
# Find the title of the comic
comic_content[???]

In [None]:
# Get the value for the 'img' key
comic_img = comic_content[???]
comic_img

In [None]:
# Display the content by wrapping it in HTML img tag
HTML(f'<img src="{comic_img}"/>')

### Experiment with JSON from APIs

- First create a function that gets json from a URL and returns a Python array or dict
- Try your function on: `https://dog.ceo/api/breeds/list/all` - how many breeds of hound are there?

In [None]:
# Function
def getDogBreeds(url):
    response = requests.get(url)
    content = response.content
    return json.loads(content)

In [None]:
# Get the breeds
data = getDogBreeds(???)
print("Data from server:\n",???)

In [None]:
# Get just the info we need
breeds = data['message']

for breed in breeds:
    print(breed)

Try answering a question. Like:

>how many breeds of `terrier` are there, and what are they?

In [None]:
# Get the breeds
breed2find = ???
selected = breeds[???]

# Calculate the number of breeds and display a list of them
print("There are",len(selected),"breeds of",breed2find,":")
for s in selected:
    print("\t",s)

In [None]:
# Try with other breeds
???

### Try some other APIs

Try calling other APIs from these sites...
* [toddmotto public APIs](https://github.com/toddmotto/public-apis) *(Tip: start with APIs that don't need authentication)*
* [Any API](https://any-api.com)

# 3 - Why open data?

## Data Analytics Big Idea

The key big idea of Data Analytics is to: **Address organisational concerns through storytelling with information**

1. **CONCERN:** The organisation/business concern or problem understood in the context of the organisation and in relation to the stakeholders.

2. **DATA ANALYTICS:** Potential sources of information that exist inside or outside of the organisation or which may be synthesised in order to address a concern. Techniques and processes and tools which can be utilised in analysing available data for the purposes of addressing a concern.

4. **MEANING:** Relationships, perspectives, narratives, and understandings that are supported by the data analytics in a way that is meaningful for stakeholders and holds efficacy in addressing a concern.

Traditionally, businesses have relied on data *inside* the organsation, but more and more data is becoming available *outside* of the organisation - external data. When data is made available for other individuals or organisations to use, it is known as **Open Data**. Sharing and using open data can result in new [opportunities](https://www2.deloitte.com/content/dam/Deloitte/uk/Documents/deloitte-analytics/open-data-driving-growth-ingenuity-and-innovation.pdf) for business.

**What do businesses have to gain by sharing data?**
    
* Expand customer base
* Customer loyalty via transparency - evidence of claims
* Knowledge base adds value to business
* Product improvement - communication - realtime

**What do businesses have to gain by using open data?**

* Identify new customers and markets
* Identify trends
* Guage sentiment of customers
* Adjust operations more quickly in response to external events
* Competitive knowledge


### Increasingly data is unstructured or semi-structured

Humans can make meaning from data without necessarily having pre-defined structure. In fact we frequently use very ill-defined structures to organise and communicate our thinking. We are also adept at creating these kinds of structures as required, in the moment, rather than requiring the data be structured before we can make sense of it.

Computers are not so adept, so complex in the moment sense-making tasks on unstructured data are often easy for humans but very challenging for computers.

<img src="https://static.boredpanda.com/blog/wp-content/uploads/2016/03/dog-food-comparison-bagel-muffin-lookalike-teenybiscuit-karen-zack-5__700.jpg">

[Puppies or Food (boredpanda.com March 2016)](https://www.boredpanda.com/dog-food-comparison-bagel-muffin-lookalike-teenybiscuit-karen-zack/)

# 4 - Activity 2 - Working with data

*In this activity, we're going to use our Jupyter notebook environments to extract meaningful information from semi-structured data (in JSON format). We will connect to an API to retrieve the JSON data, and then manipulate it to extract useful information.*

It's important to understand data in context. Who has collected it? Why? How is it being used? What limitations are there.

[Australia - state of the environment report](https://soe.dcceew.gov.au)

#### EXPLORE:

* Take a look at: `Biodiversity` > `graphs, maps & tables` > `Figure 1: Annual rate of naming, and the accumulation of new species of animals, plants, fungi and protists since 1753`
* Explore some data that was used in the report: [2021 SoE Biodiversity The annual rate of naming (blue line) and the time-course of accumulation (shading) of new species of animals, plants, fungi and protists since 1753](https://data.gov.au/data/dataset/2021-soe-bio-078)
* Retrieve the data yourself using the code below...

In [None]:
# Load data from
soe_url = "https://data.gov.au/data/api/3/action/package_show?id=016a071b-e75d-4eda-9632-661376ed8daf"
content = requests.get(???)
soe_data = json.loads(content.???)

In [None]:
# view the result
???

In [None]:
# since the result is a dictionary, we can the value for one particular key
soe_data["result"]["notes"]

Display information from the data

In [None]:
# Get the name and URL for the data

resource = soe_data["result"]["resources"][0]
print("Name:",resource["name"])
print("URL:",resource[???])
print()
resource

Load information into a `Dataframe`

In [None]:
# Load the data from the URL into a dataframe

df = pd.read_csv(resource['original_url'],index_col='Year')
df

Save information locally (in your Jupyter environment)

In [None]:
df.to_csv("data.csv")

Visualise the data

In [None]:
# Create a chart for the dataframe

ax = df['Accumulation'].plot()

In [None]:
# Try a more advanced chart

# Create charts with more options

fig = px.line(df['Accumulation'])
fig.show()

In [None]:
# Try adding a title to the chart

fig = px.line(df['Accumulation'],title=???,labels={"value":???})
fig.show()

# 5 - Activity 3 - Real-time Data

*In this activity, we're going to use our Jupyter notebook environments to tap data is that is made available in real-time. We will connect to an API to retrieve the data in JSON format, and then explore it to identify how it might be useful.*

[The Guardian](https://www.theguardian.com/au) is an online news site that provides access to its stories via an API called [The Guardian Open Platform](https://open-platform.theguardian.com)

A function can help us search the guardian without repeating code

In [None]:
#create a function to search the guardian

def searchTheGuardian(search_string,office="AUS",from_date="2022-09-26"):
    base_url = 'https://content.guardianapis.com/search?q='
    key = "test"
    url = base_url+'"'+search_string+'"'+'&production-office='+office+'&from-date='+from_date+'&api-key='+key
    response = requests.get(url)
    return json.loads(response.content)

In [None]:
# Try the function searching for Ukraine 

response_content = searchTheGuardian("Ukraine")
response_content

In [None]:
# Obtain the results from within the response

search_results = response_content['response']['results']
search_results

In [None]:
# Get the date, title and url for each result

for result in search_results:
    print("Date: ",result['webPublicationDate'])
    print("Title: ",result['webTitle'])
    print("Url: ",result['webUrl'])
    print()

In [None]:
# Using the API URL, we can retrieve the main text of the story

# Get the first story
page = requests.get(search_results[0]['apiUrl']+'?show-fields=bodyText&api-key=test')
json.loads(page.content)['response']['content']['fields']['bodyText']

Using the code above, we can create a more advanced function to do the steps above in a single step

In [None]:
# Create a function to get the data we need

def get_stories(search_results):
    stories = []
    for result in search_results:
        story = {}
        story["date"] = result['webPublicationDate']
        story["title"] = result['webTitle']
        story_url = result['webUrl']
        
        page = requests.get(search_results[0]['apiUrl']+'?show-fields=bodyText&api-key=test')
        story["text"] = json.loads(page.content)['response']['content']['fields']['bodyText']
        stories.append(story)
    return stories


In [None]:
# Test our new function with previous search_results

ukraine_stories = get_stories(search_results)
ukraine_stories

We can also create a function to help visualise the information

In [None]:
# Create a function to create HTML with stories for visualisation.

def format_stories(stories,heading):
    html = "<h1>{}</h1>".format(heading)
    for story in stories:
        html += "<div>"
        html += "<h3>{}</h3>".format(story['title'])
        html += "<p><i>{}</i></p>".format(story['date'])
        html += "<p>{}</p>".format(story['text'])
        html += "</div>"
    html += "<hr/>"
    return html
        

In [None]:
# Test the function

ukraine_html = format_stories(ukraine_stories,"Ukraine Stories")
ukraine_html

In [None]:
# Display the HTML

HTML(ukraine_html)

#### Explore

Now that we have functions, we can explore different searches by re-using the function

In [None]:
# We can try a new search...

search = "referendum"

referendum_results = searchTheGuardian(search)['response']['results']
referendum_stories = get_stories(referendum_results)

HTML(format_stories(referendum_stories,search))


In [None]:
# Annotate the text for a keyword

def format_keyword(stories,heading,keyword):
    html = "<h1>{}</h1>".format(heading)
    for story in stories:
        text = story['text']
        annotated_text = text.replace(keyword,'<span class="keyword" style="background-color:yellow;">'+keyword+'</span>')
        html += "<div>"
        html += "<h3>{}</h3>".format(story['title'])
        html += "<p><i>{}</i></p>".format(story['date'])
        html += "<p>{}</p>".format(annotated_text)
        html += "</div>"
    html += "<hr/>"
    return html

In [None]:
format_keyword(referendum_stories,"Referendum Stories","voice")

In [None]:
HTML(format_keyword(referendum_stories,"Referendum Stories","constitution"))

#### Experiment

Try your own searches. You might also try modifying the functions to:

- change the search parameters
- modify the display of the results