<div class="alert alert-block alert-info"><b>IAB303</b> - Data Analytics for Business Insight</div>

## External Concerns & Unstructured data

[CoreSignal: External Data and Its Integration to Business Strategy](https://coresignal.com/blog/external-data/)

> Organizations that use external data effectively have more potential to place themselves ahead of their competition when it comes to strategic planning.

- Open data
- Paid data
- Shared data
- Web data

[McKinsey: Harnessing the power of external data](https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/harnessing-the-power-of-external-data)

> The COVID-19 crisis provides an example of just how relevant external data can be. In a few short months, consumer purchasing habits, activities, and digital behavior changed dramatically, making preexisting consumer research, forecasts, and predictive models obsolete. Moreover, as organizations scrambled to understand these changing patterns, they discovered little of use in their internal data. Meanwhile, a wealth of external data could—and still can—help organizations plan and respond at a granular level.

- Customer Analytics
- Strategic Analysis
- Operations and Forecasting
- Risk Management

### Unstructured data

Humans can make meaning from data without necessarily having pre-defined structure. In fact we frequently use very ill-defined structures to organise and communicate our thinking. We are also adept at creating these kinds of structures as required, in the moment, rather than requiring the data be structured before we can make sense of it.

<p><a href="https://commons.wikimedia.org/wiki/File:Coggle_Document.png#/media/File:Coggle_Document.png"><img src="https://upload.wikimedia.org/wikipedia/commons/1/19/Coggle_Document.png" alt="Coggle Document.png"></a><br>By <a href="https://en.wikipedia.org/wiki/User:Lurched95" class="extiw" title="en:User:Lurched95">User:Lurched95</a>, <a href="https://creativecommons.org/licenses/by-sa/3.0" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=33923406">Link</a></p>


Computers are not so adept, so complex in the moment sense-making tasks on unstructured data are often easy for humans but very challenging for computers.

<img src="https://static.boredpanda.com/blog/wp-content/uploads/2016/03/dog-food-comparison-bagel-muffin-lookalike-teenybiscuit-karen-zack-5__700.jpg">

[Puppies or Food (boredpanda.com March 2016)](https://www.boredpanda.com/dog-food-comparison-bagel-muffin-lookalike-teenybiscuit-karen-zack/)

### Kinds of structuring of data

In order for us to perform data analysis on unstructured data, we will usually need to do some structuring of it, and this frequently results in semi-structured data. 

The 3 different kinds of structuring can be summarised as:

* **Structured** $\Rightarrow$ when the structure is pre-defined
* **Structured** $\Rightarrow$ is almost synonymous with 'stored in a RDMS', but can also exist in other software
* **Unstructured** $\leadsto$ when there is no pre-defined structure, or can't easily be conformed to a structure
* **Unstructured** $\leadsto$ commonly raw text, but also images, video, audio
* **Unstructured** $\leadsto$ can appear to have some kind of structure, but often that appearence is derived from our understanding, not from the data itself
* **Semi-structured** $\rightarrow$ the data can be stored in defined structure, but the actual instance of the structure is not predefined

#### Recap

Last week, we read structured data in the form of a CSV file from a URL, and saved the resulting CSV (a plain text file formatted as comma separated values). 

This week, we will use the saved file. So if you missed this step last week, make sure you run the following code.

In [None]:
# Load a CSV from remote URL and save a local file
import pandas as pd
extinct_mammals_url = "https://data.gov.au/dataset/c02731e8-5327-4720-bbc7-1fe67350a569/resource/8339c2b4-c763-4c50-a647-63935537453c/download/cumulative-number-of-extinct-mammal-species.csv"
exmam_df = pd.read_csv(extinct_mammals_url)
file_name = "extinct_aus_mammals.csv"
path = "data"
exmam_df.to_csv(f"{path}/{file_name}",index=False)

### Semi-structured data

Semi-structured data is a lot more prevalent than structured data, but the computational tools are not as mature as structured data tools. Most semi-structured data tools have come about with the advent of the internet and then social media.

### Working with semi-structured data

We will work with semi-structured data mostly by:
1. creating it from plain text which is read from a file, or
2. importing the data from a `JSON` file.

*JSON* is a way of labelling data, but without requiring all data to be the same or without requiring the structure to be fixed in advance.


#### Reading plain text files

In [None]:
# Read in a plain text file
with open(f"{path}/{file_name}", 'r') as fp:
    exmam_text = fp.???

# Print the string that was read from the file
print(???)

In [None]:
# What does the actual string look like (not formatted)
exmam_text

In [None]:
# We can read the text in a semi-structured format by taking advantage of the lines in the file
with open(f"{path}/{file_name}", 'r') as fp:
    exmam_lines = fp.???

print(???)

In [None]:
# Show the list that was read from the file
???

In each of these examples, notice that we have `\n` newline characters in the data. This is because Python is keeping all of the data from the original file including the characters that specify the end of a line of text.

A way to import the data *without* this character is to split the lines using the string `.split()` function after reading in the text as a single string.

In [None]:
# We can also create the list, by splitting the original string
lines = exmam_text.split(???)

# view the list
???

Now each line of the file is an element in the list and the `\n` characters have been removed. 

However, because each line includes 2 data points, we can split each line. This creates a list of lists which gives us a structure similar to a dataframe.

In [None]:
# We can also create the list, by splitting the original string
lines = exmam_text.split('\n')

# view the list
for line in lines:
    data_points = line.split(???)
    print(data_points)

Notice that the data is not `clean`. Think about ways that we might be able to fix this.

#### Reading JSON

JSON is a very common file format for semi-structured data. To read this format we open the file as before, but we use the `json` library to help load the data into a Python dictionary or `dict` structure.

In [None]:
# We need the JSON library
import json

In [None]:
# Read a JSON file like text, but with conversion to python dictionary
json_file_name = "simple_json_file.json"
path = "data"

with open(f"{path}/{json_file_name}", ???) as file:
    json_data = json.load(file)

# print the loaded data
print(???)

In [None]:
# View the json data
???

The advantage of a `dict` in python is that you can access a `value` by calling its `key`. These are called *key-value pairs* and are fundamental to a dictionary structure.

In [None]:
# Access values in the dict by calling the keys
json_data['Key 1']

In [None]:
# Get a list of keys for a dict
json_data.keys()

In [None]:
# Iterate over the keys in a dict

for key in json_data.???:
    print("key:",???)
    value = json_data[???]
    print("value:",???)
    print()

JSON data can include dictionary structures and list structures and they can be nested. To see this, in action we can load json data from a URL. 

To get data from a URL we use the `requests` library. This works like your web browser by sending `get` *request* to a web server, and then processing the response (instead of rendering in a browser).

In [None]:
# You can also load json data from a URL
import requests

# JSON data about the CSV on extinct mammals from the same website above
mammal_url = "https://data.gov.au/api/3/action/package_show?id=c02731e8-5327-4720-bbc7-1fe67350a569"

# Request the content from the web server with a .get() request
response = requests.get(???)

response.content

In [None]:
# Get the data as json from the response

mammal_json = response.json()

mammal_json

Since we know this is json data, we can use the structure to navigate the data and find what we are interested in.

In [None]:
# Take a look at the keys
mammal_json.keys()

In [None]:
# What about the keys down a level?
mammal_json[???].keys()

In [None]:
# Digging deeper
mammal_json['result']['resources']

This is a list of dicts - let's get the first dict in the list (item 0) and explore further

In [None]:
# Only one item in the list - get it by accessing the first item 0
mammal_json['result']['resources'][???]

We can save this dictionary formated data as *JSON* by using the `dumps()` function of the `json` library.

In [None]:
# Dump the dict into a json string
metadata = json.dumps(???)
metadata

In [None]:
# Write the json string to a file
file_name = "extinct_mammals_metadata.json"
with open(f"{path}/{file_name}",'w') as fp:
    fp.write(???)

Open the file that you just created to check that it has been written correctly.

In [None]:
# read the file back in
with open(f"{path}/{file_name}",'r') as fp:
    text = fp.read()
    file_json = json.loads(text)

file_json

Explore the JSON structure starting with the keys

In [None]:
# What keys are available in the first item in the list of resources?
mammal_json['result']['resources'][0].???

In [None]:
# Take a look at the description
mammal_json['result']['resources'][0][???]

In [None]:
# Format the description as a list
mammal_json['result']['resources'][0]['description'].split(???)

In [None]:
# Get the first item in the list
mammal_json['result']['resources'][0]['description'].split('\r\n')[???]

In [None]:
# since the result is a dictionary, we can the value for one particular key
mammal_json["result"]["notes"]

In [None]:
# we can take this data and structure it further

notes = mammal_json["result"]["notes"]
struct_notes = notes.split(???)
for note in struct_notes:
    print(???)

### Visualise

We can use HTML to visualise text.

In [None]:
from IPython.display import display, HTML

heading = f"<h3>???</h3>"

content = ""
for note in struct_notes:
    content += f"<p>???</p>"

display(HTML(???+???))

### Explore further

Try experimenting with exploring the dict format to find interesting parts of the data. 

You might also like to try saving the extracted data as a file, and creating a new dataframe with the structured data.

In [None]:
# Your code here
???

### Accessing the data via The Guardian API

A useful external data source are news publishers. **The Guardian** provides an Application Programming Interface (API) which allows us to search and retrieve news articles.

See the `Accessing_the_Guardian_API.ipynb` notebook file for details on obtaining data from the API. 