<div class="alert alert-block alert-info"><b>IAB303</b> - Data Analytics for Business Insight</div>

## External Concerns & Unstructured data

[CoreSignal: External Data and Its Integration to Business Strategy](https://coresignal.com/blog/external-data/)

> Organizations that use external data effectively have more potential to place themselves ahead of their competition when it comes to strategic planning.

- Open data
- Paid data
- Shared data
- Web data

[McKinsey: Harnessing the power of external data](https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/harnessing-the-power-of-external-data)

> The COVID-19 crisis provides an example of just how relevant external data can be. In a few short months, consumer purchasing habits, activities, and digital behavior changed dramatically, making preexisting consumer research, forecasts, and predictive models obsolete. Moreover, as organizations scrambled to understand these changing patterns, they discovered little of use in their internal data. Meanwhile, a wealth of external data could—and still can—help organizations plan and respond at a granular level.

- Customer Analytics
- Strategic Analysis
- Operations and Forecasting
- Risk Management

### Unstructured data

Humans can make meaning from data without necessarily having pre-defined structure. In fact we frequently use very ill-defined structures to organise and communicate our thinking. We are also adept at creating these kinds of structures as required, in the moment, rather than requiring the data be structured before we can make sense of it.

<p><a href="https://commons.wikimedia.org/wiki/File:Coggle_Document.png#/media/File:Coggle_Document.png"><img src="https://upload.wikimedia.org/wikipedia/commons/1/19/Coggle_Document.png" alt="Coggle Document.png"></a><br>By <a href="https://en.wikipedia.org/wiki/User:Lurched95" class="extiw" title="en:User:Lurched95">User:Lurched95</a>, <a href="https://creativecommons.org/licenses/by-sa/3.0" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=33923406">Link</a></p>


Computers are not so adept, so complex in the moment sense-making tasks on unstructured data are often easy for humans but very challenging for computers.

<img src="https://static.boredpanda.com/blog/wp-content/uploads/2016/03/dog-food-comparison-bagel-muffin-lookalike-teenybiscuit-karen-zack-5__700.jpg">

[Puppies or Food (boredpanda.com March 2016)](https://www.boredpanda.com/dog-food-comparison-bagel-muffin-lookalike-teenybiscuit-karen-zack/)

### Kinds of structuring of data

In order for us to perform data analysis on unstructured data, we will usually need to do some structuring of it, and this frequently results in semi-structured data. 

The 3 different kinds of structuring can be summarised as:

* **Structured** $\Rightarrow$ when the structure is pre-defined
* **Structured** $\Rightarrow$ is almost synonymous with 'stored in a RDMS', but can also exist in other software
* **Unstructured** $\leadsto$ when there is no pre-defined structure, or can't easily be conformed to a structure
* **Unstructured** $\leadsto$ commonly raw text, but also images, video, audio
* **Unstructured** $\leadsto$ can appear to have some kind of structure, but often that appearence is derived from our understanding, not from the data itself
* **Semi-structured** $\rightarrow$ the data can be stored in defined structure, but the actual instance of the structure is not predefined

#### Recap

Last week, we read structured data in the form of a CSV file from a URL, and saved the resulting CSV (a plain text file formatted as comma separated values). 

This week, we will use the saved file. So if you missed this step last week, make sure you run the following code.

In [4]:
# Load a CSV from remote URL and save a local file
import pandas as pd
extinct_mammals_url = "https://data.gov.au/dataset/c02731e8-5327-4720-bbc7-1fe67350a569/resource/8339c2b4-c763-4c50-a647-63935537453c/download/cumulative-number-of-extinct-mammal-species.csv"
exmam_df = pd.read_csv(extinct_mammals_url)
file_name = "extinct_aus_mammals.csv"
path = "data"
exmam_df.to_csv(f"{path}/{file_name}",index=False)


### Semi-structured data

Semi-structured data is a lot more prevalent than structured data, but the computational tools are not as mature as structured data tools. Most semi-structured data tools have come about with the advent of the internet and then social media.

### Working with semi-structured data

We will work with semi-structured data mostly by:
1. creating it from plain text which is read from a file, or
2. importing the data from a `JSON` file.

*JSON* is a way of labelling data, but without requiring all data to be the same or without requiring the structure to be fixed in advance.


#### Reading plain text files

In [5]:
# Read in a plain text file
file_name="extinct_aus_mammals.csv"
path = "data"
with open(f"{path}/{file_name}", 'r') as fp:
    exmam_text = fp.read()

# Print the string that was read from the file
print(exmam_text)

Decade,Cumulative number of extinct mammal species
1790,0
1800,0
1810,0
1820,0
1830,0
1840,1
1850,3
1860,5
1870,7
1880,8
1890,9
1900,14
1910,14
1920,15
1930,18
1940,18
1950,22
1960,24
1970,25
1980,25
1990,25
2000,26
2010,27



In [6]:
# What does the actual string look like (not formatted)
exmam_text

'Decade,Cumulative number of extinct mammal species\n1790,0\n1800,0\n1810,0\n1820,0\n1830,0\n1840,1\n1850,3\n1860,5\n1870,7\n1880,8\n1890,9\n1900,14\n1910,14\n1920,15\n1930,18\n1940,18\n1950,22\n1960,24\n1970,25\n1980,25\n1990,25\n2000,26\n2010,27\n'

In [7]:
# We can read the text in a semi-structured format by taking advantage of the lines in the file
with open(f"{path}/{file_name}", 'r') as fp:
    exmam_lines = fp.readlines()

print(exmam_lines)

['Decade,Cumulative number of extinct mammal species\n', '1790,0\n', '1800,0\n', '1810,0\n', '1820,0\n', '1830,0\n', '1840,1\n', '1850,3\n', '1860,5\n', '1870,7\n', '1880,8\n', '1890,9\n', '1900,14\n', '1910,14\n', '1920,15\n', '1930,18\n', '1940,18\n', '1950,22\n', '1960,24\n', '1970,25\n', '1980,25\n', '1990,25\n', '2000,26\n', '2010,27\n']


In [8]:
# Show the list that was read from the file
exmam_lines

['Decade,Cumulative number of extinct mammal species\n',
 '1790,0\n',
 '1800,0\n',
 '1810,0\n',
 '1820,0\n',
 '1830,0\n',
 '1840,1\n',
 '1850,3\n',
 '1860,5\n',
 '1870,7\n',
 '1880,8\n',
 '1890,9\n',
 '1900,14\n',
 '1910,14\n',
 '1920,15\n',
 '1930,18\n',
 '1940,18\n',
 '1950,22\n',
 '1960,24\n',
 '1970,25\n',
 '1980,25\n',
 '1990,25\n',
 '2000,26\n',
 '2010,27\n']

In each of these examples, notice that we have `\n` newline characters in the data. This is because Python is keeping all of the data from the original file including the characters that specify the end of a line of text.

A way to import the data *without* this character is to split the lines using the string `.split()` function after reading in the text as a single string.

In [9]:
# We can also create the list, by splitting the original string
lines = exmam_text.split('\n')

# view the list
lines

['Decade,Cumulative number of extinct mammal species',
 '1790,0',
 '1800,0',
 '1810,0',
 '1820,0',
 '1830,0',
 '1840,1',
 '1850,3',
 '1860,5',
 '1870,7',
 '1880,8',
 '1890,9',
 '1900,14',
 '1910,14',
 '1920,15',
 '1930,18',
 '1940,18',
 '1950,22',
 '1960,24',
 '1970,25',
 '1980,25',
 '1990,25',
 '2000,26',
 '2010,27',
 '']

Now each line of the file is an element in the list and the `\n` characters have been removed. 

However, because each line includes 2 data points, we can split each line. This creates a list of lists which gives us a structure similar to a dataframe.

In [10]:
# split each line to create a list of lists
for line in lines:
    data_points = line.split(',')
    print(data_points)

['Decade', 'Cumulative number of extinct mammal species']
['1790', '0']
['1800', '0']
['1810', '0']
['1820', '0']
['1830', '0']
['1840', '1']
['1850', '3']
['1860', '5']
['1870', '7']
['1880', '8']
['1890', '9']
['1900', '14']
['1910', '14']
['1920', '15']
['1930', '18']
['1940', '18']
['1950', '22']
['1960', '24']
['1970', '25']
['1980', '25']
['1990', '25']
['2000', '26']
['2010', '27']
['']


Notice that the data is not `clean`. Think about ways that we might be able to fix this.

#### Reading JSON

JSON is a very common file format for semi-structured data. To read this format we open the file as before, but we use the `json` library to help load the data into a Python dictionary or `dict` structure.

In [11]:
# We need the JSON library
import json

In [13]:
# Read a JSON file like text, but with conversion to python dictionary
json_file_name = "simple_json_file.json"
path = "data"

with open(f"{path}/{json_file_name}", 'r') as file:
    json_data = json.load(file)

# print the loaded data
print(json_data)

{'Key 1': 'The first value that goes with Key 1', 'Second key': 'Json data that goes with the second key', '3rd_key': ['This', 'is', 'a', 'json', 'list', 'and', 'value', 'for', '3rd key']}


In [14]:
# View the json data
json_data

{'Key 1': 'The first value that goes with Key 1',
 'Second key': 'Json data that goes with the second key',
 '3rd_key': ['This',
  'is',
  'a',
  'json',
  'list',
  'and',
  'value',
  'for',
  '3rd key']}

The advantage of a `dict` in python is that you can access a `value` by calling its `key`. These are called *key-value pairs* and are fundamental to a dictionary structure.

In [15]:
# Access values in the dict by calling the keys
json_data['Key 1']

'The first value that goes with Key 1'

In [16]:
# Get a list of keys for a dict
json_data.keys()

dict_keys(['Key 1', 'Second key', '3rd_key'])

In [17]:
# Iterate over the keys in a dict

for key in json_data.keys():
    print("key:",key)
    value = json_data[key]
    print("value:",value)
    print()

key: Key 1
value: The first value that goes with Key 1

key: Second key
value: Json data that goes with the second key

key: 3rd_key
value: ['This', 'is', 'a', 'json', 'list', 'and', 'value', 'for', '3rd key']



JSON data can include dictionary structures and list structures and they can be nested. To see this, in action we can load json data from a URL. 

To get data from a URL we use the `requests` library. This works like your web browser by sending `get` *request* to a web server, and then processing the response (instead of rendering in a browser).

In [18]:
# You can also load json data from a URL
import requests

# JSON data about the CSV on extinct mammals from the same website above
mammal_url = "https://data.gov.au/api/3/action/package_show?id=c02731e8-5327-4720-bbc7-1fe67350a569"

# Request the content from the web server with a .get() request
response = requests.get(mammal_url)

response.content

b'{"help": "https://data.gov.au/data/api/3/action/help_show?name=package_show", "success": true, "result": {"author": "CSIRO Publishing", "author_email": null, "contact_point": "soe@environment.gov.au", "creator_user_id": "dd537e79-7dfd-4a78-9110-8a42785bad1a", "data_state": "active", "id": "c02731e8-5327-4720-bbc7-1fe67350a569", "isopen": true, "jurisdiction": "Commonwealth of Australia", "language": "English", "license_id": "cc-by-4.0", "license_title": "Creative Commons Attribution 4.0 International", "license_url": "http://creativecommons.org/licenses/by/4.0", "maintainer": "soe", "maintainer_email": null, "metadata_created": "2016-10-13T05:31:05.905629", "metadata_modified": "2018-12-21T04:48:53.097284", "name": "2016-soe-biodiversity-cumulative-number-of-extinct-mammal-species", "notes": "Data on cumulative historical extinctions of Australian mammal species - according to the Action Plan for Australian Mammals (2014 edition) by John Woinarksi, Andrew Burbidge and Peter Harrison.

In [19]:
# Get the data as json from the response

mammal_json = response.json()

mammal_json

{'help': 'https://data.gov.au/data/api/3/action/help_show?name=package_show',
 'success': True,
 'result': {'author': 'CSIRO Publishing',
  'author_email': None,
  'contact_point': 'soe@environment.gov.au',
  'creator_user_id': 'dd537e79-7dfd-4a78-9110-8a42785bad1a',
  'data_state': 'active',
  'id': 'c02731e8-5327-4720-bbc7-1fe67350a569',
  'isopen': True,
  'jurisdiction': 'Commonwealth of Australia',
  'language': 'English',
  'license_id': 'cc-by-4.0',
  'license_title': 'Creative Commons Attribution 4.0 International',
  'license_url': 'http://creativecommons.org/licenses/by/4.0',
  'maintainer': 'soe',
  'maintainer_email': None,
  'metadata_created': '2016-10-13T05:31:05.905629',
  'metadata_modified': '2018-12-21T04:48:53.097284',
  'name': '2016-soe-biodiversity-cumulative-number-of-extinct-mammal-species',
  'notes': 'Data on cumulative historical extinctions of Australian mammal species - according to the Action Plan for Australian Mammals (2014 edition) by John Woinarksi, A

Since we know this is json data, we can use the structure to navigate the data and find what we are interested in.

In [20]:
# Take a look at the keys
mammal_json.keys()

dict_keys(['help', 'success', 'result'])

In [21]:
# What about the keys down a level?
mammal_json['result'].keys()

dict_keys(['author', 'author_email', 'contact_point', 'creator_user_id', 'data_state', 'id', 'isopen', 'jurisdiction', 'language', 'license_id', 'license_title', 'license_url', 'maintainer', 'maintainer_email', 'metadata_created', 'metadata_modified', 'name', 'notes', 'num_resources', 'num_tags', 'organization', 'owner_org', 'private', 'spatial_coverage', 'state', 'temporal_coverage_from', 'temporal_coverage_to', 'title', 'type', 'update_freq', 'url', 'version', 'resources', 'tags', 'groups', 'relationships_as_subject', 'relationships_as_object'])

In [23]:
# Digging deeper
mammal_json['result']['resources']

[{'cache_last_updated': None,
  'cache_url': None,
  'created': '2016-10-13T16:31:36.178357',
  'datastore_active': True,
  'datastore_contains_all_records_of_source_file': False,
  'description': 'Source: Woinarski, JCZ, Burbidge, AA & Harrison PL (2014 edition). The action plan for Australian mammals 2012, CSIRO Publishing, Melbourne.\r\n\r\nNote: The extinction of 3 species—Bettongia pusilla, Conilurus capricornensis and Pseudomys glaucus—cannot be readily and reliably dated, and these are not included in the data.',
  'format': 'CSV',
  'hash': '',
  'id': '8339c2b4-c763-4c50-a647-63935537453c',
  'last_modified': '2016-10-13',
  'metadata_modified': '2016-10-13T16:31:36.178357',
  'mimetype': None,
  'mimetype_inner': None,
  'name': 'BIO19 Cumulative historical extinctions of Australian mammal species',
  'package_id': 'c02731e8-5327-4720-bbc7-1fe67350a569',
  'position': 0,
  'resource_type': None,
  'size': None,
  'state': 'active',
  'url': 'https://data.gov.au/data/dataset/c

This is a list of dicts - let's get the first dict in the list (item 0) and explore further

In [22]:
# Only one item in the list - get it by accessing the first item 0
mammal_json['result']['resources'][0]

{'cache_last_updated': None,
 'cache_url': None,
 'created': '2016-10-13T16:31:36.178357',
 'datastore_active': True,
 'datastore_contains_all_records_of_source_file': False,
 'description': 'Source: Woinarski, JCZ, Burbidge, AA & Harrison PL (2014 edition). The action plan for Australian mammals 2012, CSIRO Publishing, Melbourne.\r\n\r\nNote: The extinction of 3 species—Bettongia pusilla, Conilurus capricornensis and Pseudomys glaucus—cannot be readily and reliably dated, and these are not included in the data.',
 'format': 'CSV',
 'hash': '',
 'id': '8339c2b4-c763-4c50-a647-63935537453c',
 'last_modified': '2016-10-13',
 'metadata_modified': '2016-10-13T16:31:36.178357',
 'mimetype': None,
 'mimetype_inner': None,
 'name': 'BIO19 Cumulative historical extinctions of Australian mammal species',
 'package_id': 'c02731e8-5327-4720-bbc7-1fe67350a569',
 'position': 0,
 'resource_type': None,
 'size': None,
 'state': 'active',
 'url': 'https://data.gov.au/data/dataset/c02731e8-5327-4720-bb

We can save this dictionary formated data as *JSON* by using the `dumps()` function of the `json` library.

In [24]:
# Dump the dict into a json string
metadata = json.dumps(mammal_json['result']['resources'][0])
metadata

'{"cache_last_updated": null, "cache_url": null, "created": "2016-10-13T16:31:36.178357", "datastore_active": true, "datastore_contains_all_records_of_source_file": false, "description": "Source: Woinarski, JCZ, Burbidge, AA & Harrison PL (2014 edition). The action plan for Australian mammals 2012, CSIRO Publishing, Melbourne.\\r\\n\\r\\nNote: The extinction of 3 species\\u2014Bettongia pusilla, Conilurus capricornensis and Pseudomys glaucus\\u2014cannot be readily and reliably dated, and these are not included in the data.", "format": "CSV", "hash": "", "id": "8339c2b4-c763-4c50-a647-63935537453c", "last_modified": "2016-10-13", "metadata_modified": "2016-10-13T16:31:36.178357", "mimetype": null, "mimetype_inner": null, "name": "BIO19 Cumulative historical extinctions of Australian mammal species", "package_id": "c02731e8-5327-4720-bbc7-1fe67350a569", "position": 0, "resource_type": null, "size": null, "state": "active", "url": "https://data.gov.au/data/dataset/c02731e8-5327-4720-bbc7

In [25]:
# Write the json string to a file
file_name = "extinct_mammals_metadata.json"
with open(f"{path}/{file_name}",'w') as fp:
    fp.write(metadata)

Open the file that you just created to check that it has been written correctly.

In [26]:
# read the file back in
with open(f"{path}/{file_name}",'r') as fp:
    text = fp.read()
    file_json = json.loads(text)

file_json

{'cache_last_updated': None,
 'cache_url': None,
 'created': '2016-10-13T16:31:36.178357',
 'datastore_active': True,
 'datastore_contains_all_records_of_source_file': False,
 'description': 'Source: Woinarski, JCZ, Burbidge, AA & Harrison PL (2014 edition). The action plan for Australian mammals 2012, CSIRO Publishing, Melbourne.\r\n\r\nNote: The extinction of 3 species—Bettongia pusilla, Conilurus capricornensis and Pseudomys glaucus—cannot be readily and reliably dated, and these are not included in the data.',
 'format': 'CSV',
 'hash': '',
 'id': '8339c2b4-c763-4c50-a647-63935537453c',
 'last_modified': '2016-10-13',
 'metadata_modified': '2016-10-13T16:31:36.178357',
 'mimetype': None,
 'mimetype_inner': None,
 'name': 'BIO19 Cumulative historical extinctions of Australian mammal species',
 'package_id': 'c02731e8-5327-4720-bbc7-1fe67350a569',
 'position': 0,
 'resource_type': None,
 'size': None,
 'state': 'active',
 'url': 'https://data.gov.au/data/dataset/c02731e8-5327-4720-bb

Explore the JSON structure starting with the keys

In [27]:
# What keys are available in the first item in the list of resources?
mammal_json['result']['resources'][0].keys()

dict_keys(['cache_last_updated', 'cache_url', 'created', 'datastore_active', 'datastore_contains_all_records_of_source_file', 'description', 'format', 'hash', 'id', 'last_modified', 'metadata_modified', 'mimetype', 'mimetype_inner', 'name', 'package_id', 'position', 'resource_type', 'size', 'state', 'url', 'url_type', 'wms_layer'])

In [28]:
# Take a look at the description
mammal_json['result']['resources'][0]["description"]

'Source: Woinarski, JCZ, Burbidge, AA & Harrison PL (2014 edition). The action plan for Australian mammals 2012, CSIRO Publishing, Melbourne.\r\n\r\nNote: The extinction of 3 species—Bettongia pusilla, Conilurus capricornensis and Pseudomys glaucus—cannot be readily and reliably dated, and these are not included in the data.'

In [29]:
# Format the description as a list
mammal_json['result']['resources'][0]['description'].split(',')

['Source: Woinarski',
 ' JCZ',
 ' Burbidge',
 ' AA & Harrison PL (2014 edition). The action plan for Australian mammals 2012',
 ' CSIRO Publishing',
 ' Melbourne.\r\n\r\nNote: The extinction of 3 species—Bettongia pusilla',
 ' Conilurus capricornensis and Pseudomys glaucus—cannot be readily and reliably dated',
 ' and these are not included in the data.']

In [30]:
# Get the first item in the list
mammal_json['result']['resources'][0]['description'].split('\r\n')[0]

'Source: Woinarski, JCZ, Burbidge, AA & Harrison PL (2014 edition). The action plan for Australian mammals 2012, CSIRO Publishing, Melbourne.'

In [31]:
# since the result is a dictionary, we can the value for one particular key
mammal_json["result"]["notes"]

'Data on cumulative historical extinctions of Australian mammal species - according to the Action Plan for Australian Mammals (2014 edition) by John Woinarksi, Andrew Burbidge and Peter Harrison. CSIRO Publishing.\r\n\r\nThis data was used by the Department of Environment and Energy to produce Figure BIO19 in the Biodiversity theme of Australia State of the Environment 2016, available at \r\nhttps://soe.environment.gov.au/theme/biodiversity/topic/2016/terrestrial-plant-and-animal-species-mammals#biodiversity-figure-BIO19\r\n'

In [32]:
# we can take this data and structure it further

notes = mammal_json["result"]["notes"]
struct_notes = notes.split('\r\n')
for note in struct_notes:
    print(note)

Data on cumulative historical extinctions of Australian mammal species - according to the Action Plan for Australian Mammals (2014 edition) by John Woinarksi, Andrew Burbidge and Peter Harrison. CSIRO Publishing.

This data was used by the Department of Environment and Energy to produce Figure BIO19 in the Biodiversity theme of Australia State of the Environment 2016, available at 
https://soe.environment.gov.au/theme/biodiversity/topic/2016/terrestrial-plant-and-animal-species-mammals#biodiversity-figure-BIO19



### Visualise

We can use HTML to visualise text.

In [36]:
from IPython.display import display, HTML

heading = f"<h3>NOTES:</h3>"

content = ""
for note in struct_notes:
    content += f"<p>{note}</p>"

display(HTML(heading+content))

### Explore further

Try experimenting with exploring the dict format to find interesting parts of the data. 

You might also like to try saving the extracted data as a file, and creating a new dataframe with the structured data.

In [None]:
# Your code here
???

### Accessing the data via The Guardian API

A useful external data source are news publishers. **The Guardian** provides an Application Programming Interface (API) which allows us to search and retrieve news articles.

See the `Accessing_the_Guardian_API.ipynb` notebook file for details on obtaining data from the API. 