<div style="background:#E9FFF6; color:#440404; padding:8px; border-radius: 4px; text-align: center; font-weight: 500;">IFN619 - Data Analytics for Strategic Decision Makers (2024 Sem 1)</div>

# IFN619 :: B1-DataStructures

### Why consider structure in data?

Structure in data helps us make meaning from that data and to be consistent and precise in using it.

When data is structured, then we can program computational systems to compute using that data as long as the data adheres to the defined structure.

### Structured data

For example, the idea of a spreadsheet allows us to make infer relationships between the cells based on rows and columns. This was possible with even the very first spreadsheets.

<a title="By User:Gortu (apple2history.org) [Public domain], via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File%3AVisicalc.png"><img width="512" alt="Visicalc" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Visicalc.png"/></a>

*By User:Gortu (apple2history.org) \[Public domain\], via Wikimedia Commons*


Even in this very simple example, we can make sense of this data, and because of the structure, the computer can work with the data, like calculating the costs for each row, calculating a subtotal, the tax, and the total. If the data in this spreadsheet was unstructured (for example written on a piece of paper), the computer would not be able to work with it.

Most structured data today is found in relational datatabase management systems (RDBMS), or commonly just refered to as databases. The structure is one of tables (like a spreadsheet) and relationships between them, but relating particular fields or columns in one table with those in another.

<a data-flickr-embed="true"  href="https://www.flickr.com/photos/14804582@N08/2111269218" title="database schema"><img src="https://farm3.staticflickr.com/2129/2111269218_950cf23a03_b.jpg" width="1024" height="953" alt="database schema"></a><script async src="//embedr.flickr.com/assets/client-code.js" charset="utf-8"></script>


But these are not the only kinds of data structuring. Structured data is basically when the organisation of the data is pre-defined so that certain data is associated with certain labels.

<p><a href="https://commons.wikimedia.org/wiki/File:Database_models.jpg#/media/File:Database_models.jpg"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Database_models.jpg/1200px-Database_models.jpg" alt="Database models.jpg"></a><br>By <a href="//commons.wikimedia.org/wiki/User:Mdd" title="User:Mdd">Marcel Douwe Dekker</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/3.0" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=5679857">Link</a></p>

### Working with structured data

We will work with structured data by mostly reading it from a file into a dataframe. One of the most common files for holding structured tabular data is the comma separated value (CSV) file. A spreadsheet table or a database table can be saved as a CSV file, which we can then import into a pandas dataframe. If you'd like more information on Pandas, try the [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/version/2.1/user_guide/10min.html) tutorial.

For the following example, we're going to use open data from [data.gov.au](https://data.gov.au) for [2016 SoE Biodiversity Cumulative historical extinctions of Australian mammal species](https://data.gov.au/data/dataset/2016-soe-biodiversity-cumulative-number-of-extinct-mammal-species). If you're interested, you can read about the data in [Terrestrial plant and animal species: Mammals](https://soe.environment.gov.au/theme/biodiversity/topic/2016/terrestrial-plant-and-animal-species-mammals#biodiversity-figure-BIO19)

Previously, we have read data into pandas from a file. But pandas also allows us to load data from a URL. Before completing the code below, take a look at the data by opening the URL in your browser. Take a note of which column might be appropriate for the index of the dataframe.

[https://data.gov.au/dataset/c02731e8-5327-4720-bbc7-1fe67350a569/resource/8339c2b4-c763-4c50-a647-63935537453c/download/cumulative-number-of-extinct-mammal-species.csv](https://data.gov.au/dataset/c02731e8-5327-4720-bbc7-1fe67350a569/resource/8339c2b4-c763-4c50-a647-63935537453c/download/cumulative-number-of-extinct-mammal-species.csv)

#### Reading data from a URL

In [None]:
# To use pandas, we need to import it (normally as 'pd')
import ??? as pd

# We can then open a CSV file into a new dataframe
extinct_mammals_url = ???
exmam_df = pd.read_csv(extinct_mammals_url,index_col=???)

# view the dataframe
exmam_df

### Saving (writing) data to a file

Pandas allows us to write a dataframe to a local file with the function `to_csv()`. This can be used in a way that is similar to reading a CSV into a dataframe.

In [None]:
# We can save our dataframe to use later
file_name = "extinct_aus_mammals.csv"
path = "data"
exmam_df.to_csv(f"{??}/{???}")


In [None]:
# The saved version can be loaded in the same way as the original URL
# We have already declared the path and file_name variables in the previous string

exmam_file_df = pd.???(f"{path}/{file_name}")
exmam_file_df

#### Reading and writing different formats

We can also read and write structured data formats different to CSV. Excel is a common spreadsheet format for structured data. Jupyter allows us to read and write excel files with the functions `read_excel()` and `to_excel()`.

Try writing the data above to excel format and then downloading it to your local computer and opening it with Excel. You might also try reading in an excel file that you upload from your local computer.

What other formats does pandas *read* and *write*? (TIP: Type `pandas.to` and then the `tab` key to bring up a menu of suggestions.)

In [None]:
# Make sure the file extension .xlsx matches the save format
# The old excel format of .xls does not work with the current version of pandas!

excel_file = "extinct_aus_mammals.xlsx" # <---- note .xlsx
exmam_file_df.to_excel(f"{path}/{???}")

To avoid writing the (new) index to the excel file, use the option: `index=False`

In [None]:
exmam_file_df.to_excel(f"{path}/{excel_file}",???)

In [None]:
exmam_excel_df = pd.read_excel(f"{path}/{excel_file}",???="Decade")
exmam_excel_df

### Unstructured data

Humans can make meaning from data without necessarily having pre-defined structure. In fact we frequently use very ill-defined structures to organise and communicate our thinking. We are also adept at creating these kinds of structures as required, in the moment, rather than requiring the data be structured before we can make sense of it.

<p><a href="https://commons.wikimedia.org/wiki/File:Coggle_Document.png#/media/File:Coggle_Document.png"><img src="https://upload.wikimedia.org/wikipedia/commons/1/19/Coggle_Document.png" alt="Coggle Document.png"></a><br>By <a href="https://en.wikipedia.org/wiki/User:Lurched95" class="extiw" title="en:User:Lurched95">User:Lurched95</a>, <a href="https://creativecommons.org/licenses/by-sa/3.0" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=33923406">Link</a></p>


Computers are not so adept, so complex in the moment sense-making tasks on unstructured data are often easy for humans but very challenging for computers.

<img src="https://static.boredpanda.com/blog/wp-content/uploads/2016/03/dog-food-comparison-bagel-muffin-lookalike-teenybiscuit-karen-zack-5__700.jpg">

[Puppies or Food (boredpanda.com March 2016)](https://www.boredpanda.com/dog-food-comparison-bagel-muffin-lookalike-teenybiscuit-karen-zack/)

### Kinds of structuring of data

In order for us to perform data analysis on unstructured data, we will usually need to do some structuring of it, and this frequently results in semi-structured data. 

The 3 different kinds of structuring can be summarised as:

* **Structured** $\Rightarrow$ when the structure is pre-defined
* **Structured** $\Rightarrow$ is almost synonymous with 'stored in a RDMS', but can also exist in other software
* **Unstructured** $\leadsto$ when there is no pre-defined structure, or can't easily be conformed to a structure
* **Unstructured** $\leadsto$ commonly raw text, but also images, video, audio
* **Unstructured** $\leadsto$ can appear to have some kind of structure, but often that appearence is derived from our understanding, not from the data itself
* **Semi-structured** $\rightarrow$ the data can be stored in defined structure, but the actual instance of the structure is not predefined

### Semi-structured data

Semi-structured data is a lot more prevalent than structured data, but the computational tools are not as mature as structured data tools. Most semi-structured data tools have come about with the advent of the internet and then social media.

### Working with semi-structured data

We will work with semi-structured data mostly by:
1. creating it from plain text which is read from a file, or
2. importing the data from a `JSON` file.

*JSON* is a way of labelling data, but without requiring all data to be the same or without requiring the structure to be fixed in advance.




#### Reading plain text files

In [None]:
# Read in a plain text file
with open(f"{path}/{file_name}", 'r') as fp:
    exmam_text = fp.???

# Print the string that was read from the file
print(???)

In [None]:
# What does the actual string look like (not formatted)
exmam_text

In [None]:
# We can read the text in a semi-structured format by taking advantage of the lines in the file
with open(f"{path}/{file_name}", 'r') as fp:
    exmam_lines = fp.???

print(???)

In [None]:
# Show the list that was read from the file
???

In each of these examples, notice that we have `\n` newline characters in the data. This is because Python is keeping all of the data from the original file including the characters that specify the end of a line of text.

A way to import the data *without* this character is to split the lines using the string `.split()` function after reading in the text as a single string.

In [None]:
# We can also create the list, by splitting the original string
lines = exmam_text.split(???)

# view the list
???

Now each line of the file is an element in the list and the `\n` characters have been removed. 

However, because each line includes 2 data points, we can split each line. This creates a list of lists which gives us a structure similar to a dataframe.

In [None]:
# We can also create the list, by splitting the original string
lines = exmam_text.split('\n')

# view the list
for line in lines:
    data_points = line.split(???)
    print(data_points)

Notice that the data is not `clean`. Think about ways that we might be able to fix this.

#### Reading JSON

JSON is a very common file format for semi-structured data. To read this format we open the file as before, but we use the `json` library to help load the data into a Python dictionary or `dict` structure.

In [None]:
# We need the JSON library
import json

In [None]:
# Read a JSON file like text, but with conversion to python dictionary
json_file_name = "simple_json_file.json"
path = "data"

with open(f"{path}/{json_file_name}", ???) as file:
    json_data = json.load(file)

# print the loaded data
print(???)

In [None]:
# View the json data
???

The advantage of a `dict` in python is that you can access a `value` by calling its `key`. These are called *key-value pairs* and are fundamental to a dictionary structure.

In [None]:
# Access values in the dict by calling the keys
json_data['Key 1']

In [None]:
# Get a list of keys for a dict
json_data.keys()

In [None]:
# Iterate over the keys in a dict

for key in json_data.???:
    print("key:",???)
    value = json_data[???]
    print("value:",???)
    print()

JSON data can include dictionary structures and list structures and they can be nested. To see this, in action we can load json data from a URL. 

To get data from a URL we use the `requests` library. This works like your web browser by sending `get` *request* to a web server, and then processing the response (instead of rendering in a browser).

In [None]:
# You can also load json data from a URL
import requests

# JSON data about the CSV on extinct mammals from the same website above
mammal_url = "https://data.gov.au/api/3/action/package_show?id=c02731e8-5327-4720-bbc7-1fe67350a569"

# Request the content from the web server with a .get() request
response = requests.get(???)

response.content

In [None]:
# Get the data as json from the response

mammal_json = response.json()

mammal_json

Since we know this is json data, we can use the structure to navigate the data and find what we are interested in.

In [None]:
# Take a look at the keys
mammal_json.keys()

In [None]:
# What about the keys down a level?
mammal_json[???].keys()

In [None]:
# Digging deeper
mammal_json['result']['resources']

This is a list of dicts - let's get the first dict in the list (item 0) and explore further

In [None]:
# Only one item in the list - get it by accessing the first item 0
mammal_json['result']['resources'][???]

We can save this dictionary formated data as *JSON* by using the `dumps()` function of the `json` library.

In [None]:
# Dump the dict into a json string
metadata = json.dumps(???)
metadata

In [None]:
# Write the json string to a file
file_name = "extinct_mammals_metadata.json"
with open(f"{path}/{file_name}",'w') as fp:
    fp.write(???)

Open the file that you just created to check that it has been written correctly.

In [None]:
# read the file back in
with open(f"{path}/{file_name}",'r') as fp:
    text = fp.read()
    file_json = json.loads(text)

file_json

Explore the JSON structure starting with the keys

In [None]:
# What keys are available in the first item in the list of resources?
mammal_json['result']['resources'][0].???

In [None]:
# Take a look at the description
mammal_json['result']['resources'][0][???]

In [None]:
# Format the description as a list
mammal_json['result']['resources'][0]['description'].split(???)

In [None]:
# Get the first item in the list
mammal_json['result']['resources'][0]['description'].split('\r\n')[???]

In [None]:
# since the result is a dictionary, we can the value for one particular key
mammal_json["result"]["notes"]

In [None]:
# we can take this data and structure it further

notes = mammal_json["result"]["notes"]
struct_notes = notes.split(???)
for note in struct_notes:
    print(???)

### Explore further

Try experimenting with exploring the dict format to find interesting parts of the data. 

You might also like to try saving the extracted data as a file, and creating a new dataframe with the structured data.

In [None]:
???