<div class="alert alert-block alert-info"><b>IAB303</b> - Data Analytics for Business Insight</div>

# Unstructured and Semi-structured Data

Humans can make meaning from data without necessarily having pre-defined structure. In fact we frequently use very ill-defined structures to organise and communicate our thinking. We are also adept at creating these kinds of structures as required, in the moment, rather than requiring the data be structured before we can make sense of it.

Semi-structured data is a lot more prevalent than structured data, but the computational tools are not as mature as structured data tools. Most semi-structured data tools have come about with the advent of the internet and then social media.

### Accessing semi-structured data

We will work with semi-structured data mostly by (a) creating it from plain text which is read from a file, or (b) importing the data from a JSON file.

JSON is a way of labelling data, but without requiring all data to be the same or without requiring the structure to be fixed in advance.




#### Reading plain text files

In [None]:
# Read in a plain text file
with open('data/simple_text_file.txt', 'r') as file:
    text = file.read()

In [None]:
# Show the string that was read from the file
text

In [None]:
# We can read the text in a semi-structured format by taking advantage of the lines in the file
with open('data/simple_text_file.txt', 'r') as file:
    text_lines = file.readlines()

In [None]:
# Show the list that was read from the file
text_lines

In [None]:
# We can also create the list, by splitting the original string
lines = text.split('\n')

# view the list
lines

In [None]:
lines[1]

#### Reading JSON

In [None]:
# We need the JSON library
import json

# Read a JSON file like text, but with conversion to python dictionary
with open('data/simple_json_file.json', 'r') as file:
    json_data = json.load(file)

In [None]:
# View the json data
json_data

In [None]:
json_data['3rd_key']

In [None]:
# You can also load json data from a URL
import requests

# We can load data about the CSV on extinct mammals that we loaded above
mammal_url = "https://data.gov.au/api/3/action/package_show?id=c02731e8-5327-4720-bbc7-1fe67350a569"
content = requests.get(mammal_url)
mammal_data = json.loads(content.content)

In [None]:
# view the result
mammal_data

In [None]:
# since the result is a dictionary, we can the value for one particular key
mammal_data["result"]["notes"]

In [None]:
# we can take this data and structure it further

notes = mammal_data["result"]["notes"]
struct_notes = notes.split('\r\n')
for note in struct_notes:
    print(note)