<div class="alert alert-block alert-info"><b>IAB303</b> - Data Analytics for Business Insight</div>

# Studio :: Unstructured data

### Unstructured data

Humans can make meaning from data without necessarily having pre-defined structure. In fact we frequently use very ill-defined structures to organise and communicate our thinking. We are also adept at creating these kinds of structures as required, in the moment, rather than requiring the data be structured before we can make sense of it.

<p><a href="https://commons.wikimedia.org/wiki/File:Coggle_Document.png#/media/File:Coggle_Document.png"><img src="https://upload.wikimedia.org/wikipedia/commons/1/19/Coggle_Document.png" alt="Coggle Document.png"></a><br>By <a href="https://en.wikipedia.org/wiki/User:Lurched95" class="extiw" title="en:User:Lurched95">User:Lurched95</a>, <a href="https://creativecommons.org/licenses/by-sa/3.0" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=33923406">Link</a></p>


Computers are not so adept, so complex in the moment sense-making tasks on unstructured data are often easy for humans but very challenging for computers.

<img src="https://static.boredpanda.com/blog/wp-content/uploads/2016/03/dog-food-comparison-bagel-muffin-lookalike-teenybiscuit-karen-zack-5__700.jpg">

[Puppies or Food (boredpanda.com March 2016)](https://www.boredpanda.com/dog-food-comparison-bagel-muffin-lookalike-teenybiscuit-karen-zack/)

### Kinds of structuring of data

In order for us to perform data analysis on unstructured data, we will usually need to do some structuring of it, and this frequently results in semi-structured data. 

The 3 different kinds of structuring can be summarised as:

* **Structured** $\Rightarrow$ when the structure is pre-defined
* **Structured** $\Rightarrow$ is almost synonymous with 'stored in a RDMS', but can also exist in other software
* **Unstructured** $\leadsto$ when there is no pre-defined structure, or can't easily be conformed to a structure
* **Unstructured** $\leadsto$ commonly raw text, but also images, video, audio
* **Unstructured** $\leadsto$ can appear to have some kind of structure, but often that appearence is derived from our understanding, not from the data itself
* **Semi-structured** $\rightarrow$ the data can be stored in defined structure, but the actual instance of the structure is not predefined

### Semi-structured data

Semi-structured data is a lot more prevalent than structured data, but the computational tools are not as mature as structured data tools. Most semi-structured data tools have come about with the advent of the internet and then social media.

## Working with semi-structured data

We will work with semi-structured data mostly by (a) creating it from plain text which is read from a file, or (b) importing the data from a JSON file.

JSON is a way of labelling data, but without requiring all data to be the same or without requiring the structure to be fixed in advance.




#### Reading plain text files

In [None]:
# Read in a plain text file
with open('data/simple_text_file.txt', 'r') as file:
    text = file.read()

In [None]:
# Show the string that was read from the file
text

In [None]:
# We can read the text in a semi-structured format by taking advantage of the lines in the file
with open('data/simple_text_file.txt', 'r') as file:
    text_lines = file.readlines()

In [None]:
# Show the list that was read from the file
text_lines

In [None]:
# We can also create the list, by splitting the original string
lines = text.split('\n')

# view the list
lines

#### Reading JSON

In [None]:
# We need the JSON library
import json

# Read a JSON file like text, but with conversion to python dictionary
with open('data/simple_json_file.json', 'r') as file:
    json_data = json.load(file)

In [None]:
# View the json data
json_data

In [None]:
# You can also load json data from a URL
import requests

# We can load data about the CSV on extinct mammals that we loaded above
mammal_url = "https://data.gov.au/api/3/action/package_show?id=c02731e8-5327-4720-bbc7-1fe67350a569"
content = requests.get(mammal_url)
mammal_data = json.loads(content.content)

In [None]:
# view the result
mammal_data

In [None]:
# since the result is a dictionary, we can the value for one particular key
mammal_data["result"]["notes"]

In [None]:
# we can take this data and structure it further

notes = mammal_data["result"]["notes"]
struct_notes = notes.split('\r\n')
for note in struct_notes:
    print(note)

---

## Structured data from text with regex

### Required libraries

We will need the Python regular expression library for this. *Note: Libraries are also known as packages in python and can include multiple modules.*

In [None]:
import re

### The task

We start with text from which we want to extract some structured data. Here is the text:

In [None]:
starting_text = """2020.03.18 Andrew eats avocado
2020.03.19 Catarina eats coconut
2020.03.19 Prime Minister eats pineapple mousse"""

In [None]:
starting_text

What we are looking for are the important data that is embedded in the text. We must look at our original data to see what is possible.

Looking at the `starting_text` we should be able to obtain the following data:

| Date | Name | Food eaten |
| --- | --- | --- |
| 2020.03.18 | Andrew | avocado |
| 2020.03.19 | Catarina | coconut |
| 2020.03.19 | Prime Minister | pineapple mousse |

### Initial structure - lines

Our text already has some structure to it: each entry is on a new line. So we can turn the string `starting_text` into a list of strings by splitting on the newline (`\n`) character that separates each line.

In [None]:
text_list = starting_text.split('\n')
text_list

### First *n* characters

We now have a list where each element in the list is one line of the starting text. Now we need to start extracting the data we want. 

The date looks the easiest to extract as it is the beginning of each line. As each line is a string, and a string is basically a list of characters, we can get the first 10 characters by indicating the range in square brackets: `[0:10]`

Let's try this on the first element of the list...


In [None]:
first_element = text_list[0]
print(first_element)

starting_10 = first_element[0:10]
print(starting_10)

We can experiment with this a bit...

In [None]:
# If we're starting from 0 we can leave it out...
first_element[:10]

In [None]:
# Get just the first 4 characters...
first_element[:4]

In [None]:
#Get the month
first_element[5:7]

Note that the first number of the range is the position of the first character, but the second number is the **next** position after the last character.

Since we need the date for all lines, we can loop over the list:

In [None]:
for element in text_list:
    print(element[0:10])

### New list with extracted data

If we want to add this data to a new list, instead of printing the elements, we could do the following:

In [None]:
#Create a new empty list
date_list = [] 

#Loop over the text_list and append dates to the new list
for element in text_list:
    date_list.append(element[0:10])
    
#Check new list
date_list

Python also allows us to do this in a single line of code: 

In [None]:
date_list_two = [element[0:10] for element in text_list]
date_list_two

This single line can be read as:
1. Create a new list 
    `[ ... ]`
2. Put the first 10 characters of each element in the list 
    `[element[0:10] ...]`
3. Where each element is obtained by looping over text_list 
    `[element[0:10] for element in text_list]`
4. Assign the resulting list to a variable 
    `date_list = [element[0:10] for element in text_list]`

### More complex text

But what happens if our initial text didn't always have the date starting a new line...

In [None]:
messy_text = """
*** Start ***
2020.03.18 Andrew eats avocado
2020.03.19 Catarina eats coconut
2020.03.19 Prime Minister eats pineapple mousse
*** End ***
"""

If we re-run the above approach on this text, we run into some issues...

In [None]:
messy_list = messy_text.split('\n')
messy_dates = [element[0:10] for element in messy_list]
messy_dates

We have a list of 7 elements. Only 3 of them are dates. 2 of them are empty and 2 of them have other data that mark the beginning and end of the file. 

### Pattern matching with regex

This is where pattern matching with regex can be helpful. Instead of just getting the first 10 characters, we can get those characters that match a specific regex pattern.

One possibility is a pattern that:
1. Starts a line
2. Has 4 numbers then a full stop, then 2 numbers then full stop, then 2 numbers

The regex for matching the start of a string is `^`, numbers are matched by `[0-9]` and if we want certain number of something we can put the number in `{}` like `{3}`. 
So if we want to match the first 4 numbers of a string, we could use the regex `^[0-9]{4}`.

Let's try this with our messy_list...

**Tip:** You can test regex patterns and find additional information in this page [https://regexr.com/](https://regexr.com/)

In [None]:
print(messy_list[0]) # Print first line
match = re.search(r"^[0-9]{4}", messy_list[0]) # Pattern match first line
print(match) # Print the match

Now we try with the third line which we know has a date...

In [None]:
print(messy_list[2]) # Print third line
match = re.search(r"^[0-9]{4}", messy_list[2]) # Pattern match third line
print(match) # Print the match

We can get the actual value of the match with the `group()` function

In [None]:
match.group()

We can see that it found a match which was the first 4 characters of the date. Let's try this approach with the whole of `messy_text` instead of our original counting characters approach...

In [None]:
for element in messy_list:
    match = re.search(r"^[0-9]{4}", element)
    print(match)


### Getting match values

So we can see that the dates have been matched, and those lines without dates come back with none. We can only use the `group()` function on actual matches, so in our loop, we'll need to check if the match has a value in it...

In [None]:
for element in messy_list:
    match = re.search(r"^[0-9]{4}", element)
    if match:
        print(match.group())



Instead of printing, let's add the results to our new `date_list`

In [None]:
# create empty list
date_list = []

# loop over lines and match dates and put in new list
for element in messy_list:
    match = re.search(r"^[0-9]{4}", element)
    if match:
        date_list.append(match.group())

# look at the resulting list
date_list

### Pattern match whole date with regex

Great, but we still need to get the whole date, not just the year. So we need to modify the regex.

Unfortunately, the `.` in regex means match any character, so we need to tell regex that we actually want to match a full stop. We can do this by using a backslash to tell regex to treat the next character as an ordinary character. like `\.`

We also need to match the month, another full stop and the day of month. So our full regex will be: `^[0-9]{4}\.[0-9]{2}\.[0-9]{2}` (four numbers, stop, two numbers, stop, two numbers).

Let's try that in our code:


In [None]:
# create empty list
date_list = []

# loop over lines and match dates and put in new list
for element in messy_list:
    match = re.search(r"^[0-9]{4}\.[0-9]{2}\.[0-9]{2}", element)
    if match:
        date_list.append(match.group())

# look at the resulting list
date_list

### Bring it together

> Part 1

Here, we bring it all together and tidy up the code using some more meaningful variable names

In [None]:
# the text which we want to obtain data from
text = """
*** Start ***
2020.03.18 Andrew eats avocado
2020.03.19 Catarina eats coconut
2020.03.19 Prime Minister eats pineapple mousse
*** End ***
"""
# split into list of lines
lines = text.split('\n')

# create empty list for dates
dates = []

# loop over lines and match dates and put in new list
for line in lines:
    match = re.search(r"^[0-9]{4}\.[0-9]{2}\.[0-9]{2}", line)
    if match:
        dates.append(match.group())
        
# look at the resulting list
for date in dates:
    print(date)

In [None]:
dates

### Reducing the task

Once we have the date, we no longer need to work on the entire string because our other data is in the remainder of each line.

To obtain this data, we can use a technique that we used in part 1 - treating a string as a list.

In part 1, we tried getting just the first 10 characters, but we can also get all of the characters *after* a certain point.

In [None]:
third_line = lines[2] # The third line of the text
third_line

In [None]:
third_line[10:] # All character from position 10 onwards

Even the date is only 10 characters, it is followed by a space, so really we want all text from position 11 onwards...

In [None]:
third_line[11:]

We can try this in our code so far...

In [None]:
# loop over lines and match dates
for line in lines:
    match = re.search(r"^[0-9]{4}\.[0-9]{2}\.[0-9]{2}", line)
    if match:
        print(line[11:]) # print the characters from 11 onwards of lines that match

These are substrings of the original lines which we can add to a new list for further processing...

In [None]:
# create a new empty list
eat_lines = []

# loop over lines and match dates
for line in lines:
    match = re.search(r"^[0-9]{4}\.[0-9]{2}\.[0-9]{2}", line)
    if match:
        eat_lines.append(line[11:]) # add the substrings for lines that match

eat_lines

### Splitting text using regex

Now that we have reduced our task by taking the substrings, we now need to extract the data from these strings. 

For this task, we don't need to match the names and the food eaten (even though that is what we want). We can take a simpler task as all strings have a common pattern that separates names and foods: `eats`

So we can simply split the strings...

In [None]:
first_line = eat_lines[0]
first_line

In [None]:
first_line.split('eats')

This leaves us with a space attached to the end of the name and the beginning of the food, so instead we could use..

In [None]:
first_line.split(' eats ')

However, what the `space` was an invisible character (like a tab or a non-breaking space), or what if the word was 'ate' instead of 'eats'?

Using regex can help. We need to find a pattern of a space, word of any lenght and a space. In regex a space is `\s`. The previous example we specified the number of characters using `{}`. In this case, we don't know how many characters. Thus, we can use `+` to match one or more characters.

In [None]:
alt_line = "andrew ate avocado"

re.split(r"\s[a-z]+\s",alt_line)

In [None]:
alt_line2 = "andrew someotherrediculousword avocado"
re.split(r"\s[a-z]+\s",alt_line)

In [None]:
alt_line3 = """andrew
eats
avocado"""
re.split(r"\s[a-z]+\s",alt_line)

As you can see the regex captures all of these variation by matching the pattern rather than the exact text.

We can now try this back in code...

In [None]:
# loop over lines and match dates
for line in lines:
    match = re.search(r"^[0-9]{4}\.[0-9]{2}\.[0-9]{2}", line)
    if match:
        eatline = line[11:] # substrings for lines that match
        print(eatline)
        splitline = re.split(r"\s[a-z]+\s",eatline)
        print("Name: ",splitline[0]) # First element of splitline is name
        print("Food: ",splitline[1]) # Second element of splitline is food
        print() # Empty line to make easy to read


### Bring it together

> Part 2

Here, we bring it all together including part 1

In [None]:
# the text which we want to obtain data from
text = """
*** Start ***
2020.03.18 Andrew ate avocado
2020.03.19 Catarina eats coconut
2020.03.19 Prime Minister eating pineapple mousse
*** End ***
"""
# split into list of lines
lines = text.split('\n')

# loop over lines and match dates and put in new list
for line in lines:
    print(line)
    match = re.search(r"^[0-9]{4}\.[0-9]{2}\.[0-9]{2}", line)
    if match:
        print("\tDate: ",match.group())
        eatline = line[11:] # substrings for lines that match
        splitline = re.split(r"\s[a-z]+\s",eatline)
        print("\tName: ",splitline[0]) # First element of splitline is name
        print("\tFood: ",splitline[1]) # Second element of splitline is food
        print() # Empty line to make easy to read
    else:
        print("--- not a match ---")
        print()

### Making the data useful for further analysis

While the previous example showed that we could extract the data, it didn't provide it in an easy to use format. 

If we want to make further analysis easy, we need to take our data and put it into an appropriate **Data Structure**.

The two most common data structures we will use are **DataFrames** and **JSON**.

### Extracting to JSON

If we look carefully at the code above, we see that we already have name value pairs, like what we would expect to see in a JSON object.

```javascript
{ 
    "Date": "2020.02.02",
    "Name": "My Name",
    "Food": "my food"
}
```

The Python equivalent is called a dictionary. We can create a dictionary and add name value pairs like this...

In [None]:
# create an empty dictionary
my_dict = {}

# add name-value pairs
my_dict["Date"] = "2020.02.02"
my_dict["Name"] = "My Name"
my_dict["Food"] = "my food"

# view the dictionary
my_dict

This is a Python dictionary, but we can turn this into a JSON string by using the json library...

In [None]:
json.dumps(my_dict)

Now we can use this json approach to modify our code from part 2...

In [None]:
# create a list to put our dictionaries in
new_list = []

# loop over lines and match dates and put in new list
for line in lines:
    match = re.search(r"^[0-9]{4}\.[0-9]{2}\.[0-9]{2}", line)
    if match:
        # create a new dictionary to put this match in
        new_dict = {}
        # now add the name value pairs
        new_dict["Date"] = match.group()
        eatline = line[11:] # substrings for lines that match
        splitline = re.split(r"\s[a-z]+\s",eatline)
        new_dict["Name"] = splitline[0] # First element of splitline is name
        new_dict["Food"] = splitline[1] # Second element of splitline is food
        new_list.append(new_dict) # Add the dict to the list
    
new_list

In [None]:
# JSON string version
json.dumps(new_list)

### Extracting to a DataFrame

In order to use pandas dataframes, we need to import the pandas library. It is common in Python to abreviate pandas as **pd**.

In [None]:
import pandas as pd

Before extracting our data into a dataframe, we need to understand what is required to (a) create a new dataframe (b) append data to that dataframe.

In [None]:
df = pd.DataFrame(columns=['Date','Name','Food'])

In [None]:
df

pandas expects a dictionary when adding to the dataframe, as it needs to know which values should be put with which columns. Also, if we don't have explicit index values, we need to tell pandas to ignore the index.

In [None]:
dict_to_add = {'Date':'2020.03.15','Name':'Test Name','Food':'test food'}
df = df.append(dict_to_add,ignore_index=True)

In [None]:
df

Notice that because we are adding a dictionary to the dataframe, this is very similar to our JSON code where we were adding a dictionary to a list.

We can therefore modify the code above to add to a dataframe...

In [None]:
# create an empty dataframe
df = pd.DataFrame(columns=['Date','Name','Food'])

# loop over lines and match dates and put in new list
for line in lines:
    match = re.search(r"^[0-9]{4}\.[0-9]{2}\.[0-9]{2}", line)
    if match:
        # create a new dictionary to put this match in
        new_dict = {}
        # now add the name value pairs
        new_dict["Date"] = match.group()
        eatline = line[11:] # substrings for lines that match
        splitline = re.split(r"\s[a-z]+\s",eatline)
        new_dict["Name"] = splitline[0] # First element of splitline is name
        new_dict["Food"] = splitline[1] # Second element of splitline is food
        df = df.append(new_dict,ignore_index=True) # Add the dict to the df
    
df

That comples how to extract data from text using regex and put that data into JSON and DataFrame data structures.