<div class="alert alert-block alert-info"><b>IAB303</b> - Data Analytics for Business Insight</div>

# Structured data from text with regex - part 3

---

### Previously...

Our task was to obtain the following data from a given text.

| Date | Name | Food eaten |
| --- | --- | --- |
| 2020.03.18 | Andrew | avocado |
| 2020.03.19 | Catarina | coconut |
| 2020.03.19 | Prime Minister | pineapple mousse |

In Part 1, we extracted the date from the beginning of lines of text (where there was a date to extract).

In Part 2, we extracted the names and foods by splitting the remainder of each line based on a regex that could cope with different eat words.

In [None]:
import re

# the text which we want to obtain data from
text = """
*** Start ***
2020.03.18 Andrew ate avocado
2020.03.19 Catarina eats coconut
2020.03.19 Prime Minister eating pineapple mousse
*** End ***
"""
# split into list of lines
lines = text.split('\n')

# loop over lines and match dates and put in new list
for line in lines:
    print(line)
    match = re.search(r"^[0-9]{4}\.[0-9]{2}\.[0-9]{2}", line)
    if match:
        print("\tDate: ",match.group())
        eatline = line[11:] # substrings for lines that match
        splitline = re.split(r"\s[a-z]+\s",eatline)
        print("\tName: ",splitline[0]) # First element of splitline is name
        print("\tFood: ",splitline[1]) # Second element of splitline is food
        print() # Empty line to make easy to read
    else:
        print("--- not a match ---")
        print()

### Making the data useful for further analysis

While the previous example showed that we could extract the data, it didn't provide it in an easy to use format. 

If we want to make further analysis easy, we need to take our data and put it into an appropriate **Data Structure**.

The two most common data structures we will use are **DataFrames** and **JSON**.

### Extracting to JSON

If we look carefully at the code above, we see that we already have name value pairs, like what we would expect to see in a JSON object.

```javascript
{ 
    "Date": "2020.02.02",
    "Name": "My Name",
    "Food": "my food"
}
```

The Python equivalent is called a dictionary. We can create a dictionary and add name value pairs like this...

In [None]:
# create an empty dictionary
my_dict = {}

# add name-value pairs
my_dict["Date"] = "2020.02.02"
my_dict["Name"] = "My Name"
my_dict["Food"] = "my food"

# view the dictionary
my_dict

This is a Python dictionary, but we can turn this into a JSON string by using the json library...

In [None]:
import json

In [None]:
json.dumps(my_dict)

Now we can use this json approach to modify our code from part 2...

In [None]:
# create a list to put our dictionaries in
new_list = []

# loop over lines and match dates and put in new list
for line in lines:
    match = re.search(r"^[0-9]{4}\.[0-9]{2}\.[0-9]{2}", line)
    if match:
        # create a new dictionary to put this match in
        new_dict = {}
        # now add the name value pairs
        new_dict["Date"] = match.group()
        eatline = line[11:] # substrings for lines that match
        splitline = re.split(r"\s[a-z]+\s",eatline)
        new_dict["Name"] = splitline[0] # First element of splitline is name
        new_dict["Food"] = splitline[1] # Second element of splitline is food
        new_list.append(new_dict) # Add the dict to the list
    
new_list

In [None]:
# JSON string version
json.dumps(new_list)

### Extracting to a DataFrame

In order to use pandas dataframes, we need to import the pandas library. It is common in Python to abreviate pandas as **pd**.

In [None]:
import pandas as pd

Before extracting our data into a dataframe, we need to understand what is required to (a) create a new dataframe (b) append data to that dataframe.

In [None]:
df = pd.DataFrame(columns=['Date','Name','Food'])

In [None]:
df

pandas expects a dictionary when adding to the dataframe, as it needs to know which values should be put with which columns. Also, if we don't have explicit index values, we need to tell pandas to ignore the index.

In [None]:
dict_to_add = {'Date':'2020.03.15','Name':'Test Name','Food':'test food'}
df = df.append(dict_to_add,ignore_index=True)

In [None]:
df

Notice that because we are adding a dictionary to the dataframe, this is very similar to our JSON code where we were adding a dictionary to a list.

We can therefore modify the code above to add to a dataframe...

In [None]:
# create an empty dataframe
df = pd.DataFrame(columns=['Date','Name','Food'])

# loop over lines and match dates and put in new list
for line in lines:
    match = re.search(r"^[0-9]{4}\.[0-9]{2}\.[0-9]{2}", line)
    if match:
        # create a new dictionary to put this match in
        new_dict = {}
        # now add the name value pairs
        new_dict["Date"] = match.group()
        eatline = line[11:] # substrings for lines that match
        splitline = re.split(r"\s[a-z]+\s",eatline)
        new_dict["Name"] = splitline[0] # First element of splitline is name
        new_dict["Food"] = splitline[1] # Second element of splitline is food
        df = df.append(new_dict,ignore_index=True) # Add the dict to the df
    
df

That comples our QuAD showing how to extract data from text using regex and put that data into JSON and DataFrame data structures.