<div class="alert alert-block alert-info"><b>IAB303</b> - Data Analytics for Business Insight</div>

# Structured data from text with regex - part 2

---

### Previously...

Our task was to obtain the following data from a given text.

| Date | Name | Food eaten |
| --- | --- | --- |
| 2020.03.18 | Andrew | avocado |
| 2020.03.19 | Catarina | coconut |
| 2020.03.19 | Prime Minister | pineapple mousse |

In Part 1, we extracted the date from the beginning of lines of text (where there was a date to extract).

In [None]:
# import the regex library
import re

# the text which we want to obtain data from
text = """
*** Start ***
2020.03.18 Andrew eats avocado
2020.03.19 Catarina eats coconut
2020.03.19 Prime Minister eats pineapple mousse
*** End ***
"""
# split into list of lines
lines = text.split('\n')

# create empty list for dates
dates = []

# loop over lines and match dates and put in new list
for line in lines:
    match = re.search(r"^[0-9]{4}\.[0-9]{2}\.[0-9]{2}", line)
    if match:
        dates.append(match.group())

# look at the resulting list
for date in dates:
    print(date)

### Reducing the task

Once we have the date, we no longer need to work on the entire string because our other data is in the remainder of each line.

To obtain this data, we can use a technique that we used in part 1 - treating a string as a list.

In part 1, we tried getting just the first 10 characters, but we can also get all of the characters *after* a certain point.

In [None]:
third_line = lines[2] # The third line of the text
third_line

In [None]:
third_line[10:] # All character from position 10 onwards

Even the date is only 10 characters, it is followed by a space, so really we want all text from position 11 onwards...

In [None]:
third_line[11:]

We can try this in our code so far...

In [None]:
# loop over lines and match dates
for line in lines:
    match = re.search(r"^[0-9]{4}\.[0-9]{2}\.[0-9]{2}", line)
    if match:
        print(line[11:]) # print the characters from 11 onwards of lines that match

These are substrings of the original lines which we can add to a new list for further processing...

In [None]:
# create a new empty list
eat_lines = []

# loop over lines and match dates
for line in lines:
    match = re.search(r"^[0-9]{4}\.[0-9]{2}\.[0-9]{2}", line)
    if match:
        eat_lines.append(line[11:]) # add the substrings for lines that match

eat_lines

### Splitting text using regex

Now that we have reduced our task by taking the substrings, we now need to extract the data from these strings. 

For this task, we don't need to match the names and the food eaten (even though that is what we want). We can take a simpler task as all strings have a common pattern that separates names and foods: `eats`

So we can simply split the strings...

In [None]:
first_line = eat_lines[0]
first_line

In [None]:
first_line.split('eats')

This leaves us with a space attached to the end of the name and the beginning of the food, so instead we could use..

In [None]:
first_line.split(' eats ')

However, what the `space` was an invisible character (like a tab or a non-breaking space), or what if the word was 'ate' instead of 'eats'?

Using regex can help...

In [None]:
alt_line = "andrew ate avocado"

re.split(r"\s[a-z]+\s",alt_line)

In [None]:
alt_line2 = "andrew someotherrediculousword avocado"
re.split(r"\s[a-z]+\s",alt_line)

In [None]:
alt_line3 = """andrew
eats
avocado"""
re.split(r"\s[a-z]+\s",alt_line)

As you can see the regex captures all of these variation by matching the pattern rather than the exact text.

We can now try this back in code...

In [None]:
# loop over lines and match dates
for line in lines:
    match = re.search(r"^[0-9]{4}\.[0-9]{2}\.[0-9]{2}", line)
    if match:
        eatline = line[11:] # substrings for lines that match
        print(eatline)
        splitline = re.split(r"\s[a-z]+\s",eatline)
        print("Name: ",splitline[0]) # First element of splitline is name
        print("Food: ",splitline[1]) # Second element of splitline is food
        print() # Empty line to make easy to read


### Bring it together

Here, we bring it all together including part 1

In [None]:
# the text which we want to obtain data from
text = """
*** Start ***
2020.03.18 Andrew ate avocado
2020.03.19 Catarina eats coconut
2020.03.19 Prime Minister eating pineapple mousse
*** End ***
"""
# split into list of lines
lines = text.split('\n')

# loop over lines and match dates and put in new list
for line in lines:
    print(line)
    match = re.search(r"^[0-9]{4}\.[0-9]{2}\.[0-9]{2}", line)
    if match:
        print("\tDate: ",match.group())
        eatline = line[11:] # substrings for lines that match
        splitline = re.split(r"\s[a-z]+\s",eatline)
        print("\tName: ",splitline[0]) # First element of splitline is name
        print("\tFood: ",splitline[1]) # Second element of splitline is food
        print() # Empty line to make easy to read
    else:
        print("--- not a match ---")
        print()