<div class="alert alert-block alert-info"><b>IAB303</b> - Data Analytics for Business Insight</div>

# Structured data from text with regex - part 1
---

### Required libraries

We will need the Python regular expression library for this QuAD. *Note: Libraries are also known as packages in python and can include multiple modules.*

In [None]:
import re

### The task

We start with text from which we want to extract some structured data. Here is the text:

In [None]:
starting_text = """2020.03.18 Andrew eats avocado
2020.03.19 Catarina eats coconut
2020.03.19 Prime Minister eats pineapple mousse"""

In [None]:
starting_text

What we are looking for are the important data that is embedded in the text. We must look at our original data to see what is possible.

Looking at the `starting_text` we should be able to obtain the following data:

| Date | Name | Food eaten |
| --- | --- | --- |
| 2020.03.18 | Andrew | avocado |
| 2020.03.19 | Catarina | coconut |
| 2020.03.19 | Prime Minister | pineapple mousse |

### Initial structure - lines

Our text already has some structure to it: each entry is on a new line. So we can turn the string `starting_text` into a list of strings by splitting on the newline (`\n`) character that separates each line.

In [None]:
text_list = starting_text.split('\n')
text_list

### First *n* characters

We now have a list where each element in the list is one line of the starting text. Now we need to start extracting the data we want. 

The date looks the easiest to extract as it is the beginning of each line. As each line is a string, and a string is basically a list of characters, we can get the first 10 characters by indicating the range in square brackets: `[0:10]`

Let's try this on the first element of the list...


In [None]:
first_element = text_list[0]
print(first_element)

starting_10 = first_element[0:10]
print(starting_10)

We can experiment with this a bit...

In [None]:
# If we're starting from 0 we can leave it out...
first_element[:10]

In [None]:
# Get just the first 4 characters...
first_element[:4]

In [None]:
#Get the month
first_element[5:7]

Note that the first number of the range is the position of the first character, but the second number is the **next** position after the last character.

Since we need the date for all lines, we can loop over the list:

In [None]:
for element in text_list:
    print(element[0:10])

### New list with extracted data

If we want to add this data to a new list, instead of printing the elements, we could do the following:

In [None]:
#Create a new empty list
date_list = [] 

#Loop over the text_list and append dates to the new list
for element in text_list:
    date_list.append(element[0:10])
    
#Check new list
date_list

Python also allows us to do this in a single line of code: 

In [None]:
date_list_two = [element[0:10] for element in text_list]
date_list_two

This single line can be read as:
1. Create a new list 
    `[ ... ]`
2. Put the first 10 characters of each element in the list 
    `[element[0:10] ...]`
3. Where each element is obtained by looping over text_list 
    `[element[0:10] for element in text_list]`
4. Assign the resulting list to a variable 
    `date_list = [element[0:10] for element in text_list]`

### More complex text

But what happens if our initial text didn't always have the date starting a new line...

In [None]:
messy_text = """
*** Start ***
2020.03.18 Andrew eats avocado
2020.03.19 Catarina eats coconut
2020.03.19 Prime Minister eats pineapple mousse
*** End ***
"""

If we re-run the above approach on this text, we run into some issues...

In [None]:
messy_list = messy_text.split('\n')
messy_dates = [element[0:10] for element in messy_list]
messy_dates

We have a list of 7 elements. Only 3 of them are dates. 2 of them are empty and 2 of them have other data that mark the beginning and end of the file. 

### Pattern matching with regex

This is where pattern matching with regex can be helpful. Instead of just getting the first 10 characters, we can get those characters that match a specific regex pattern.

One possibility is a pattern that:
1. Starts a line
2. Has 4 numbers then a full stop, then 2 numbers then full stop, then 2 numbers

The regex for matching the start of a string is `^`, numbers are matched by `[0-9]` and if we want certain number of something we can put the number in `{}` like `{3}`. 
So if we want to match the first 4 numbers of a string, we could use the regex `^[0-9]{4}`.

Let's try this with our messy_list...

In [None]:
print(messy_list[0]) # Print first line
match = re.search(r"^[0-9]{4}", messy_list[0]) # Pattern match first line
print(match) # Print the match

Now we try with the third line which we know has a date...

In [None]:
print(messy_list[2]) # Print third line
match = re.search(r"^[0-9]{4}", messy_list[2]) # Pattern match third line
print(match) # Print the match

We can get the actual value of the match with the `group()` function

In [None]:
match.group()

We can see that it found a match which was the first 4 characters of the date. Let's try this approach with the whole of `messy_text` instead of our original counting characters approach...

In [None]:
for element in messy_list:
    match = re.search(r"^[0-9]{4}", element)
    print(match)


### Getting match values

So we can see that the dates have been matched, and those lines without dates come back with none. We can only use the `group()` function on actual matches, so in our loop, we'll need to check if the match has a value in it...

In [None]:
for element in messy_list:
    match = re.search(r"^[0-9]{4}", element)
    if match:
        print(match.group())



Instead of printing, let's add the results to our new `date_list`

In [None]:
# create empty list
date_list = []

# loop over lines and match dates and put in new list
for element in messy_list:
    match = re.search(r"^[0-9]{4}", element)
    if match:
        date_list.append(match.group())

# look at the resulting list
date_list

### Pattern match whole date with regex

Great, but we still need to get the whole date, not just the year. So we need to modify the regex.

Unfortunately, the `.` in regex means match any character, so we need to tell regex that we actuall want to match a full stop. We can do this by using a backslash to tell regex to treat the next character as an ordinary character. like `\.`

We also need to match the month, another full stop and the day of month. So our full regex will be: `^[0-9]{4}\.[0-9]{2}\.[0-9]{2}` (four numbers, stop, two numbers, stop, two numbers).

Let's try that in our code:


In [None]:
# create empty list
date_list = []

# loop over lines and match dates and put in new list
for element in messy_list:
    match = re.search(r"^[0-9]{4}\.[0-9]{2}\.[0-9]{2}", element)
    if match:
        date_list.append(match.group())

# look at the resulting list
date_list

### Bring it together

Here, we bring it all together and tidy up the code using some more meaningful variable names

In [None]:
# the text which we want to obtain data from
text = """
*** Start ***
2020.03.18 Andrew eats avocado
2020.03.19 Catarina eats coconut
2020.03.19 Prime Minister eats pineapple mousse
*** End ***
"""
# split into list of lines
lines = text.split('\n')

# create empty list for dates
dates = []

# loop over lines and match dates and put in new list
for line in lines:
    match = re.search(r"^[0-9]{4}\.[0-9]{2}\.[0-9]{2}", line)
    if match:
        dates.append(match.group())
        
# look at the resulting list
for date in dates:
    print(date)

In [None]:
dates