# Using Regex in Python

In this example we're going to use regular expressions to extract data from free-text. This is quite a common use of regular expressions, particularly when webscraping lists *etc.* Download the *student_details.txt* file from Brightspace and drop it into the same folder as this notebook to follow along.

Before we can get our hands dirty with regular expressions we need to read the text into python. Our first step is to read this file into a variable. Regular Expressions are part of the Python standard library but we need to import them if we want to use them in our scripts. The regular expression module is named **re**.

In [1]:
import re

text = None

with open("./student_details.txt") as f:
    text = f.read()
    
text

'John Doe has student number D12345665, is enrolled on TU953 and was born on 01/02/1990\nJane Doe has student number D44563217, is enrolled on TU256 and was born on 02/02/1979\nLucas Rizzo has student number D12345678, is enrolled on TU256 and was born on 01/01/1960\nBojan Bozick has student number C87654321, is enrolled on TU44 and was born on 08/08/2002\nRichard Boyd Barrett has student number D77553321, is enrolled on TU953 and was born on 01/03/1988\n'

We've successfully read in the text file and we've stored in a variable called **text**. Notice the **\n** characters in the text above. This is a character encoding meaning *new line*, if we opened this in a text editor, all of the \n characters would be replaced by a line break.

Our next step is to define the goal. First we need to work out what are the different values which can be extracted from this text. Every line seems to contain four pieces of information, *name*, *student number*, *course code*, *date of birth*. We'll start small and work our way up. Our first task is to extract the name from each line (first and last). Remember, to extract data we need to use *capturing groups*.

In [2]:
pattern = "(\w+) (\w+)"
regex = re.compile(pattern, re.MULTILINE) # this is a multiline file (it contains linebreaks or \n characters)
matches = regex.findall(text)
matches

[('John', 'Doe'),
 ('has', 'student'),
 ('number', 'D12345665'),
 ('is', 'enrolled'),
 ('on', 'TU953'),
 ('and', 'was'),
 ('born', 'on'),
 ('Jane', 'Doe'),
 ('has', 'student'),
 ('number', 'D44563217'),
 ('is', 'enrolled'),
 ('on', 'TU256'),
 ('and', 'was'),
 ('born', 'on'),
 ('Lucas', 'Rizzo'),
 ('has', 'student'),
 ('number', 'D12345678'),
 ('is', 'enrolled'),
 ('on', 'TU256'),
 ('and', 'was'),
 ('born', 'on'),
 ('Bojan', 'Bozick'),
 ('has', 'student'),
 ('number', 'C87654321'),
 ('is', 'enrolled'),
 ('on', 'TU44'),
 ('and', 'was'),
 ('born', 'on'),
 ('Richard', 'Boyd'),
 ('Barrett', 'has'),
 ('student', 'number'),
 ('is', 'enrolled'),
 ('on', 'TU953'),
 ('and', 'was'),
 ('born', 'on')]

Not a bad start, we've successfully matched *John* and *Doe*. Unfortunately, we've matched pretty much every other word, too. Each line should be a new record, so we can use the $ character to match the end of the line. We're going to capture our first name, then our last name, and then match (but not capture) everything up to the end of the line

In [3]:
pattern = "(\w+) (\w+).*$"
regex = re.compile(pattern, re.MULTILINE) # this is a multiline file (it contains linebreaks or \n characters)
matches = regex.findall(text)
matches

[('John', 'Doe'),
 ('Jane', 'Doe'),
 ('Lucas', 'Rizzo'),
 ('Bojan', 'Bozick'),
 ('Richard', 'Boyd')]

We have to write these same four lines of code every time we want to check an updated pattern. Let's define a function to make our lives easier. 

In [4]:
def test_pattern(pattern):
    regex = re.compile(pattern, re.MULTILINE)
    matches = regex.findall(text)
    return matches

test_pattern("(\w+) (\w+).*$")

[('John', 'Doe'),
 ('Jane', 'Doe'),
 ('Lucas', 'Rizzo'),
 ('Bojan', 'Bozick'),
 ('Richard', 'Boyd')]

That seems fine, but Richard Boyd Barrett lost his double-barrel. How do we know whether a student has two or three names? If we look closely at the text we can see it follows a regular pattern

```
<student name> has student number...
```

If we include the *has student number* in our match then we'll be able to tell whether the student has 2 or 3 names. We may have multiple surnames, so we'll add the space character to our surname character class to allow Boyd Barrett to match. This second capturing group will match any combination of letters spaces and apostrophes until it finds the text **has student number**.

In [5]:
test_pattern("(\w+) ([\w' ]+) has student number.*$")

[('John', 'Doe'),
 ('Jane', 'Doe'),
 ('Lucas', 'Rizzo'),
 ('Bojan', 'Bozick'),
 ('Richard', 'Boyd Barrett')]

The next part of our entry is the student number itself. We want to capture this. How do we define a student number?

A student number is a letter, (capital C or capital D) followed by 8 digits, Let's add this to our pattern in a capturing group

In [6]:
test_pattern("(\w+) ([\w' ]+) has student number ([CD][\d]{8}).*$")

[('John', 'Doe', 'D12345665'),
 ('Jane', 'Doe', 'D44563217'),
 ('Lucas', 'Rizzo', 'D12345678'),
 ('Bojan', 'Bozick', 'C87654321'),
 ('Richard', 'Boyd Barrett', 'D77553321')]

Now we've got the student number, let's extract the course enrollment

```
John Doe has student number D12345665, is enrolled on TU953 and was born on 01/02/1990
```

We know that the student number will be followed by **, is enrolled on** . We can match that text and then capture our course code. The course code is going to be the letters *TU* followed by 3 numbers

In [7]:
test_pattern("(\w+) ([\w' ]+) has student number ([CD][\d]{8}), is enrolled on (TU\d\d\d).*$")

[('John', 'Doe', 'D12345665', 'TU953'),
 ('Jane', 'Doe', 'D44563217', 'TU256'),
 ('Lucas', 'Rizzo', 'D12345678', 'TU256'),
 ('Richard', 'Boyd Barrett', 'D77553321', 'TU953')]

Have you noticed anything about the output? We seem to have lost Bojan.

```
Bojan Bozick has student number C87654321, is enrolled on TU44 and was born on 08/08/2002
```

On closer inspection, we can see that Bojan's course only has 2 digits. We can fix using repetition

In [8]:
test_pattern("(\w+) ([\w' ]+) has student number ([CD][\d]{8}), is enrolled on (TU[\d]{2,3}).*$")

[('John', 'Doe', 'D12345665', 'TU953'),
 ('Jane', 'Doe', 'D44563217', 'TU256'),
 ('Lucas', 'Rizzo', 'D12345678', 'TU256'),
 ('Bojan', 'Bozick', 'C87654321', 'TU44'),
 ('Richard', 'Boyd Barrett', 'D77553321', 'TU953')]

Finally, we want to match the date of birth. The date of birth is in the format dd/mm/yyyy. The date of birth should be the last part of the string, so we should put our dollar right after our final capturing group. The dot-star will match the *and was born on* part of the text for us

In [9]:
test_pattern("(\w+) ([\w' ]+) has student number ([CD][\d]{8}), is enrolled on (TU[\d]{2,3}).* (\d{2}/\d{2}/\d{4})$")

[('John', 'Doe', 'D12345665', 'TU953', '01/02/1990'),
 ('Jane', 'Doe', 'D44563217', 'TU256', '02/02/1979'),
 ('Lucas', 'Rizzo', 'D12345678', 'TU256', '01/01/1960'),
 ('Bojan', 'Bozick', 'C87654321', 'TU44', '08/08/2002'),
 ('Richard', 'Boyd Barrett', 'D77553321', 'TU953', '01/03/1988')]

We've done it. Now we can pull out the variables we matched using capturing groups. I'll re-write the code from test_pattern below just for clarity

In [10]:
pattern = "(\w+) ([\w' ]+) has student number ([CD][\d]{8}), is enrolled on (TU[\d]{2,3}).* (\d{2}/\d{2}/\d{4})$"
regex = re.compile(pattern, re.MULTILINE)
matches = regex.findall(text)

for match in matches:
    first, last, std_no, course, dob = match
    print("first: " + first)
    print("last: " + last)
    print("std_no " + std_no)
    print("course " + course)
    print("dob " + dob)
    print()

first: John
last: Doe
std_no D12345665
course TU953
dob 01/02/1990

first: Jane
last: Doe
std_no D44563217
course TU256
dob 02/02/1979

first: Lucas
last: Rizzo
std_no D12345678
course TU256
dob 01/01/1960

first: Bojan
last: Bozick
std_no C87654321
course TU44
dob 08/08/2002

first: Richard
last: Boyd Barrett
std_no D77553321
course TU953
dob 01/03/1988



That's not really very neat. If we use a format string it'll be easier to print it nicely

In [11]:
for match in matches:
    first, last, std_no, course, dob = match
    print(f"first: {first}, last: {last}, std_no: {std_no}, course: {course}, dob: {dob}")

first: John, last: Doe, std_no: D12345665, course: TU953, dob: 01/02/1990
first: Jane, last: Doe, std_no: D44563217, course: TU256, dob: 02/02/1979
first: Lucas, last: Rizzo, std_no: D12345678, course: TU256, dob: 01/01/1960
first: Bojan, last: Bozick, std_no: C87654321, course: TU44, dob: 08/08/2002
first: Richard, last: Boyd Barrett, std_no: D77553321, course: TU953, dob: 01/03/1988


## Your Turn

Download the file *ftse_100_salaries.txt*. This file contains the name, company and salary of the top 20 highest paid CEOs of ftse 100 companies. Some of the lines are credits for photographs and should be ignored (if they don't match your regex they'll be ignored).

Your task is extract the relevant information from each entry in this file and output a string for each in the following format

```
<name> is <rank>th on the list, working for <company> and making <amount>
```

*Bonus: Output the correct suffix for 2nd and 1st (this isn't really regex specific but can be done through regular Python*

In [20]:
#import re

text = None

with open("./ftse_100_salaries.txt", 'rb') as f: # used rb to solve 'utf-8' codec error
    text = f.read()
    
text = str(text) # force to string to solve any utf-8 errors
text

'b"Data Scraped from https://www.businessinsider.com/the-20-best-paid-ceos-of-the-ftse-100-2017-8?r=US&IR=T#20-antonio-horta-osorio-lloyds-group-55-million-1\\n--------------------------------------------------------------------------------------------------------\\n20. Antonio Horta Osorio, Lloyds Group - \\xa35.5 million\\nLloyds Banking Group CEO Antonio Horta Osorio poses outside the bank\'s headquarters on his first day back at work after taking a leave of absence due to exhaustion, in the City of London, January 9, 2012. \\nReuters/Andrew Winning\\n19. Stuart Gulliver, HSBC - \\xa35.7 million\\nGonzalo Fuentes/Reuters\\n18. Xavier Rolet, London Stock Exchange Group - \\xa35.7 million\\nOli Scarff/Getty\\n17. Simon Borrows, 3i Group - \\xa35.8 milliion\\n3i/PA Archive/PA Images\\n16. Richard Cousins, Compass Group - \\xa35.8 million\\nCompass Group\\n15. Peter Harrison, Schroders - \\xa36.3 million\\nSchroders\\n14. Peter Crook, Provident Financial - \\xa36.3 million\\nProvident F

In [29]:
pattern = "\. (\w+) ([\w' ]+), ([\w' ]+) - \\\\xa([\d]+.[\d])" # from https://regex101.com/
regex = re.compile(pattern, re.MULTILINE) # this is a multiline file (it contains linebreaks or \n characters)
matches = regex.findall(text)
for match in matches:
    first, last, company, salary = match
    print(f"name: {first} {last}, company: {company}, salary: €{salary} million")

name: Antonio Horta Osorio, company: Lloyds Group, salary: €35.5 million
name: Stuart Gulliver, company: HSBC, salary: €35.7 million
name: Xavier Rolet, company: London Stock Exchange Group, salary: €35.7 million
name: Simon Borrows, company: 3i Group, salary: €35.8 million
name: Richard Cousins, company: Compass Group, salary: €35.8 million
name: Peter Harrison, company: Schroders, salary: €36.3 million
name: Peter Crook, company: Provident Financial, salary: €36.3 million
name: Paul Polman, company: Unilever, salary: €36.7 million
name: Andrew Witty, company: GlaxoSmithKline, salary: €36.8 million
name: Mike Wells, company: Prudential, salary: €36.9 million
name: Ben Van Beurden, company: Royal Dutch Shell, salary: €36.9 million
name: Flemming Ornskov, company: Shire, salary: €37.5 million
name: Nicandro Durante, company: British American Tobacco, salary: €37.6 million
name: Bob Dudley, company: BP, salary: €38.4 million
name: Erik Engstrom, company: RELX, salary: €310.6 million
name