# Chapter 25 - Regular Expressions

-------------------------------

When you are facing a problem with text mining, data processing, finding patterns in large collections of data, or scraping data from web pages, and you explore the Internet to find a solution to your problem, you will find that often the very first answer given to questions in this respect is "Why don't you use regular expressions?" or even "Just use regex", without further explanations. Rather smug answers, as many people have never heard of regular expressions, and if they have, might find them scary and incomprehensible. In fact, at first glance they come over as so arcane and confusing that most people rather shy away than delve into them. Which is a pity, as regular expressions are a powerful tool that should not be missing in the toolbox of anyone who deals with unstructured data on a regular basis.

In this chapter I will explain how to write and use basic regular expressions with Python. You will find them indeed a powerful way to quickly express and discover complex and diverse patterns in data, providing access to functionalities that would be very hard to implement in vanilla Python. While this chapter does not contain a complete overview of regular expressions, after studying it you will be able to understand and use regular expressions for most, if not all, pattern-matching problems that you encounter in practice, and be confident in telling the uninitiated: "You should use regular expressions to solve your problems." <i>Now you can feel smug too!</i> 

---

## Regular expressions with Python

Regular expressions are text strings that describe a "pattern" that can be found in textual data. For example, the regular expression `a+` describes a pattern that consists of a sequence of one or more times the letter "a". In the string "aardvark" this pattern can be found twice, namely as the double "a" at the start of the string, and the single "a" in the second half of the string. 

A regular expression always consists of a string, which may contain any character. Some characters are "meta-characters" which have a special meaning in regular expressions. You should be careful when using them (how you should use them will be discussed later). The meta-characters are:

<b>`. ^ $ * + ? { } [ ] \ | ( )`</b>

I will discuss how to write regular expressions later in this chapter. First, I need to discuss how to use regular expressions in Python code.

### The `re` module

To use regular expressions in Python, you must import the `re` module.

A regular expression can be considered a piece of code. That code can be "compiled" by the `re` module to produce a "pattern object". That pattern object can then be used to search for the pattern in data. For instance, in the following code, the regular expression `a+` is compiled to produce a pattern stored as `pAplus`, which is then used to search for the pattern in the string "aardvark". It stores the occurrences of the pattern as a list, and prints that list.

In [None]:
import re

pAplus = re.compile( r"a+" )
lAplus = pAplus.findall( "aardvark" )
print( lAplus )

<b>Exercise:</b> You can change the word "aardvark" into something else, and see how that affects the output.

You might be wondering what that letter "`r`" is doing in front of the regular expression string. Why did I write `r"a+"` instead of just `"a+"`? This letter "`r`" tells Python that it should consider the string as "raw data", i.e., it should not try to convert parts of the string according to standard Python string interpretations. This is mainly necessary when the regular expression contains "`\b`", which for regular expressions means "word boundary" (I will get to that later in this chapter), but for Python means "backspace". So it is good practice to always put that "`r`" in front of a regular expression, to avoid problems.

While it is seldom done in practice, you may add an optional second parameter (a so-called "flag") to the `compile()` call, which indicates a special way to use the created pattern. The parameter `re.I` indicates that the pattern should be used case-insensitively, while `re.S` indicates that the pattern should also process newlines, and `re.M` indicates that the pattern should match the meta-characters `^` and `$` to every line of the text, and not just the text as a whole. You may combine them by putting pipe-lines (`|`) between them.  

### Shorthand

You are allowed to skip the compile-step, and call the pattern search using a class call to the `re` module. Instead of calling methods of the pattern that the compilation produced, I can directly call the method for `re`, and use the regular expression as the first parameter. The code above then becomes:

In [None]:
import re

lAplus = re.findall( r"a+", "aardvark" )
print( lAplus )

If you run this code, you will notice that the output is exactly the same as for the first bit of code. The second approach still compiles the regular expression, but does not store the pattern. If a pattern is only needed a few times in a program, the second approach is fine. However, if it is used many times, the first approach is preferred, as in the first approach the compilation of the regular expression (which takes by far the most time of the whole process) is only done once, as opposed to every time.

### Match objects

The `findall()` method used above returns the occurrences of the pattern in the target string. Often you need more information than just the actual patterns; for instance, you might want to know where the pattern occurs in the target string. The `re` module has methods that result in so-called "match objects", which are objects that contain, besides the textual result, more information, such as the index where the result is found in the target string. For example, the `search()` method returns a match object for the first occurrence of a pattern in a string.

In [None]:
import re

m = re.search( r"a+", "Look out for the aardvark!" )
print( "{} is found at index {}".format( m.group(), m.start() ) )

As you can see, the match object has several useful methods. These are:

- `group()` to return the found pattern
- `start()` to return the index at which the pattern starts
- `end()` to return the index where the pattern has ended

The `group()` method has some handy applications which you can control with parameters, which I will get to later.

The `match()` method is similar to the `search()` method, but checks if the pattern exists at the very start of a string. Both methods will return `None` if the pattern is not found, which as a condition is processed by Python as `False`. 

In [None]:
import re

m = re.match( r"a+", "Look out for the aardvark!" )
if m:
    print( "{} is found at the start of the string".format( m.group() ) )
else:
    print( "The pattern is not found at the start of the string" )    

### Lists of matches

I already showed that the `findall()` method creates a list of occurrences of a pattern in a string. The `finditer()` method is its complement, which creates a list of match objects for where the pattern occurs in a string. The best way to process such a list is by using the `for m in` approach. For example:

In [None]:
import re

mlist = re.finditer( r"a+", "Look out! A dangerous aardvark is on the loose!" )
for m in mlist:
    print( "{} is found at index {} and ends at index {}.".format( m.group(), m.start(), m.end() ) )

---

## Writing regular expressions

Now the basics of using regular expression in Python via the `re` module have been explained, I can get into the actual writing of regular expressions. 

### Regular expressions with square brackets

The simplest regular expression is a string of characters, which describes a pattern consisting of exactly that string of characters. You may also describe a range of characters using square brackets `[` and `]`. For instance, the regular expression `[aeiou]` describes any of the characters "a", "e", "i", "o", or "u". This means that if `[aeiou]` is part of a regular expression, at that location in the pattern one of these letters must reside (note: exactly one of them, so not multiple). For instance, to search for the words "ball", "bell", "bill", "boll" and "bull", the regular expression `b[aeiou]ll` can be used.

In [None]:
import re

slist = re.findall( r"b[aeiou]ll", "Bill Gates and Uwe Boll drank Red Bull at a football match in Campbell." )
print( slist )

<b>Exercise:</b> Change the regular expression above so that it not only finds the words "ball" and "bell", but also "Bill", "Boll", and "Bull".

You can use a dash within the square brackets between two characters to indicate that they represent not only these two characters, but also all the characters in between. For instance, the regular expression `[a-dqx-z]` is equivalent to `[abcdqxyz]`. To describe any of the letters of the alpabet, either as capital or lower case, you can use `[A-Za-z]`.

Moreover, if you place a caret (`^`) right next to the opening square bracket, that means that you want the opposite of what is within the square brackets. For instance, `[^0-9]` indicates any character <i>except</i> for a digit.

### Special sequences

In a regular expression, just like in strings, the backslash character (`\`) indicates that the character that follows it has a special meaning. The special sequences that hold for strings also hold for regular expressions, but regular expressions have many more. There are also a few meta-characters that are interpreted in a particular way. The following special sequences are defined (there are more, but these are the most common ones):

    \b    Word boundary (zero-width)
    \B    Not a word boundary (zero-width)
    \d    Digit [0-9]
    \D    Not a digit [^0-9]
    \n    Newline
    \r    Carriage return
    \s    Whitespace
    \S    Not a whitespace
    \t    Tabulation
    \w    Alphanumeric character [A-Za-z0-9_]
    \W    Not an alphanumeric character [^A-Za-z0-9_]
    \/    Forward slash
    \\    Backslash
    \"    Double quote
    \'    Single quote
    ^     Start of a string (zero-width)
    $     End of a string (zero-width)
    .     Any character

Note that "zero-width" means that the sequence does not represent a character, but a position in the string between two characters. For instance, the regular expression `^A` represents a string that starts with the letter "A".

Moverover, you can place characters between parentheses, in which case the characters are "grouped". Within a group, you can indicate a choice between multiple (sequences of) characters using the pipe-line (`|`). For instance, the regular expression `(apple|banana|orange)` is the string "apple" or the string "banana" or the string "orange".

You should be aware that some of these special sequences (in particular those without a backslash, the parentheses, and the pipe-line) do not work like indicated here when placed within square brackets. For instance, a period within square brackets does not mean "any character", but an actual period.

### Repetition

Where regular patterns get really interesting is when repetitions are used. Several of the meta-characters are used to indicate that (part of) a regular expression is repeated multiple times. In particular, the following repetition operators are often used:

    *      Zero or more times
    +      One or more times
    ?      Zero or one time
    {p,q}  At least p and at most q times
    {p,}   At least p times
    {p}    Exactly p times
    
You place such an operator after the (part of the) expression it repeats. For instance, `ab*c` means the letter "a", followed by zero or more times the letter "b", followed by the letter "c". Thus, it matches the strings "ac", "abc", "abbc", "abbbc", "abbbbc", etc.

When you place a repetition operator after a group (between parentheses), it indicates the repetition of the whole group. For instance, `(ab)*c` matches the strings "c", "abc", "ababc", "abababc", "ababababc", etc.

Regular expression matching for repetitions is <i>greedy</i>. It will always try to match the earliest occurring pattern first, extended to its longest possible extension. Check out the following code:

In [None]:
import re

mlist = re.finditer( r"ba+", "A sheep says 'baaaaah' to Ali Baba." )
for m in mlist:
    print( "{} is found at {}.".format(m.group(), m.start()))

<b>Exercise:</b> Change the regular expression in the code above so that it finds any "b" followed by one or more "a"s, where the "b" might be captitalized. The output should be "baaaaa", "Ba" and "ba". 

<b>Exercise:</b> Once you have solved the previous exercise, change the regular expression so that it finds the pattern consisting of a "b" or "B" followed by a sequence of one or more "a"s, repeated one or more times. The output should be "baaaaa" and "Baba". You will need to use parentheses for this. When you think that your regular expression is correct, also test it on several other strings.

Here is another one, which searches for occurrences of one or more "a"s:

In [None]:
import re

mlist = re.finditer( r"a+", "A sheep says 'baaaaah' to Ali Baba." )
for m in mlist:
    print( "{} is found at {}.".format(m.group(), m.start()))

When you run this code, you see that it finds four occurrences of the pattern: three times a single "a", and one time a sequence of five "a"s. You might wonder why the pattern matching process does not also find the four "a"s starting at position 16, the three "a"s starting at position 17, the two "a"s starting at position 18, and the single "a" starting at position 19. The reason is that the `finditer()` and `findall()` methods, when they find a match, continue searching immediately after the end of the last found match. Normally, this is the behavior that you want.

<b>Exercise:</b> Now change the `r"a+"` in the code above to `r"a*"`, which changes it to searching for zero or more "a"s. Before running the code, think about what you expect the outcome to be. Then run the code and see if your prediction was correct. If it wasn't, do you now realize why the outcome is what it is? 

Note: You may have noticed that regular expressions tend to become overly complex fast. It is a good idea to comment them so that you can understand them even when you examine your code later.

### Practice

<b>Exercise:</b> With all you learned until now, you should be able to do the following exercise. It is wise to solve this one before continuing with the remainder of this chapter. The exercise consists of a piece of code that you have to complete.

When you run the code below, it tries to search for all the regular expressions in `relist`, in all the strings in `slist`. It prints for each string the numbers of all the regular expressions for which matches are found.

Your goal is to fill in the regular expressions in `relist` according to the specification in the comments to the right of each expression. The first regular expression is already filled in. Note that it starts with a caret and ends in a dollar sign, which indicates that the expression should match the string from the start to the end. Several of the other expressions will also needs such an indication.

In [None]:
import re

# List of strings used for testing.
slist = [
    "aaabbb",
    "aaaaaa",
    "abbaba",
    "aaa",
    "gErbil ottEr",
    "tango samba rumba",
    " hello world ",
    " Hello World "
]

# List of regular expressions to be completed by the student.
relist = [
    r"^a*b*$",          # 1. Only a's, followed by only b's, including empty string
    r"",                # 2. Only a's, including the empty string
    r"",                # 3. Only a's and b's, in any order, including the empty string
    r"",                # 4. Exactly three a's
    r"",                # 5. Neither a's nor b's, but empty string allowed
    r"",                # 6. An even number of a's (and nothing else)
    r"",                # 7. A string consisting of exactly two words, regardless of whitespaces
    r"",                # 8. A string that contains a word that ends in "ba"
    r""                 # 9. A string that contains a word that starts with a capital
]

for s in slist:
    print( s, ':', sep='', end=' ' )
    for i in range(len(relist)):
        m = re.search( relist[i], s )
        if m:
            print( i+1, end=' ')
    print()

The correct output is: 

    aaabbb: 1 3   
    aaaaaa: 1 2 3 6    
    abbaba: 3 8    
    aaa: 1 2 3 4     
    gerbil otter: 7    
    tango samba rumba: 8     
     hello world : 5 7     
     Hello World : 5 7 9
     
Make sure that you can do all of these correctly before you continue!

---

## Grouping

As shown above, when parentheses are used in regular expressions, they create so-called "groups". Take for instance the regular expression `(\d{1,2})-(\d{1,2})-(\d{4})`, which describes a sequence that could represent a date: one or two digits, followed by a dash, followed by one or two digits, followed by a dash, followed by four digits (if you do not understand this regular expression, check back in previous sections of this chapter until you do understand it). This expression contains three groups: the first containing one or two digits, the second containing one or two digits, and the third one containing the four digits at the end. The code below searches for this pattern in a string.

In [None]:
import re

pDate = re.compile( r"(\d{1,2})-(\d{1,2})-(\d{4})" )
m = pDate.search( "In response to your letter of 25-3-2015, \
I decided to hire a hitman to get you." )
if m:
    print( "Date is {}; day is {}; month is {}; year is {}".format( 
            m.group(0), m.group(1), m.group(2), m.group(3) ) )

When you run the code, you see that it not only gets out the result as a whole (using the method `group()` or `group(0)`), but that you can also access each of the groups that is found in the result, using methods `group(1)` for the day, `group(2)` for the month, and `group(3)` for the year. You can also use the method `groups()` to get a tuple with all the groups.

### `findall()` and groups

The `findall()` methods returns a list of pattern objects. In the examples where it was used until now, it returned a list of strings. Indeed, pattern objects are strings if there is at most one group in the regular expression. If there are multiple groups, pattern objects are actually tuples that contain all the groups.

In [None]:
import re

pDate = re.compile( r"(\d{1,2})-(\d{1,2})-(\d{4})" )
datelist = pDate.findall( "In response to your letter of 25-3-2015, \
on 27-3-2015 I decided to hire a hitman to get you." )
for date in datelist:
    print( date )

### Named groups

It is possible to give each group a name, by placing the construct `?P<name>` (where you replace "name" with the name you want the group to have) immediately after the opening parenthesis. You can then refer to the groups by these names, instead of their index.

In [None]:
import re

pDate = re.compile( r"(?P<day>\d{1,2})-(?P<month>\d{1,2})-(?P<year>\d{4})" )
m = pDate.search( "In response to your letter of 25-3-2015, \
I decided to hire a singing telegram to get you." )
if m:
    print( "day is {}".format( m.group('day') ) )
    print( "month is {}".format( m.group('month') ) )
    print( "year is {}".format( m.group('year') ) )

### Referring within a regular expression

Suppose that you have to create a regular expression that represents a string that contains an arbitrary non-space character twice. For instance, the string "regular" would not have a match, but the string "expression" would (as it contains two "e"s and two "s"s. This cannot be done with the regular expression features that we discussed until now. It can be solved, however, with groups, and special references within a regular expression, namely as follows: using the special sequence `\x`, whereby `x` is a number, you refer to the group with index `x` in the match. Thus, a regular expression that represents a string with an arbitrary non-space character twice is `(\S).*\1`.

Since at this point this regular expression might still be a bit hard to understand, let's look at it in depth. The `\S` is a special sequence that represent a non-space character. Putting it in parentheses turns it into a group, and since this is the first (and only) group in the expression, its index is 1. The `.*` represents a sequence of zero or more characters, which can be anything (the period is a meta-character that represents any character). Finally, the `\1` refers to the first group, and says that here you want to have exactly the same thing as the first group represents. If you are wondering why you do not need to represent anything that can be placed before the `\S`, or anything that can come after the `\1`, then the answer is that you are not specifying that this regular expression represents a string as a whole, so as long as it occurs anywhere in the string, it matches.

Test this pattern with the code below, by replacing the string "Monty Python" with different strings, and running the code to examine the results.

In [None]:
import re

m = re.search( r"(\S).*\1", "Monty Python's Flying Circus" )
if m:
    print( "The character {} occurs twice in the string".format( m.group(1) ) )
else:
    print( "No match was found." )

<b>Exercise:</b> Can you change the regular expression in the code above so that it checks if the string contains a character at least three times? 

<b>Optional exercise:</b> Can you change the regular expression so that it checks whether it contains at least two characters twice? This is quite hard and therefore optional, but if you try to do it, make sure that you test it with at least the strings "aaaa", "aabb", "abab" and "abba". These all should match, unless you also want the two repeated characters different, then "aaaa" should not match (but note that that makes the regular expression even harder to design).

---

## Replacing

While regular expressions are mainly used for searching, you can also use regular expressions to replace substrings in a string with different substrings. This is done with the `sub()` method. `sub()` gets as arguments the to-be-replaced pattern, the replacement, and the string. The `sub()` method returns the new string (remember that strings are immutable, so `sub()` will not actually change your original string, even if it is stored in a variable; you will have to store its return value if you want access to the new string). 

The replacement is usually just a string, but it may contain references to groups in the original pattern. You will have to use a format that is different from the `\x` format shown before. If you want to refer to `group x` in the pattern (`x` being a number), you write `\g<x>`. The reason for the difference is disambiguation; it allows you to distinguish a reference to, for instance, group 2 followed by a character zero, from a reference to group 20.

In [None]:
import re

s = re.sub( r"([iy])se", "\g<1>ze", "Whether you categorise, emphasise, or analyse, \
you should use American spelling!" )
print( s )

---

## What you learned

In this chapter you learned about:

- What regular expressions are
- Which meta-characters can be used in regular expressions
- How to use regular expressions in Python, using the `re` module
- Compiling regular expressions with `re.compile()`
- What match objects are
- Searching for patterns using `match()`, `search()`, `findall()`, and `finditer()`
- Replacing patterns using the `sub()` method
- Using square brackets in regular expressions to represent different possibilities for characters
- Using special sequences in regular expressions, many of which use the backslash character
- Repeating sub-patterns in regular expressions using repetition operators
- Grouping of sub-patterns using parentheses
- Using groups to unravel results
- Referencing within patterns
- Smugly referring people with pattern mining problems to regular expressions

---

## Exercises

### Exercise 25.1

Assume that a word consist of only letters from the alphabet (upper case or lower case). Below, write some code that uses a regular expression to make a list of all the words in the sentence that is given.

In [None]:
# List words.
import re

sentence = "The price of a 2-room apartment in Manhattan starts at 1 \
million dollars, and may actually be the 10-fold of that on 42nd Street."


### Exercise 25.2

Using a regular expression and the `findall()` method, create a list of all the occurrences of the word "the" in the sentence given in the code block below. Print the number of items in the list. Your code should handle the problem in a case-insensitive way. Note: the outcome should be 2.

In [None]:
# List occurrences.
import re

sentence = "The word ether can be found in my thesaurus using the \
archaic spelling 'aether'."


### Exercise 25.3

A person's full name consists of two words, next to each other, consisting of only letters from the alphabet, all lower case except for the first one, which is upper case. Between the two words there should only be whitespaces. The words start and end at a word boundary. E.g., according to this specification `Cardinal Richelieu` is a name, but `Charles d'Artagnan` is not, and neither is `Gilbert duPrez`, `Joe DiMaggio`, or `Unit X1138`. Under this assumption, use a regular expression to list all the two-word combinations in the sentence below which are probably names of persons. Note: If you want to know the rest of the joke you'll have to ask me.

In [None]:
# List names.
import re

sentence = "Michael Jordan, Bill Gates, and the Dalai Lama decided to \
take a plane trip together, when they spotted a hippy next to the runway."


### Exercise 25.4

As a follow-up to the previous exercise, now assume that a person's name consists of two or more words that meet the criteria spelled out above. Use a regular expression to extract all names from the sentence below.

In [None]:
# List long names.
import re

sentence = "William Randolph Hearst attempted to destroy all copies of \
Orson Welles' masterpiece 'Citizen Kane', because he did not appreciate \
the fact that the protagonist Charles Foster Kane was a thinly \
disguised caricature of himself. I wonder whether William Henry Gates \
The Third would attempt to do the same."


### Exercise 25.5

When a person speaks in a piece of text, this is often represented by enclosing the spoken part within double quotes. Write a regular expression that extract all the spoken parts from the sentence below. Hint: Use groups, and remember that regular expressions are greedy.

In [None]:
# Quoted strings.
import re

sentence = "Client: \"I wish to register a complaint! Hello miss!\"\n\
Shopkeeper: \"What do you mean, miss?\"\n\
Client: \"I am sorry, I have a cold.\"\n"


### Exercise 25.6

When scraping data from HTML pages, you can often find the items you are interested in by looking for mark-ups. Suppose that we have a page with data of persons, who have an ID and a name. The ID is a nine-digit number, and has a marker `<id>` in front of it, and a marker `</id>` after it. The name belonging to the ID will follow immediately after the ID, and has a marker `<name>` in front of it, and a marker `</name>` after it. Use a regular expression to extract all the IDs and corresponding names from the text below, and print them. There should be five of them.

In [None]:
# Scraping.
import re

text = "<html><head><title>List of persons with ids</title></head><body>\
<p><id>123123123</id><name>Groucho Marx</name>\
<p><id>123123124</id><name>Harpo Marx</name>\
<p><id>123123125</id><name>Chico Marx</name>\
<randomcrap>Etaoin<id>Shrdlu</id>qwerty</name></randomcrap>\
<nocrap><p><id>123123126</id><name>Zeppo Marx</name></nocrap>\
<address>Chicago</address>\
<morerandomcrap><id>999999999</id>nonametobeseen!</morerandomcrap>\
<p><id>123123127</id><name>Gummo Marx</name>\
<note>Look him up on <a href=\"http://www.google.com\">Google.</a></note>\
</body></html>"


---

End of Chapter 25. Version 1.2.