#Mangle Data Like A Pro

##Match with Regular Expressions

In text processing applications, one of the most common tasks we need to perform is finding out whether a string is present in a body of text.  We may then want to replace it with another string. Sometimes we also want to ensure that a piece of data follows a particular pattern. These tasks can be done quickly and efficiently using regular expressions, a powerful pattern matching toolset.  

Regular expressions are really a mathematical language.  You use a regular expression to describe a type of string you are interested in.  The expression is a pattern and it defines a class of strings that match that pattern.  Once you have a regular expression, you can then try to find the string that you're looking for in another, possibly much larger, string (the source). 

Because of their flexiblity, regular expressions pop up in many areas of computer science, and are used in many programming languages.  Python provides a package `re` to implement regular expressions and their many uses. Consider this piece of code: 

In [3]:
import re
source = "To be or not to be, that is a question"
pattern = "To be"

result = re.match(pattern, source)
print(result)

<_sre.SRE_Match object; span=(0, 5), match='To be'>


In this case, the regular expression is just 'To be', which is pretty much the simplest regular expression we can create. The match function looks at the start of the source to see if the pattern is there.  In other words, it checks whether the source string begins with the pattern.

You can precompile a regular expression to speed up the query. If you have a regular expression that is used multiple times, this approach can provide significant performance gains.

In [2]:
compiled_pattern = re.compile("To be")

You can then use the compiled pattern to execute the match query:

In [3]:
result_compiled = compiled_pattern.match(source)
print(result_compiled)

<_sre.SRE_Match object; span=(0, 5), match='To be'>


There are a few ways to use the `re` package to find patterns in strings:

- search(): returns the first match
- findall(): returns a list of all non-overlapping matches
- split(): returns a list of strings that were split from the source string at the locations where the pattern was found
- sub(): returns a string that has replaced all instances of the pattern with a replacement string passed into the function

###Exact match with match()

In the previous example we used the `match()` function to determine whether a given pattern matches to the beginning of another string. Let's look at a typical coding snippet that is often used in processing regular expressions: 

In [4]:
import re
source = "To be or not to be, that is the question"

# We are creating a compiled pattern that we can use in multiple searches
compiled_pattern = re.compile("To be") 

# Now we are searching to see if the compiled_pattern is in the front of the source text
m = compiled_pattern.match(source)

if m: # if there is a match
    print(m.group()) # let's print out what was matched

To be


What would happen if we tried to match using the pattern "that is"?

In [5]:
middle_pattern = re.compile("that is")
m = middle_pattern.match(source)

if m:
    print(m.group())

This time this coding snippet printed nothing, because the pattern "that is" is not at the beginning of the string. If we wanted to know whether "that is" appears anywhere in the string, we could make a small change to our regular expression:

In [6]:
middle_pattern_with_wildcard = re.compile(".*that is")
m = middle_pattern_with_wildcard.match(source)
if m: 
    print(m.group())

To be or not to be, that is


Notice the extra characters we added to the beginning of our pattern, `".*that is"`. Let's explain what these mean one step at a time.

- "." matches any single character
- "*" matchines any number of the previous character
- So the combination ".*" means any number of any character (even zero)
- "that is" is the string we would like to match

So going through the source string "To be or not to be, that is the question", the pattern matches in this way:

"**To be or not to be, **that is the question" matches the **".\*"** part

"To be or not to be, **that is** the question" matches the **"that is"**


###First match with search()

You can use search to find a pattern anywhere within a source string without needing wildcards:

In [7]:
middle_pattern = re.compile("that is")
m = middle_pattern.search("that is")

if m:
    print(m.group())

that is


###All matches with findall()

You can use findall() to find all of the instances where the source string  matches the pattern:

In [8]:
n_pattern = re.compile("n") #Lets find all of the n's in the source string
m = n_pattern.findall(source)
print("Found", len(m), "matches")
print(m)

Found 2 matches
['n', 'n']


How about 'n' followed by any character?

In [9]:
n_and_character_pattern = re.compile("n.")
m = n_and_character_pattern.findall(source)
print("Found", len(m), "matches")

Found 1 matches


Note how `n_and_character_pattern`  found only one match. That's because the last "n" in the source string does not have a character after it: the string just terminates. Now if we wanted to find all n's that do or do not have a character after it, we can do the following:

In [10]:
n_and_character_optional_pattern = re.compile("n.?")
m = n_and_character_optional_pattern.findall(source)
print("Found", len(m), "matches")

Found 2 matches


As you can see, the "?" means that the previous character is optional.

Here are some other common tasks you can perform with regular expressions.

###Split at matches with split()

You can split a string at a pattern's matches (like the normal split method would do):

In [11]:
n_pattern = re.compile("n")
m = n_pattern.split(source)
print(m)

['To be or ', 'ot to be, that is the questio', '']


###Replace at matches with sub()

You can replace a match with another string, like the string replace method does:

In [12]:
m = n_pattern.sub('?', source)
print(m)

To be or ?ot to be, that is the questio?


##Defining Patterns

We've seen some of the important functions in Python's re package, including match(), search(), findall(),  split(), and sub(). We've seen simple patterns that used the following features:

- Matching literal strings like "that is"
- Matching any single character other than \n with "."
- Matching any number of the previous character with "*"
- Matching an optional previous character with "?"

Now we are going to learn how to build moe powerful patterns that can help you match almost any types of strings that you can think of.

###Special characters

The re package provides a set of character sequences that begin with a backslash for use in regular expressions.  Each of these matches a common set of useful characters.

| Pattern | Matches                                                  |
|---------|----------------------------------------------------------|
| \d      | a single digit                                           |
| \D      | a single non-digit                                       |
| \w      | an alphanumeric character                                |
| \W      | a non-alphanumeric character                             |
| \s      | a whitespace character                                   |
| \S      | a non-whitespace character                               |
| \b      | a word boundary (between a \w and a \W, in either order) |
| \B      | a non-word boundary                                      |

To put this to work we are going to use the printable variable in the string module.  This is a string that contains 100 printable ASCII characters.

In [8]:
import string
printable = string.printable
print(printable)

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ 	



We can use regular expressions to find all the digits, all the alphanumerics, and all the whitespace characters in this string.

In [11]:
re.findall("\d", printable)

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [10]:
print(re.findall("\w", printable))

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '_']


In [15]:
re.findall("\s", printable)

[' ', '\t', '\n', '\r', '\x0b', '\x0c']

###Using specifiers

We've see three characters that have a special meaning in regular expressions: "\*", ".", and "?".  These are examples of what we call specifiers, special characters that modify the meaning of a regular expression and help us create more flexible patterns.  There are several more specifiers that you should know about:

| Pattern          | Matches                                      |
|------------------|----------------------------------------------|
| abc              | literal abc                                  |
| ( expr )         | expr                                         |
| expr1 &#124; expr2    | expr1 or expr2                               |
| .                | any character except \n                      |
| ^                | start of source string                       |
| $                | end of source string                         |
| prev ?           | zero or one prev                             |
| prev *           | zero or more prev, as many as possible       |
| prev *?          | zero or more prev, as few as possible        |
| prev +           | one or more prev, as many as possible        |
| prev +?          | one or more prev, as few as possible         |
| prev { m }       | m consecutive prev                           |
| prev { m, n }    | m to n consecutive prev, as many as possible |
| prev { m, n }?   | m to n consecutive prev, as few as possible  |
| [ abc ]          | a or b or c (same as a&#124;b&#124;c)                  |
| [^ abc ]         | not (a or b or c)                            |
| prev (?= next )  | prev if followed by next                     |
| prev (?! next )  | prev if not followed by next                 |
| (?<= prev ) next | next if preceded by prev                     |
| (?<! prev ) next | next if not preceded by prev                 |

Regular expressions are not for the faint of heart. They are extremely powerful and can make things like finding a US phone number in the middle of a large section of text easy. But you need to write the correct regular expression, which can sometimes be a difficult task.

Let's try a more realistic example.  Suppose you want to extract all of US telephone numbers from a source of text. The source might look like this:

In [16]:
large_source = """
Hi Bianca,
It was great to talk to you about regular expressions. I really understand
them more than I ever had before. Would you like to work on the next project
together? My number is 650-555-3948. Thanks and talk to you soon!

-Mary
"""

What we want to do is build a pattern that would do the follwing: 

- match three numbers
- followed by a dash
- then match three more numbers
- followed by a dash
- and then match four numbers

Here's a regular expression to do exactly that:

In [17]:
phone_number_pattern = re.compile(r'[0123456789]{3}-[0123456789]{3}-[0123456789]{4}')
m = phone_number_pattern.findall(large_source)
print(m)

['650-555-3948']


One important side note: Notice that I surronded the string with the regular expression with `"r''`. This is called a raw string literal. It is similar to a regular string literal, but Python won't replace any characters when it is interpreted.  For example, `"\b"` will remain as it is, instead of being replaced by a backspace character. It is safe practice to surround all of your patterns with `"r''"` to prevent this type of substitution.

In our phone number regular expression, we used the square bracket notation to mean 0 or 1 or 2, and so on, and the curly bracket to designate how many of that pattern we would like to match. We could have also written this regular expression more compactly as follows:

In [18]:
phone_number_pattern = re.compile(r'\d{3}-\d{3}-\d{4}')
m = phone_number_pattern.findall(large_source)
print(m)

['650-555-3948']


As you can see, using the wildcard characters makes this expression much easier to read. 

As you learn to write regular expressions, you can try using [Pythex](http://pythex.org/).  This handy website lets you test out your Python regular expressions on the fly so that you can be sure that your pattern is matching what you expect it to match.

###Specifying match output

By inserting parentheses into your regular expression, you can designate individual pieces of the string that you are interested in.  After you use the match() or search() functions, you can then retrieve these pieces using the `groups()` method.

As an example, lets improve our phone number regular expression by providing the ability to grab the area code and the rest of the number:

In [19]:
phone_number_pattern = re.compile(r'(\d{3})-(\d{3}-\d{4})')
m = phone_number_pattern.search(large_source)

if m:
    print(m.group())
    print(m.groups())

650-555-3948
('650', '555-3948')


`m.groups()` returns a set of all of the groups that were captured based on where you placed the parenthenses. You can also name the groups for easy retrieval:

In [20]:
phone_number_pattern = re.compile(r'(?P<areacode>\d{3})-(?P<number>\d{3}-\d{4})')
m = phone_number_pattern.search(large_source)

if m:
    print(m.group("areacode"))
    print(m.group("number"))

650
555-3948


Regular expressions are an entire language in themselves.  Because of their flexibility, they show up in many areas of computer science and feature in many programming languages, including bash script.  It takes a long time to become fluent in regular expressions, but a little bit of practice will help you the next time you encounter them.