#Mangle Data Like a Pro

##Match With Regular Expressions

In text processing we very often need to find out whether a string is present in the text and replace it with another string. Sometimes we also want to ensure that a piece of data follows a particular pattern. Both can be done quickly and efficiently using regular expressions, a powerful pattern-matching tool set that enables you to describe the kind of string you are looking for and try to find that string in a (usually) larger string (the source).

Python provides a package `re` to implement regular expressions and their many uses. You first need to define a pattern that we want to search for in the source. Consider this line of code. 

In [1]:
import re
source = "To be or not to be, that is a question"
pattern = "To be"

result = re.match(pattern, source)
print(result)

<_sre.SRE_Match object; span=(0, 5), match='To be'>


The match function went through the source string and matched to the beginning of the source string. The match function checks whether the source string begins with the pattern.

You can precompile pattern to speed up the query. If you have a regular expression that is done multiple times, this approach could provide significant performance gains.

In [2]:
compiled_pattern = re.compile("To be")

You can then use the compiled pattern to execute the match query.

In [3]:
result_compiled = compiled_pattern.match(source)
print(result_compiled)

<_sre.SRE_Match object; span=(0, 5), match='To be'>


There are a few ways to use the `re` package to find patterns in strings:

- search(): returns the first match
- findall(): returns a list of all nonoverlapping matches
- split(): returns a list of strings that were split from the source string at the locations where the pattern was found
- sub(): returns a string that has replaced all instances of the pattern with the replacement string passed into the function

###Exact match with match()

In the previous example we used the `match()` function to determine whether a given pattern matches to the beginning of another string. Let's look at a typical coding snippet that is often used in processing regular expressions. 

In [4]:
import re
source = "To be or not to be, that is the question"

# We are creating a compiled pattern that we can use in multiple searches
compiled_pattern = re.compile("To be") 

# Now we are searching to see if the compiled_pattern is in the front of the source text
m = compiled_pattern.match(source)

if m: # if there is a match
    print(m.group()) # let's print out what was matched

To be


What happens if we try to search for the pattern "that is"?

In [5]:
middle_pattern = re.compile("that is")
m = middle_pattern.match(source)

if m:
    print(m.group())

This time this coding snippet printed nothing because the pattern "that is" is not in the front of the string. However let's look at the following example.

In [6]:
middle_pattern_with_wildcard = re.compile(".*that is")
m = middle_pattern_with_wildcard.match(source)
if m: 
    print(m.group())

To be or not to be, that is


Why did the pattern `".*that is"` match with the source string? Let's examine our new pattern:

- "." matches any single character
- "*" matches any number of the previous character
- So the combination ".*" means any number of any character (even zero)
- "that is" is the sring we would like to match

Going through the source string "To be or not to be, that is the question", the pattern matches in this way:

"**To be or not to be, **that is the question" matches the **".\*"** part

"To be or not to be, **that is** the question" matches the **"that is"**


###First match with search()

You can use search to find a pattern anywhere within a source string without needing wildcards.

In [7]:
middle_pattern = re.compile("that is")
m = middle_pattern.search("that is")

if m:
    print(m.group())

that is


###All matches with findall()

You can use findall() to find all of the instances where the source string  matches the pattern.

In [8]:
n_pattern = re.compile("n") #Lets find all of the n's in the source string
m = n_pattern.findall(source)
print("Found", len(m), "matches")
print(m)

Found 2 matches
['n', 'n']


How about 'n' followed by any character?

In [9]:
n_and_character_pattern = re.compile("n.")
m = n_and_character_pattern.findall(source)
print("Found", len(m), "matches")

Found 1 matches


Note how `n_and_character_pattern`  found only one match. This happened because the last "n" in the source string does not have a character after it; the string just terminates. If we wanted to find all n's that do or do not have a character after them, we can do as shown below.

In [10]:
n_and_character_optional_pattern = re.compile("n.?")
m = n_and_character_optional_pattern.findall(source)
print("Found", len(m), "matches")

Found 2 matches


###Split at matches with split()

You can split a string at a pattern's matches (like the normal split function would do).

In [11]:
n_pattern = re.compile("n")
m = n_pattern.split(source)
print(m)

['To be or ', 'ot to be, that is the questio', '']


###Replace at matches with sub()

You can replace a match with another string as well, like the string replace function does.

In [12]:
m = n_pattern.sub('?', source)
print(m)

To be or ?ot to be, that is the questio?


##Defining Patterns

We jumped into the regular expression package that Python provides with its match(), search(), findall(),  split(), and sub() functions. We have gone through simple patterns that used the following features:

- Matching literal strings like "that is"
- Matching any single character other than \n with "."
- Matching any number of the previous character with "*"
- Matching an optional previous character with "?"

Now we are going to look more in depth on how to build powerful patterns that help you match almost any pattern that you can think of.

###Special characters

Python provides functionality that uses a series of special characters to match common patterns that you will use in your regular expressions:

| Pattern | Matches                                                  |
|---------|----------------------------------------------------------|
| \d      | a single digit                                           |
| \D      | a single non-digit                                       |
| \w      | an alphanumeric character                                |
| \W      | a non-alphanumeric character                             |
| \s      | a whitespace character                                   |
| \S      | a non-whitespace character                               |
| \b      | a word boundary (between a \w and a \W, in either order) |
| \B      | a non-word boundary                                      |

To put this to work, we are going to use the string module's printable variable, which contains 100 printable ASCII characters to show the power of these special characters.

In [13]:
import string
printable = string.printable

re.findall("\d", printable)

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [14]:
re.findall("\w", printable)

['0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z',
 '_']

In [15]:
re.findall("\s", printable)

[' ', '\t', '\n', '\r', '\x0b', '\x0c']

###Using specifiers

As you may have guessed, "\*", ".", and "?" are not the only characters that have special meaning to the pattern. These characters are called specifiers, and there are several of them to work with in building your powerful patterns.

| Pattern          | Matches                                      |
|------------------|----------------------------------------------|
| abc              | literal abc                                  |
| ( expr )         | expr                                         |
| expr1 &#124; expr2    | expr1 or expr2                               |
| .                | any character except \n                      |
| ^                | start of source string                       |
| $                | end of source string                         |
| prev ?           | zero or one prev                             |
| prev *           | zero or more prev, as many as possible       |
| prev *?          | zero or more prev, as few as possible        |
| prev +           | one or more prev, as many as possible        |
| prev +?          | one or more prev, as few as possible         |
| prev { m }       | m consecutive prev                           |
| prev { m, n }    | m to n consecutive prev, as many as possible |
| prev { m, n }?   | m to n consecutive prev, as few as possible  |
| [ abc ]          | a or b or c (same as a&#124;b&#124;c)                  |
| [^ abc ]         | not (a or b or c)                            |
| prev (?= next )  | prev if followed by next                     |
| prev (?! next )  | prev if not followed by next                 |
| (?<= prev ) next | next if preceded by prev                     |
| (?<! prev ) next | next if not preceded by prev                 |

Regular expressions are extremely powerful and can make things like finding a U.S. phone number in the middle of a large section of text trival. But you need to write that regular expression, which can be tricky at best.

Let's try a real-life example. Let's say you would like to extract all U.S. telephone numbers from a source of text. The source would be like that shown below.

In [16]:
large_source = """
Hi Bianca,
It was great to talk to you about regular expressions. I really understand
them more than I ever had before. Would you like to work on the next project
together? My number is 650-555-3948. Thanks and talk to you soon!

-Mary
"""

What we want to do is build a pattern that would do the folowing: 

- match three numbers
- followed by a dash
- then match three more numbers
- followed by a dash
- and then match four numbers

Here's the regular expression you can create.

In [17]:
phone_number_pattern = re.compile(r'[0123456789]{3}-[0123456789]{3}-[0123456789]{4}')
m = phone_number_pattern.findall(large_source)
print(m)

['650-555-3948']


Notice that the string is surrounded with the regular expression `"r''`. This is Python's representation of a raw string, and that prevents Python from interpreting certain special characters as other characters (like `"\b"` becoming a backspace instead of matching a word boundary). It is safe to surround all of your patterns with `"r''"` so that you do not have any unintended side effects in the future.

For the first implementation of our phone number regular expression, we used the square bracket notation to mean 0 or 1 or 2, and so on, and the curly bracket to designate how many of that pattern we would like to match. Also note that we could have also written this regular expression as done below.

In [18]:
phone_number_pattern = re.compile(r'\d{3}-\d{3}-\d{4}')
m = phone_number_pattern.findall(large_source)
print(m)

['650-555-3948']


Using wildcards makes this process much easier. I highly recommend checking out [Pythex](http://pythex.org/), which allows you to test your Python regular expressions so that you can be sure that your pattern matches what you expect it to match. 

###Specifying match output

You can also capture arbitary parts of the match that you have written by using parentheses. When using the match() or search() functions, you can use the `group()` function of the match to gain access to the groups that you have designated.

For example, we can further improve our phone number regular expression by providing the ability to grab the area code and the rest of the number.

In [19]:
phone_number_pattern = re.compile(r'(\d{3})-(\d{3}-\d{4})')
m = phone_number_pattern.search(large_source)

if m:
    print(m.group())
    print(m.groups())

650-555-3948
('650', '555-3948')


`m.groups()` returns a set of all of the groups that were captured based on where you placed the parentheses. You can also name the groups for easy retrieval.

In [20]:
phone_number_pattern = re.compile(r'(?P<areacode>\d{3})-(?P<number>\d{3}-\d{4})')
m = phone_number_pattern.search(large_source)

if m:
    print(m.group("areacode"))
    print(m.group("number"))

650
555-3948


Regular expressions are very powerful and expressive. It is also a language in its own right so be sure to practice and try out your own regular expressions in the beginning. Also check out [Pythex](http://pythex.org/) for a quick way to test your Python regular expressions before you add them to your code.

Now that we have discussed manipulating and searching character data, we will cover binary data in the next section.