# Regular Expression in Python

Based on https://developers.google.com/edu/python/regular-expressions

In Python a regular expression search is typically written as `match = re.search(pat, str)`

In [19]:
import re

str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)


# If-statement after search() tests if it succeeded
if match:
    print 'found: ', match.group() ## 'found word:cat'
else:
    print 'did not find'

found:  word:cat


The code "`match = re.search(pat, str)`" stores the search result in a variable named "match". Then the if-statement tests the match -- if true the search succeeded and `match.group()` is the matching text (e.g. 'word:cat'). Otherwise if the match is false (None to be more specific), then the search did not succeed, and there is no matching text.

The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions (Java needs this feature badly!). I recommend that you always write pattern strings with the 'r' just as a habit.

## Basic Patterns

The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

- **a, X, 9, <** -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)

- **. (a period)** -- matches any single character except newline '\n'

- **\w** -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.

- **\b** -- boundary between word and non-word

- **\s** -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.

- **\t, \n, \r** -- tab, newline, return

- **\d** -- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)

- **^** = start, **$** = end -- match the start or end of the string

- **\** -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.

## Basic Examples

The basic rules of regular expression search for a pattern within a string are:

- The search proceeds through the string from start to end, **stopping at the first match found**

- All of the pattern must be matched, but not all of the string

- If `match = re.search(pat, str)` is successful, match is not None and in particular `match.group()` is the matching text

In [20]:
def FUN_check_result(match):
    if match:
        print 'found: ', match.group() ## 'found word:cat'
    else:
        print 'did not find'

## Search for pattern 'iii' in string 'piiig'.
## All of the pattern must match, but it may appear anywhere.
## On success, match.group() is matched text.
match = re.search(r'iii', 'piiig')
FUN_check_result(match)

match = re.search(r'igs', 'piiig')
FUN_check_result(match)


## . = any char but \n
match = re.search(r'..g', 'piiig')
FUN_check_result(match)

## \d = digit char, \w = word char
match = re.search(r'\d\d\d', 'p123g')
FUN_check_result(match)

match = re.search(r'\w\w\w', '@@abcd!!')
FUN_check_result(match)

found:  iii
did not find
found:  iig
found:  123
found:  abc


## Repetition

Things get more interesting when you use + and * to specify repetition in the pattern

- '+' -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's

- '*' -- 0 or more occurrences of the pattern to its left

- '?' -- match 0 or 1 occurrences of the pattern to its left

### Leftmost & Largest

First the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible -- i.e. + and * go as far as possible (the + and * are said to be "**greedy**").

### Specify (Limit) the Repetition
Sometimes we may want to add more restrictions on the repetition, like how many time the pattern is repeated. We can use `{m}` or `{m,n}` to do this.

For example, we want to match all the potential IP addresses, like 101.203.120.187 or 211.182.3.12, but not 1234.21.93823.43. Then we can use pattern `[0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}` in which `{1,3}` means the pattern should repeat for 1, or 2, or 3 times. 

### Repetition Examples

In [21]:
## i+ = one or more i's, as many as possible.
match = re.search(r'pi+', 'piiig')
FUN_check_result(match)

## Finds the first/leftmost solution, and within it drives the +
## as far as possible (aka 'leftmost and largest').
## In this example, note that it does not get to the second set of i's.
match = re.search(r'i+', 'piigiiii')
FUN_check_result(match)

## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx')   # '/s' matches a single whitespace character -- space, newline, return, tab, form
FUN_check_result(match)

match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx') 
FUN_check_result(match)

match = re.search(r'\d\s*\d\s*\d', 'xx123xx')
FUN_check_result(match)


## ^ = matches the start of string, so this fails:
match = re.search(r'^b\w+', 'foobar')
FUN_check_result(match)

## but without the ^ it succeeds:
match = re.search(r'b\w+', 'foobar')
FUN_check_result(match)


## Specify (limit) the repetitions
match = re.search(r'[0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}', '101.203.120.187')
FUN_check_result(match)

match = re.search(r'[0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}', '211.182.3.12')
FUN_check_result(match)

match = re.search(r'[0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}', '1234.21.93823.43')
FUN_check_result(match)

found:  piii
found:  ii
found:  1 2   3
found:  12  3
found:  123
did not find
found:  bar
found:  101.203.120.187
found:  211.182.3.12
did not find


## Emails Example

The search below does not get the whole email address in this case because the \w does not match the '-' or '.' in the address.

In [22]:
str = 'purple alice-b@google.com monkey dishwasher'

match = re.search(r'\w+@\w+', str)
if match:
    print match.group()  ## 'b@google'

b@google


**Square brackets** can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'. The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. For the emails problem, the square brackets are an easy way to add '.' and '-' to the set of chars which can appear around the @ with the pattern r'[\w.-]+@[\w.-]+' to get the whole email address:

In [23]:
match = re.search(r'[\w.-]+@[\w.-]+', str)

if match:
    print match.group()  ## 'alice-b@google.com'

alice-b@google.com


You can also use a dash to indicate a range, so [a-z] matches all lowercase letters. To use a dash without indicating a range, put the dash last, e.g. [abc-]. 

An up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except 'a' or 'b'.

## Group Extraction

The "group" feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis `( )` around the username and host in the pattern, like this: r'([\w.-]+)@([\w.-]+)'. 

In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text. On a successful search, match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis. The plain match.group() is still the whole match text as usual.

In [24]:
str = 'purple alice-b@google.com monkey dishwasher'

match = re.search('([\w.-]+)@([\w.-]+)', str)

if match:
    print match.group()   ## 'alice-b@google.com' (the whole match)
    print match.group(1)  ## 'alice-b' (the username, group 1)
    print match.group(2)  ## 'google.com' (the host, group 2)

alice-b@google.com
alice-b
google.com


A common workflow with regular expressions is that you write a pattern for the thing you are looking for, adding parenthesis groups to extract the parts you want.

## findall

`findall()` is probably the single most powerful function in the **re** module. Above we used `re.search()` to find the *first* match for a pattern. `findall()` finds **all** the matches and returns them as a list of strings, with each string representing one match.

In [25]:
## Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']

print emails

for email in emails:
    # do something with each found email string
    print email

['alice@google.com', 'bob@abc.com']
alice@google.com
bob@abc.com


## findall With Files

For files, you may be in the habit of writing a loop to iterate over the lines of the file, and you could then call findall() on each line. Instead, let findall() do the iteration for you -- much better! Just feed the whole file text into findall() and let it return a list of all the matches in a single step (`f.read()` returns the whole text of a file in a single string).

```
# Open file
f = open('test.txt', 'r')

# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'some pattern', f.read())
```

## findall and Groups

The parenthesis ( ) group mechanism can be combined with findall(). 

If the pattern includes 2 or more parenthesis groups, then instead of returning a list of strings, findall() returns a list of *tuples*. Each tuple represents one match of the pattern, and inside the tuple is the group(1), group(2) .. data. So if 2 parenthesis groups are added to the email pattern, then findall() returns a list of tuples, each length 2 containing the username and host, e.g. ('alice', 'google.com').

In [26]:
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', str)

print tuples

for tuple in tuples:
    print tuple[0]  ## username
    print tuple[1]  ## host

[('alice', 'google.com'), ('bob', 'abc.com')]
alice
google.com
bob
abc.com


## Greedy vs. Non-Greedy -  Where to Stop

This is optional section which shows a more advanced regular expression technique not needed for the exercises.

Suppose you have text with tags in it: `<b>foo</b> and <i>so on</i>`

Suppose you are trying to match each tag with the pattern '(<.*>)' -- what does it match first?

The result is a little surprising, but the greedy aspect of the `.*` causes it to match the whole `<b>foo</b> and <i>so on</i>` as one big match. The problem is that the `.*` goes **as far as is it can**, instead of stopping at the first > (aka it is "**greedy**").

There is an extension to regular expression where you add a **?** at the end, such as **.*?** or **.+?**, changing them to be **non-greedy**. Now they stop **as soon as they can**. So the pattern '**(<.*?>)**' will get just '**<b>**' as the first match, and '**</b>**' as the second match, and so on getting each <..> pair in turn. The style is typically that you use a **.*?**, and then immediately its right look for some concrete marker (> in this case) that forces the end of the .*? run.

The ** *? ** extension originated in *Perl*, and regular expressions that include Perl's extensions are known as Perl Compatible Regular Expressions -- pcre. Python includes pcre support. Many command line utils etc. have a flag where they accept pcre patterns.

An older but widely used technique to code this idea of "all of these chars except stopping at X" uses the square-bracket style. For the above you could write the pattern, but instead of .* to get all the chars, use [^>]* which skips over all characters which are not > (the leading ^ "inverts" the square bracket set, so it matches any char not in the brackets).

In [27]:
str = "<b>foo</b> and <i>so on</i>"

print("Greedy")
result = re.findall("(<.*>)", str)
for i in result:
    print i

print("\nNon-Greedy")
result = re.findall("(<.*?>)", str)
for i in result:
    print i

Greedy
<b>foo</b> and <i>so on</i>

Non-Greedy
<b>
</b>
<i>
</i>


## Substitution 

The `re.sub(pat, replacement, str)` function searches for all the instances of pattern in the given string, and replaces them. 

**The replacement string can include '\1', '\2' which refer to the text from group(1), group(2), and so on from the original matching text.**

There is another function `re.subn()`. It will return a tuple in which the first element is the expected string (like that from function `re.sub`) and the second element is the count of matched instances found & replaced.

Here's an example which searches for all the email addresses, and changes them to keep the user (\1) but have yo-yo-dyne.com as the host.

In [28]:
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

## re.sub(pat, replacement, str) -- returns new string with all replacements,
## \1 is group(1), \2 group(2) in the replacement
print re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@yo-yo-dyne.com', str)
## purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher

purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher


A simpler example:

In [29]:
str = "Work is bad. Life is bad. Everything is bad."

print re.sub("bad", "GOOD", str)

Work is GOOD. Life is GOOD. Everything is GOOD.
