## Pattern Matching with Regular Expressions

In [1]:
import re

In [2]:
phoneNumRegex = re.compile(r'\d{3}-\d{3}-\d{4}')

Now the phoneNumRegex variable contains a Regex object.

    Passing Raw Strings to re.compile( )

A Regex object’s search() method searches the string it is passed for any matches to the regex. The ** search() ** method will return None if the regex pattern is not found in the string. **If the pattern is found, the search() method returns a Match object.** Match objects have a **group()** method that will **return the actual matched text from the searched string**.

In [3]:
mo = phoneNumRegex.search('My number is 415-555-4242.')

In [4]:
print('Phone number found: ' + mo.group())

Phone number found: 415-555-4242


### Grouping with Parentheses

Adding parentheses will create groups in the regex: (\d\d\d)-(\d\d\d-\d\d\d\d). Then we can use the group() match object method to grab the matching text from just one group.

    The first set of parentheses in a regex string will be group 1. The second set will be group 2.

Passing 0 or nothing to the group() method will return the entire matched text.

In [6]:
phoneNumRegex = re.compile(r'(\d{3})-(\d{3}-\d{4})')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print(mo.group(1))
print(mo.group(2))
print(mo.group(0))
print(mo.group())

415
555-4242
415-555-4242
415-555-4242


In [7]:
#All the groups 
mo.groups()

('415', '555-4242')

In [9]:
areaCode, mainNumber = mo.groups()
print(areaCode)
print(mainNumber)

415
555-4242


##### Special Case - Matching Parentheses in the expression

In [11]:
#The \( and \) escape characters in the raw string passed to re.compile() 
#will match actual parenthesis characters.
phoneNumRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My phone number is (415) 555-4242.')
print(mo.group(1))
print(mo.group(2))

(415)
555-4242


## Matching Multiple Groups with the Pipe

    The | character can be used anywhere we want to match one of many expressions

In [12]:
heroRegex = re.compile (r'Batman|Tina Fey')
mo1 = heroRegex.search('Batman and Tina Fey.')
mo1.group()

'Batman'

In [13]:
mo2 = heroRegex.search('Tina Fey and Batman.')
mo2.group()

'Tina Fey'

We can also use the pipe to match one of several patterns as part of our regex. 

In [14]:
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmobile lost a wheel')
print(mo.group())
print(mo.group(1))

Batmobile
mobile


#### Optional Matching with the Question Mark

Sometimes there is a pattern that we want to match only optionally. That is, the regex should find a match whether or not that bit of text is there. The ? character flags the group that precedes it as an optional part of the pattern.

In [16]:
batRegex = re.compile(r'Bat(wo)?man')
mo1 = batRegex.search('The Adventures of Batman')
mo1.group()

'Batman'

In [19]:
mo2 = batRegex.search('The Adventures of Batwoman')
mo2.group()

'Batwoman'

In [22]:
#Match zero or one of the group preceding this question mark.
phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
mo1 = phoneRegex.search('My number is 415-555-4242')
print(mo1.group())
mo2 = phoneRegex.search('My number is 555-4242')
print(mo2.group())

415-555-4242
555-4242


#### Matching Zero or More with the Star
The * (called the star or asterisk) means “match zero or more”—the group that precedes the star can occur any number of times in the text. It can be completely absent or repeated over and over again.

In [23]:
batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search('The Adventures of Batman')
mo1.group()

'Batman'

In [25]:
mo2 = batRegex.search('The Adventures of Batwoman')
mo2.group()

'Batwoman'

In [26]:
mo3 = batRegex.search('The Adventures of Batwowowowoman')
mo3.group()

'Batwowowowoman'

#### Matching One or More with the Plus
While * means “match zero or more,” the + (or plus) means “match one or more.” Unlike the star, which does not require its group to appear in the matched string, the group preceding a plus must appear at least once. It is not optional.

In [27]:
batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batwoman')
print(mo1.group())

mo2 = batRegex.search('The Adventures of Batwowowowoman')
print(mo2.group())

mo3 = batRegex.search('The Adventures of Batman')
mo3 == None

Batwoman
Batwowowowoman


True

### Matching Specific Repetitions with Curly Brackets

If we have a group that you want to repeat a specific number of times, follow the group in your regex with a number in curly brackets. For example, the regex (Ha){3} will match the string 'HaHaHa', but it will not match 'HaHa', since the latter has only two repeats of the (Ha) group.

Instead of one number, we can specify a range by writing a minimum, a comma, and a maximum in between the curly brackets. For example, the regex (Ha){3,5} will match 'HaHaHa', 'HaHaHaHa', and 'HaHaHaHaHa'.

We can also leave out the first or second number in the curly brackets to leave the minimum or maximum unbounded. For example, (Ha){3,} will match three or more instances of the (Ha) group, while (Ha){,5} will match zero to five instances. 

And these two regular expressions also match identical patterns:

(Ha){3,5}

((Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha)(Ha))

In [28]:
haRegex = re.compile(r'(Ha){3}')
mo1 = haRegex.search('HaHaHa')
mo1.group()

'HaHaHa'

In [29]:
mo2 = haRegex.search('Ha')
mo2 == None

True

** Python’s regular expressions are greedy by default, which means that in ambiguous situations they will match the longest string possible. The non-greedy version of the curly brackets, which matches the shortest string possible, has the closing curly bracket followed by a question mark. **

In [30]:
greedyHaRegex = re.compile(r'(Ha){3,5}')
mo1 = greedyHaRegex.search('HaHaHaHaHa')
mo1.group()

'HaHaHaHaHa'

In [31]:
nongreedyHaRegex = re.compile(r'(Ha){3,5}?')
mo2 = nongreedyHaRegex.search('HaHaHaHaHa')
mo2.group()

'HaHaHa'

** Note that the question mark can have two meanings in regular expressions: declaring a nongreedy match or flagging an optional group. These meanings are entirely unrelated. **

## The findall() Method
In addition to the search() method, Regex objects also have a findall() method. While search() will return a Match object of the first matched text in the searched string, the findall() method will return the strings of every match in the searched string.

In [32]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('Cell: 415-555-9999 Work: 212-555-0000')
print(mo.group())
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

415-555-9999


['415-555-9999', '212-555-0000']

In [33]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)') # has groups
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

[('415', '555', '9999'), ('212', '555', '0000')]

## Character Classes

    \d    Any numeric digit from 0 to 9.

    \D    Any character that is not a numeric digit from 0 to 9.

    \w    Any letter, numeric digit, or the underscore character. (matching “word” characters.)

    \W    Any character that is not a letter, numeric digit, or the underscore character.

    \s    Any space, tab, or newline character. (Think of this as matching “space” characters.)

    \S    Any character that is not a space, tab, or newline.

In [36]:
xmasRegex = re.compile(r'\d+\s\w+')
xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans, \
                  6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')

['12 drummers',
 '11 pipers',
 '10 lords',
 '9 ladies',
 '8 maids',
 '7 swans',
 '6 geese',
 '5 rings',
 '4 birds',
 '3 hens',
 '2 doves',
 '1 partridge']

### Making Your Own Character Classes


In [37]:
vowelRegex = re.compile(r'[aeiouAEIOU]')
vowelRegex.findall('Robocop eats baby food. BABY FOOD.')

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']

** By placing a caret character (^) just after the character class’s opening bracket, you can make a negative character class. A negative character class will match all the characters that are not in the character class. **

In [38]:
consonantRegex = re.compile(r'[^aeiouAEIOU]')
consonantRegex.findall('Robocop eats baby food. BABY FOOD.')

['R',
 'b',
 'c',
 'p',
 ' ',
 't',
 's',
 ' ',
 'b',
 'b',
 'y',
 ' ',
 'f',
 'd',
 '.',
 ' ',
 'B',
 'B',
 'Y',
 ' ',
 'F',
 'D',
 '.']

#### The Caret and Dollar Sign Characters

You can also use the caret symbol (^) at the start of a regex to indicate that a match must occur at the beginning of the searched text. Likewise, you can put a dollar sign  at the end of the regex to indicate the string must end with this regex pattern. And you can use the ^ and $ together to indicate that the entire string must match the regex—that is, it’s not enough for a match to be made on some subset of the string.

In [39]:
beginsWithHello = re.compile(r'^Hello')
beginsWithHello.search('Hello world!')

<_sre.SRE_Match object; span=(0, 5), match='Hello'>

In [40]:
beginsWithHello.search('He said hello.') == None

True

In [41]:
endsWithNumber = re.compile(r'\d$')
endsWithNumber.search('Your number is 42')

<_sre.SRE_Match object; span=(16, 17), match='2'>

In [42]:
endsWithNumber.search('Your number is forty two.') == None

True

In [43]:
wholeStringIsNum = re.compile(r'^\d+$')
wholeStringIsNum.search('1234567890')

<_sre.SRE_Match object; span=(0, 10), match='1234567890'>

In [44]:
wholeStringIsNum.search('12345xyz67890') == None

True

In [45]:
wholeStringIsNum.search('12 34567890') == None

True

### The Wildcard Character
The . (or dot) character in a regular expression is called a wildcard and will match any character except for a newline.

In [47]:
atRegex = re.compile(r'.at')
atRegex.findall('The cat in the hat sat on the flat mat.')

['cat', 'hat', 'sat', 'lat', 'mat']

In [48]:
atRegex.findall('The cat in the hat sat on the flat mat.')

['cat', 'hat', 'sat', 'lat', 'mat']

### Matching Everything with Dot-Star
Sometimes you will want to match everything and anything. For example, say you want to match the string 'First Name:', followed by any and all text, followed by 'Last Name:', and then followed by anything again. You can use the dot-star (.*) to stand in for that “anything.” Remember that the dot character means “any single character except the newline,” and the star character means “zero or more of the preceding character.”

In [49]:
nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
mo = nameRegex.search('First Name: Al Last Name: Sweigart')
mo.group(1)

'Al'

In [50]:
mo.group(2)

'Sweigart'

The dot-star uses greedy mode: It will always try to match as much text as possible. To match any and all text in a nongreedy fashion, use the dot, star, and question mark (.*?). Like with curly brackets, the question mark tells Python to match in a nongreedy way.

In [51]:
nongreedyRegex = re.compile(r'<.*?>')
mo = nongreedyRegex.search('<To serve man> for dinner.>')
mo.group()

'<To serve man>'

In [52]:
greedyRegex = re.compile(r'<.*>')
mo = greedyRegex.search('<To serve man> for dinner.>')
mo.group()

'<To serve man> for dinner.>'

#### Matching Newlines with the Dot Character
The dot-star will match everything except a newline. By passing re.DOTALL as the second argument to re.compile(), you can make the dot character match all characters, including the newline character.

In [54]:
noNewlineRegex = re.compile('.*')
noNewlineRegex.search('Serve the public trust.\nProtect the innocent. \nUphold the law.').group()

'Serve the public trust.'

In [55]:
newlineRegex = re.compile('.*', re.DOTALL)
newlineRegex.search('Serve the public trust.\nProtect the innocent. \nUphold the law.').group()

'Serve the public trust.\nProtect the innocent. \nUphold the law.'

## Review of Regex Symbols

The ? matches zero or one of the preceding group.

The * matches zero or more of the preceding group.

The + matches one or more of the preceding group.

The {n} matches exactly n of the preceding group.

The {n,} matches n or more of the preceding group.

The {,m} matches 0 to m of the preceding group.

The {n,m} matches at least n and at most m of the preceding group.

{n,m}? or *? or +? performs a nongreedy match of the preceding group.

^spam means the string must begin with spam.

spam$ means the string must end with spam.

The . matches any character, except newline characters.

\d, \w, and \s match a digit, word, or space character, respectively.

\D, \W, and \S match anything except a digit, word, or space character, respectively.

[abc] matches any character between the brackets (such as a, b, or c).

[^abc] matches any character that isn’t between the brackets.

# Case-Insensitive Matching

In [56]:
regex1 = re.compile('Robocop')
regex2 = re.compile('ROBOCOP')
regex3 = re.compile('robOcop')
regex4 = re.compile('RobocOp')

# To make your regex case-insensitive, you can pass 
# re.IGNORECASE or re.I as a second argument to re.compile().
robocop = re.compile(r'robocop', re.I)
robocop.search('Robocop is part man, part machine, all cop.').group()

'Robocop'

In [57]:
robocop.search('ROBOCOP protects the innocent.').group()

'ROBOCOP'