Regular expressions are complicated and powerful and potentially a lot of fun because you can do really cool things with them.

First things first: in writing regexes, non-special characters just match themselves.

In [1]:
import re

sentence = 'Mary had a little lamb.'

In [2]:
help(re.search)

Help on function search in module re:

search(pattern, string, flags=0)
    Scan through string looking for a match to the pattern, returning
    a match object, or None if no match was found.



In [11]:
re.search(r'e', sentence)

<_sre.SRE_Match object; span=(16, 17), match='e'>

In [15]:
re.search(r'z', sentence)

In [13]:
re.search(r'a', sentence).group()

'a'

In [16]:
paragraph = '''
Mr. Sherlock Holmes, who was usually very late in the mornings, save upon those not infrequent occasions when he was up all night, was seated at the breakfast table. I stood upon the hearth-rug and picked up the stick which our visitor had left behind him the night before. It was a fine, thick piece of wood, bulbous-headed, of the sort which is known as a "Penang lawyer." Just under the head was a broad silver band nearly an inch across. "To James Mortimer, M.R.C.S., from his friends of the C.C.H.," was engraved upon it, with the date "1884." It was just such a stick as the old-fashioned family practitioner used to carry—dignified, solid, and reassuring.
'''

In [19]:
re.search(r'e', paragraph).group()

'e'

In [22]:
re.search(r'mornings', paragraph).group()

'mornings'

`re.search` returns a `Match` object. The `Match` object has many methods, but it only finds the **first** occurrence of the pattern.

In [24]:
m = re.search(r'mornings', paragraph)

55

In [29]:
re.findall(r'the', paragraph)

['the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the']

In [30]:
len(re.findall(r'the', paragraph))

10

In [35]:
re.findall(r'wh', paragraph)

['wh', 'wh', 'wh', 'wh']

## Wildcard

There is one character that can represent "any" character: that's the `.`

If we want to find, for instance, "a double-o followed by any character"

In [36]:
re.search(r'oo.', paragraph)

<_sre.SRE_Match object; span=(171, 174), match='ood'>

In [38]:
re.findall(r'll.', paragraph)

['lly', 'll ']

Our wildcard matches EVERYTHING except for newlines.

In [41]:
long_story = '''
Once upon
a
time
'''

In [42]:
re.search(r'ce.', long_story)

<_sre.SRE_Match object; span=(3, 6), match='ce '>

Finding actual dots - use backslash to *escape* the dot

In [43]:
re.search('\.', paragraph)

<_sre.SRE_Match object; span=(3, 4), match='.'>

You can do case-insensitive searches by passing the `re.IGNORECASE` option as the second argument to `findall`

In [46]:
re.findall(r'it.', paragraph, re.IGNORECASE)

['ito', 'It ', 'it,', 'ith', 'It ', 'iti']

In [47]:
re.findall(r'it.', paragraph)

['ito', 'it,', 'ith', 'iti']

## Match objects


In [48]:
match = re.search('it', paragraph)

In [49]:
help(match)

Help on SRE_Match object:

class SRE_Match(builtins.object)
 |  The result of re.match() and re.search().
 |  Match objects always have a boolean value of True.
 |  
 |  Methods defined here:
 |  
 |  __copy__(...)
 |  
 |  __deepcopy__(...)
 |  
 |  __repr__(self, /)
 |      Return repr(self).
 |  
 |  end(...)
 |      end([group=0]) -> int.
 |      Return index of the end of the substring matched by group.
 |  
 |  expand(...)
 |      expand(template) -> str.
 |      Return the string obtained by doing backslash substitution
 |      on the string template, as done by the sub() method.
 |  
 |  group(...)
 |      group([group1, ...]) -> str or tuple.
 |      Return subgroup(s) of the match by indices or names.
 |      For 0 returns the entire match.
 |  
 |  groupdict(...)
 |      groupdict([default=None]) -> dict.
 |      Return a dictionary containing all the named subgroups of the match,
 |      keyed by the subgroup name. The default argument is used for groups
 |      that did no

### Start/end matches

Sometimes you only want an expression to match if it's at the beginning and/or it is at the end of a string

Use `^` to match the beginning, and the `$` to match the end

In [50]:
sentence

'Mary had a little lamb.'

In [54]:
re.search(r'^l', sentence)

In [60]:
re.search(r'.$', sentence)

<_sre.SRE_Match object; span=(22, 23), match='.'>

Remember, `.` matches anything, so if you're checking for a period, make sure to escape it!

## Matching multiple characters

Sometimes you want to match something only if there is more than one of it. Could be 0 or more somethings, could be 1 or more, could be a specific number of something. RegExes can handle all of these

* `*` matches 0 or more
* `+` matches 1 or more
* `?` matches 0 or 1
* `{n}` matches _n_ times
* `{m,n}` matches between _m_ and _n_ times. Like slicing, you can leave a number out and it will match from 0-n or from m-infinity times

In [61]:
re.findall(r'o+', paragraph)

['o',
 'o',
 'o',
 'o',
 'o',
 'o',
 'o',
 'o',
 'o',
 'oo',
 'o',
 'o',
 'o',
 'o',
 'o',
 'oo',
 'o',
 'o',
 'o',
 'o',
 'o',
 'o',
 'o',
 'o',
 'o',
 'o',
 'o',
 'o',
 'o',
 'o',
 'o',
 'o']

In [63]:
re.findall(r'n?g', paragraph)

['ng', 'g', 'g', 'g', 'ng', 'ng', 'g', 'ng']

In [64]:
re.findall(r'.?g', paragraph)

['ng', 'ig', 'ug', 'ig', 'ng', 'ng', 'ig', 'ng']

In [65]:
re.findall(r'.?g', 'go west and get riding')

['g', ' g', 'ng']

In [66]:
no_a = 'b'
one_a = 'ab'
lots_of_a = 'aaaaaaaaaaaab'

In [67]:
re.search(r'a*b', no_a)

<_sre.SRE_Match object; span=(0, 1), match='b'>

In [68]:
re.search(r'a*b', one_a)

<_sre.SRE_Match object; span=(0, 2), match='ab'>

In [69]:
re.search(r'a*b', lots_of_a)

<_sre.SRE_Match object; span=(0, 13), match='aaaaaaaaaaaab'>

In [70]:
other_string = 'baaaaaaaaaab'
re.search(r'a*b', other_string)

<_sre.SRE_Match object; span=(0, 1), match='b'>

In [71]:
re.findall(r'a*b', other_string)

['b', 'aaaaaaaaaab']

In [72]:
re.search(r'a+b', no_a)

In [73]:
re.search(r'a+b', one_a)

<_sre.SRE_Match object; span=(0, 2), match='ab'>

In [74]:
re.search(r'a+b', lots_of_a)

<_sre.SRE_Match object; span=(0, 13), match='aaaaaaaaaaaab'>

In [75]:
re.search(r'a{2}b', no_a)

In [76]:
re.search(r'a{2}b', one_a)

In [77]:
re.search(r'a{2}b', lots_of_a)

<_sre.SRE_Match object; span=(10, 13), match='aab'>

In [78]:
re.search(r'a{2,}b', lots_of_a)

<_sre.SRE_Match object; span=(0, 13), match='aaaaaaaaaaaab'>

In [79]:
re.search(r'a{,5}b', lots_of_a)

<_sre.SRE_Match object; span=(7, 13), match='aaaaab'>

In [80]:
re.search(r'a{,5}b', one_a)

<_sre.SRE_Match object; span=(0, 2), match='ab'>

In [84]:
re.search(r'(a+b){2}', 'abaaaabaaab')

<_sre.SRE_Match object; span=(0, 7), match='abaaaab'>

Parentheses indicate subpatterns, and we can match those using our specifiers of multiplicity

## Matching sets of things

It's nice to be able to match multiple options for a particular sequence.

We use square brackets to indicate multiple characters can match.

`.` matches everything that's not a newline

* `[abc]` to match a, b, or c
* `[A-Z]` to match A, B, C.. X, Y, Z
* `[^m-q]`to match everything that is NOT m, n, o, p, or q

In [91]:
# find words 3-5 letters long
re.findall(r'^[a-z]{3,5}| [a-z]{3,5} | [a-z]{3,5}$', 'The apple tree', re.IGNORECASE)

['The', ' apple ']

In [95]:
re.findall(r' [a-z]{3,5} ', 'The apple tree', re.IGNORECASE)

[' apple ']

In [100]:
# find the first number in a string

re.findall(r'[0-9]+', 'I ate 100 ghost peppers')

['100']

In [106]:
# find all punctuation

re.findall(r'[\.,;?!]', paragraph)
re.findall(r'[^a-zA-Z0-9 ]', paragraph)

['\n',
 '.',
 ',',
 ',',
 ',',
 '.',
 '-',
 '.',
 ',',
 ',',
 '-',
 ',',
 '"',
 '.',
 '"',
 '.',
 '"',
 ',',
 '.',
 '.',
 '.',
 '.',
 ',',
 '.',
 '.',
 '.',
 ',',
 '"',
 ',',
 '"',
 '.',
 '"',
 '-',
 '—',
 ',',
 ',',
 '.',
 '\n']

In [117]:
# find a phone number

print(re.search(r'[0-9]{3}-?[0-9]{3}-?[0-9]{4}', 'My phone number is 555-555-1234'))

print(re.search(r'[0-9]{3}-?[0-9]{3}-?[0-9]{4}', 'My phone number is 5555551234'))

print(re.search(r'\(?[0-9]{3}\)?[-. ]?[0-9]{3}[-. ]?[0-9]{4}', 'My phone number is 555.555-1234'))

print(re.search(r'\(?[0-9]{3}\)?[-. ]?[0-9]{3}[-. ]?[0-9]{4}', 'My phone number is (555) 555-1234'))


<_sre.SRE_Match object; span=(19, 31), match='555-555-1234'>
<_sre.SRE_Match object; span=(19, 29), match='5555551234'>
<_sre.SRE_Match object; span=(19, 31), match='555.555-1234'>
<_sre.SRE_Match object; span=(19, 33), match='(555) 555-1234'>


## Character Classes

Some common groupings of characters have shortcuts, known as _character classes_

* `\d` - matches digits (0-9)
* `\D` - matches _non_-digits
* `\w` - matches "word characters" ([a-zA-Z0-9_], plus all other valid unicode characters that can appear in words
* `\W` - matches _non_-word characters
* `\s` - matches whitespace characters ([ \t\n\r\f\v])
* `\S` - matches _non_-whitespace characters

In [118]:
# phone number revisited
print(re.search(r'\(?\d{3}\)?[-. ]?\d{3}[-. ]?\d{4}', 'My phone number is (555) 555-1234'))


<_sre.SRE_Match object; span=(19, 33), match='(555) 555-1234'>


In [123]:
# find all punctuation

re.findall(r'[^\w\s]', paragraph)

['.',
 ',',
 ',',
 ',',
 '.',
 '-',
 '.',
 ',',
 ',',
 '-',
 ',',
 '"',
 '.',
 '"',
 '.',
 '"',
 ',',
 '.',
 '.',
 '.',
 '.',
 ',',
 '.',
 '.',
 '.',
 ',',
 '"',
 ',',
 '"',
 '.',
 '"',
 '-',
 '—',
 ',',
 ',',
 '.']

A couple weirder character classes:

* `\A` - matches the beginning of the string (basically like `^`, different for multi-line strings)
* `\Z` - matches the end of the string
* `\b` - matches a word boundary, which matches the empty string if it's at the end of a word

In [124]:
# words 3-5 letters long
re.findall(r'\b\w{3,5}\b', sentence)

['Mary', 'had', 'lamb']


# Capturing matches

We often want to hang onto something that matches a regex. Using parentheses, we can mark parts of the string that are interesting

In [128]:
locations = ['Atlanta, GA', 'Durham, NC', 'Little River, SC', 'Seattle', 'This is a long thing. Cleveland, OH', 'Winston-Salem, NC']

for l in locations:
    match = re.search(r'([\w\s-]+), ([A-Z]{2})', l)
    if match:
        print(match.groups())

('Atlanta', 'GA')
('Durham', 'NC')
('Little River', 'SC')
(' Cleveland', 'OH')
('Winston-Salem', 'NC')


In [133]:
phone_nums = [
    '999-555-1212',
    '(703) 440-5678',
    '800.555.1234',
    '3141355311',
    '1235',
]

for num in phone_nums:
    match = re.search(r'\(?(\d{3})\)?[-. ]?(\d{3})[-. ]?(\d{4})', num)
    if match:
#         print(match.groups())
        print("({}) {}-{}".format(*match.groups()))


(999) 555-1212
(703) 440-5678
(800) 555-1234
(314) 135-5311
