In [2]:
import re

- What are various methods of Regular Expressions?

The ‘re’ package provides multiple methods to perform queries on an input string. 

    - re.match()
    - re.search()
    - re.findall()
    - re.split()
    - re.sub()

In [24]:
# Create some data
text = 'A flock of 120 quick brown foxes jumped over 30 lazy brown, bears.'

In [25]:
# ^ Matches beginning of line.
re.findall('^A', text)

['A']

In [26]:
# $ Matches end of line.
re.findall('bears.$', text)

['bears.']

In [27]:
# . Matches any single character except newline.
re.findall('f..es', text)

['foxes']

In [28]:
# Find all vowels
re.findall('[aeiou]', text)

['o', 'o', 'u', 'i', 'o', 'o', 'e', 'u', 'e', 'o', 'e', 'a', 'o', 'e', 'a']

In [29]:
# Find all characters that are not lower-case vowels
re.findall('[^aeiou]', text)

['A',
 ' ',
 'f',
 'l',
 'c',
 'k',
 ' ',
 'f',
 ' ',
 '1',
 '2',
 '0',
 ' ',
 'q',
 'c',
 'k',
 ' ',
 'b',
 'r',
 'w',
 'n',
 ' ',
 'f',
 'x',
 's',
 ' ',
 'j',
 'm',
 'p',
 'd',
 ' ',
 'v',
 'r',
 ' ',
 '3',
 '0',
 ' ',
 'l',
 'z',
 'y',
 ' ',
 'b',
 'r',
 'w',
 'n',
 ',',
 ' ',
 'b',
 'r',
 's',
 '.']

In [30]:
# a | b Matches either a or b.
re.findall('a|A', text)

['A', 'a', 'a']

In [31]:
# Find any instance of 'fox'
re.findall('(foxes)', text)

['foxes']

In [33]:
# \w Matches word characters.
# Break up string into five character blocks
re.findall('\w\w\w\w\w\w', text)

['jumped']

In [34]:
# \W Matches nonword characters.
re.findall('\W\W', text)

[', ']

In [35]:
# \s Matches whitespace. Equivalent to [\t\n\r\f].
re.findall('\s', text)

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

In [37]:
# \S Matches nonwhitespace.
re.findall('\S\S\S', text)

['flo',
 '120',
 'qui',
 'bro',
 'fox',
 'jum',
 'ped',
 'ove',
 'laz',
 'bro',
 'wn,',
 'bea',
 'rs.']

In [39]:
# \d Matches digits. Equivalent to [0-9].
re.findall('\d\d', text)

['12', '30']

In [40]:
# [Pp]ython Match “Python” or “python”
re.findall('[Ff]oxes', 'foxes Foxes Doxes')

['foxes', 'Foxes']

In [41]:
# [0-9] Match any digit; same as [0123456789]
re.findall('[0-9]', text)

['1', '2', '0', '3', '0']

In [42]:
# [A-Z] Match any uppercase ASCII letter
re.findall('[A-Z]', 'foxes Foxes')

['F']

In [43]:
# [a-zA-Z0-9] Match any of the above
re.findall('[a-zA-Z0-9]', 'foxes Foxes')

['f', 'o', 'x', 'e', 's', 'F', 'o', 'x', 'e', 's']

In [44]:
# [^aeiou] Match anything other than a lowercase vowel
re.findall('[^aeiou]', 'foxes Foxes')

['f', 'x', 's', ' ', 'F', 'x', 's']

In [45]:
# [^0-9] Match anything other than a digit
re.findall('[^0-9]', 'foxes Foxes')

['f', 'o', 'x', 'e', 's', ' ', 'F', 'o', 'x', 'e', 's']

In [46]:
# \d{3} Match exactly 3 digits
re.findall('\d{3}', text)

['120']

In [47]:
# \d{3,} Match 3 or more digits
re.findall('\d{2,}', text)

['120', '30']

In [48]:
# \d{3,5} Match 3, 4, or 5 digits
re.findall('\d{2,3}', text)

['120', '30']

# More examples

The __match()__ function returns a match object if the text matches the pattern. Otherwise it returns None.

In [17]:
pattern  = r"Cookie"
sequence = "Cookie is great"

result = re.match(pattern, sequence)

if result:
    print("Match!")
else: 
    print("Not a match!")

print(result.group(0))

Match!
Cookie


#### Multiple Matches

The findall() function returns all of the substrings of the input that match the pattern without overlapping.

In [21]:
import re

text = 'abbaaabbbbaaaaa'

pattern = 'ab'

for match in re.findall(pattern, text):
    print ('Found "%s"' % match)

Found "ab"
Found "ab"


finditer() returns an iterator that produces Match instances instead of the strings returned by findall().

In [22]:
text = 'abbaaabbbbaaaaa'

pattern = 'ab'

for match in re.finditer(pattern, text):
    s = match.start()
    e = match.end()
    print ('Found "%s" at %d:%d' % (text[s:e], s, e))

Found "ab" at 0:2
Found "ab" at 5:7


# Finding Patterns in Text

The most common use for re is to search for patterns in text. This example looks for two literal strings, 'this' and 'that', in a text string.

In [15]:
patterns = [ 'this', 'that' ]
text     = 'Does this text match the pattern?'

for pattern in patterns:
    print ('Looking for "%s" in "%s" ->' % (pattern, text))

    if re.search(pattern,  text):
        print ('found a match!')
    else:
        print ('no match')

Looking for "this" in "Does this text match the pattern?" ->
found a match!
Looking for "that" in "Does this text match the pattern?" ->
no match


search() takes the pattern and text to scan, and returns a Match object when the pattern is found. If the pattern is not found, search() returns None.

In [16]:
pattern = 'this'
text    = 'Does this text match the pattern?'

match = re.search(pattern, text)

s = match.start()
e = match.end()

print ('Found "%s" in "%s" from %d to %d ("%s")' % (match.re.pattern, match.string, s, e, text[s:e]))

Found "this" in "Does this text match the pattern?" from 5 to 9 ("this")


The start() and end() methods give the integer indexes into the string showing where the text matched by the pattern occurs.

# Compiling Expressions

In [20]:
# Pre-compile the patterns
regexes = [ re.compile(p) for p in [ 'this', 'that', ]]
text    = 'Does this text match the pattern?'

for regex in regexes:
    print ('Looking for "%s" in "%s" ->' % (regex.pattern, text))

    if regex.search(text):
        print ('found a match!')
    else:
        print ('no match')

Looking for "this" in "Does this text match the pattern?" ->
found a match!
Looking for "that" in "Does this text match the pattern?" ->
no match


# Repetition

There are 5 ways to express repetition in a pattern. 

- A pattern followed by the metacharacter * is repeated zero or more times (allowing a pattern to repeat zero times means it does not need to appear at all to match). 

- Replace the * with + and the pattern must appear at least once. 

- Using ? means the pattern appears zero or one time. For a specific number of occurrences, use {m} after the pattern, where m is replaced with the number of times the pattern should repeat. 

- And finally, to allow a variable but limited number of repetitions, use {m,n} where m is the minimum number of repetitions and n is the maximum. Leaving out n ({m,}) means the value appears at least m times, with no maximum.

In [49]:
pattern     = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)

if result:
  print("Search successful.")
else:
  print("Search unsuccessful.")

Search successful.


In [50]:
# Program to extract numbers from a string

import re

string = 'hello 12 hi 89. Howdy 34'
pattern = '\d+'

result = re.findall(pattern, string) 
print(result)

['12', '89', '34']
