Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. 

Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C.


Matching Characters
Most letters and characters will simply match themselves. For example, the regular expression test will match the string test exactly.

There are exceptions to this rule; some characters are special metacharacters, and don’t match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the RE by repeating them or changing their meaning. Much of this document is devoted to discussing various metacharacters and what they do.

Here’s a complete list of the metacharacters; their meanings will be discussed in the rest of this HOWTO.

. ^ $ * + ? { } [ ] \ | ( )

Metacharacters (except \) are not active inside classes. For example, [akm$] will match any of the characters 'a', 'k', 'm', or '$'; '$' is usually a metacharacter, but inside a character class it’s stripped of its special nature.

> '*' - 0 to all

> '+' - 1 to all

> '?' - only one

> '{m, n}' - minimum m and maximum n


In [1]:
import re

In [2]:
p = re.compile('ab*')

In [3]:
p

re.compile(r'ab*', re.UNICODE)

In [11]:
print(p.match("ab"))

print(p.match("acbcd"))

<re.Match object; span=(0, 2), match='ab'>
<re.Match object; span=(0, 1), match='a'>


In [19]:
m = p.match("ab")
m

<re.Match object; span=(0, 2), match='ab'>

In [20]:
print( m.group(),
      m.start(),
      m.end(),
      m.span(), sep='\n')

ab
0
2
(0, 2)


In [21]:
print(p.match('::: message'))

None


In [22]:
m = p.search('::: message'); print(m)

<re.Match object; span=(8, 9), match='a'>


In [24]:
m.group(), m.span()

('a', (8, 9))

In [35]:
p = re.compile('ab*')
m = p.match('Starting goes here')

if m:
    print('Match found', m.group())
else:
    print('No Match')

No Match


In [38]:

p =re.compile('\d+')
m = p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
m

['12', '11', '10']

In [42]:
for match in m:
    print(match)

12
11
10


In [64]:
iterator = p.finditer(
    '12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
iterator


<callable_iterator at 0x1e145ad6380>

In [65]:
dict_digi = { }
for match in iterator:
    dict_digi[match.group()] = match.span()
    


In [66]:
dict_digi

{'12': (0, 2), '11': (22, 24), '10': (40, 42)}

In [67]:
re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998')


<re.Match object; span=(0, 5), match='From '>

Compilation Flags

Flag

Meaning

ASCII, A

Makes several escapes like \w, \b, \s and \d match only on ASCII characters with the respective property.

DOTALL, S

Make . match any character, including newlines.

IGNORECASE, I

Do case-insensitive matches.

LOCALE, L

Do a locale-aware match.

MULTILINE, M

Multi-line matching, affecting ^ and $.

VERBOSE, X (for ‘extended’)

Enable verbose REs, which can be organized more cleanly and understandably.

I

In [68]:
import re

In [71]:
print(re.findall('abc', 'abcde'))

['abc']


In [73]:
p = re.compile('abc')
m = p.search('abcabcabc')

In [74]:
m

<re.Match object; span=(0, 3), match='abc'>

In [75]:
p.findall('abcabcabc')

['abc', 'abc', 'abc']

In [77]:
for i in p.finditer('abcabcabc'):
    print(i, i.span())

<re.Match object; span=(0, 3), match='abc'> (0, 3)
<re.Match object; span=(3, 6), match='abc'> (3, 6)
<re.Match object; span=(6, 9), match='abc'> (6, 9)


In [85]:
dir(re)

['A',
 'ASCII',
 'DEBUG',
 'DOTALL',
 'I',
 'IGNORECASE',
 'L',
 'LOCALE',
 'M',
 'MULTILINE',
 'Match',
 'NOFLAG',
 'Pattern',
 'RegexFlag',
 'S',
 'Scanner',
 'T',
 'TEMPLATE',
 'U',
 'UNICODE',
 'VERBOSE',
 'X',
 '_MAXCACHE',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_cache',
 '_casefix',
 '_compile',
 '_compile_repl',
 '_compiler',
 '_constants',
 '_expand',
 '_parser',
 '_pickle',
 '_special_chars_map',
 '_subx',
 'compile',
 'copyreg',
 'enum',
 'error',
 'escape',
 'findall',
 'finditer',
 'fullmatch',
 'functools',
 'match',
 'purge',
 'search',
 'split',
 'sub',
 'subn',
 'template']

In [87]:
re.split(r'\W+', 'I am Ravi Kumar')

['I', 'am', 'Ravi', 'Kumar']

In [88]:
re.split(r'(\W+)', 'I am Ravi Kumar')

['I', ' ', 'am', ' ', 'Ravi', ' ', 'Kumar']

In [110]:
re.split('\W+', '99 152 52564 658252 kfa sdfjl')

['99', '152', '52564', '658252', 'kfa', 'sdfjl']

In [115]:
re.split('\W*', '99 152 52564 658252 kfa sdfjl')


['',
 '9',
 '9',
 '',
 '1',
 '5',
 '2',
 '',
 '5',
 '2',
 '5',
 '6',
 '4',
 '',
 '6',
 '5',
 '8',
 '2',
 '5',
 '2',
 '',
 'k',
 'f',
 'a',
 '',
 's',
 'd',
 'f',
 'j',
 'l',
 '']

In [118]:
re.sub(r'\bAND\b', '&', 'Baked Beans And Spam', flags=re.IGNORECASE)


'Baked Beans & Spam'

In [117]:
re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)

'Baked Beans & Spam'

In [119]:

# Making a Phonebook
text = """Ross McFluff: 834.345.1254 155 Elm Street

Ronald Heathmore: 892.345.3428 436 Finley Avenue
Frank Burger: 925.541.7625 662 South Dogwood Way


Heather Albrecht: 548.326.4584 919 Park Place"""


In [121]:
entries = re.split("\n+", text)
entries

['Ross McFluff: 834.345.1254 155 Elm Street',
 'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
 'Frank Burger: 925.541.7625 662 South Dogwood Way',
 'Heather Albrecht: 548.326.4584 919 Park Place']

In [127]:
[re.split(":? ", entry, 4) for entry in entries]

[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
 ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
 ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
 ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]

In [130]:
# Text Munging
import random
def repl(m):
    inner_word = list(m.group(2))
    print(inner_word)
    random.shuffle(inner_word)
    return m.group(1) + "".join(inner_word)+m.group(3)


text = "Professor Abdolmalek, please report your absences promptly."
re.sub(r"(\w)(\w+)(\w)", repl, text)


['r', 'o', 'f', 'e', 's', 's', 'o']
['b', 'd', 'o', 'l', 'm', 'a', 'l', 'e']
['l', 'e', 'a', 's']
['e', 'p', 'o', 'r']
['o', 'u']
['b', 's', 'e', 'n', 'c', 'e']
['r', 'o', 'm', 'p', 't', 'l']


'Poersosfr Adboemlalk, pselae reoprt your asbeencs prpomlty.'

In [132]:
text = "He was carefully disguised but captured quickly by police."
re.findall(r"\w+ly\b", text)


['carefully', 'quickly']