# Regular Expressions in Python

By far, the best course I have found on Regular Expressions is:  
https://www.safaribooksonline.com/videos/understanding-regular-expressions/9781491996300  

The best website I have found for learning and experimenting with Regular Expressions is:  
https://regex101.com/ 

A good cookbook of practical regular expressions is:  
https://www.amazon.com/Regular-Expressions-Cookbook-Solutions-Programming/dp/1449319432/

## **re** module vs **regex** module

* **re** is part of standard library
* **regex** has latest PCRE with variable length look behind and more

In most cases, re and regex will do the same thing.  However with advanced regular expressions, you may find yourself wanting to use regex over re.

PCRE means Perl Compatible Regular Expressions

## re.search() and re.finditer()

In practice, re.search() and re.findIter() may be the methods you use most.

re.search() will find the **first** match and return that as a match object.
re.findIter() will return an iterator over all matches.

re.search() will return None, if there is no match.

The truthiness of None defined to be bool(None) which is False.

In [1]:
# The last expression in a Jupyter Notebook cell is displayed using __repr__
# For None, nothing is displayed
None

In [2]:
# The truth value of None is False
bool(None)

False

In [3]:
# you can print(None)
print(None)

None


In [4]:
import re

ml_text = """Machine learning is a method of data analysis that automates 
analytical model building. It is a branch of artificial intelligence 
based on the idea that systems can learn from data, identify patterns 
and make decisions with minimal human intervention.
"""

pattern = "minimal"
re.search(pattern, ml_text)

<re.Match object; span=(227, 234), match='minimal'>

In [5]:
# A Python pattern can be explictly compiled before usage
compiled_pattern = re.compile(pattern)
compiled_pattern.search(ml_text)

<re.Match object; span=(227, 234), match='minimal'>

In [6]:
# None is returned when there is no match, so nothing is displayed
pattern = "xyz"
re.search(pattern, ml_text)

## Compiling Regular Expressions
If you do not explicitly compile the pattern, Python will compile it for you.  It will also cache many of the patterns it compiles so it is does not have to recompile them when the same regex is used again.

### Software Engineering Note
In most cases, optimizing code to make it run faster, makes it harder to read and maintain.  This incurs a software development cost that can be significant over the life cycle of the project.  As Knuth said, "premature optimization is the root of all evil".  It is usually best to write clear code.  Later the application can be profiled to see where time is being spent and that section of code can be optimized.  This approach reduces developer cost.

An exception to this is optimizing a nested loop from say O(n^2) to say O(nlog(n)) or faster.  Reducing the exponent of an operation that takes exponential time to run, will make a difference to overall application performance if the number of elements being processed is in the thousands or greater.  This is knowable without profiling.

In the case of regular expressions, the time being saved is just the compile time of the regular expression when it cannot be found in the cache or the lookup time to find it in the cache.  This is not much time.

In the case of regular expressions, it can be argued that manually compiling the regular expression does not make the code harder to read and maintain. If it's not harder to read and maintain, then you might as well manually compile the regular expression, *outside of all loops*.

For this notebook, I will argue that manually compiling the regex is ever so slightly harder to read and therefore it is not done to make learning easier.

## Jupyter HTML Cell Magic

The following highlights the part of the string that is matched.

In [7]:
from IPython.core.display import display, HTML
import html
def highlight(match):
    """Display the matched part of the string with a different background color."""
    s = match.string
    start = match.start()
    end = match.end()
    span_start = '<span style="background-color:LightSalmon">'
    span_end = '</span>'
    html_string = html.escape(s[:start]) + span_start + \
        html.escape(s[start:end]) + span_end + html.escape(s[end:])
    display(HTML(html_string))

In [8]:
pattern = "minimal"
match = re.search(pattern, ml_text)
if match:
    highlight(match)

### re.search() Flags vs Regular Expression Flags

It is easier to read and maintain code using arguments to re.search() than to use flags inside the regular expression.

In rare situations, you may want the regex flag to be in effect for only part of the pattern.  To do this:
1. use the regex module instead of the re module
2. specify the flag inside of the regular expression at the location you want it to take effect

In [9]:
# using flags to re.search()
pattern = "machine"
match = re.search(pattern, ml_text, flags=re.IGNORECASE)
if match:
    highlight(match)

In [10]:
# same as above, but put the flag inside the regex itself
# this is harder to read, especially when more metacharacters are used in the regex
pattern = "(?i)machine"
match = re.search(pattern, ml_text)
if match:
    highlight(match)

### Raw Strings

In Python, a raw string is string that is taken literally.

This is easiest to understand by example.

Usually a regular expression is expressed as a raw string in Python.

In [11]:
# \n represents new line
s = "a\nb"
print(s)

a
b


In [12]:
# r'\n' is taken literally as the characters \ and n
s = r"a\nb"
print(s)

a\nb


## Quantifiers

```
r'ab*'      # a followed by zero or more b's
r'ab+'      # a followed by one or more b's
r'ab?'      # a followed by zero or one b's
r'ab{3}'    # a followed by three b's
r'ab{2,3}'  # a followed by two to three b's
r'ab{2,}'   # a followed by two or more b's
```

In [13]:
# a followed by 0+ b's
match = re.search(r'ab*', 'xax xabx ab aab')
if match:
    highlight(match)

In [14]:
# same as above, but find all matches
for match in re.finditer(r'ab*', 'xax xabx ab aab'):
    highlight(match)

In [15]:
# a followed by 0+ b's
# default is greedy: consume as many b's as possible
match = re.search(r'ab*', 'xabbx ab abbb')
if match:
    highlight(match)

In [16]:
# same as above, but find all matches
for match in re.finditer(r'ab*', 'xabbx ab abbb'):
    highlight(match)

### re.finditer()

As seen above, the first match returned with re.finditer() is the same as using re.search().  
Subsequent matches are only available when using re.finditer()

In [17]:
# 1+ of either a or b
# default is greedy, find the most a's or b's
for match in re.finditer(r'[ab]+', 'xabbx ab abbb'):
    highlight(match)

In [18]:
# 1+ of either a or b
# minimal match: place ? after qunatifier
for match in re.finditer(r'[ab]+?', 'xabbx ab abbb'):
    highlight(match)

In [19]:
# a followed by 3 b's
for match in re.finditer(r'ab{3}', 'a ab abb abbbb abbb'):
    highlight(match)

In [20]:
# a followed by 3 or more b's
for match in re.finditer(r'ab{3,}', 'a ab abb abbbbbb abbb'):
    highlight(match)

## Anchors (aka Assertions)

In [21]:
# a followed by 2 b's and
# assert preceeding character is non-word and following character is non-word
for match in re.finditer(r'\bab{2}\b', 'xabbx xabb abbx ab abbbbb abb abbbb abbb'):
    highlight(match)

In [22]:
# space followed by a followed by 2 b's followed by space
# Note: this match includes the spaces on either side of abb
for match in re.finditer(r' ab{2} ', 'xabbx xabb abbx ab abbbbb abb abbbb abbb'):
    highlight(match)

In [23]:
# a or b 2 or more times
# this implies the match is of length 2 or greater
for match in re.finditer(r'[a|b]{2,}', 'a xbax baab abb abbbb abbb'):
    highlight(match)

In [24]:
# ab 2 or more times
for match in re.finditer(r'(ab){2,}', 'a ba baab xyabababz abbbb xababz'):
    highlight(match)

In [25]:
# ab 2 or more times with non-word boundaries
for match in re.finditer(r'\b(ab){2,}\b', 'a ba abab baab ab xyabababz ababab xababz'):
    highlight(match)

## Match any characters except ...

In [26]:
# match any character except '!' 1 or more times, greedy
for match in re.finditer(r'[^!]+', '!Hello World!'):
    highlight(match)

In [27]:
# match any character except '!' 1 or more times, minimal match
for match in re.finditer(r'[^!]+?', '!Hello World!'):
    highlight(match)

## Character Ranges

In [28]:
# match any lowercase character 1 or more times
for match in re.finditer(r'[a-z]+', 'Hello World!'):
    highlight(match)

In [29]:
# match any uppercase character 1 or more times
for match in re.finditer(r'[A-Z]+', 'Hello World!'):
    highlight(match)

## Sequences

In [30]:
# \w is shorthand for: [a-zA-Z0-9_]
# match any word character 1 or more times
for match in re.finditer(r'\w+', 'Hello World!'):
    highlight(match)

In [31]:
# \d is shordhand for: [0-9]
# match any sequence of digits 1 or more times
match = re.search(r'\d+', 'Hello 123 World!')
if match:
    highlight(match)

In [32]:
# \D is shordhand for: [^0-9]
# match any sequence of nondigits 1 or more times
for match in re.finditer(r'\D+', 'Hello 123 World!'):
    highlight(match)

In [33]:
# Match 1+ non-space characters
for match in re.finditer(r'\S+', 'This! is a string 1 2 345'):
    highlight(match)

In [34]:
# Match 1+ word characters followed by a space character
pattern = r'\w+ '
for match in re.finditer(r'\w+ ', 'This! is a string 1 2 345'):
    highlight(match)

In [35]:
# Match 1+ word characters and
# assert next position is non-word character
pattern = r'\w+ '
for match in re.finditer(r'\w+\b', 'This! is a string 1 2 345'):
    highlight(match)

In [36]:
# match any character any number of times, greedily, then match a digit
match = re.search(r'.*\d', 'This! is a string 1 2 345')
if match:
    highlight(match)

In [37]:
# match any character any number of times, minimually, then match a digit
match = re.search(r'.*?\d', 'This! is a string 1 2 345')
if match:
    highlight(match)

In [38]:
# match any non-word character one or more times
for match in re.finditer(r'\W+', 'This! is a string 1 2 345'):
    highlight(match)

### RegEx Processing

At each position in the string, the regex is tried.  If it does not match, the next position in the string is tried.

The first match is always returned.

In [39]:
# try each alternative at each position in the string
s = 'An anteater encountered an ant on an antelope'

pattern = r'ant|antelope|anteater'
match = re.search(pattern, s)
if match:
    highlight(match)
    
pattern = r'anteater|antelope|ant'
match = re.search(pattern, s)
if match:
    highlight(match)
    
pattern = r'antelope|anteater|ant'
match = re.search(pattern, s)
if match:
    highlight(match)    

In [40]:
# try each alternative at each position in the string, all matches
s = 'An anteater encountered an ant on an antelope'

pattern = r'ant|antelope|anteater'
for match in re.finditer(pattern, s):
    highlight(match) 

In [41]:
# try each alternative at each position in the string, all matches
s = 'An anteater encountered an ant on an antelope'

pattern = r'antelope|ant|anteater'
for match in re.finditer(pattern, s):
    highlight(match) 

In [42]:
# try each alternative at each position in the string, all matches
s = 'An anteater encountered an ant on an antelope'

pattern = r'antelope|anteater|ant'
for match in re.finditer(pattern, s):
    highlight(match) 

### Order of Matching Alternatives

As per above, if you want the longest match, among alternatives which have the same starting characters, then sort the matches by length, longest first.

## Match RegEx Metacharacters

In [43]:
# $ is the metachatacter for match at end of string
# to search for it directly, escape it
s = 'The 22nd item costs $999 or more'
pattern = r'\$\d+'

match = re.search(pattern, s)
if match:
    highlight(match)     

In [44]:
s = 'The 22nd item costs $999 or more'
pattern = r'\d+$'

# no match
match = re.search(pattern, s)
if match:
    highlight(match)

In [45]:
s = 'The 22nd item costs $999'
pattern = r'\d+$'

match = re.search(pattern, s)
if match:
    highlight(match)

In [46]:
# backtracking
# 1 or more alphanumeric characters, greedy, with 2 digits at the end
for match in re.finditer(r'\w+\d\d', 'A123 BC3456 78 D7'):
    highlight(match)

In [47]:
# backtracking
# 1 or more alphanumeric characters, minimal match, with 2 digits at the end
for match in re.finditer(r'\w+?\d\d', 'A123 BC3456 78 D7'):
    highlight(match)

In [48]:
# exactly 1 alphanumeric character, followed by exactly 2 digits
for match in re.finditer(r'\w\d\d', 'A123 BC3456 78 D7'):
    highlight(match)

In [49]:
# greedy search does not find what is inside each set of ""
pattern = r'".*"'
for match in re.finditer(r'".*"', '"Hello World!" and some "more" text "" and more'):
    highlight(match)

In [50]:
# minimal search will find what is inside each set of quotes
# Note: this search includes the double quotes
for match in re.finditer(r'".*?"', '"Hello World!" and some "more" text "" and more'):
    highlight(match)

## Groups

In [51]:
def highlight_groups(match):
    """Display the matched part of each group, each with a different background color.
    """
    s = match.string
    num_groups = len(match.groups())
    
    if num_groups > 9:
        raise ValueError("Too Many Groups")
        
    colors = ['LightSalmon','GreenYellow', 'Orange'] * 3
    span_end = '</span>'
    
    # get start and end points
    starts = []
    ends = []
    for group_num in range(num_groups):
        start, end = match.span(group_num+1)
        starts.append(start)
        ends.append(end)
        
    starts.append(len(s))
    
    # escape html chactares as this will be rendered as HTML to the Jupyter Notebook cell
    html_string = html.escape(s[:starts[0]])
    for n in range(num_groups):
        span_start = f'<span style="background-color:{colors[n]}">'
        html_string = html_string + span_start + \
            html.escape(s[starts[n]:ends[n]]) + span_end + html.escape(s[ends[n]:starts[n+1]])
    
    display(HTML(html_string))

In [52]:
# minimal search will find what is inside each set of ""
# note: this search does not include the double quotes
for match in re.finditer(r'"(.*?)"', '"Hello World!" and some "more" text "" and more'):
    highlight_groups(match)

In [53]:
# a non-capturing () begins with (?:
s = 'cp original dup and some text mv oldname newname'
pattern = r'(?:cp|mv|ln)\s+(\w+)\s+(\w+)'
for match in re.finditer(pattern, s):
    highlight_groups(match)

In [54]:
# parentheses create capture groups by default
s = 'cp original dup and some text mv oldname newname'
pattern = r'(cp|mv|ln)\s+(\w+)\s+(\w+)'
for match in re.finditer(pattern, s):
    highlight_groups(match)

In [55]:
# named captures make it easy for the code to use the capture groups
s = 'cp original dup and some text mv oldname newname'
pattern = r'(?P<op>cp|mv|ln)\s+(?P<source>\w+)\s+(?P<target>\w+)'
for match in re.finditer(pattern, s):
    print(match.groupdict())

{'op': 'cp', 'source': 'original', 'target': 'dup'}
{'op': 'mv', 'source': 'oldname', 'target': 'newname'}


In [56]:
s = '< stuff inside angle brakets > this is not inside <more stuff inside angle brakets>'
pattern = r'<(.*?)>'
for match in re.finditer(pattern, s):
    highlight_groups(match)

In [57]:
# by default, $ is end of string
s = """
This is string1
This is string2
"""

for match in re.finditer(r'\d$', s):
    highlight(match)
# Note: highlight method uses HTML which causes \n to be ignored on display

In [58]:
# use flag to consider $ to match end of each line
s = """
This is string1
This is string2
"""

for match in re.finditer(r'\d$', s, flags=re.MULTILINE):
    highlight(match)
# Note: highlight method uses HTML which causes \n to be ignored on display        

## Lookahead and Lookbehind

This are assertions about what is around the pattern.  They are not considered part of the match.

Lookahead means to assert what is beyond the end of the match.  
Lookbehind means to assert what is before the start of the match.  
Negative means to negate the logic of what is being asserted.  
Positive means to use the logic of what is being asserted.

In [59]:
# positive lookahead: ?=
s = 'end endwh1le endloop wh2le loop end'

# find e followed by n followed by d and assert that next position is whitespace
pattern = r'end(?=\s)'
for match in re.finditer(pattern, s):
    highlight(match)

In [60]:
# negative lookahead: ?!
s = 'end endwh1le endloop wh2le loop end'

# find e followed by n followed by d and assert that next position is not a word character
pattern = r'end(?!\w)'
for match in re.finditer(pattern, s):
    highlight(match)

In [61]:
# positive lookbehind: ?<=
s = 'end endwh1le endloop wh2le loop end'

# find w followed by h followed by any character followed by l followed by e and
# assert that preceding characters are e,n,d
pattern = r'(?<=end)wh.le'
for match in re.finditer(pattern, s):
    highlight(match)

In [62]:
# positive lookbehind: ?<=
s = 'end endwh1le endloop wh2le loop end'

# find w followed by h followed by any character followed by l followed by e and
# assert that preceding characters are not e,n,d
pattern = r'(?<!end)wh.le'
for match in re.finditer(pattern, s):
    highlight(match)

In [63]:
# using previous capture group to detect repeated words
s = 'there are repeated repeated words in this string'
pattern = r'(\b\S+)\s+(\1)\b'
for match in re.finditer(pattern, s):
    highlight_groups(match)

In [64]:
# using previous capture group to detect repeated words
s = 'there are are repeated words in this string'
pattern = r'(\b\S+)\s+(\1)\b'
for match in re.finditer(pattern, s):
    highlight_groups(match)

In [65]:
pattern_compiled = re.compile(r"""(?x)
(put|get)\s+
(files|buffers)\s+
in\s+
(\w+)
"""
                             )

In [66]:
s = 'put files in mydirectory'
for match in pattern_compiled.finditer(s):
    if match:
        highlight_groups(match)

In [67]:
pattern_compiled = re.compile(r"""(?x)
(?P<CMD>put|get)\s+
(?P<TYPE>files|buffers)\s+
in\s+
(?P<TARGET>\w+)
"""
                             )

In [68]:
s = 'put files in mydirectory'
match = pattern_compiled.search(s)
match.groupdict()

{'CMD': 'put', 'TYPE': 'files', 'TARGET': 'mydirectory'}

In [69]:
print(match.group("CMD"))
print(match.group("TYPE"))
print(match.group("TARGET"))

put
files
mydirectory


In [70]:
pattern = re.compile(r'(?P<digits>\d+) (?P<letters>[a-zA-Z])')
s = '123 45 6789 a234'
for match in pattern.finditer(s):
    if match:
        highlight_groups(match)

In [71]:
print(match.group('digits'))

6789


In [72]:
print(match.group('letters'))

a


<a name="example"></a>