# Regular Expressions in Python

By far, the best course I have found on Regular Expressions is:  
https://www.safaribooksonline.com/videos/understanding-regular-expressions/9781491996300  

The best website I have found for learning and experimenting with Regular Expressions is:  
https://regex101.com/ 

## **re** module vs **regex** module

* **re** is part of standard library
* **regex** has latest PCRE with variable length look behind and more

In most cases, re and regex will do the same thing.  However with advanced regular expressions, you may find yourself wanting to use regex over re.

PCRE means Perl Compatible Regular Expressions

## re.search()

In practice, re.search() may be the method you use most.

re.search() will find the **first** match and return that as a match object.

If there is no match, it will return None.

The truthiness of None is bool(None) which is False.

In [1]:
# The last expression in a Jupyter Notebook cell is displayed using __repr__
# For None, nothing is displayed
None

In [2]:
# The truth value of None is False
bool(None)

False

In [3]:
# you can print(None) though
print(None)

None


In [4]:
import re

ml_text = """Machine learning is a method of data analysis that automates 
analytical model building. It is a branch of artificial intelligence 
based on the idea that systems can learn from data, identify patterns 
and make decisions with minimal human intervention.
"""

pattern = "minimal"
re.search(pattern, ml_text)

<re.Match object; span=(227, 234), match='minimal'>

In [5]:
# A Python pattern can be explictly compiled before usage
compiled_pattern = re.compile(pattern)
compiled_pattern.search(ml_text)

<re.Match object; span=(227, 234), match='minimal'>

In [6]:
# None is returned when there is no match, so nothing is displayed
pattern = "xyz"
re.search(pattern, ml_text)

## Compiling Regular Expressions
If you do not explicitly compile the pattern, Python will compile it for you.  It will also cache many of the patterns it compiles so it is does not have to recompile them when the same search is performed.

### Software Engineering Note
In most cases, optimizing code to make it run faster, makes it harder to read and maintain.  This incurs a software development cost that can be significant over the life cycle of the project.  As Knuth says, "premature optimization is the root of all evil".  It is usually best to write clear code.  Later the application can be profiled to see where time is being spent and that section of code can be optimized.  This approach reduces developer cost.

An exception to this is optimizing a nested loop from say O(n^2) to say O(nlog(n)) or faster.  Reducing the exponent of an operation that takes exponential time to run, will make a difference to overall application performance if the number of elements being processed is in the thousands or greater.  This is knowable without profiling.

In the case of regular expressions, the time being saved is just the compile time of the regular expression when it cannot be found in the cache or the lookup time to find it in the cache.  This is not much time.

However in this case, it can be argued that manually compiling the regular expression does not make the code harder to read and maintain. If it's not harder to read and maintain, then you might as well manually compile the regular expression, *outside of all loops*.

For this notebook, I will argue that manually compiling the regex is every so slightly harder to read and therefore it is not done to make learning easier.

## Jupyter HTML Cell Magic

The following allows for highlighting the part of the string that is matched.

In [19]:
from IPython.core.display import display, HTML
def highlight(match):
    """Display the matched part of the string with a different background color."""
    s = match.string
    start = match.start()
    end = match.end()
    span_start = '<span style="background-color:LightSalmon">'
    span_end = '</span>'
    html_string = s[:start] + span_start + s[start:end] + span_end + s[end:]
    display(HTML(html_string))

In [20]:
pattern = "minimal"
match = re.search(pattern, ml_text)
if match:
    highlight(match)

### re.search() Flags vs Regular Expression Flags

It is easier to read and maintain code using arguments to re.search() than to use flags inside the regular expression.

In rare situations, you may want the regex flag to be in effect for only part of the pattern.  To do this:
1. use the regex module instead of the re module
2. specify the flag inside of the regular expression at the location you want it to take effect

In [21]:
# using flags to re.search()
pattern = "machine"
match = re.search(pattern, ml_text, flags=re.IGNORECASE)
if match:
    highlight(match)

In [23]:
# same as above, but put the flag inside the regex itself
# this is harder to read!
pattern = "(?i)machine"
match = re.search(pattern, ml_text)
if match:
    highlight(match)

### Raw Strings

In Python, a raw string is string that is taken literally.

This is easiest to understand by example.

Usually a regular expression is expressed as a raw string in Python.

In [24]:
# \n represents new line
s = "a\nb"
print(s)

a
b


In [25]:
# \n is taken literally as the characters \ and n
s = r"a\nb"
print(s)

a\nb


## Quantifiers

```
* r'ab*'      # a followed by zero or more b's
* r'ab+'      # a followed by one or more b's
* r'ab?'      # a followed by zero or one b's
* r'ab{3}'    # a followed by three b's
* r'ab{2,3}'  # a followed by two to three b's
* r'ab{2,}'   # a followed by two or more b's
```

In [26]:
# a followed by 0+ b's
match = re.search(r'ab*', 'xax xabx ab aab')
if match:
    highlight(match)

In [35]:
# same as above, but find all matches
for match in re.finditer(r'ab*', 'xax xabx ab aab'):
    highlight(match)

In [27]:
# a followed by 0+ b's
# default is greedy: consume as many b's as possible
match = re.search(r'ab*', 'xabbx ab abbb')
if match:
    highlight(match)

In [38]:
# same as above, but find all matches
for match in re.finditer(r'ab*', 'xabbx ab abbb'):
    if match:
        highlight(match)

### Next Examples Will Use re.finditer()

As seen above, the first match with re.finditer() is the same as using re.search().  
Subsequent matches are only available to re.finditer()

In [41]:
# specify minimal match by placing '?' after quantifier
for match in re.finditer(r'ab*?', 'xabbx ab abbb'):
    if match:
        highlight(match)

In [42]:
# same as above but clearer
# minimal match of 0 or more b's is 0, so no need to have b at all
for match in re.finditer(r'a', 'xabbx ab abbb'):
    if match:
        highlight(match)

In [43]:
# a followed by 1+ b's
for match in re.finditer(r'ab+', 'xabbx ab abbb'):
    if match:
        highlight(match)

In [44]:
# a followed by 1+ b's, minimal match
for match in re.finditer(r'ab+?', 'xabbx ab abbb'):
    if match:
        highlight(match)

In [45]:
# same as above, but clearer
# minimal match of 1 or more b's is 1, so just use 1 b
for match in re.finditer(r'ab', 'xabbx ab abbb'):
    if match:
        highlight(match)

In [47]:
# a followed by 3 b's
for match in re.finditer(r'ab{3}', 'a ab abb abbbb abbb'):
    if match:
        highlight(match)

In [49]:
# a followed by 3 or more b's
for match in re.finditer(r'ab{3,}', 'a ab abb abbbbbb abbb'):
    if match:
        highlight(match)

## Anchors (aka Assertions)

In [50]:
# a followed by 2 b's AND
# assert preceeding character is non-word and following character is non-word
for match in re.finditer(r'\bab{2}\b', 'xabbx xabb abbx ab abbbbb abb abbbb abbb'):
    if match:
        highlight(match)

In [51]:
# space followed by a followed by 2 b's followed by space
# Note: this match includes the spaces on either side of abb
for match in re.finditer(r' ab{2} ', 'xabbx xabb abbx ab abbbbb abb abbbb abbb'):
    if match:
        highlight(match)

In [52]:
# a or b 2 or more times
for match in re.finditer(r'[a|b]{2,}', 'a xbax baab abb abbbb abbb'):
    if match:
        highlight(match)

In [54]:
# ab 2 or more times
for match in re.finditer(r'(ab){2,}', 'a ba baab xyabababz abbbb xababz'):
    if match:
        highlight(match)

In [56]:
# ab 2 or more times with non-word boundaries
for match in re.finditer(r'\b(ab){2,}\b', 'a ba abab baab ab xyabababz ababab xababz'):
    if match:
        highlight(match)

## Match any characters except ...

In [57]:
# match any character except '!' 1 or more times
match = re.search(r'[^!]+', '!Hello World!')
if match:
    highlight(match)

In [58]:
# match any character except '!' 1 or more times, minimal match
match = re.search(r'[^!]+?', '!Hello World!')
if match:
    highlight(match)

## Character Ranges

In [60]:
# match any lowercase character 1 or more times
for match in re.finditer(r'[a-z]+', 'Hello World!'):
    if match:
        highlight(match)

In [61]:
# match any uppercase character 1 or more times
for match in re.finditer(r'[A-Z]+', 'Hello World!'):
    if match:
        highlight(match)

## Sequences

In [62]:
# \w is shorthand for: [a-zA-Z0-9_]
# match any word character 1 or more times
for match in re.finditer(r'\w+', 'Hello World!'):
    if match:
        highlight(match)

In [63]:
# \d is shordhand for: [0-9]
# match any sequence of digits 1 or more times
match = re.search(r'\d+', 'Hello 123 World!')
if match:
    highlight(match)

In [64]:
# \D is shordhand for: [^0-9]
# match any sequence of nondigits 1 or more times
for match in re.finditer(r'\D+', 'Hello 123 World!'):
    if match:
        highlight(match)

In [66]:
# Match 1+ non-space characters
for match in re.finditer(r'\S+', 'This! is a string 1 2 345'):
    if match:                
        highlight(match)

In [69]:
# Match 1+ word characters followed by a space character
pattern = r'\w+ '
for match in re.finditer(r'\w+ ', 'This! is a string 1 2 345'):
    if match:
        highlight(match)

In [70]:
# Match 1+ word characters and
# assert next position is non-word character
pattern = r'\w+ '
for match in re.finditer(r'\w+\b', 'This! is a string 1 2 345'):
    if match:
        highlight(match)

In [75]:
# match any character any number of times, greedily, then match a digit
match = re.search(r'.*\d', 'This! is a string 1 2 345')
if match:
    highlight(match)

In [77]:
# match any character any number of times, minimually, then match a digit
match = re.search(r'.*?\d', 'This! is a string 1 2 345')
if match:
    highlight(match)

In [87]:
# match any non-word character one or more times
for match in re.finditer(r'\W+', 'This! is a string 1 2 345'):
    if match:
        highlight(match)

### RegEx Processing

At each position in the string, the regex is tried.  If it does not match, the next position in the string is tried.

The first match is always returned.

In [88]:
# try each alternative at each position in the string
s = 'An anteater encountered an ant on an antelope'

pattern = r'ant|antelope|anteater'
match = re.search(pattern, s)
if match:
    highlight(match)
    
pattern = r'anteater|antelope|ant'
match = re.search(pattern, s)
if match:
    highlight(match)
    
pattern = r'antelope|anteater|ant'
match = re.search(pattern, s)
if match:
    highlight(match)    

In [94]:
# try each alternative at each position in the string
# find all matches
s = 'An anteater encountered an ant on an antelope'

pattern = r'ant|antelope|anteater'
for match in re.finditer(pattern, s):
    if match:
        highlight(match) 

In [95]:
# try each alternative at each position in the string
# find all matches
s = 'An anteater encountered an ant on an antelope'

pattern = r'antelope|ant|anteater'
for match in re.finditer(pattern, s):
    if match:
        highlight(match) 

In [96]:
# try each alternative at each position in the string
# find all matches
s = 'An anteater encountered an ant on an antelope'

pattern = r'antelope|anteater|ant'
for match in re.finditer(pattern, s):
    if match:
        highlight(match) 

## Match RegEx Metacharacters

In [97]:
# Normally * and $ are special characters in a regex
# if you want to search for them, they need to be escaped
s = 'The 22nd item costs $999 or more'
pattern = r'\$\d+'

match = re.search(pattern, s)
if match:
    highlight(match)     

In [98]:
# $ as a metacharacter means match end of string
s = 'The 22nd item costs $999 or more'
pattern = r'\d+$'

# no match
match = re.search(pattern, s)
if match:
    highlight(match)

In [99]:
# $ as a metacharacter means match end of string
s = 'The 22nd item costs $999'
pattern = r'\d+$'

match = re.search(pattern, s)
if match:
    highlight(match)

In [103]:
# backtracking
# 1 or more alphanumeric characters, greedy, ending in 2 digits
for match in re.finditer(r'\w+\d\d', 'A123 BC3456 78 D7'):
    if match:
        highlight(match)

In [107]:
# backtracking
# 1 or more alphanumeric characters, minimal match, ending in 2 digits
for match in re.finditer(r'\w+?\d\d', 'A123 BC3456 78 D7'):
    if match:
        highlight(match)

In [105]:
# backtracking
# exactly 1 alphanumeric character, ending in 2 digits
for match in re.finditer(r'\w\d\d', 'A123 BC3456 78 D7'):
    if match:
        highlight(match)

In [109]:
# greedy search does not find what is inside each set of ""
pattern = r'".*"'
for match in re.finditer(r'".*"', '"Hello World!" and some "more" text "" and more'):
    if match:
        highlight(match)

In [135]:
# minimal search will find what is inside each set of ""
# Note: this search includes the double quotes
for match in re.finditer(r'".*?"', '"Hello World!" and some "more" text "" and more'):
    if match:
        highlight(match)

## Groups

In [133]:
from IPython.core.display import display, HTML
def highlight_one_group(match):
    """Display the matched part of the group with a different background color."""
    s = match.string
    start, end = match.span(1)
    span_start = '<span style="background-color:LightSalmon">'
    span_end = '</span>'
    html_string = s[:start] + span_start + s[start:end] + span_end + s[end:]
    display(HTML(html_string))

In [136]:
# minimal search will find what is inside each set of ""
# note: this search does not include the double quotes
for match in re.finditer(r'"(.*?)"', '"Hello World!" and some "more" text "" and more'):
    if match:
        highlight_one_group(match)

In [78]:
s = 'cp file1 file2'
pattern = r'(?:cp|mv|ln)\s+(\w+)\s+(\w+)'
match = re.search(pattern, s)
match.groups()

('file1', 'file2')

In [79]:
re.findall(pattern, s)

[('file1', 'file2')]

In [88]:
s = 'cp file1 file2 and some stuff cp file1 file2'
pattern = r'(?P<op>cp|mv|ln)\s+(?P<file1>\w+)\s+(?P<file2>\w+)'
match = re.search(pattern, s)
match.groups()

('cp', 'file1', 'file2')

In [86]:
match.groupdict()

{'op': 'cp', 'file1': 'file1', 'file2': 'file2'}

In [89]:
re.findall(pattern, s)

[('cp', 'file1', 'file2'), ('cp', 'file1', 'file2')]

In [26]:
p = re.compile("[a-z]")

In [27]:
%%timeit
for m in p.finditer('a1b2c3d4'):
    a, b = (m.start(), m.group())

1.27 µs ± 3.55 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [28]:
%%timeit
p = r'[a-z]'
for m in re.finditer(p, 'a1b2c3d4'):
    a, b = (m.start(), m.group())

1.8 µs ± 4.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [81]:
match = re.search(pattern,s)
print(match.groups())
print(match.group(1)) # counting starts at 1!

('cp', 'file1', 'file2')
cp


In [76]:
# with multiple matches, use findall
s = '< stuff inside angle brakets > this is not inside <more stuff>'
pattern = r'<(.*?)>'
groups = re.findall(pattern,s)
groups

[' stuff inside angle brakets ', 'more stuff']

In [None]:
s = """
This is string1
This is string2
"""
pattern = r'\d$'
print(regex.findall(pattern,s))
print(regex.findall(pattern,s,regex.M))

In [None]:
# negative (i.e. don't match) lookahead
s = 'end endwh1le endloop wh2le loop end'
pattern = r'end(?!\w)'
print(regex.findall(pattern,s))

In [None]:
# positive (i.e. match) lookahead
pattern = r'end(?=\s)'
print(regex.findall(pattern,s))

In [None]:
# positive (i.e. match) lookbehind
pattern = r'(?<!end)wh.le'
print(regex.findall(pattern,s))

In [None]:
pattern = r'\bwh.le\b'
print(regex.findall(pattern,s))

In [None]:
s = """
~123
~~1234
~~~12345
~~~~123456
"""
pattern = r'^~(?:~~)*\K\d+'
print(regex.findall(pattern,s,regex.M))

In [None]:
s = 'there are repeated repeated words in this string'
pattern = r'(\b\S+)\s+(\1)\b'
print(regex.findall(pattern,s))

In [None]:
s = """
~123
~~1234
~~~12345
~~~~123456
"""
pattern = r"""
(?x)       # turn on extended formatting
^~(?:~~)*  # start of string (in multiline mode) followed by ~, followed by ~~
\K\d+      # restart matching followed by 1 or more digits
"""
print(regex.findall(pattern,s,regex.M))

In [None]:
pattern_compiled = re.compile(r"""(?x)
(put|get)\s+
(files|buffers)\s+
in\s+
(\w+)
"""
                             )

In [None]:
s = 'put files in mydirectory'
print(pattern_compiled.findall(s))
match = pattern_compiled.search(s)
print(match.groups())

In [None]:
pattern_compiled = re.compile(r"""(?x)
(?P<CMD>put|get)\s+
(?P<TYPE>files|buffers)\s+
in\s+
(?P<TARGET>\w+)
"""
                             )

In [None]:
s = 'put files in mydirectory'
print(pattern_compiled.findall(s))
match = pattern_compiled.search(s)
print(match.groups())

In [None]:
match.groupdict()

In [None]:
print(match.group("CMD"))
print(match.group("TYPE"))
print(match.group("TARGET"))

In [None]:
pattern_compiled = re.compile(r'(dog)')
match = pattern_compiled.fullmatch("dog")
match.groups()

In [None]:
s = 'dog'
pattern = r'(dog)'
pc = re.compile(pattern)
match = pc.search(s)
print(match)
print(match.groups())

In [None]:
print(match[0]) # all groups
print(match[1]) # group 1

### Regex Module

In [None]:
import regex
regex.DEFAULT_VERSION == regex.V0

In [None]:
s = 'abcdefg'
pattern = r'[[a-z]--[aeiou]](?V0)'
regex.findall(pattern, s)

In [None]:
s = 'abcdefg'
pattern = r'[[a-z]--[aeiou]](?V1)'
regex.findall(pattern, s)

In [None]:
pattern = r'[[a-z]--[aeiou]]'

In [None]:
print(re.findall(pattern, s))
print(regex.findall(pattern, s))
print(regex.findall(pattern, s, regex.V0))
print(regex.findall(pattern, s, regex.V1))

In [None]:
s = 'a-beautiful-day'

In [None]:
import sys
print(sys.version)

# new in 3.7, this now the same as with regex.split() below
re.split('(?=-)', s)

In [None]:
regex.split('(?=-)', s, regex.V1)

In [None]:
whos

In [None]:
regex.split('(?=-)', s)

In [None]:
s = "<aa> b <ccde>"
r = r'<.*>'
mo = regex.match(r, s)
mo.captures()

In [None]:
s = "<aa> b <ccde>"
r = r'<.*>'
mo = re.match(r, s)
mo.string

In [None]:
s = "<aa> b <ccde>"
r = r'<.*?>'
mo = regex.match(r, s)
mo.captures()

In [None]:
s = "<aa> b <ccde>"
r = r'<.*?>'
match_list = regex.findall(r, s)
match_list

In [None]:
s = "<aa> b <ccde>"
r = r'<.*>'
match_list = regex.findall(r, s)
match_list

In [None]:
re.split(r'\W+', 'Words, words, and more words.')

In [None]:
re.split(r'(?:\W+)', 'Words, words, and more words.')

In [None]:
pattern = re.compile(r'(?P<digits>\d+) (?P<letters>[a-zA-Z])')
mo = pattern.search('123 45 6789 a234')
print(mo.group(0))
print(mo.group(1))
print(mo.group(2))
print(mo.groups())
print(mo.groupdict())

In [None]:
pattern = re.compile(r'(?P<digits>\d+)')
mo = pattern.search('123 45 6789 a234')
print(mo.group(0))
print(mo.groups())
print(mo.groupdict())

In [None]:
pattern = re.compile(r'(?P<digits>\d+)')
pattern.findall('123 45 6789 a234')

In [None]:
env = %env
env_list = [(key,value) for key,value in env.items()]
env_list.sort()
env_list[:5]

In [None]:
print(type(env.keys()))

In [None]:
env = %env
keys = list(env.keys())
keys.sort()
for key in keys:
    m = regex.search(r'(?<=XDG)(\w+)', key)
    if m:
        print(m.group(1))

In [None]:
[k for k in keys if k.startswith('XDG')]

In [None]:
[k for k in keys if regex.search(r'^XDG | ^PATH | ^LS_COLORS', k, flags=regex.X)]

In [None]:
env = %env
env_list = [(key,value) for key,value in env.items() if not regex.search(r'^XDG | ^PATH | ^LS_COLORS | ^DBUS', key, flags=regex.X)]
env_list.sort()
env_list

In [None]:
# lsmagic
# pdoc regex

In [None]:
whos

In [None]:
%psearch env*

In [None]:
timeit pass

In [None]:
who

In [None]:
whos

In [None]:
who_ls