# Regular Expressions in Python

By far, the best course I have found on Regular Expressions is:  
https://www.safaribooksonline.com/videos/understanding-regular-expressions/9781491996300  

The best website I have found for learning and experimenting with Regular Expressions is:  
https://regex101.com/ 

## **re** module vs **regex** module

* **re** is part of standard library
* **regex** has latest PCRE with variable length look behind and more

In most cases, re and regex will do the same thing.  However with advanced regular expressions, you may find yourself wanting to use regex over re.

PCRE means Perl Compatible Regular Expressions

## re.search()

In practice, re.search() may be the method you use most.

re.search() will find the **first** match and return that as a match object.

If there is no match, it will return None.

The truthiness of None is bool(None) which is False.

In [1]:
# The last expression in a Jupyter Notebook cell is displayed using __repr__
# For None, nothing is displayed
None

In [2]:
# The truth value of None is False
bool(None)

False

In [3]:
# you can print(None) though
print(None)

None


In [4]:
import re

ml_text = """Machine learning is a method of data analysis that automates 
analytical model building. It is a branch of artificial intelligence 
based on the idea that systems can learn from data, identify patterns 
and make decisions with minimal human intervention.
"""

pattern = "minimal"
re.search(pattern, ml_text)

<re.Match object; span=(227, 234), match='minimal'>

In [5]:
# A Python pattern can be explictly compiled before usage
compiled_pattern = re.compile(pattern)
compiled_pattern.search(ml_text)

<re.Match object; span=(227, 234), match='minimal'>

In [6]:
# None is returned when there is no match, so nothing is displayed
pattern = "xyz"
re.search(pattern, ml_text)

## Compiling Regular Expressions
If you do not explicitly compile the pattern, Python will do that for you.  It will also cache a few of the compiled patterns.

If you use a lot of patterns, and you use those patterns many times, it can be more efficient to compile each pattern once before first usage.

## Jupyter HTML Cell Magic

The following allows for highlighting the part of the string that is matched.

In [7]:
from IPython.core.display import display, HTML
def highlight(match):
    s = match.string
    start = match.start()
    end = match.end()
    span_start = '<span style="background-color:LightSalmon">'
    span_end = '</span>'
    html_string = s[:start] + span_start + s[start:end] + span_end + s[end:]
    display(HTML(html_string))

In [8]:
pattern = "minimal"
match = re.search(pattern, ml_text)
if match:
    highlight(match)

### re.search() Flags vs Regular Expression Flags

It is easier to read and maintain code using arguments to re.search() than to use flags inside the regular expression.

In rare situations, you may want the regex flag to be in effect for only part of the pattern.  To do this:
1. use the regex module instead of the re module
2. specify the flag inside of the regular expression at the location you want it to take effect

### Raw Strings

In Python, a raw string is string that is taken literally.

This is easiest to understand by example.

Usually a regular expression is expressed as a raw string in Python.

In [9]:
# \n represents new line
s = "a\nb"
print(s)

a
b


In [10]:
# \n is taken literally as the characters \ and n
s = r"a\nb"
print(s)

a\nb


## Quantifiers

```
* r'ab*'      # a followed by zero or more b's
* r'ab+'      # a followed by one or more b's
* r'ab?'      # a followed by zero or one b's
* r'ab{3}'    # a followed by three b's
* r'ab{2,3}'  # a followed by two to three b's
* r'ab{2,}'   # a followed by two or more b's
```

In [11]:
# a followed by 0+ b's
match = re.search(r'ab*', 'xax xabx ab aab')
if match:
    highlight(match)

In [12]:
# a followed by 0+ b's
# default is greedy: consume as many b's as possible
# stop searching when 1st match is encountered
match = re.search(r'ab*', 'xabbx ab abbb') # a followed by 0+ b's, default is greedy
if match:
    highlight(match)

In [13]:
# same as above, but specify minmal match by placing '?' after quantifier
match = re.search(r'ab*?', 'xabbx ab abbb')
if match:
    highlight(match)

In [14]:
# same as above but clearer
# minimal match of 0 or more b's is 0, so no need to have b at all
match = re.search(r'a', 'xabbx ab abbb')
if match:
    highlight(match)

In [15]:
# a followed by 1+ b's
match = re.search(r'ab+', 'xabbx ab abbb')
if match:
    highlight(match)

In [16]:
# a followed by 1+ b's, minimal match
match = re.search(r'ab+?', 'xabbx ab abbb')
if match:
    highlight(match)

In [17]:
# same as above, but clearer
# minimal match of 1 or more b's is 1, so just use 1 b
match = re.search(r'ab', 'xabbx ab abbb')
if match:
    highlight(match)

In [18]:
# a followed by 3 b's
match = re.search(r'ab{3}', 'a ab abb abbbb abbb')
if match:
    highlight(match)

In [19]:
# a followed by 3 or more b's
match = re.search(r'ab{3,}', 'a ab abb abbbbbb abbb')
if match:
    highlight(match)

## Anchors (aka Assertions)

In [20]:
# a followed by 2 b's AND
# assert preceeding character is non-word and following character is non-word
match = re.search(r'\bab{2}\b', 'xabbx xabb abbx ab abbbbb abb abbbb abbb')
if match:
    highlight(match)

In [21]:
# space followed by a followed by 2 b's followed by space
# Note: this match includes the spaces on either side of abb
match = re.search(r' ab{2} ', 'xabbx xabb abbx ab abbbbb abb abbbb abbb')
if match:
    highlight(match)

In [22]:
# a or b 2 or more times
match = re.search(r'[a|b]{2,}', 'a xbax baab abb abbbb abbb')
if match:
    highlight(match)

In [23]:
# ab 2 or more times
match = re.search(r'(ab){2,}', 'a ba baab xyabababz abbbb abbb')
if match:
    highlight(match)

In [24]:
# ab 2 or more times with non-word boundaries
match = re.search(r'\b(ab){2,}\b', 'a ba baab xyabababz ababab abbb')
if match:
    highlight(match)

## Match any characters except ...

In [25]:
# match any character except '!' 1 or more times
match = re.search(r'[^!]+', '!Hello World!')
if match:
    highlight(match)

In [26]:
# match any character except '!' 1 or more times, minimal match
match = re.search(r'[^!]+?', '!Hello World!')
if match:
    highlight(match)

## Character Ranges

In [27]:
# match any lowercase character 1 or more times
match = re.search(r'[a-z]+', 'Hello World!')
if match:
    highlight(match)

In [28]:
# match any uppercase character 1 or more times
match = re.search(r'[A-Z]+', 'Hello World!')
if match:
    highlight(match)

## Sequences

In [29]:
# \w is shorthand for: [a-zA-Z0-9_]
# match any word character 1 or more times
match = re.search(r'\w+', 'Hello World!')
if match:
    highlight(match)

In [30]:
# \d is shordhand for: [0-9]
# match any sequence of digits 1 or more times
match = re.search(r'\d+', 'Hello 123 World!')
if match:
    highlight(match)

In [31]:
# \D is shordhand for: [^0-9]
# match any sequence of nondigits 1 or more times
match = re.search(r'\D+', 'Hello 123 World!')
if match:
    highlight(match)

### RegEx Processing

At each position in the string, the regex is tried.  If it does not match, the next position in the string is tried.

The first match is always returned.

In [32]:
# try each alternative at each position in the string
s = 'An anteater encountered an ant on an antelope'

pattern = r'ant|antelope|anteater'
match = re.search(pattern, s)
if match:
    highlight(match)
    
pattern = r'anteater|antelope|ant'
match = re.search(pattern, s)
if match:
    highlight(match)
    
pattern = r'antelope|anteater|ant'
match = re.search(pattern, s)
if match:
    highlight(match)    

## Match RegEx Metacharacters

In [33]:
# Normally * and $ are special characters in a regex
# if you want to search for them, they need to be escaped
s = 'The 22nd item costs $999 or more'
pattern = r'\$\d+'

match = re.search(pattern, s)
if match:
    highlight(match)     

In [34]:
# $ means match at end of string
s = 'The 22nd item costs $999 or more'
pattern = r'\d+$'

# no match
match = re.search(pattern, s)
if match:
    highlight(match)

In [35]:
# $ means match at end of string
s = 'The 22nd item costs $999'
pattern = r'\d+$'

match = re.search(pattern, s)
if match:
    highlight(match)

In [36]:
# Match 1+ non-space characters
pattern = r'\S+'
match = re.search(r'\S+', 'This! is a test string 1 2 3 456')
if match:                
    highlight(match)

In [52]:
# Match 1+ word characters followed by a space character
s = 'This! is a test string 1 2 3 456'
pattern = r'\w+ '
match = re.search(pattern, s)
if match:
    highlight(match)

In [54]:
# Match 'is' and assert that the position before 'i' is a word boundary
s = "This! isn't a test string 1 2 3 456"
pattern = r'\bis'
match = re.search(pattern, s)
if match:
    highlight(match)

In [39]:
s = 'This is a test string 1 2 3 456'
pattern = r'.*?\d+'
match = re.search(pattern, s)
s[match.start():match.end()]

'This is a test string 1'

In [40]:
# not: not-word or digit
s = 'This is some test string 1 2 3 456'
pattern = r'[^\W\d]'
match = re.search(pattern, s)
s[match.start():match.end()]

'T'

In [41]:
print(regex.findall(pattern,s))

NameError: name 'regex' is not defined

In [42]:
# . by default does not match \n
s = 'This is \na test.'
pattern = r'.*'
print(regex.findall(pattern,s))

NameError: name 'regex' is not defined

In [43]:
# . by default does not match \n
s = 'This is \na test.'
pattern = r'.*'
print(regex.findall(pattern,s, regex.DOTALL))
print(re.findall(pattern,s, re.DOTALL))

NameError: name 'regex' is not defined

In [44]:
# . by default does not match \n
s = 'This is not an isisis test.'
pattern = r'\s(is){1,3}\s'
print(regex.findall(pattern,s))

NameError: name 'regex' is not defined

In [45]:
# . by default does not match \n
s = 'This is not an isisis test.'
pattern = r'\s((?:is){1,})\s'
match = regex.search(pattern,s)
print(match.groups())
print(regex.findall(pattern,s))

NameError: name 'regex' is not defined

In [46]:
# backtracking
# take as many alphanumeric characters as possible, ending in 2 digits
s = 'A123'
pattern = r'\w*\d\d'
print(regex.findall(pattern,s))

NameError: name 'regex' is not defined

In [47]:
# backtracking
# take as many alphanumeric characters as possible, ending in 2 digits
s = 'A1'
pattern = r'\w*\d\d'
print(regex.findall(pattern,s))

NameError: name 'regex' is not defined

In [48]:
# default is maximal match with minimal backtracking
s = '"Hello World!" and some "more" text "" and more'
pattern = r'".*"'
print(regex.findall(pattern,s))

NameError: name 'regex' is not defined

In [49]:
# ? next to quantifier means minimal loop
pattern = r'".*?"'
print(regex.findall(pattern,s))

NameError: name 'regex' is not defined

In [50]:
s = 'cp file1 file2'
pattern = r'(?:cp|mv|ln)\s+(\w+)\s+(\w+)'
print(regex.findall(pattern,s))

NameError: name 'regex' is not defined

In [51]:
match = regex.search(pattern,s)
print(match.groups())
print(match.group(1)) # counting starts at 1!

NameError: name 'regex' is not defined

In [None]:
# with multiple matches, use findall
s = '< stuff inside angle brakets > this is not inside <more stuff>'
pattern = r'<(.*?)>'
groups = regex.findall(pattern,s)
groups

In [None]:
s = """
This is string1
This is string2
"""
pattern = r'\d$'
print(regex.findall(pattern,s))
print(regex.findall(pattern,s,regex.M))

In [None]:
# negative (i.e. don't match) lookahead
s = 'end endwh1le endloop wh2le loop end'
pattern = r'end(?!\w)'
print(regex.findall(pattern,s))

In [None]:
# positive (i.e. match) lookahead
pattern = r'end(?=\s)'
print(regex.findall(pattern,s))

In [None]:
# positive (i.e. match) lookbehind
pattern = r'(?<!end)wh.le'
print(regex.findall(pattern,s))

In [None]:
pattern = r'\bwh.le\b'
print(regex.findall(pattern,s))

In [None]:
s = """
~123
~~1234
~~~12345
~~~~123456
"""
pattern = r'^~(?:~~)*\K\d+'
print(regex.findall(pattern,s,regex.M))

In [None]:
s = 'there are repeated repeated words in this string'
pattern = r'(\b\S+)\s+(\1)\b'
print(regex.findall(pattern,s))

In [None]:
s = """
~123
~~1234
~~~12345
~~~~123456
"""
pattern = r"""
(?x)       # turn on extended formatting
^~(?:~~)*  # start of string (in multiline mode) followed by ~, followed by ~~
\K\d+      # restart matching followed by 1 or more digits
"""
print(regex.findall(pattern,s,regex.M))

In [None]:
pattern_compiled = re.compile(r"""(?x)
(put|get)\s+
(files|buffers)\s+
in\s+
(\w+)
"""
                             )

In [None]:
s = 'put files in mydirectory'
print(pattern_compiled.findall(s))
match = pattern_compiled.search(s)
print(match.groups())

In [None]:
pattern_compiled = re.compile(r"""(?x)
(?P<CMD>put|get)\s+
(?P<TYPE>files|buffers)\s+
in\s+
(?P<TARGET>\w+)
"""
                             )

In [None]:
s = 'put files in mydirectory'
print(pattern_compiled.findall(s))
match = pattern_compiled.search(s)
print(match.groups())

In [None]:
match.groupdict()

In [None]:
print(match.group("CMD"))
print(match.group("TYPE"))
print(match.group("TARGET"))

In [None]:
pattern_compiled = re.compile(r'(dog)')
match = pattern_compiled.fullmatch("dog")
match.groups()

In [None]:
s = 'dog'
pattern = r'(dog)'
pc = re.compile(pattern)
match = pc.search(s)
print(match)
print(match.groups())

In [None]:
print(match[0]) # all groups
print(match[1]) # group 1

### Regex Module

In [None]:
import regex
regex.DEFAULT_VERSION == regex.V0

In [None]:
s = 'abcdefg'
pattern = r'[[a-z]--[aeiou]](?V0)'
regex.findall(pattern, s)

In [None]:
s = 'abcdefg'
pattern = r'[[a-z]--[aeiou]](?V1)'
regex.findall(pattern, s)

In [None]:
pattern = r'[[a-z]--[aeiou]]'

In [None]:
print(re.findall(pattern, s))
print(regex.findall(pattern, s))
print(regex.findall(pattern, s, regex.V0))
print(regex.findall(pattern, s, regex.V1))

In [None]:
s = 'a-beautiful-day'

In [None]:
import sys
print(sys.version)

# new in 3.7, this now the same as with regex.split() below
re.split('(?=-)', s)

In [None]:
regex.split('(?=-)', s, regex.V1)

In [None]:
whos

In [None]:
regex.split('(?=-)', s)

In [None]:
s = "<aa> b <ccde>"
r = r'<.*>'
mo = regex.match(r, s)
mo.captures()

In [None]:
s = "<aa> b <ccde>"
r = r'<.*>'
mo = re.match(r, s)
mo.string

In [None]:
s = "<aa> b <ccde>"
r = r'<.*?>'
mo = regex.match(r, s)
mo.captures()

In [None]:
s = "<aa> b <ccde>"
r = r'<.*?>'
match_list = regex.findall(r, s)
match_list

In [None]:
s = "<aa> b <ccde>"
r = r'<.*>'
match_list = regex.findall(r, s)
match_list

In [None]:
re.split(r'\W+', 'Words, words, and more words.')

In [None]:
re.split(r'(?:\W+)', 'Words, words, and more words.')

In [None]:
pattern = re.compile(r'(?P<digits>\d+) (?P<letters>[a-zA-Z])')
mo = pattern.search('123 45 6789 a234')
print(mo.group(0))
print(mo.group(1))
print(mo.group(2))
print(mo.groups())
print(mo.groupdict())

In [None]:
pattern = re.compile(r'(?P<digits>\d+)')
mo = pattern.search('123 45 6789 a234')
print(mo.group(0))
print(mo.groups())
print(mo.groupdict())

In [None]:
pattern = re.compile(r'(?P<digits>\d+)')
pattern.findall('123 45 6789 a234')

In [None]:
env = %env
env_list = [(key,value) for key,value in env.items()]
env_list.sort()
env_list[:5]

In [None]:
print(type(env.keys()))

In [None]:
env = %env
keys = list(env.keys())
keys.sort()
for key in keys:
    m = regex.search(r'(?<=XDG)(\w+)', key)
    if m:
        print(m.group(1))

In [None]:
[k for k in keys if k.startswith('XDG')]

In [None]:
[k for k in keys if regex.search(r'^XDG | ^PATH | ^LS_COLORS', k, flags=regex.X)]

In [None]:
env = %env
env_list = [(key,value) for key,value in env.items() if not regex.search(r'^XDG | ^PATH | ^LS_COLORS | ^DBUS', key, flags=regex.X)]
env_list.sort()
env_list

In [None]:
# lsmagic
# pdoc regex

In [None]:
whos

In [None]:
%psearch env*

In [None]:
timeit pass

In [None]:
who

In [None]:
whos

In [None]:
who_ls