### <font color="brown">Regular Expressions</font>

Tutorials can be found at the following sites

1. https://www.w3schools.com/python/python_regex.asp
2. https://developers.google.com/edu/python/regular-expressions#basic-patterns
3. https://docs.python.org/3/howto/regex.html?highlight=regular%20expressions

And the site https://regex101.com/ has a regular expression engine you can use to try things out.


---

#### <font color="brown">Import the re module</font>

In [None]:
import re

---

#### <font color="brown">Search for a pattern in a string using re.search function</font>

In [2]:
res = re.search('a','cat')
print(res)
# 'cat' has an 'a' in it, starting at index 1, ending at index 1

<re.Match object; span=(1, 2), match='a'>


In [3]:
res = re.search('a','dog')
print(res)

None


In [4]:
print ('matched') if re.search('a','dog') else print('not matched')

not matched


In [13]:
res = re.search('ar','barbaric')  # returns the first occurrence of match
print(res)  

<re.Match object; span=(1, 3), match='ar'>


In [27]:
# when searching, because failure is possible, use condition
def searchit(pattern,astr): 
    if re.search(pattern,astr):   # same as if re.search(pattern,astr) != None
        return True
    else:
        return False 

print(searchit('a','cat'))
print(searchit('a','dog'))
print(searchit('ar','barbaric'))

True
False
True


**But matching literal strings is faster with string method**

In [20]:
def findit(astr,litstr):
    return astr.find(litstr) != -1
    
print(findit('cat','a'))
print(findit('dog','a'))
print(findit('barbaric','ar'))

True
False
True


---

#### <font color="brown">Writing regexp patterns with metacharacters</font>

**Metacharacter [ ] is used for a class of characters<br>
Metacharacter * means 0 or more of preceding character/class<br>
Metacharacter + means 1 or more of preceding character/class**

In [None]:
# search for any sequence that starts with a, ends with t,
# and has any number of letters or digits (zero included) between
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('a[a-zA-Z0-9]*t',astr)  # uses metacharacters [] and *
    print('match', res) if res else print('no match')

In [29]:
# search for any sequence that starts with a, ends with t,
# and has any number of letters or digits (zero included) between
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('a[a-zA-Z0-9]*t',astr)  # uses metacharacters [] and *
    print('match') if res else print('no match')

string? ('quit' to stop)  at


match


string? ('quit' to stop)  art


match


string? ('quit' to stop)  artistic


match


string? ('quit' to stop)  at&t


match


string? ('quit' to stop)  a&tt


no match


string? ('quit' to stop)  armada


no match


string? ('quit' to stop)  quit


In [30]:
# search for any sequence that starts with a, ends with t,
# and has any number of letters or digits (zero included) between
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('a[a-zA-Z0-9]*t',astr)  # uses metacharacters [] and *
    print('match') if res else print('no match')

string? ('quit' to stop)  at


match


string? ('quit' to stop)  a2t


match


string? ('quit' to stop)  a23Aaatt


match


string? ('quit' to stop)  at&t


match


string? ('quit' to stop)  a&tt


no match


string? ('quit' to stop)  arm


no match


string? ('quit' to stop)  a1B2xyt


match


string? ('quit' to stop)  quit


In [31]:
# search for any sequence that starts with a, ends with t,
# and has AT LEAST one letter and one digit between, in that order
# i.e. between a and t, all letters must precede all digits
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('a[a-zA-Z]+[0-9]+t',astr)  # uses metacharacters [] and +
    print('match') if res else print('no match')

string? ('quit' to stop)  at


no match


string? ('quit' to stop)  aat


no match


string? ('quit' to stop)  aa1t


match


string? ('quit' to stop)  a1at


no match


string? ('quit' to stop)  art1st


no match


string? ('quit' to stop)  quit


**Metacharacter . matches any character**

In [22]:
# search for any sequence that starts with a, ends with t,
# and has any character any number of times (including zero) between
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('a.*t',astr)  # uses metacharacters . and *
    print('match') if res else print('no match')

String?  at


match


String?  tart


match


String?  roast


match


String?  rat


match


String?  race


no match


String?  quit


**Metacharacter ? matches one or zero occurrence of preceding character**

In [24]:
res = re.search('ac?t','at')
print(res)
res = re.search('ac?t','act')
print(res)
res = re.search('ac?t','tractor')
print(res)
res = re.search('ac?t','art')
print(res)

<re.Match object; span=(0, 2), match='at'>
<re.Match object; span=(0, 3), match='act'>
<re.Match object; span=(2, 5), match='act'>
None


**Metacharacter ^ matches start of string when used outside of a [ ] class<br>
Metacharacter $ matches end of string**

In [25]:
# match all strings that start with ar, end with t, 
# and have at least one lowercase letter between
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('^ar[a-z]+t$',astr)  # uses metacharacters ^, [ ], +, and $
    print('match') if res else print('no match')

String?  at


no match


String?  art


no match


String?  arrest


match


String?  artist


match


String?  arid


no match


String?  aristocrat


match


String?  arrested


no match


String?  quit


**Metacharacter ^ negates when used as first character inside a class [ ]**

In [32]:
# match all strings that start with ar, end with t, 
# and do NOT have any digits between
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('^ar[^0-9]*t$',astr)  # uses metacharacters ^, [ ], *, and $
    print('match') if res else print('no match')

string? ('quit' to stop)  art


match


string? ('quit' to stop)  artist


match


string? ('quit' to stop)  cart


no match


string? ('quit' to stop)  at


no match


string? ('quit' to stop)  ar1st


no match


string? ('quit' to stop)  quit


In [70]:
# want string to not have any digits or upper case
res = re.search(r'^[^0-9A-Z]*$','abcXyz')   
print(res) 

None


**Metacharacter | is used for alternative match, usually used with metacharacters ( )**

In [33]:
# search for any sequence that starts with a, ends with t,
# and has exactly one letter and exactly one digit between, in EITHER order
# first attempt
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('a[a-zA-Z0-9][a-zA-Z0-9]t',astr)  # uses metacharacters []
    print('match') if res else print('no match')

string? ('quit' to stop)  aa9t


match


string? ('quit' to stop)  a9at


match


string? ('quit' to stop)  aaat


match


string? ('quit' to stop)  a99t


match


string? ('quit' to stop)  quit


**above does not work because you can have consecutive letters, or consecutive digits**

In [34]:
# search for any sequence that starts with a, ends with t,
# and has exactly one letter and exactly one digit between, in EITHER order
# second attempt
while True:
    astr = input('String? ')
    if astr == 'quit':
        break
    res = re.search('a([a-zA-Z][0-9]|[0-9][a-zA-Z])t',astr)  # uses metacharacters [], (), |
    print('match') if res else print('no match')

String?  aa9t


match


String?  a9at


match


String?  aaat


no match


String?  a99t


no match


String?  quit


In [4]:
# match any text that has airport or airplane
res = re.search('air(port|plane)','the air is cool')
print(res)
res = re.search('air(port|plane)','newark airport security')
print(res)
res = re.search('air(port|plane)','airplane')
print(res)
res = re.search('air(port|plane)','port')
print(res)

# match any text that has airport or airplane, or air
res = re.search('air(port|plane)|air','the air is cool')
print(res)

None
<re.Match object; span=(7, 14), match='airport'>
<re.Match object; span=(0, 8), match='airplane'>
None
<re.Match object; span=(4, 7), match='air'>


**Greedy and non-greedy matching**

In [35]:
# greedy matching
# * matches longest possible sequence 
res = re.search('<.*>', '<p class="para">This is a paragraph.</p>')
print(res)

<re.Match object; span=(0, 40), match='<p class="para">This is a paragraph.</p>'>


In [36]:
# non-greedy matching with ?
res = re.search('<.*?>', '<p class="para">This is a paragraph.</p>')
print(res)

<re.Match object; span=(0, 16), match='<p class="para">'>


In [71]:
# or, we can use negation to prevent < or > characters in between
print(re.search(r'<[^<>]*>','<p class="para">This is a paragraph.</p>'))

<re.Match object; span=(0, 16), match='<p class="para">'>


<font color="red">Note that the usage of ? following a sequence such as * or +, is different from<br>
the usage of ? following a single character (for one or zero occurence of that character)</font>

**Metacharacter pair { } used for specific number of instances**

In [48]:
# search for any sequence that has exactly three consecutive uppercase letters 
print(re.search('[A-Z]{3}','ABC'))
print(re.search('[A-Z]{3}','xyABCdef'))
print(re.search('[A-Z]{3}','xyABdef'))

<re.Match object; span=(0, 3), match='ABC'>
<re.Match object; span=(2, 5), match='ABC'>
None


In [49]:
# search for any sequence that has between 2 and 4 consecutive uppercase letters 
print(re.search('[A-Z]{2,4}','A'))
print(re.search('[A-Z]{2,4}','AB'))
print(re.search('[A-Z]{2,4}','xyABCdef'))
print(re.search('[A-Z]{2,4}','xyABCDef'))
print(re.search('[A-Z]{2,4}','xyABCDZef'))

None
<re.Match object; span=(0, 2), match='AB'>
<re.Match object; span=(2, 5), match='ABC'>
<re.Match object; span=(2, 6), match='ABCD'>
<re.Match object; span=(2, 6), match='ABCD'>


In [50]:
# search for any sequence that has at least two consecutive uppercase letters 
print(re.search('[A-Z]{2,}','12CAR34'))
print(re.search('[A-Z]{2,}','12C34'))

<re.Match object; span=(2, 5), match='CAR'>
None


In [51]:
# search for any string that starts with at most two consecutive uppercase letters, 
# followed by a digit
print(re.search('^[A-Z]{,2}[0-9]','1234'))
print(re.search('^[A-Z]{,2}[0-9]','X1234'))
print(re.search('^[A-Z]{,2}[0-9]','CAR1234'))

<re.Match object; span=(0, 1), match='1'>
<re.Match object; span=(0, 2), match='X1'>
None


**Inside a class [ ], all metacharacters lose their meaning, including ^ if it's not first character**

In [52]:
print(re.search('a[$.^?]*t','at'))
print(re.search('a[$.^?]*t','a$.t'))
print(re.search('a[$.^?]*t','a.^^t'))
print(re.search('a[$.^?]*t','a?$..t'))
print(re.search('a[$.^?]*t','aBCt'))

<re.Match object; span=(0, 2), match='at'>
<re.Match object; span=(0, 4), match='a$.t'>
<re.Match object; span=(0, 5), match='a.^^t'>
<re.Match object; span=(0, 6), match='a?$..t'>
None


**To "defang" a metacharacter (lose its special meaning), use metacharacter '\\' in front of it**

In [53]:
print(re.search('a+b','ab'))
print(re.search('a+b','aab'))
print(re.search('a+b','a+b')) 
print(re.search('a\+b','a+b')) # to match + literally
print(re.search('a\++b','a+b')) # to match + literally, one or more times
print(re.search('a\++b','a++b')) 
print(re.search('a\++b','a+cb')) # c won't match

<re.Match object; span=(0, 2), match='ab'>
<re.Match object; span=(0, 3), match='aab'>
None
<re.Match object; span=(0, 3), match='a+b'>
<re.Match object; span=(0, 3), match='a+b'>
<re.Match object; span=(0, 4), match='a++b'>
None


**Using '\\\\' to defang the '\\' itself**

In [58]:
# want to match either 'a\t' or 'at', i.e. one of zero occurence of '\' between a and t
print(re.search('a\\?t','at'))  

None


**Above doesn't work because the '\\' defangs the following metacharacter ?, which is then taken literally, see below**

In [56]:
print(re.search('a\?t','a?t')) 

<re.Match object; span=(0, 3), match='a?t'>


In [57]:
# how about this?
print(re.search('a\\?t','at')) # doesn't work either

None


**The reason the above doesn't work is that Python interprets the '\\\\' as the first '\\' escaping the second '\\', so it will convert the pattern string to 'a\\?t'**

In [72]:
# So you will need to do a\\\\?t to have Python translate it to 'a\\?t'
# but this is cumbersome
print(re.search('a\\\\?t','at'))

<re.Match object; span=(0, 2), match='at'>


**Easy workaround is to tag the pattern as a RAW string, with an 'r' in front of it, so Python leaves it alone and sends it as is to the search function**

In [75]:
# use r in front of regular expression string
print(re.search(r'a\\?t','a\t')) # doesn't work, because \t is a single character in the target string

None


In [74]:
# Use r in front of target string as well
print(re.search(r'a\\?t',r'a\t'))

<re.Match object; span=(0, 3), match='a\\t'>


In [76]:
# with r in front of regexp string, single '\' still retains its special meaning
print(re.search(r'a\?t','a?t'))
print(re.search(r'a\?t','a??t'))

<re.Match object; span=(0, 3), match='a?t'>
None


#### Safest to always use r'...' for the regular expression