# Regular Expressions Training

## Python Regex Library (`re`)

In [None]:
import re

## The Basics

    match     Match a regular expression pattern to the beginning of a string.
    fullmatch Match a regular expression pattern to all of a string.
    search    Search a string for the presence of a pattern.
    sub       Substitute occurrences of a pattern found in a string.
    subn      Same as sub, but also return the number of substitutions made.
    split     Split a string by the occurrences of a pattern.
    findall   Find all occurrences of a pattern in a string.
    finditer  Return an iterator yielding a match object for each match.
    compile   Compile a pattern into a RegexObject.
    purge     Clear the regular expression cache.
    escape    Backslash all non-alphanumerics in a string.


## Construction

re is Python's method for regular expressions.The construction of the argument is of the form:

    mymatches = re.<method>("pattern","string")
   
Alternatively, the desired pattern or pattern format could be compiled into a regexp object and the re methods can be applied on this object:

    mypattern = re.compile(r"pattern")
    mymatches = mypattern.<method>("string")
    
Saving the matches in either a string or a match object. If a string, the end results could be used right away, however if a match object returns you will have to apply a method to the match object in order to extract the desired item.
    
    

### Using re.match

* Returns the literal string passed
* Alternatively, returns set of characters matching the speficied format
* Returns a string

\* Notice the use of matches.group() to extract match returns. More on this later

In [None]:
#Example of match

# A string to match, and one to not match
matchingString = 'Yes, match'

# Return a match 
matches = re.match('Y',matchingString)
print('Matching ("Y") returns: ' + str(matches.group()))

matches = re.match('Ye',matchingString)
print('Matching ("Ye") returns: ' + str(matches.group()))

matches = re.match('Yes',matchingString)
print('Matching ("Yes") returns: ' + str(matches.group()))

matches = re.match('Y.',matchingString)
print('Matching ("Y.") returns: ' + str(matches.group()))

matches = re.match('Y.*',matchingString)
print('Matching ("Y.*") returns: ' + str(matches.group()))

matches = re.match('Y?.*',matchingString)
print('Matching ("Y?.*") returns: ' + str(matches.group()))

matches = re.match('Y.+',matchingString)
print('Matching ("Y.+") returns: ' + str(matches.group()))

Matching ("Y") returns: Y
Matching ("Ye") returns: Ye
Matching ("Yes") returns: Yes
Matching ("Y.") returns: Ye
Matching ("Y.*") returns: Yes, match
Matching ("Y?.*") returns: Yes, match
Matching ("Y.+") returns: Yes, match


### Using re.fullmatch

* Looks for matches of exactly the format specified
* Return a match object
* Fullmatch is great when doing data validation or checking for a condition through a True/False statement


In [None]:
# This example DOES NOT work because the string is not an exact match of the specified format
mystring = 'Drop a line to learn more, 908-821-6865 for a direct line to the boss'
matches = re.fullmatch(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d',mystring)   
print(matches) 

None


In [None]:
# This example DOES work
mystring = '908-821-6865'
mymatches = re.fullmatch(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d',mystring)   
print(mymatches) 
if mymatches: print('\nMatch Condition Passed')

<re.Match object; span=(0, 12), match='908-821-6865'>

Match Condition Passed


In [None]:
mystring = 'blob'
mymatches = re.fullmatch(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d',mystring)   
print(mymatches) 
if mymatches: print('\nMatch Condition Passed')

None


### Using re.search

* Returns a match object
* Returns the first match only

In [None]:
mystring = 'Drop a line to learn more, 908-821-6865 for a direct line to the boss.'
matches = re.search(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d',mystring)   
print(matches)

<re.Match object; span=(27, 39), match='908-821-6865'>


### Using re.sub

In [None]:
mystring = '84635184 AEG-ABA-CBH'

mynewstring  = re.sub(r'84635184\s*','',mystring)

print(mynewstring)

AEG-ABA-CBH


### Using re.subn

In [None]:
mystring = '84635184 AEG-ABA-CBH 84635184 KNN-RNN-AI'
mynewstring  = re.subn(r'84635184\s*','',mystring)

print(mynewstring)

('AEG-ABA-CBH KNN-RNN-AI', 2)


### Using re.split

In [None]:
mystring = '84635184 AEG-ABA-CBH 84635184 KNN-RNN-AI'
mysplitlist = re.split(' ',mystring)
mysplitlist

['84635184', 'AEG-ABA-CBH', '84635184', 'KNN-RNN-AI']

### Using re.findall

* Returns a string, not a match object
* Returns all matches

In [None]:
mystrings = [
    'Hindsight has created a new framework and delivery for native search for digital media companies to improve reader engagement and reader monetization',
    'Our Smart-Tagging technology pre-identifies search terms in an article and attaches internal related content, further context on the topic, and highly targeted advertising (native or video), all accessible to the reader with a single click (or hover) on the term.',
    'The type and search action is now a click or hover action call us at 908-821-6865']
mymatches = []
for mystring in mystrings:
    print('Search String: ' + mystring)
    mymatches = re.findall(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d',mystring)
    print('Matches returned: ')
    if not mymatches:
        print('Is Empty')
        print('\n')
    else:
        for mymatch in mymatches:
            print(mymatch)
        print('')       

Search String: Hindsight has created a new framework and delivery for native search for digital media companies to improve reader engagement and reader monetization
Matches returned: 
Is Empty


Search String: Our Smart-Tagging technology pre-identifies search terms in an article and attaches internal related content, further context on the topic, and highly targeted advertising (native or video), all accessible to the reader with a single click (or hover) on the term.
Matches returned: 
Is Empty


Search String: The type and search action is now a click or hover action call us at 908-821-6865
Matches returned: 
908-821-6865



### Using re.finditer

* Returns an object for each found item

In [None]:
mymatches = []
for mystring in mystrings:
    print('Search String: ' + mystring)
    mymatches = re.finditer(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d',mystring)
    print('Matches returned: ')
    for mymatch in mymatches:
        print(mymatch)
    print('')

Search String: Hindsight has created a new framework and delivery for native search for digital media companies to improve reader engagement and reader monetization
Matches returned: 

Search String: Our Smart-Tagging technology pre-identifies search terms in an article and attaches internal related content, further context on the topic, and highly targeted advertising (native or video), all accessible to the reader with a single click (or hover) on the term.
Matches returned: 

Search String: The type and search action is now a click or hover action call us at 908-821-6865
Matches returned: 
<re.Match object; span=(69, 81), match='908-821-6865'>



* You can extract items from the match object

In [None]:
# Calling the span start and end
print('span starts at: ' + str(mymatch.span()[0]))
print('span ends at: ' + str(mymatch.span()[1]))

# Using span to slice the string
print(mystrings[1][mymatch.span()[0]:mymatch.span()[1]])

span starts at: 69
span ends at: 81
e and attach


## Patterns

The following is a list of possible patterns and their descriptors

    "."      Matches any character except a newline.
    
    "^"      Matches the start of the string.
    
    "$"      Matches the end of the string or just before the newline at the end of the string.
    
    "*"      Matches 0 or more (greedy) repetitions of the preceding RE.
             Greedy means that it will match as many repetitions as possible.
             
    "+"      Matches 1 or more (greedy) repetitions of the preceding RE.
    
    "?"      Matches 0 or 1 (greedy) of the preceding RE.
    
    *?,+?,?? Non-greedy versions of the previous three special characters.
    
    {m,n}    Matches from m to n repetitions of the preceding RE.
    
    {m,n}?   Non-greedy version of the above.
    
    "\\"     Either escapes special characters or signals a special sequence.
    
    []       Indicates a set of characters.
             A "^" as the first character indicates a complementing set.
             
    "|"      A|B, creates an RE that will match either A or B.
    
    (...)    Matches the RE inside the parentheses.
             The contents can be retrieved or matched later in the string.
                 
    (?:...)  Non-grouping version of regular parentheses.
    
    (?P<name>...) The substring matched by the group is accessible by name.
    
    (?P=name)     Matches the text matched earlier by the group named name.
    
    (?#...)  A comment; ignored.
    
    (?=...)  Matches if ... matches next, but doesn't consume the string.
    
    (?!...)  Matches if ... doesn't match next.
    
    (?<=...) Matches if preceded by ... (must be fixed length).
    
    (?<!...) Matches if not preceded by ... (must be fixed length).
    
    (?(id/name)yes|no) Matches yes pattern if the group with id/name matched
    
    (?aiLmsux) Set the A, I, L, M, S, U, or X flag for the RE (see below).



Over the following examples, different patterns will be demonstrated using a **compiled** pattern regexp object and the **finditer**

We start out with the most basic pattern. The **"."** matches any and every character, individually

In [None]:
mystring = 'aa?K,%5w'

pattern = re.compile(r'.')

mymatches = pattern.finditer(mystring)

print('Matches returned: ')
for mymatch in mymatches : print(mymatch)
print('')

Matches returned: 
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(2, 3), match='?'>
<re.Match object; span=(3, 4), match='K'>
<re.Match object; span=(4, 5), match=','>
<re.Match object; span=(5, 6), match='%'>
<re.Match object; span=(6, 7), match='5'>
<re.Match object; span=(7, 8), match='w'>



Here we use **"^"** to find the string which has a matching pattern at the **beginning**, and retun the exact pattern as the match object

In [None]:
mystrings = [
    'https://www.youtube.com/watch?c=K8cvcvL6KVGG-7o',
    'youtube.com/watch?cc=K8L6KVGG-7o'
]

In [None]:
pattern = re.compile(r'^you')

mymatches = []
for mystring in mystrings:
    print('Search String: ' + mystring)
    mymatches = pattern.finditer(mystring)
    print('Matches returned: ')
    for mymatch in mymatches:
        print(mymatch)
    print('')

Search String: https://www.youtube.com/watch?c=K8cvcvL6KVGG-7o
Matches returned: 

Search String: youtube.com/watch?cc=K8L6KVGG-7o
Matches returned: 
<re.Match object; span=(0, 3), match='you'>



In [None]:
mystring = 'https://www.youtube.com/watch?cc=K8L6KVGG-7o'

pattern = re.compile(r'.*')

mymatches = pattern.finditer(mystring)
for mymatch in mymatches: print(mymatch)

<re.Match object; span=(0, 44), match='https://www.youtube.com/watch?cc=K8L6KVGG-7o'>
<re.Match object; span=(44, 44), match=''>


Here we use **"$"** to find the string which has a matching pattern at the **end**, and return the exact pattern as the match object

In [None]:
mystrings = [
    'Past participle verbs end in "-ed" (i.e.) finished',
    'Present continuous verbs end in "-ing" (i.e) running'
]

In [None]:
pattern = re.compile(r'ing$')

mymatches = []
for mystring in mystrings:
    print('Search String: ' + mystring)
    mymatches = pattern.finditer(mystring)
    print('Matches returned: ')
    for mymatch in mymatches:
        print(mymatch)
    print('')

Search String: Past participle verbs end in "-ed" (i.e.) finished
Matches returned: 

Search String: Present continuous verbs end in "-ing" (i.e) running
Matches returned: 
<re.Match object; span=(49, 52), match='ing'>



Additionally we can add **"^"** to return the string which has a matching pattern at the end, but the returned match goes from the **beginning** of the string up to the matched pattern.

In [None]:
mystrings = [
    'Past participle verbs end in "-ed" (i.e.) finished',
    'Continuous verbs end in "-ing" (i.e) running'
]

# pattern = re.compile(r'^C.*ing$')
# pattern = re.compile(r'^o.*ing$')
pattern = re.compile(r'o.*ing$')



mymatches = []
for mystring in mystrings:
    print('Search String: ' + mystring)
    mymatches = pattern.finditer(mystring)
    print('Matches returned: ')
    for mymatch in mymatches:
        print(mymatch)
    print('')
    
mystrings[1][mymatch.span()[0]:mymatch.span()[1]:]

Search String: Past participle verbs end in "-ed" (i.e.) finished
Matches returned: 

Search String: Continuous verbs end in "-ing" (i.e) running
Matches returned: 
<re.Match object; span=(1, 44), match='ontinuous verbs end in "-ing" (i.e) running'>



'ontinuous verbs end in "-ing" (i.e) running'

In [None]:
mystrings = [
    'Past participle verbs end in "-ed" (i.e.) finished',
    'Continuous verbs end in "-ing" (i.e) running',
    'ing'
]

# pattern = re.compile(r'^.+ing$')
pattern = re.compile(r'^.*ing$')


mymatches = []
for mystring in mystrings:
    print('Search String: ' + mystring)
    mymatches = pattern.finditer(mystring)
    print('Matches returned: ')
    for mymatch in mymatches:
        print(mymatch)
    print('')     

Search String: Past participle verbs end in "-ed" (i.e.) finished
Matches returned: 

Search String: Continuous verbs end in "-ing" (i.e) running
Matches returned: 
<re.Match object; span=(0, 44), match='Continuous verbs end in "-ing" (i.e) running'>

Search String: ing
Matches returned: 
<re.Match object; span=(0, 3), match='ing'>



* We use **"\"** to escape special characters like periods
* We use **"?"** to find matches of the desired pattern

Find instances of Mr.

In [None]:
mystring = 'Mr. Patel, Mr Kara, Ms Ramkumar, Mr. Guridi, Mr. A, Mrs. Butterworth'

In [None]:
# We want to find instances of "Mr."
pattern = re.compile(r'Mr\.')
mymatches = pattern.finditer(mystring)
for mymatch in mymatches : print(mymatch)

<re.Match object; span=(0, 3), match='Mr.'>
<re.Match object; span=(33, 36), match='Mr.'>
<re.Match object; span=(45, 48), match='Mr.'>


In [None]:
# We want to get all three Mr whether they have a period or not:
pattern = re.compile(r'Mr\.?')
mymatches = pattern.finditer(mystring)
for mymatch in mymatches : print(mymatch)

<re.Match object; span=(0, 3), match='Mr.'>
<re.Match object; span=(11, 13), match='Mr'>
<re.Match object; span=(33, 36), match='Mr.'>
<re.Match object; span=(45, 48), match='Mr.'>
<re.Match object; span=(52, 54), match='Mr'>


#### Characters used for pattern construction

    \d       Matches any decimal digit; equivalent to the set [0-9] in
             bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the whole
             range of Unicode digits.

    \s       Matches any whitespace character; equivalent to [ \t\n\r\f\v] in
             bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the whole
             range of Unicode whitespace characters.
                
    \w       Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]
             in bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the
             range of Unicode alphanumeric charactemyrs (letters plus digits
             plus underscore).
             With LOCALE, it will match the set [0-9_] plus characters defined
             as letters for the current locale.

In [None]:
#Then we look for a space and a capital letter followign the space
pattern = re.compile(r'Mr\.?\s[A-Z]') #Finds Mr. P, Mr K and Mr. A
mymatches = pattern.finditer(mystring)
for mymatch in mymatches : print(mymatch)

<re.Match object; span=(0, 5), match='Mr. P'>
<re.Match object; span=(11, 15), match='Mr K'>
<re.Match object; span=(33, 38), match='Mr. G'>
<re.Match object; span=(45, 50), match='Mr. A'>


In [None]:
#Add the search for the complete first name
#After we have found the prefix followed by period, space and first letter we don't know how many letters to find

pattern = re.compile(r'Mr\.?\s[A-Z]\w+') #Finds Mr. Patel, Mr Kara but not Mr. A becuase there is not a word character after T
# pattern = re.compile(r'Mr\.?\s\w+') # Finds Mr. A because there is not a word character after A
mymatches = pattern.finditer(mystring)
for mymatch in mymatches : print(mymatch)

<re.Match object; span=(0, 9), match='Mr. Patel'>
<re.Match object; span=(11, 18), match='Mr Kara'>
<re.Match object; span=(33, 43), match='Mr. Guridi'>


In [None]:
#The asterisk quantifier allows to find zero or more of the characters 

pattern = re.compile(r'Mr\.?\s[A-Z]\w*') #Finds Mr. Patel, Mr Kara AND Mr. A (becuase Mr. A is followed by zero word characters)
mymatches = pattern.finditer(mystring)
for mymatch in mymatches : print(mymatch)

<re.Match object; span=(0, 9), match='Mr. Patel'>
<re.Match object; span=(11, 18), match='Mr Kara'>
<re.Match object; span=(33, 43), match='Mr. Guridi'>
<re.Match object; span=(45, 50), match='Mr. A'>


In [None]:
#Add a group of possible characters

pattern = re.compile(r'M(r|s|rs)\.?\s[A-Z]\w*')
mymatches = pattern.finditer(mystring)
for mymatch in mymatches : print(mymatch)

<re.Match object; span=(0, 9), match='Mr. Patel'>
<re.Match object; span=(11, 18), match='Mr Kara'>
<re.Match object; span=(20, 31), match='Ms Ramkumar'>
<re.Match object; span=(33, 43), match='Mr. Guridi'>
<re.Match object; span=(45, 50), match='Mr. A'>
<re.Match object; span=(52, 68), match='Mrs. Butterworth'>


In [None]:
# Easier to read, same result

pattern = re.compile(r'(Mr|Ms|Mrs)\.?\s[A-Z]\w*')
mymatches = pattern.finditer(mystring)
for mymatch in mymatches : print(mymatch)

<re.Match object; span=(0, 9), match='Mr. Patel'>
<re.Match object; span=(11, 18), match='Mr Kara'>
<re.Match object; span=(20, 31), match='Ms Ramkumar'>
<re.Match object; span=(33, 43), match='Mr. Guridi'>
<re.Match object; span=(45, 50), match='Mr. A'>
<re.Match object; span=(52, 68), match='Mrs. Butterworth'>


In [None]:
mystring = 'The email address workers@hindsightsolutions.net belongs to Hindsight Technology Solutions'

# Now we used a pattern constructor to ind multiple lower or upper until '@' then multiple lower or upper until '.' then 'com'

pattern = re.compile(r'[a-zA-Z]+@[a-zA-Z]+\.(com|org|net|edu)')
mymatches = pattern.finditer(mystring)
for mymatch in mymatches : print(mymatch)

<re.Match object; span=(18, 48), match='workers@hindsightsolutions.net'>


In [None]:
# We find multiple lower or upper until '@' then multiple lower or upper until '.' then 'com  or 'edu'

mystring = 'The email address workers@umd.edu belongs to UMD'

pattern = re.compile(r'[a-zA-Z.]+@[a-zA-Z]+\.[com|edu]*') #Need the star to match the entire string
mymatches = pattern.finditer(mystring)
for mymatch in mymatches : print(mymatch)   

<re.Match object; span=(18, 33), match='workers@umd.edu'>


#### Quantifiers

Quantifiers:
*     \*     0 or more
*     \+     1 or more
*     ?     0 or one
*     {3}   exact number
*     {3,4} range of numbers

In [None]:
mystring = '281-832-4848'

pattern = re.compile(r'\d{3}[-]\d{3}[-]\d{4}')
mymatches = pattern.finditer(mystring)
for mymatch in mymatches : print(mymatch)      

<re.Match object; span=(0, 12), match='281-832-4848'>
