## Regular Expressions Training
### Samuel Gonzalez
### Jun 21, 2018

In [346]:
import re

## The Basics

    match     Match a regular expression pattern to the beginning of a string.
    fullmatch Match a regular expression pattern to all of a string.
    search    Search a string for the presence of a pattern.
    sub       Substitute occurrences of a pattern found in a string.
    subn      Same as sub, but also return the number of substitutions made.
    split     Split a string by the occurrences of a pattern.
    findall   Find all occurrences of a pattern in a string.
    finditer  Return an iterator yielding a match object for each match.
    compile   Compile a pattern into a RegexObject.
    purge     Clear the regular expression cache.
    escape    Backslash all non-alphanumerics in a string.


## Construction

re is Python's method for regular expressions.The construction of the argument is of the form:

    mymatches = re.<method>("pattern","string")
   
Alternatively, the desired pattern or pattern format could be compiled into a regexp object and the re methods can be applied on this object:

    mypattern = re.compile(r"pattern")
    mymatches = mypattern.<method>("string")
    
Saving the matches in either a string or a match object. If a string, the end results could be used right away, however if a match object returns you will have to apply a method to the match object in order to extract the desired item.
    
    

### Using re.match

* Returns the literal string passed
* Alternatively, returns set of characters matching the speficied format
* Returns a string

\* Notice the use of matches.group() to extract match returns. More on this later

In [347]:
#Example of match

# A string to match, and one to not match
matchingString = 'Yes, match'

# Return a match 
matches = re.match('Y',matchingString)
print('Matching ("Y") returns: ' + str(matches.group()))

matches = re.match('Ye',matchingString)
print('Matching ("Ye") returns: ' + str(matches.group()))

matches = re.match('Yes',matchingString)
print('Matching ("Yes") returns: ' + str(matches.group()))

matches = re.match('Y.',matchingString)
print('Matching ("Y.") returns: ' + str(matches.group()))

matches = re.match('Y.*',matchingString)
print('Matching ("Y.*") returns: ' + str(matches.group()))

matches = re.match('Y?.*',matchingString)
print('Matching ("Y?.*") returns: ' + str(matches.group()))

matches = re.match('Y.+',matchingString)
print('Matching ("Y.+") returns: ' + str(matches.group()))

Matching ("Y") returns: Y
Matching ("Ye") returns: Ye
Matching ("Yes") returns: Yes
Matching ("Y.") returns: Ye
Matching ("Y.*") returns: Yes, match
Matching ("Y?.*") returns: Yes, match
Matching ("Y.+") returns: Yes, match


### Using re.fullmatch

* Looks for matches of exactly the format specified
* Return a match object
* Fullmatch is great when doing data validation or checking for a condition through a True/False statement


In [348]:
# This example DOES NOT work because the string is not an exact match of the specified format
mystring = 'Sed porttitor 281-617-4840 urna a quam hendrerit, 817-281-4840 ornare quam blandit.'
matches = re.fullmatch(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d',mystring)   
print(matches) 

None


In [349]:
# This example DOES work
mystring = '281-617-4840'
mymatches = re.fullmatch(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d',mystring)   
print(mymatches) 
if mymatches: print('\nMatch Condition Passed')

<_sre.SRE_Match object; span=(0, 12), match='281-617-4840'>

Match Condition Passed


In [350]:
mystring = 'blob'
mymatches = re.fullmatch(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d',mystring)   
print(mymatches) 
if mymatches: print('\nMatch Condition Passed')

None


In [351]:
mystrings = [
    'Lorem ipsum dolor sit amet, consectetur adipiscing elit.',
    'Sed porttitor 281-617-4840 urna a quam hendrerit, 817-281-4840 ornare quam blandit.',
    'Praesent consectetur rutrum orci nec lacinia.']

### Using re.search

* Returns a match object
* Returns the first match only

In [352]:
mystring = 'Sed porttitor 281-617-4840 urna a quam hendrerit, 817-281-4840 ornare quam blandit.'
matches = re.search(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d',mystring)   
print(matches)

<_sre.SRE_Match object; span=(14, 26), match='281-617-4840'>


### Using re.sub

In [353]:
mystring = '23093762 REF-Tang-Profile-AE'

mynewstring  = re.sub(r'23093762\s*','',mystring)

print(mynewstring)

REF-Tang-Profile-AE


### Using re.subn

In [354]:
mystring = '23093762 REF-Tang-Profile-AE 23093762 REF-Tang-Thickness'
mynewstring  = re.subn(r'23093762\s*','',mystring)

print(mynewstring)

('REF-Tang-Profile-AE REF-Tang-Thickness', 2)


### Using re.split

In [355]:
mystring = '23093762 REF-Tang-Profile-AE 23093762 REF-Tang-Thickness'
mysplitlist = re.split(' ',mystring)
mysplitlist

['23093762', 'REF-Tang-Profile-AE', '23093762', 'REF-Tang-Thickness']

### Using re.findall

* Returns a string, not a match object
* Returns all matches

In [356]:
mymatches = []
for mystring in mystrings:
    print('Search String: ' + mystring)
    mymatches = re.findall(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d',mystring)
    print('Matches returned: ')
    if not mymatches:
        print('Is Empty')
        print('\n')
    else:
        for mymatch in mymatches:
            print(mymatch)
        print('')       

Search String: Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Matches returned: 
Is Empty


Search String: Sed porttitor 281-617-4840 urna a quam hendrerit, 817-281-4840 ornare quam blandit.
Matches returned: 
281-617-4840
817-281-4840

Search String: Praesent consectetur rutrum orci nec lacinia.
Matches returned: 
Is Empty




### Using re.finditer

* Returns an object for each found item

In [357]:
mymatches = []
for mystring in mystrings:
    print('Search String: ' + mystring)
    mymatches = re.finditer(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d',mystring)
    print('Matches returned: ')
    for mymatch in mymatches:
        print(mymatch)
    print('')

Search String: Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Matches returned: 

Search String: Sed porttitor 281-617-4840 urna a quam hendrerit, 817-281-4840 ornare quam blandit.
Matches returned: 
<_sre.SRE_Match object; span=(14, 26), match='281-617-4840'>
<_sre.SRE_Match object; span=(50, 62), match='817-281-4840'>

Search String: Praesent consectetur rutrum orci nec lacinia.
Matches returned: 



* You can extract items from the match object

In [358]:
# Calling the span start and end
print('span starts at: ' + str(mymatch.span()[0]))
print('span ends at: ' + str(mymatch.span()[1]))

# Using span to slice the string
print(mystrings[1][mymatch.span()[0]:mymatch.span()[1]])

span starts at: 50
span ends at: 62
817-281-4840


## Patterns

The following is a list of possible patterns and their descriptors

    "."      Matches any character except a newline.
    
    "^"      Matches the start of the string.
    
    "$"      Matches the end of the string or just before the newline at the end of the string.
    
    "*"      Matches 0 or more (greedy) repetitions of the preceding RE.
             Greedy means that it will match as many repetitions as possible.
             
    "+"      Matches 1 or more (greedy) repetitions of the preceding RE.
    
    "?"      Matches 0 or 1 (greedy) of the preceding RE.
    
    *?,+?,?? Non-greedy versions of the previous three special characters.
    
    {m,n}    Matches from m to n repetitions of the preceding RE.
    
    {m,n}?   Non-greedy version of the above.
    
    "\\"     Either escapes special characters or signals a special sequence.
    
    []       Indicates a set of characters.
             A "^" as the first character indicates a complementing set.
             
    "|"      A|B, creates an RE that will match either A or B.
    
    (...)    Matches the RE inside the parentheses.
             The contents can be retrieved or matched later in the string.
                 
    (?:...)  Non-grouping version of regular parentheses.
    
    (?P<name>...) The substring matched by the group is accessible by name.
    
    (?P=name)     Matches the text matched earlier by the group named name.
    
    (?#...)  A comment; ignored.
    
    (?=...)  Matches if ... matches next, but doesn't consume the string.
    
    (?!...)  Matches if ... doesn't match next.
    
    (?<=...) Matches if preceded by ... (must be fixed length).
    
    (?<!...) Matches if not preceded by ... (must be fixed length).
    
    (?(id/name)yes|no) Matches yes pattern if the group with id/name matched
    
    (?aiLmsux) Set the A, I, L, M, S, U, or X flag for the RE (see below).



Over the following examples, different patterns will be demonstrated using a **compiled** pattern regexp object and the **finditer**

We start out with the most basic pattern. The **"."** matches any and every character, individually

In [359]:
mystring = 'aa?K,%5w'

pattern = re.compile(r'.')

mymatches = pattern.finditer(mystring)

print('Matches returned: ')
for mymatch in mymatches : print(mymatch)
print('')

Matches returned: 
<_sre.SRE_Match object; span=(0, 1), match='a'>
<_sre.SRE_Match object; span=(1, 2), match='a'>
<_sre.SRE_Match object; span=(2, 3), match='?'>
<_sre.SRE_Match object; span=(3, 4), match='K'>
<_sre.SRE_Match object; span=(4, 5), match=','>
<_sre.SRE_Match object; span=(5, 6), match='%'>
<_sre.SRE_Match object; span=(6, 7), match='5'>
<_sre.SRE_Match object; span=(7, 8), match='w'>



Here we use **"^"** to find the string which has a matching pattern at the **beginning**, and retun the exact pattern as the match object

In [360]:
mystrings = [
    'https://www.youtube.com/watch?v=K8L6KVGG-7o',
    'youtube.com/watch?v=K8L6KVGG-7o'
]

In [361]:
pattern = re.compile(r'^you')

mymatches = []
for mystring in mystrings:
    print('Search String: ' + mystring)
    mymatches = pattern.finditer(mystring)
    print('Matches returned: ')
    for mymatch in mymatches:
        print(mymatch)
    print('')

Search String: https://www.youtube.com/watch?v=K8L6KVGG-7o
Matches returned: 

Search String: youtube.com/watch?v=K8L6KVGG-7o
Matches returned: 
<_sre.SRE_Match object; span=(0, 3), match='you'>



In [378]:
mystring = 'https://www.youtube.com/watch?v=K8L6KVGG-7o'

pattern = re.compile(r'.*')

mymatches = pattern.finditer(mystring)
for mymatch in mymatches: print(mymatch)

<_sre.SRE_Match object; span=(0, 43), match='https://www.youtube.com/watch?v=K8L6KVGG-7o'>
<_sre.SRE_Match object; span=(43, 43), match=''>


Here we use **"$"** to find the string which has a matching pattern at the **end**, and return the exact pattern as the match object

In [380]:
mystrings = [
    'Past participle verbs end in "-ed" (i.e.) finished',
    'Present continuous verbs end in "-ing" (i.e) running'
]

In [381]:
pattern = re.compile(r'ing$')

mymatches = []
for mystring in mystrings:
    print('Search String: ' + mystring)
    mymatches = pattern.finditer(mystring)
    print('Matches returned: ')
    for mymatch in mymatches:
        print(mymatch)
    print('')

Search String: Past participle verbs end in "-ed" (i.e.) finished
Matches returned: 

Search String: Present continuous verbs end in "-ing" (i.e) running
Matches returned: 
<_sre.SRE_Match object; span=(49, 52), match='ing'>



Additionally we can add **"^"** to return the string which has a matching pattern at the end, but the returned match goes from the **beginning** of the string up to the matched pattern.

In [365]:
mystrings = [
    'Past participle verbs end in "-ed" (i.e.) finished',
    'Continuous verbs end in "-ing" (i.e) running'
]

# pattern = re.compile(r'^C.*ing$')
# pattern = re.compile(r'^o.*ing$')
pattern = re.compile(r'o.*ing$')



mymatches = []
for mystring in mystrings:
    print('Search String: ' + mystring)
    mymatches = pattern.finditer(mystring)
    print('Matches returned: ')
    for mymatch in mymatches:
        print(mymatch)
    print('')
    
mystrings[1][mymatch.span()[0]:mymatch.span()[1]:]

Search String: Past participle verbs end in "-ed" (i.e.) finished
Matches returned: 

Search String: Continuous verbs end in "-ing" (i.e) running
Matches returned: 
<_sre.SRE_Match object; span=(1, 44), match='ontinuous verbs end in "-ing" (i.e) running'>



'ontinuous verbs end in "-ing" (i.e) running'

In [386]:
mystrings = [
    'Past participle verbs end in "-ed" (i.e.) finished',
    'Continuous verbs end in "-ing" (i.e) running',
    'ing'
]

# pattern = re.compile(r'^.+ing$')
pattern = re.compile(r'^.*ing$')


mymatches = []
for mystring in mystrings:
    print('Search String: ' + mystring)
    mymatches = pattern.finditer(mystring)
    print('Matches returned: ')
    for mymatch in mymatches:
        print(mymatch)
    print('')     

Search String: Past participle verbs end in "-ed" (i.e.) finished
Matches returned: 

Search String: Continuous verbs end in "-ing" (i.e) running
Matches returned: 
<_sre.SRE_Match object; span=(0, 44), match='Continuous verbs end in "-ing" (i.e) running'>

Search String: ing
Matches returned: 
<_sre.SRE_Match object; span=(0, 3), match='ing'>



* We use **"\"** to escape special characters like periods
* We use **"?"** to find matches of the desired pattern

Find instances of Mr.

In [367]:
mystring = 'Mr. Schafer, Mr Smith, Ms Davis, Mrs. Robinson, Mr. T'

In [368]:
# We want to find instances of "Mr."
pattern = re.compile(r'Mr\.')
mymatches = pattern.finditer(mystring)
for mymatch in mymatches : print(mymatch)

<_sre.SRE_Match object; span=(0, 3), match='Mr.'>
<_sre.SRE_Match object; span=(48, 51), match='Mr.'>


In [369]:
# We want to get all three Mr whether they have a period or not:
pattern = re.compile(r'Mr\.?')
mymatches = pattern.finditer(mystring)
for mymatch in mymatches : print(mymatch)

<_sre.SRE_Match object; span=(0, 3), match='Mr.'>
<_sre.SRE_Match object; span=(13, 15), match='Mr'>
<_sre.SRE_Match object; span=(33, 35), match='Mr'>
<_sre.SRE_Match object; span=(48, 51), match='Mr.'>


#### Characters used for pattern construction

    \d       Matches any decimal digit; equivalent to the set [0-9] in
             bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the whole
             range of Unicode digits.

    \s       Matches any whitespace character; equivalent to [ \t\n\r\f\v] in
             bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the whole
             range of Unicode whitespace characters.
                
    \w       Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]
             in bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the
             range of Unicode alphanumeric charactemyrs (letters plus digits
             plus underscore).
             With LOCALE, it will match the set [0-9_] plus characters defined
             as letters for the current locale.

In [370]:
#Then we look for a space and a capital letter followign the space
pattern = re.compile(r'Mr\.?\s[A-Z]') #Finds Mr. S, Mr S and Mr. T
mymatches = pattern.finditer(mystring)
for mymatch in mymatches : print(mymatch)

<_sre.SRE_Match object; span=(0, 5), match='Mr. S'>
<_sre.SRE_Match object; span=(13, 17), match='Mr S'>
<_sre.SRE_Match object; span=(48, 53), match='Mr. T'>


In [371]:
#Add the search for the complete first name
#After we have found the prefix followed by period, space and first letter we don't know how many letters to find

pattern = re.compile(r'Mr\.?\s[A-Z]\w+') #Finds Mr. Schaffer, Mr Smith but not Mr. T becuase there is not a word character after T
# pattern = re.compile(r'Mr\.?\s\w+') # Finds Mr. T because there is not a word character after T
mymatches = pattern.finditer(mystring)
for mymatch in mymatches : print(mymatch)

<_sre.SRE_Match object; span=(0, 11), match='Mr. Schafer'>
<_sre.SRE_Match object; span=(13, 21), match='Mr Smith'>


In [372]:
#The asterisk quantifier allows to find zero or more of the characters 

pattern = re.compile(r'Mr\.?\s[A-Z]\w*') #Finds Mr. Schaffer, Mr Smith AND Mr. T (becuase Mr. T is followed by zero word characters)
mymatches = pattern.finditer(mystring)
for mymatch in mymatches : print(mymatch)

<_sre.SRE_Match object; span=(0, 11), match='Mr. Schafer'>
<_sre.SRE_Match object; span=(13, 21), match='Mr Smith'>
<_sre.SRE_Match object; span=(48, 53), match='Mr. T'>


In [373]:
#Add a group of possible characters

pattern = re.compile(r'M(r|s|rs)\.?\s[A-Z]\w*')
mymatches = pattern.finditer(mystring)
for mymatch in mymatches : print(mymatch)

<_sre.SRE_Match object; span=(0, 11), match='Mr. Schafer'>
<_sre.SRE_Match object; span=(13, 21), match='Mr Smith'>
<_sre.SRE_Match object; span=(23, 31), match='Ms Davis'>
<_sre.SRE_Match object; span=(33, 46), match='Mrs. Robinson'>
<_sre.SRE_Match object; span=(48, 53), match='Mr. T'>


In [374]:
# Easier to read, same result

pattern = re.compile(r'(Mr|Ms|Mrs)\.?\s[A-Z]\w*')
mymatches = pattern.finditer(mystring)
for mymatch in mymatches : print(mymatch)

<_sre.SRE_Match object; span=(0, 11), match='Mr. Schafer'>
<_sre.SRE_Match object; span=(13, 21), match='Mr Smith'>
<_sre.SRE_Match object; span=(23, 31), match='Ms Davis'>
<_sre.SRE_Match object; span=(33, 46), match='Mrs. Robinson'>
<_sre.SRE_Match object; span=(48, 53), match='Mr. T'>


In [375]:
mystring = 'The email address samuelgonzalez@controlsdata.com belongs to Rolls-Royce Controls Data Services'

# Now we used a pattern constructor to ind multiple lower or upper until '@' then multiple lower or upper until '.' then 'com'

pattern = re.compile(r'[a-zA-Z]+@[a-zA-Z]+\.com')
mymatches = pattern.finditer(mystring)
for mymatch in mymatches : print(mymatch)

<_sre.SRE_Match object; span=(18, 49), match='samuelgonzalez@controlsdata.com'>


In [382]:
# We find multiple lower or upper until '@' then multiple lower or upper until '.' then 'com  or 'edu'

mystring = 'The email address samuelgonzalez@tamu.edu belongs to Texas A&M'

pattern = re.compile(r'[a-zA-Z.]+@[a-zA-Z]+\.[com|edu]*') #Need the star to match the entire string
mymatches = pattern.finditer(mystring)
for mymatch in mymatches : print(mymatch)   

<_sre.SRE_Match object; span=(18, 41), match='samuelgonzalez@tamu.edu'>


#### Quantifiers

Quantifiers:
*     \*     0 or more
*     \+     1 or more
*     ?     0 or one
*     {3}   exact number
*     {3,4} range of numbers

In [377]:
mystring = '281-817-4848'

pattern = re.compile(r'\d{3}[-]\d{3}[-]\d{4}')
mymatches = pattern.finditer(mystring)
for mymatch in mymatches : print(mymatch)      

<_sre.SRE_Match object; span=(0, 12), match='281-817-4848'>
