# Regex tutorial

You write regular expressions (regex) to match patterns in strings.  When you are processing text, you may want to extract a **substring** of some predictable structure: a phone number, an email address, or something more specific to your research or task.  You may also want to clean your text of some kind of junk: maybe there are repetitive formatting errors due to some transcription process that you need to remove.
        
In these cases and in many others like them, writing the right regex will be better than working by hand or using a magical third-party library/software that claims to do what you want.

Please refer back to the slides to see the building blocks of regex.

## Character classes

- Used to match any one of a specific set of characters
- Defined using the **[** and **]** metacharacters
- Within a character class, **^** and **-** can have special meaning (complement and range), depending on their position in the class


In [None]:
import re #the regex module in the python standard library

#strings to be searched for matching regex patterns
str1 = "Aardvarks belong to the Captain"
str2 = "Albert's famous equation, E = mc^2."
str3 = "Located at 455 Serra Mall."
str4 = "Beware of the shape-shifters!"
test_strings = [str1, str2, str3, str4] #created a list of strings

In [None]:
for test_string in test_strings:
    print('The test string is "' + test_string + '"')
    match = re.search(r'[a-z]', test_string)
    if match:
        print('The first possible match is: ' + match.group())
    else:
        print('no match.')

Let's go through the code above line by line:

    for test_string in test_strings:

`test_strings` is a list, and so it is iterable in a for loop.  Every element in this list is a string.  So for the rest of the for loop, we will be referring to the current element as `test_string`

    print('The test string is "' + test_string + '"')
    
This just prints out the current object we're iterating over

    match = re.search(r'[A-Z]', test_string)

Remember the basic approach to using regex in Python.  You give a searcher (in this case, the function `re.search()` a pattern and a string in which to find matches.  That's exactly what this line does.  `re.search()` returns either an object of type `SRE_Match` or `None`. 
    
    if match:
        print('The first possible match is: ' + match.group())
    else:
        print('no match.')

`match` is an object that has two possible states: `SRE_Match` or `None`.  `None` is a type of object that returns `false` in a logical test.  In this for loop, we've basically told the Python interpreter to check whether match is `NoneType` or not.  If it isn't, we return a string plus `match.group()`.  `group()` is a method that `SRE_Match` objects have.  By default, it returns the 0th group; we'll get to what that means later.  For now, just know that it will return the substring that matched the pattern defined.

Note that since we are using `re.search`, only a single character is returned.  That's because of the following:

1. We only defined a single character pattern and
2. `re.search` finds the first possible match and then doesn't look for any more.

If you wanted to find **all** of the possible matches in a string, you can use `re.findall()`, which will return a list of all matches:

In [None]:
for string in test_strings:
    print(string)
    print(re.findall(r'[A-Z]', string))

You can also compile your regex ahead of time.  This will create `SRE_Pattern` objects.  There are many performance reasons to do this.  Additionally, you can create lists of these objects and iterate over both strings and patterns more easily.  Here's an example:

In [None]:
patterns = [re.compile(r'[ABC]'),
re.compile(r'[^ABC]'),
re.compile(r'[ABC^]'),
re.compile(r'[0123456789]'),
re.compile(r'[0-9]'),
re.compile(r'[0-4]'),
re.compile(r'[A-Z]'),
re.compile(r'[A-Za-z]'),
re.compile(r'[A-Za-z0-9]'),
re.compile(r'[-a-z]'),
re.compile(r'[- a-z]')]

def find_match(pattern, string):
    match = re.search(pattern, string)
    if match:
        return match.group()
    else:
        return 'no match.'
    
for test_string in test_strings:
    matches = [find_match(pattern, test_string) for pattern in patterns]
  
    for i in range(len(patterns)):
        print('The first potential match for "' + patterns[i].pattern + '" in "' + test_string + '" is: ' + matches[i])

Let's go over this code line by line: 

    patterns = [re.compile(r'[ABC]'),
    re.compile(r'[^ABC]'),
    re.compile(r'[ABC^]'),
    re.compile(r'[0123456789]'),
    re.compile(r'[0-9]'),
    re.compile(r'[0-4]'),
    re.compile(r'[A-Z]'),
    re.compile(r'[A-Za-z]'),
    re.compile(r'[A-Za-z0-9]'),
    re.compile(r'[-a-z]'),
    re.compile(r'[- a-z]')]

This creates a list of `SRE_Pattern`s.
    
    def find_match(pattern, string):
        match = re.search(pattern, string)
        if match:
            return match.group()
        else:
            return 'no match.'

I defined a function `find_match` that expects some variables called `pattern` and `string`.  Notice that this function is very similar to the logical condition testing from the code above.  Note also that this function returns either the match.group() or a string `"no match."` 
        
    for test_string in test_strings:
        matches = [find_match(pattern, test_string) for pattern in patterns]

By defining the `find_match()` function above, I can then call it from within a list comprehension.  In words, for each string `test_string` that is in `test_strings`, I want to compare against the list of patterns and return matches.  The resulting list of `matches` should be the same length as `patterns`; one match per pattern tested.  

        for pattern in patterns:
            print('The first potential match for "' + pattern.pattern + '" in "' + test_string + '" is: ' + matches[patterns.index(pattern)])

Because I wanted to print(some diagnostic code, I need to iterate over each `pattern` in `patterns` (a list and thus iterable) and print(it out, along with the test string.  If you want to get the pattern out of an `SRE_Pattern` object, you can call its member method `.pattern` and it will return the regex pattern as a string.  Since we are nesting this loop within the bigger loop above, this loop will go over every pattern in the `patterns` list for each string, and then repeat for the next string in the list `test_strings`.

However, note that I am dynamically referring to the index of the `matches` list.  By this, I mean the following code:

    matches[patterns.index(pattern)]

Make sure this makes sense to you.  Remember, `matches` and `patterns` are the same length.  That means that if I want to return the match that correspondes to the current pattern, I have to call the match at the same index as the current pattern for their respective lists.  Every list has an `.index()` method, and you can find the corresponding index number in the list for a given element passed to the method as an argument.  So if I wanted *where* in `patterns` was the regex `r'[^ABC]'`, I could use `patterns.index(re.compile(r'[^ABC]'))`.  This will return an `int`, which corresponds to the position of `r'[^ABC]'` in `patterns.`

In [None]:
print(patterns.index(re.compile(r'[^ABC]')))

## Pre-defined character classes: shorthand

In [None]:
patterns2 = [re.compile(r'.'),
re.compile(r'\w'),
re.compile(r'\W'),
re.compile(r'\d'),
re.compile(r'\D'),
re.compile(r'\n'),
re.compile(r'\r'),
re.compile(r'\t'),
re.compile(r'\f'),
re.compile(r'\s')]

test_strings.append('Aardvarks belong to the Captain, capt_hook')

for test_string in test_strings:
    matches = [find_match(pattern, test_string) for pattern in patterns2]

    for pattern in patterns2:
        print('The first potential match for "' + pattern.pattern + '" in "' + test_string + '" is: ' + matches[patterns2.index(pattern)])

## Matching character sequences


In [None]:
test_strings2 = ["The Aardvarks belong to the Captain.",
                 "Bitter butter won't make the batter better.",
                 "Hark, the pitter patter of little feet!"]

patterns3 = [re.compile(r'Aa'),
re.compile(r'[Aa][Aa]'),
re.compile(r'[aeiou][aeiou]'),
re.compile(r'[AaEeIiOoUu][aeiou]'),
re.compile(r'[Tt]he'),
re.compile(r'^[Tt]he'),
re.compile(r'n.'),
re.compile(r'n.$'),
re.compile(r'\W\w'),
re.compile(r'\w[aeiou]tter'),
re.compile(r'\w[aeiou]tter'),
re.compile(r'..tt..')]

for test_string in test_strings2:
    matches = [find_match(pattern, test_string) for pattern in patterns3]

    for pattern in patterns3:
        print('The first potential match for "' + pattern.pattern + '" in "' + test_string + '" is: ' + matches[patterns3.index(pattern)])

## Matching character sequences

In [None]:
def find_all_matches(pattern, string):
    matches = re.findall(pattern, string)
    if matches:
        return matches
    else:
        return None

for test_string in test_strings2:
    matches = [find_all_matches(pattern, test_string) for pattern in patterns3]
    
    for pattern in patterns3:
        if matches[patterns3.index(pattern)]:
            print('All potential matches for "' + pattern.pattern + '" in "' + test_string + '" is/are: ' + ', '.join(matches[patterns3.index(pattern)]))
        else:
            print('There were no matches for "' + pattern.pattern + '" in "' + test_string + '".')


We have a new function and some new code.  Let's go over it:

First, I wrote a function called `find_all_matches`:
    
    def find_all_matches(pattern, string):
        matches = re.findall(pattern, string)
        if matches:
            return matches
        else:
            return None

There are only two differences between `find_matches` and `find_all_matches`.  First, `find_all_matches` uses `re.findall` not `re.search`.  So matches is a list of all possible matches.  Thus, instead of return a single string in either condition, `find_all_matches` can return either a list of strings or `None`.

    for test_string in test_strings2:
        matches = [find_all_matches(pattern, test_string) for pattern in patterns3]
        
        for pattern in patterns3:
            if matches[patterns3.index(pattern)]:

Remember the use of `.index()` from the previous code walkthrough.  Also, remember that `None` returns false in a logical condition test.  In this `if` statement, I'm testing to see if there were any matches for the current pattern in the loop.  If there were any matches, the code will execute the next line.  Otherwise, it will go to the `else` block.        
                
                print('All potential matches for "' + pattern.pattern + '" in "' + test_string + '" is/are: ' + ', '.join(matches[patterns3.index(pattern)]))
        
If `matches` at the index of the current pattern is not `None`, it will be a list of strings.  Because I'm printing these results, I wanted to nicely format them for diagnostic purposes.  So we use the standard list-to-string Python expression of `''.join(list)`.  In this case, I wanted the results to be comma-separated.  
        
            else:
                print('There were no matches for "' + pattern.pattern + '" in "' + test_string + '".')


## Quantification and grouping

In [None]:
test_strings3 = ['Now Mr. N said, "Nooooooo!"',
                 'Then she told him he had to be quiet.']

patterns4 = [re.compile(r'No*'),
re.compile(r'No+'),
re.compile(r'No?'),
re.compile(r'No{7}'),
re.compile(r's?he'),
re.compile(r'(she|he)')]

for test_string in test_strings3:
    matches = [find_all_matches(pattern, test_string) for pattern in patterns4]
    
    for pattern in patterns4:
        if matches[patterns4.index(pattern)]:
            print('All potential matches for "' + pattern.pattern + '" in "' + test_string + '" is/are: ' + ', '.join(matches[patterns4.index(pattern)]))
        else:
            print('There were no matches for "' + pattern.pattern + '" in "' + test_string + '".')

## Capturing groups

In Python, `SRE_Match` objects have `.groups` and `.group` methods.  These correspond to the capturing groups established in the regex, if you chose to indicate groups.  By default, the 0th group is the entire match to the whole regex.  To access the result for a capturing group, you pass the capturing group index to the `.group` method.   

In [None]:
test_strings4 = ['The benefit is being held for Mr. Kite and Mr. Henderson.',
                 'Tickets cost $5.00 for adults, $3.50 for children.',
                 'Over 9000 attendees are expected, up from 900 attendees last year.',
                 'Over 9,123,456 attendees are expected, up from 900 attendees last year.']

patterns5 = [re.compile(r'Mr\. (\w+)'),
re.compile(r'\$(\d+\.\d\d)'),
re.compile(r'(\d+) attendees'),
re.compile(r'((\d+,)*\d+) attendees')]

In [None]:
# simple example
print('string=', test_strings4[3])
print('pattern', patterns5[3].pattern)
matches = re.search(patterns5[3], test_strings4[3])
print('groups:',matches.groups())
print('Group 0: ' + matches.group(0))
print('Group 1: ' + matches.group(1))
print('Group 2: ' + matches.group(2))
#print('Group 3: ' + matches.group(3) # what happens if you uncomment this?

This example searched for r'((\d+,)*\d+) attendees' in the string "Over 9000 attendees are expected, up from 900 attendees last year.'"  There are two groups, one nested inside the other.  Groups are indexed outer-most left parens.  This is why Group 1 is `9,000` and Group 2 is `9,`.

In [None]:
for test_string in test_strings4:
    for pattern in patterns5:
        for result in re.finditer(pattern, test_string):
            for i in range(pattern.groups+1):
                
                print('In "' + test_string + '", '  + 'given pattern "' + pattern.pattern + '", the group ' +str(i)+ ' match is ' + str(result.group(i)))

Before we go over this code block, let's establish the purpose of the code.  I wanted to return all the matches for each group.  But there are a few concerns:

1. The *number* of groups is different for each pattern.  So I can't hardcode the number of times to loop over.  In other words, the number of times my loop should iterate has to be *dynamically* assigned, conditioning on *which regex pattern* is the comparison regex in the loop.
2. `.findall' return a list of matches, and if there are groups, it will return a list of tuples, where each tuple is the length of the number of capturing groups.

In [None]:
matches = re.findall(patterns5[3], test_strings4[3])
matches

You can refer to the index of a tuple within a list of tuples through indexing a second index:

In [None]:
matches[0][0]

But there are other ways of constructing this kind of loop.

    for test_string in test_strings4:
        for pattern in patterns5:
            for result in re.finditer(pattern, test_string):
                
`re.finditer` returns an iterator, which is a new Python concept to you.  This loop means that for every pattern and for each string we're testing, instead of creating a list of matches, we're going to create a iterator object that contains the results.
                
                for i in range(pattern.groups+1):

The `.groups` method will list the number of capturing groups in the regular expression.  `range` is a function that will return a list of integers ranging from a start or a stop value and by a step value.  If you just give it a int, by default it will treat that value is a stopping value and start from 0.   Now, we add 1 to this value because the end point is omitted in `range`.  If we want to return *all* the groups, we have to add that end point back.

                    print('In "' + test_string + '", '  + 'given pattern "' + pattern.pattern + '", the group ' +str(i)+ ' match is ' + str(result.group(i)))

Because `i` is established as the index value of the current regex match produced by the iterator, we can use `i` as the index value for which group we'd like to return.  That's why we can call `result.group(i)`. 

In no way was this the *only* way to accomplish this task!  I wanted to show you a few different functions in this tutorial, as well as introduce you to the more examples where typical "Pythonic" code constructions are useful, such as list comprehensions and `join`.  There are many ways of replicating all of these diagnostic printout examples.

## Final exercise

Let's see how much you've learned.  We're going to give you three strings that have a phone number in them.  Your job is to write a regex that will return the telephone numbers in each string.

In [None]:
phone_strings = ['Call Empire Carpets at 588-2300',
'Does Jenny live at 867 5309?',
'You can reach Mr. Plow at 636-555-3226']