This tutorial is written to assist for HW5, regular expressions practice with Shakespeare.

## Lookbehind and Lookaheads

In many cases, you want to match for some specific text, but make sure that it is **either right before** another pattern, or **right after** another pattern. For instance, we can use it to match multi-line dialogue:

### Task: Get all of Yu Chen's Dialogue

In [42]:
dialogue = '''
YUCHEN: You can reach me at ychen220@marshall.usc.edu.
If you cannot find me there; you can try my other email.
You can also call my assistant Todd.
MIKE: Sounds good. Well then,
it's been a pleasure.
YUCHEN: Nice.
MIKE: Good talk.
JIMMY: Yeah, it's been a great experience all around.
YUCHEN: Okay, good bye.
'''

import re

My first attempt will involve what we have already learned. The `[\w ;@\.,]+` says we want to match any alphanumeric character (`\w`), and white space (` `), and semicolon, @, period, or comma: (`;@\.,` ) at least one or more times.

In [49]:
re.findall(r'YUCHEN: ([A-Za-z0-9 ;@\.,]+)', dialogue)

['You can reach me at ychen220@marshall.usc.edu.', 'Nice.', 'Okay, good bye.']

Notice that right now, this regular expression capture almost all of my dialogue, but not everything. It misses the `If you cannot find me there, you can try my other email.` in the second line, since it doesn't understand it needs to match stuff after a new line. I can add in `\n` to my brackets so it knows to match beyond a new line:

In [50]:
re.findall(r'YUCHEN: ([\w @;\n\.,]+)', dialogue)

['You can reach me at ychen220@marshall.usc.edu.\nIf you cannot find me there; you can try my other email.\nYou can also call my assistant Todd.\nMIKE',
 'Nice.\nMIKE',
 'Okay, good bye.\n']

Notice now that the problem is `MIKE`, the next character speaking, gets dragged into the match, along with his dialogue. This is because the regex does not know that `Mike: Sounds good.` is another person speaking - it thinks it is still part of `YUCHEN`'s dialogue.

In [51]:
re.findall(r'(?<=YUCHEN: )([\w @;\n\.,]+)(?=\nMIKE: )', dialogue)

['You can reach me at ychen220@marshall.usc.edu.\nIf you cannot find me there; you can try my other email.\nYou can also call my assistant Todd.',
 'Nice.']

The `(?<=YUCHEN: )` is called a **positive lookbehind**, and it states that whatever we match, it must begin with (but not include) `YUCHEN: `. The `(?=\nMIKE: )` is a **positive lookahead**, and it states that whatever we match, it must end with `\nMIKE: ` (a new line, then Mike beginning to speak).

This is looks great, but we are still missing the final line in the dialogue (`Okay, good bye.`). That currently no longer matches, since it does not have `\nMIKE: ` at its end. We can add a conditional to our positive lookahead to account for the fact that it is the end of the text:`(?=\nMIKE: |$)`- this states that we either need to have the end of the entire string, or we need to have `\nMIKE: ` after our match:

In [56]:
re.findall(r'(?<=YUCHEN: )([\w @;\n\.,]+)(?=\nMIKE: |$)', dialogue)

['You can reach me at ychen220@marshall.usc.edu.\nIf you cannot find me there; you can try my other email.\nYou can also call my assistant Todd.',
 'Nice.',
 'Okay, good bye.\n']

## Greedy versus Lazy Evaluation

By default, regular expressions try to match as MUCH as possible- **this is called greedy evaluation**. Sometimes, we only want to match for as little as possible- this is called **lazy evaluation**. We can do this by putting a `?` after our quantifier (remember a quantifier is `+`,`*`, or `{3,5}` - anything that tells regex how many times to match something). For instance:

In [59]:
sentence = "Hello"
import re

re.findall("H.*l", sentence) #matches Hell
re.findall("H.*?l", sentence) #matches Hel

['Hel']

If we apply this lazy evaluation. to our original dialogue `(?<=YUCHEN: )([\w @;\n\.,]+?)(?=\nMIKE: |$)`, we notice that the last match no longer has the `\n` attached to it. This is because it matches only the minimum text sufficient, and the `\n` symbol is not necessary to complete the match:

In [63]:
re.findall(r'(?<=YUCHEN: )([\w @;\n\.,]+?)(?=\nMIKE: |$)', dialogue) # notice the ? before the plus sign

['You can reach me at ychen220@marshall.usc.edu.\nIf you cannot find me there; you can try my other email.\nYou can also call my assistant Todd.',
 'Nice.',
 'Okay, good bye.']

Another example: let's say we are trying to find all HTML tags in a particular website:

In [66]:
html = '''
<HTML>
        <HEAD>
            <TITLE>Your Title Here</TITLE>
        </HEAD>
    <a href="http://somegreatsite.com">Link Name</a>
    <H1>This is a Header</H1>
    <H2>This is a Medium Header</H2>
<</HTML>
'''

In [67]:
import re

re.findall(r'<.+>', html)

['<HTML>',
 '<HEAD>',
 '<TITLE>Your Title Here</TITLE>',
 '</HEAD>',
 '<a href="http://somegreatsite.com">Link Name</a>',
 '<H1>This is a Header</H1>',
 '<H2>This is a Medium Header</H2>',
 '<</HTML>']

This drags in everything between the tags themselves, including the text `Your Title Here` and `This is a Header`. We don't want that. This is because the regex `<.+>` is trying to match as much of the text as possible. However, using lazy quantifiers, we can get regex to match only the minimum necessary:

In [68]:
re.findall(r'<.+?>', html)

['<HTML>',
 '<HEAD>',
 '<TITLE>',
 '</TITLE>',
 '</HEAD>',
 '<a href="http://somegreatsite.com">',
 '</a>',
 '<H1>',
 '</H1>',
 '<H2>',
 '</H2>',
 '<</HTML>']