## What is Regular Expression?

Working with text is HARD. Do we consider "dog" to be different from "dogs"? How do we explain to a computer if we want those words to be considered the same word? In written language, most rules are not really rules, but patterns within text. Regular expression is a paradigm for describing patterns in text in order to accomodate the flexibility of written language and our attempts to find information within larger bodies of text.

For example, I might want to find a series of keywords within a body of text in order to determine whether or not the topic of the text has been correctly identified (or even in order to determine what the topic of the text actually is!). Regular expression has the ability to help express patterns in text that we may want to examine, and the flexibility to determine exactly how we want to capture it. Some common uses of regular expression:

- Validating emails, phone numbers, addresses, and other text blocks
- Finding subsets of a document
- Determining the presence or count of particular words, phrases or patterns
- Web Scraping
- Grammar suggestions

There are, of course many more ways to use regular expression. Let's start looking at how we can use it in our personal data analysis.


## Basics of Regular Expression

Remember, regular expression is a way to describe patterns. Below, we will explore how to create various kinds of patterns. To get started, we need to import our regular expression library, called `re`. We can import `re` just like any other library by including the following line of code:

In [None]:
import re

The first function that we will utilize is `re.search`. This function will allow us to search through a string of text in order to try and find patterns of interest. Let's begin by importing some text. We will utilize two chapters (44 and 45) from the Jane Austen novel "Pride and Prejudice" (one of my favorites). Below is the code to import these chapters as a (very long) string:

In [None]:
import requests
url = "https://raw.githubusercontent.com/dustywhite7/pythonMikkeli/master/exampleData/prideAndPrejudiceChapters.txt"
document = requests.get(url).text

print(document) # only run this last line if you want to see the whole (very long) string!

>     Chapter 44
>     
>     Elizabeth had settled it that Mr. Darcy would bring his sister to visit
>     her the very day after her reaching Pemberley; and was consequently
>     resolved not to be out of sight of the inn the whole of that morning.
>     But her conclusion was false; for on the very morning after their
>     arrival at Lambton, these visitors came. They had been walking about the
>     place with some of their new friends, and were just returning to the inn
>     to dress themselves for dining with the same family, when the sound of a
>     carriage drew them to a window, and they saw a gentleman and a lady in
>     a curricle driving up the street. Elizabeth immediately recognizing
>     the livery, guessed what it meant, and imparted no small degree of her
>     surprise to her relations by acquainting them with the honour which she
> 
>     ...yadda yadda yadda... (not really part of the text)

### Finding text

Now that we have our text imported, we can begin to explore it. We will start by looking for the word "Elizabeth" (she is the main character of the novel, and is also sometimes called "Lizzy" or "Eliza"). In order to look for the word "Elizabeth", we will use the `re.search` function mentioned earlier. This function takes two arguments: a regular expression (or pattern), and a string in which to search for matches to that pattern. Regular expression patterns will be stored as strings, but we will use **raw strings**. To create a raw string, we prepend the string with a single letter `r`, so that our string looks something like this: `r''`. 

By using raw strings, we allow ourselves to use escape characters (character sequences beginning with `\` such as `\n` or `\d`) within our string to describe possible patterns that we would like to match.

Once we have created our regular expression pattern, we can call the `re.search` function. It will look for the first match to your pattern. Here is what happens when we look for a match to "Elizabeth":

In [None]:
import re

re.search(r'Elizabeth', document)

The result of this command is


    <re.Match object; span=(13, 22), match='Elizabeth'>



This is a successful search! The document does mention Elizabeth (to the surprise of absolutely nobody), and we see that a match to the expression `r'Elizabeth'` happened. The result of the function is an `re.Match` object, which contains information about where the match was found, as well as the text that matched our stated pattern. When we include a string like `r'Elizabeth'` as our pattern, we are telling regular expression to find a **capital** "E", followed by lower case "l", followed by lower case "i", followed by lower case "z", followed by lower case "a", followed by lower case "b", followed by lower case "e", followed by lower case "t", followed by lower case "h". No other text will match this pattern.

We will almost always want to be more expressive than this. For example, if we read the document, we can see that some characters mention Elizabeth as Eliza (or Lizzy). We can of course search for Eliza separately, or we can try to combine Eliza into our expression pattern, so that we can detect all matches of either Elizabeth or Eliza. To see the difference, we will use the `re.finditer` function to look for all matches to our pattern:

In [None]:
# Number of results for only Elizabeth 
print("Number of results for only Elizabeth")
print(len([i for i in re.finditer(r'Elizabeth', document)]))

# Number of results for Elizabeth OR Eliza
print("Number of results for only Elizabeth OR Eliza")
print(len([i for i in re.finditer(r'Eliza(beth)?', document)]))

    Number of results for only Elizabeth
    32
    Number of results for only Elizabeth OR Eliza
    34


What did we just do? A couple things. First, we used the `re.finditer` function, which generates an **iterable** object. In order to get back all of our results, we need to create a list by iterating over the results that are generated by our iterable object. We used a list comprehension to do that. (Remember that list comprehensions are statements like `[i for i in some_iterable]`). At that point, we had a list of results. We then checked the length of the list for each of two different regular expressions. One for just Elizabeth, and one for Eliza/Elizabeth. 

Now let's break down the difference in the regular expressions. There are two differences between our first and second expressions. First, we used `()` to encapsulate a portion of the word Elizabeth that may not be present in some cases. Parens (`()`) create **groups** in regular expression. Sometimes we will want to capture groups from within an expression, and other times we will only want to indicate that a group may or may not be present. In this case, we indicate that the sequence `beth` is not strictly required for a match by using a `?` character after the group. This doesn't mean that our text needs to include a `?`. The `?` symbol says "I want either 1 or 0 matches to the "beth" sequence.

In other words, "beth" may be present (have one match) or not present (have zero matches) while still allowing us to have a successful match to our regular expression pattern. There are several similar expressions:

| Symbol | Meaning |
| --- | :---: |
| `?` | 1 or 0 repetitions|
| `+` | 1 or more repetitions | 
| `*` | 0 or more repetitions |
|`{n}` | exactly `n` repetitions|
|`{m,n}` | between `m` and `n` repetitions|

We can use these expressions with groups, or with single characters. What if we wanted to find the end of sentences in our text? Let's try and express that:

In [None]:
[i for i in re.finditer(r'[.?!][ ]*', document)]

    [<re.Match object; span=(45, 47), match='. '>,
     <re.Match object; span=(221, 222), match='.'>,
     <re.Match object; span=(328, 330), match='. '>,
     <re.Match object; span=(611, 613), match='. '>,
     <re.Match object; span=(798, 800), match='. '>,
     <re.Match object; span=(1017, 1019), match='. '>,
     <re.Match object; span=(1199, 1201), match='. '>,
     <re.Match object; span=(1332, 1334), match='. '>,
     <re.Match object; span=(1616, 1617), match='.'>,
     <re.Match object; span=(1829, 1830), match='.'>,
     <re.Match object; span=(1912, 1914), match='. '>,
     <re.Match object; span=(2019, 2021), match='. '>,
     <re.Match object; span=(2191, 2193), match='. '>,
     <re.Match object; span=(2268, 2269), match='.'>,
     <re.Match object; span=(2430, 2432), match='. '>,
     <re.Match object; span=(2573, 2575), match='. '>,
     <re.Match object; span=(2667, 2669), match='. '>,
     <re.Match object; span=(2740, 2741), match='.'>,
     <re.Match object; span=(2784, 2786), match='. '>,
     <re.Match object; span=(3017, 3019), match='. '>,
     <re.Match object; span=(3229, 3231), match='. '>,
     <re.Match object; span=(3371, 3372), match='.'>,
     <re.Match object; span=(3379, 3381), match='. '>,
     <re.Match object; span=(3388, 3390), match='. '>,
     <re.Match object; span=(3459, 3461), match='. '>,
     <re.Match object; span=(3492, 3494), match='. '>,
     <re.Match object; span=(3557, 3559), match='. '>,
     <re.Match object; span=(3601, 3603), match='. '>,
     <re.Match object; span=(3815, 3817), match='. '>,
     <re.Match object; span=(3948, 3949), match='.'>,
     <re.Match object; span=(3989, 3991), match='. '>,
     <re.Match object; span=(4291, 4293), match='. '>,
     <re.Match object; span=(4368, 4369), match='.'>,
     <re.Match object; span=(4440, 4441), match='!'>,
     <re.Match object; span=(4525, 4527), match='. '>,
     <re.Match object; span=(4712, 4714), match='. '>,
     <re.Match object; span=(4850, 4852), match='. '>,
     <re.Match object; span=(4912, 4914), match='. '>,
     <re.Match object; span=(4986, 4988), match='. '>,
     <re.Match object; span=(5267, 5269), match='. '>,
     <re.Match object; span=(5530, 5532), match='. '>,
     <re.Match object; span=(5624, 5625), match='.'>,
     <re.Match object; span=(5805, 5807), match='. '>,
     <re.Match object; span=(5929, 5930), match='.'>,
     <re.Match object; span=(5983, 5985), match='. '>,
     <re.Match object; span=(6343, 6345), match='. '>,
     <re.Match object; span=(6815, 6817), match='. '>,
     <re.Match object; span=(7251, 7252), match='.'>,
     <re.Match object; span=(7339, 7341), match='. '>,
     <re.Match object; span=(7417, 7419), match='. '>,
     <re.Match object; span=(7426, 7428), match='. '>,
     <re.Match object; span=(7507, 7509), match='. '>,
     <re.Match object; span=(7620, 7622), match='. '>,
     <re.Match object; span=(7625, 7627), match='. '>,
     <re.Match object; span=(7798, 7800), match='. '>,
     <re.Match object; span=(8090, 8091), match='.'>,
     <re.Match object; span=(8275, 8277), match='. '>,
     <re.Match object; span=(8590, 8592), match='. '>,
     <re.Match object; span=(8783, 8784), match='.'>,
     <re.Match object; span=(8818, 8820), match='. '>,
     <re.Match object; span=(8827, 8829), match='. '>,
     <re.Match object; span=(8899, 8901), match='. '>,
     <re.Match object; span=(8959, 8961), match='. '>,
     <re.Match object; span=(9054, 9056), match='. '>,
     <re.Match object; span=(9113, 9114), match='.'>,
     <re.Match object; span=(9121, 9123), match='. '>,
     <re.Match object; span=(9244, 9246), match='. '>,
     <re.Match object; span=(9505, 9507), match='. '>,
     <re.Match object; span=(9512, 9514), match='. '>,
     <re.Match object; span=(9767, 9769), match='. '>,
     <re.Match object; span=(9883, 9885), match='. '>,
     <re.Match object; span=(10067, 10069), match='. '>,
     <re.Match object; span=(10158, 10159), match='.'>,
     <re.Match object; span=(10458, 10460), match='. '>,
     <re.Match object; span=(10487, 10488), match='.'>,
     <re.Match object; span=(10765, 10767), match='. '>,
     <re.Match object; span=(10797, 10799), match='. '>,
     <re.Match object; span=(10935, 10937), match='. '>,
     <re.Match object; span=(11291, 11293), match='. '>,
     <re.Match object; span=(11405, 11406), match='.'>,
     <re.Match object; span=(11642, 11644), match='. '>,
     <re.Match object; span=(12002, 12004), match='. '>,
     <re.Match object; span=(12270, 12272), match='. '>,
     <re.Match object; span=(12628, 12629), match='.'>,
     <re.Match object; span=(13063, 13065), match='. '>,
     <re.Match object; span=(13092, 13094), match='. '>,
     <re.Match object; span=(13194, 13195), match='.'>,
     <re.Match object; span=(13199, 13201), match='. '>,
     <re.Match object; span=(13240, 13242), match='. '>,
     <re.Match object; span=(13386, 13387), match='.'>,
     <re.Match object; span=(13687, 13688), match='.'>,
     <re.Match object; span=(13818, 13820), match='. '>,
     <re.Match object; span=(14024, 14025), match='.'>,
     <re.Match object; span=(14105, 14107), match='. '>,
     <re.Match object; span=(14173, 14175), match='. '>,
     <re.Match object; span=(14426, 14428), match='. '>,
     <re.Match object; span=(14431, 14433), match='. '>,
     <re.Match object; span=(14497, 14498), match='.'>,
     <re.Match object; span=(14506, 14508), match='. '>,
     <re.Match object; span=(14671, 14673), match='. '>,
     <re.Match object; span=(14699, 14701), match='. '>,
     <re.Match object; span=(14887, 14889), match='. '>,
     <re.Match object; span=(14967, 14969), match='. '>,
     <re.Match object; span=(15127, 15128), match='.'>,
     <re.Match object; span=(15295, 15297), match='. '>,
     <re.Match object; span=(15495, 15497), match='. '>,
     <re.Match object; span=(15532, 15534), match='. '>,
     <re.Match object; span=(15607, 15609), match='. '>,
     <re.Match object; span=(15754, 15756), match='. '>,
     <re.Match object; span=(15932, 15934), match='. '>,
     <re.Match object; span=(16010, 16011), match='.'>,
     <re.Match object; span=(16253, 16255), match='. '>,
     <re.Match object; span=(16319, 16321), match='. '>,
     <re.Match object; span=(16520, 16521), match='.'>,
     <re.Match object; span=(16646, 16648), match='. '>,
     <re.Match object; span=(16826, 16827), match='.'>,
     <re.Match object; span=(16858, 16860), match='. '>,
     <re.Match object; span=(17057, 17059), match='. '>,
     <re.Match object; span=(17420, 17422), match='. '>,
     <re.Match object; span=(17665, 17667), match='. '>,
     <re.Match object; span=(17694, 17696), match='. '>,
     <re.Match object; span=(17933, 17935), match='. '>,
     <re.Match object; span=(18136, 18137), match='?'>,
     <re.Match object; span=(18180, 18181), match='.'>,
     <re.Match object; span=(18526, 18528), match='. '>,
     <re.Match object; span=(18711, 18713), match='. '>,
     <re.Match object; span=(19171, 19173), match='. '>,
     <re.Match object; span=(19244, 19246), match='. '>,
     <re.Match object; span=(19521, 19523), match='. '>,
     <re.Match object; span=(19744, 19745), match='.'>,
     <re.Match object; span=(19978, 19980), match='. '>,
     <re.Match object; span=(20215, 20216), match='.'>,
     <re.Match object; span=(20311, 20313), match='. '>,
     <re.Match object; span=(20451, 20453), match='. '>,
     <re.Match object; span=(20485, 20487), match='. '>,
     <re.Match object; span=(20576, 20578), match='. '>,
     <re.Match object; span=(20711, 20713), match='. '>,
     <re.Match object; span=(20844, 20845), match='.'>,
     <re.Match object; span=(20901, 20903), match='. '>,
     <re.Match object; span=(20995, 20997), match='. '>,
     <re.Match object; span=(21029, 21031), match='! '>,
     <re.Match object; span=(21097, 21098), match='.'>,
     <re.Match object; span=(21118, 21120), match='. '>,
     <re.Match object; span=(21323, 21324), match='.'>,
     <re.Match object; span=(21415, 21417), match='. '>,
     <re.Match object; span=(21513, 21515), match='. '>,
     <re.Match object; span=(21577, 21579), match='. '>,
     <re.Match object; span=(21746, 21748), match='. '>,
     <re.Match object; span=(21901, 21902), match='.'>,
     <re.Match object; span=(22142, 22144), match='. '>,
     <re.Match object; span=(22468, 22469), match='!'>,
     <re.Match object; span=(22509, 22510), match='.'>,
     <re.Match object; span=(22612, 22613), match='.'>,
     <re.Match object; span=(22817, 22818), match='.'>,
     <re.Match object; span=(22955, 22956), match='.'>,
     <re.Match object; span=(22961, 22963), match='. '>,
     <re.Match object; span=(23105, 23107), match='. '>,
     <re.Match object; span=(23232, 23234), match='. '>,
     <re.Match object; span=(23365, 23367), match='. '>,
     <re.Match object; span=(23399, 23401), match='. '>,
     <re.Match object; span=(23479, 23480), match='.'>]



This is a good start! We know that sentences can end with "?", "!", or ".", so we create a set of characters that can be matched: `[.?!]`. This expression says that a character can be a period, question mark, or exclamation mark and still match our pattern. We then use a character set including only the space character followed by a star (`[ ]*`) to indicate that there should be 0 or more spaces following each period.

We can use character sets where spelling is uncertain, or where there is a possibility of letters being upper or lower case. Let's look for the word "on", both with and without a character set to accomodate case ambiguity:

In [None]:
# Number of results for "on"
print("Number of results for 'on'")
print(len([i for i in re.finditer(r'on', document)]))

# Number of results for "on" or "On"
print("Number of results for 'on' or 'On'")
print(len([i for i in re.finditer(r'[Oo]n', document)]))

    Number of results for 'on'
    219
    Number of results for 'on' or 'On'
    221


If we weren't careful, we would miss some matches to the word we are looking for!

## Finding numbers

We won't always be looking for words. Sometimes, we want to find or validate numbers. In the United States, phone numbers are commonly expressed with a three-digit area code, a hyphen (`-`), and a seven-digit phone number with the first three numbers separated from the final four by another hyphen. Let's try to detect this pattern in text that is provided by the user. Before we get started, let's talk about how we can identify numbers.

When we look for numbers, we typically are not looking for a single specific number (like 42). We are instead looking for **some** number. It might be 2, it might be 142, or it might be 1000. Let's look for a number in the following sentence:

In [None]:
sentence = "My favorite number is 7"

re.search(r'7', sentence)

    <re.Match object; span=(22, 23), match='7'>



Okay, we can find an exact number. We can even see that we can slice it out of the string using `sentence[22:23]` based on the `span` attribute of our match. Can you think of a way that we have discussed above to find **any** single-digit number?


**Solve it!**

The user will provide their favorite single-digit number in a string called `input_string`. You need to make sure that there is a number in the sentence. Store the results of your `re.search` function in the variable `hasNumber` in the cell labeled `#si-favorite` found below.

In [None]:
#si-favorite

import re

hasNumber = re.search(r"(?<!\d)\d(?!\d)", input_string)

When we want to find a number, we could use character sets: `[0123456789]`. This character set specifies that we will accept any number character as a match to our pattern. If we want to specify that it be a number that has more than one character, we could provide a pattern such as `[0123456789]+`. It will get really old really fast to keep typing `[0123456789]`. Instead, we can use the shorthand `\d` representation for a **digit** or number. Now, if we want to find a number character, we can just write `\d` in place of `[0123456789]`. For one or more number characters, we write `\d+`.

Let's get back to trying to break down phone number validation. (Look at us factoring our problem! :D) Using what we know about digit representation, combined with our knowledge of ways to express repeated patterns from before, we can express a phone number! If we rewrite our goal as words, a US phone number would result in the following:

- Three numbers (if we include an area code)
- A hyphen (again, assuming an area code is included)
- Three numbers
- A hyphen
- Four numbers

Let's write this as a regular expression assuming that we want an area code:

In [None]:
# Ask user for phone number
myNumber = input("Please enter your phone number: ")

# Check if the phone number is valid
valid = re.search(r'\d{3}-\d{3}-\d{4}', myNumber)

# Print a boolean value indicating whether or not the string contains a valid phone number
print(bool(valid))

    Please enter your phone number: 123-456-7890
    True


Awesome! Try different valid and invalid phone numbers in the cell above to see if they work with our code.

### More general patterns

What other patterns might come in handy? Here is a short table of helpers for making your expressions easier to write:

| Pattern | Meaning | Pattern | Meaning |
| --- | --- | --- | --- |
| `\s` | any whitespace character | `\S` | any non-whitespace character |
| `\d` | any digit character | `\D` | any non-digit character | 
| `\w` | any word character (a number, a letter, or a `_`) | `\W` | non-word characters|
| `\b` | word boundaries (the start or end of a sequence of word characters) ||
| `^` | start of a string | `$` | end of a string

Going back to our phone number exercise, what happens if you provide the string `123-456-7890 turtles` as a phone number? Is it considered valid? **It is!** We can use our new patterns to protect against this:

In [None]:
# Add start and end of string characters
valid = re.search(r'^\d{3}-\d{3}-\d{4}$', "123-456-7890 turtles")

# Print an improved boolean value indicating whether or not the string contains a valid phone number
print(bool(valid))

    True


Now we won't be fooled so easily! Our number has to be formatted correctly, AND our string cannot contain any other characters. Play with the code above and check for yourself!

When we think about how to create an expression to validate text or numbers, we should be particularly careful to think of any **possible** ways that a value might be invalid. It is not sufficient to think only of ways in which text is **likely** to be invalid. If we are not careful, our validation will not truly validate, and any work that we do with the "validated" text may be broken by cases we have not accounted for.

### Looking Behind

Speaking of not accounting for all the possibilities, we made this mistake earlier when we were looking for the ends of sentences earlier. We looked for all periods (`.`), question marks (`?`), and exclamation marks (`!`). Unfortunately, not all periods come at the end of sentences! In *Pride and Prejudice*, many characters are referred to as "Mr." or "Mrs.". Those periods are followed by spaces, and so would have counted as the end of a sentence.

#### Negative Lookbehind

In order to solve this problem, we need a new tool: the **negative lookbehind**. A negative lookbehind is a pattern that must be matched while simultaneously NOT being preceded by another pattern. In this case, we would want `[.?!]` wherever that character set is NOT preceded by `Mr` or `Mrs`. First, let's make sure that we can find `Mr` and `Mrs` with a single expression:

In [None]:
[i for i in re.finditer(r'Mrs*', document)]

    [<re.Match object; span=(43, 45), match='Mr'>,
     <re.Match object; span=(2665, 2667), match='Mr'>,
     <re.Match object; span=(2782, 2784), match='Mr'>,
     <re.Match object; span=(3377, 3379), match='Mr'>,
     <re.Match object; span=(3385, 3388), match='Mrs'>,
     <re.Match object; span=(3599, 3601), match='Mr'>,
     <re.Match object; span=(5981, 5983), match='Mr'>,
     <re.Match object; span=(7337, 7339), match='Mr'>,
     <re.Match object; span=(7415, 7417), match='Mr'>,
     <re.Match object; span=(7423, 7426), match='Mrs'>,
     <re.Match object; span=(7622, 7625), match='Mrs'>,
     <re.Match object; span=(8816, 8818), match='Mr'>,
     <re.Match object; span=(8824, 8827), match='Mrs'>,
     <re.Match object; span=(8957, 8959), match='Mr'>,
     <re.Match object; span=(9119, 9121), match='Mr'>,
     <re.Match object; span=(9503, 9505), match='Mr'>,
     <re.Match object; span=(10456, 10458), match='Mr'>,
     <re.Match object; span=(13197, 13199), match='Mr'>,
     <re.Match object; span=(14102, 14105), match='Mrs'>,
     <re.Match object; span=(14428, 14431), match='Mrs'>,
     <re.Match object; span=(14503, 14506), match='Mrs'>,
     <re.Match object; span=(14696, 14699), match='Mrs'>,
     <re.Match object; span=(14884, 14887), match='Mrs'>,
     <re.Match object; span=(16250, 16253), match='Mrs'>,
     <re.Match object; span=(16644, 16646), match='Mr'>,
     <re.Match object; span=(16856, 16858), match='Mr'>,
     <re.Match object; span=(17663, 17665), match='Mr'>,
     <re.Match object; span=(20309, 20311), match='Mr'>,
     <re.Match object; span=(20899, 20901), match='Mr'>,
     <re.Match object; span=(21116, 21118), match='Mr'>,
     <re.Match object; span=(22958, 22961), match='Mrs'>,
     <re.Match object; span=(23362, 23365), match='Mrs'>,
     <re.Match object; span=(23396, 23399), match='Mrs'>]



So far, so good! Now that we can find all of the cases that we **don't** want to count as ends of sentences, let's use a negative lookbehind. The expression we want to avoid will be put in a group (between `()`), and we will put the character sequence `?<!` at the start of the group:

In [None]:
# Count of sentences without negative lookbehind (incorrect)
print("Count of sentences without negative lookbehind (incorrect)")
print(len([i for i in re.finditer(r'[.!?]', document)]))

# Count of sentences with negative lookbehind (correct)
print("Count of sentences with negative lookbehind (correct)")
print(len([i for i in re.finditer(r'(?<!Mr)(?<!Mrs)[.!?]', document)]))

    Count of sentences without negative lookbehind (incorrect)
    161
    Count of sentences with negative lookbehind (correct)
    128


We have to use **two** negative lookbehinds to solve this problem, because we cannot allow for `Mr` or `Mrs` to precede our punctuation. You might have tried `(?<!Mrs*)` (as I did at first). Unfortunately, we cannot use this structure because we are not permitted to use a negative lookbehind that is of uncertain length (meaning that we cannot use any pattern with `*`, `+`, `?`, or `{}` to denote uncertainty). Instead, we simply check for the first string, then check for the second string.

#### Positive Lookbehind

We can also use a positive lookbehind to **require** that our pattern be preceded by another pattern. For example, we may wish to determine which chapters are covered in a text document (we know it is Chapters 44 and 45, but we can use this as an exercise anyway). A positive lookbehind is indicated within parens (`()`) just like the negative lookbehind, but uses the character sequence `?<=`. Let's try to find numbers that are preceded by the pattern `Chapter `:

In [None]:
[i for i in re.finditer(r'(?<=Chapter )\d+', document)]

    [<re.Match object; span=(8, 10), match='44'>,
     <re.Match object; span=(13399, 13401), match='45'>]



Excellent! Just as we know *should* be the case, we can see that we found the beginnings of both chapters 44 and 45.

## Options

When we are creating patterns, we sometimes will encounter multiple acceptable patterns that we wish to account for (just like the example with Lizzy vs Eliza vs Elizabeth earlier). With phone numbers, area codes in the United States are often written in parens rather than with hypens. Thus, we might need to account for two patterns:
- `xxx-xxx-xxxx`
- `(xxx) xxx-xxxx`

Let's try to write expressions for the second option (since we can already accomodate the first):

In [None]:
# Ask user for phone number
myNumber = input("Please enter your phone number: ")

# Check if the phone number is valid
valid = re.search(r'^[(]\d{3}[)][ ]\d{3}-\d{4}$', myNumber)

# Print a boolean value indicating whether or not the string contains a valid phone number
print(bool(valid))

    Please enter your phone number: (425) 443-1880
    True


Our new rule is written `[(]\d{3}[)][ ]`. There are a lot of square brackets! In order to look for the `()` characters, we have to wrap them in a character set, so that our program knows that they are not there to indicate a group, but are instead literal characters. In between the parens, we have our `\d{3}` three-digit number set. Finally, at the end, we have a single space character `[ ]` that must separate the area code `(xxx)` from the rest of the phone number `xxx-xxxx`.

You can try various inputs to see if the code continues to validate phone numbers correctly. It should work on our new format with parens, but not on our original phone number format. Luckily, now that we have a validity check for both phone number formats, we can incorportate them into a single expression to accept either valid format:

In [None]:
# Ask user for phone number
myNumber = input("Please enter your phone number: ")

# Check if the phone number is valid
valid = re.search(r'^([(]\d{3}[)][ ]|\d{3}-)\d{3}-\d{4}$', myNumber)

# Print a boolean value indicating whether or not the string contains a valid phone number
print(bool(valid))

    Please enter your phone number: (123) 456-7890
    True


In this case, we have taken the original area code rule `\d{3}-` as well as the new area code rule `[(]\d{3}[)][ ]`, and we have put them inside of a group using parens `()`. They are separated inside this group by the `|` character (called a "pipe" when we are speaking). Just like it does in our boolean conditions, the `|` character represents the logical statement "or". Essentially, we are saying that our pattern must match `\d{3}-` OR `[(]\d{3}[)][ ]`, followed by `\d{3}-\d{4}`.

For good measure, we have wrapped the whole expression in string start and end characters (`^` and `$`) in order to disallow any other text in our valid phone numbers.

## Extracting Results



When we use the `re.search` or `re.finditer` functions, matches are returned to us through an `re.Match` object. We can extract the text from a match object by using the `.string` attribute. 

In [None]:
valid.string

    '(123) 456-7890'



Also useful is the `.span` function of the `re.Match` object, which returns to us the start and end points (within the original string) of the match to our pattern:

In [None]:
valid.span()

    (0, 14)



Finally, we also have the ability to transform an `re.Match` object into a `bool`-type object, which is useful in various contexts such as text validation. To evaluate an `re.Match` object, we can simply convert the variable into which it is stored:

bool(valid)

    True



If, on the other hand, we fail to find a match to our pattern, our `re.search` or `re.finditer` function will return a `None` object, and it's boolean value will be `False`:

value = re.search(r'happy', "I am sad")

print(value)
print(bool(value))

    None
    False


**Solve it!**

Write a function called `dateCheck` in the cell labeled `#si-date-check` found below that will accept a string as an argument to determine whether or not that string contains ONLY a valid date. The date should be of the format DD-MM-YYYY. In other words, the day of the month should come first, followed by a hyphen (`-`), followed by the month itself, followed by another hyphen (`-`), and finally the year. Days and months should always have two digits, and years should have four digits.

For the purpose of this exercise, pretend that all months have 31 days, and remember that there are twelve months in any year.

When the string is a correctly formatted date, your function should return `True`. Where the string is not a valid date, your function should return `False`.

The `#si-date-check` file should include everything needed, including import statements, for your function to work properly.


In [None]:
#si-date-check
import re

def dateCheck(date):
    # Regular expression pattern for DD-MM-YYYY format
    pattern = r'^(0[1-9]|[12][0-9]|3[01])-(0[1-9]|1[0-2])-\d{4}$'
    
    # Check if the date matches the pattern
    return bool(re.match(pattern, date))

# Example test cases
print(dateCheck("15-08-2023"))  # True
print(dateCheck("31-12-1999"))  # True
print(dateCheck("32-01-2022"))  # False (Invalid day)
print(dateCheck("15-13-2023"))  # False (Invalid month)
print(dateCheck("5-08-2023"))   # False (Day should be two digits)
print(dateCheck("15-8-2023"))   # False (Month should be two digits)
print(dateCheck("15/08/2023"))  # False (Wrong separator)

**Solve it!**

Write a function called `emailCheck` in the cell labeled `#si-email-check` below that will accept a string as an argument, and determine whether or not that string is a valid email. Emails take the following form: a string that contains numbers, letters, and periods followed by the `@` symbol and a domain. We will accept domains that contain numbers and letters, ending in `.com`, `.edu`, or `.org`.

When the string is a correctly formatted email address, your function should return `True`. Where the string is not a valid email address, your function should return `False`.

The file should contain ALL NECESSARY CODE, including import statements, for your function to work properly.

In [None]:
#si-email-check

import re

def emailCheck(email):
    pattern = r"^[a-zA-Z0-9.]+@[a-zA-Z0-9]+\.(com|edu|org)$"

    if re.match(pattern, email):
        return True
    return False
