# Regular Expressions (Regexes)

### Introduction

In the following lessons we will learn how to create basic **Regular Expressions** in Python. Regular Expressions, also known as Regexes, are used to find different patterns of text. In general, regexes work by first specifying the rules for the set of possible patterns that you want to find and then making queries such as "Is this pattern found at the beginning of this string?" or “Is there a match for this pattern anywhere in this string?”. We will learn for example, how to write regular expressions to find phone numbers, names, and email addresses. 

By the end this lesson you should be able to read and write basic regular expressions in Python and know how to apply them to get useful financial information from 10-Ks.

# Raw Strings

We will be using raw strings to create our regular expressions, because regular expressions themselves, also use the backslash character (`\`) to indicate their own special characters. Therefore, by using raw strings, we avoid the problem of Python interpreting the special characters in regexes in the wrong way.

In [1]:
print(r'Hello\n\tWorld')

Hello\n\tWorld


# Finding Words Using Regexes

In this notebook we will learn how to find letters and words in a string using regular expressions. Throughout these lessons, we will use the `re` module from Python's standard library to work with regular expressions. The `re` module not only contains functions that allow us to check if a given regular expression matches a particular string, but also contains functions that allow us to modify strings in various ways. 

Let’s begin by using a regular expression to find all the locations of a single letter in a given string. To do this, we will use the `re.compile()` function from the `re` module. The `re.compile(pattern)` function converts a regular expression `pattern` into a regular expression object. This allows us to save our regular expressions into objects that can be used later to perform pattern matching using various methods, such as `.match()`, `.search()`, `.findall()`, and `.finditer()`. Let’s see how this works.

In the code below, we will find all the locations of the letter `a` in a string named `sample_text`. In this case, our regular expression pattern will just be `'a'` and we will pass it to the `re.compile()` function as a raw string. We will save the regular expression object returned by the `re.compile()` function in a variable called `regex`. We will then use the `.finditer()` method to search our `sample_text` for the given regular expression contained in the `regex` object. The `.finditer()` method returns an iterator with all the non-overlapping matches of our regular expression pattern in the string. We should also mention that the `.finditer()` method scans the strings from left-to-right, and returns the matches in the order found. Since the `.finditer()` method returns an iterator, we can loop through it to print all the matches, as shown below:

In [2]:
# Import re module
import re

# Sample text
sample_text = 'Alice and Walter are walking to the store.'

# Create a regular expression object with the regular expression 'a'
regex = re.compile(r'a')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(6, 7), match='a'>
<_sre.SRE_Match object; span=(11, 12), match='a'>
<_sre.SRE_Match object; span=(17, 18), match='a'>
<_sre.SRE_Match object; span=(22, 23), match='a'>


We can see that each match corresponds to a Match Object with a given `span` and corresponding `match`. The `span=(start,end)` is a tuple that indicates the `start` and `end` indices of the given `match` in the string `sample_text`. For example, if we look at the `span` of the first match, we can see that the first `a` is located between indices `6` through `7`. Therefore, if we print the `sample_text` string from index `6` through `7` we will see that it corresponds to the letter `a`:

In [3]:
# Print the sample_text string from index 6 through 7
print(sample_text[6:7])

a


# Finding MetaCharacters

Here’s a complete list of the metacharacters used in regular expressions:

```python
. ^ $ * + ? { } [ ] \ | ( )
```

As we mentioned in the previous lesson, these metacharacters are used to give special instructions and can't be searched for directly. If we want to search for these metacharacters directly in strings we need to escape them first. Just like with Python string literals, we can use the backslash (`\`) to escape all the metacharacters. Let’s see an example.

Let's try to find the period (`.`) at the end of our `sample_text` again, but this time we will use a backslash (`\`) in our regular expression to remove the period's special meaning, as shown in the code below:

In [4]:
# Import re module
import re

# Sample text
sample_text = 'Alice and Walter are walking to the store.'

# Create a regular expression object with the regular expression '\.'
regex = re.compile(r'\.')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(41, 42), match='.'>


### Find The Price

In the cell below, write a regular expression that matches the price of the coat bought by John and save the regular expression object in a variable called `regex`. Then use the `.finditer()` method to search the `sample_text` string for the given regular expression.  Then, write a loop to print all the `matches` found by the `.finditer()` method . Finally, use the ` match.span()` method to print the match from the `sample_text` string.

In [5]:
# Import re module
import re

# Sample text
sample_text = 'John bought a winter coat for $25.99 dollars.'

# Create a regular expression object with the regular expression
regex = re.compile(r'\$\d*\.\d*')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)
    
    # Using the span information from the match, print the match from the original string
    print('\nMatch from the original text:', sample_text[match.span()[0]:match.span()[1]])

<_sre.SRE_Match object; span=(30, 36), match='$25.99'>

Match from the original text: $25.99


# Searching For Simple Patterns

Being able to match letters and metacharacters is the simplest task that regular expressions can do. In this section we will see how we can use regular expressions to perform more complex pattern matching. We can form any pattern we want by using the metacharacters mentioned in the previous lesson.

The first metacharacter we are going to look at is the backslash (`\`). We already saw that the backslash can be used to escape all the metacharacters, so that you can search for them directly. However, the backslash can also be followed by various characters to signal various special sequences. Here is a list of the special sequences we are going to look at in this notebook:

* `\d` - Matches any decimal digit; this is equivalent to the set [0-9]


* `\D` - Matches any non-digit character; this is equivalent to the set [^0-9]


* `\s` - Matches any whitespace character, this is equivalent to the set [ \t\n\r\f\v]


* `\S` - Matches any non-whitespace character; this is equivalent to the set [^ \t\n\r\f\v]


* `\w` - Matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]


* `\W` - Matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_]

We can see that there is a difference between lowercase and uppercase sequences. For example, while `\d` matches any digit, `\D` matches everything that is **not** a digit. Similarly, while `\s` matches any whitespace character, `\S` matches everything that is **not** a whitespace character; and while `\w` matches any alphanumeric character, `\W` matches everything that is **not** an alphanumeric character.

Let's start by learning how to use `\d` to search for decimal digits.

# Word Boundaries

We will now learn about another special sequence that you can create using the backslash:

* `\b`

This special sequence doesn't really match a particular set of characters, but rather determines word boundaries. A word in this context is defined as a sequence of alphanumeric characters, while a boundary is defined as a white space, a non-alphanumeric character, or the beginning or end of a string. We can have boundaries either before or after a word. Let's see how this works with an example.

In the code below, our `sample_text` string contains the following sentence:

```
The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.
```

As we can see the word `class` appears in three different positions:

1. As a stand-alone word: The word `class` has white spaces both before and after it.


2. At the beginning of a word: The word `class`  in `classroom` has a white space before it.


3. At the end of a word: The word `class`  in `subclass` has a whitespace after it.

If we use `class` as our regular expression, we will match the word `class` in all three positions as shown in the code below:

In [6]:
# Import re module
import re

# Sample text
sample_text = 'The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.'

# Create a regular expression object with the regular expression 'class'
regex = re.compile(r'class')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(12, 17), match='class'>
<_sre.SRE_Match object; span=(47, 52), match='class'>
<_sre.SRE_Match object; span=(85, 90), match='class'>


We can see that we have three matches, corresponding to all the instances of the word `class` in our `sample_text` string.

Now, let's use word boundaries to only find the word `class` when it appears in particular positions. Let’s start by using `\b` to only find the word `class` when it appears at the beginning of a word. We can do this by adding `\b` before the word `class` in our regular expression as shown below:

In [7]:
# Import re module
import re

# Sample text
sample_text = 'The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.'

# Create a regular expression object with the regular expression '\bclass'
regex = re.compile(r'\bclass')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(12, 17), match='class'>
<_sre.SRE_Match object; span=(47, 52), match='class'>


We can see that now we only have two matches because it's only matching the stand-alone word, `class`, and the `class` in `classroom` since both of them have a word boundary (in this case a white space) directly before them. We can also see that it is not matching the `class` in `subclass` because there is no word boundary directly before it. 

Now, let's use `\b` to only find the word `class` when it appears at the end of a word. We can do this by adding `\b` after the word `class` in our regular expression as shown below:

In [8]:
# Import re module
import re

# Sample text
sample_text = 'The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.'

# Create a regular expression object with the regular expression 'class\b'
regex = re.compile(r'class\b')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(12, 17), match='class'>
<_sre.SRE_Match object; span=(85, 90), match='class'>


We can see that in this case we have two matches as well because it's matching the stand-alone word, `class` again, and the `class` in `subclass` since both of them have a word boundary (in this case a white space) directly after them. We can also see that it is not matching the `class` in `classroom` because there is no word boundary directly after it.

Now, let's use `\b` to only find the word `class` when it appears as a stand-alone word. We can do this by adding `\b` both before and after the word `class` in our regular expression as shown below:

In [9]:
# Import re module
import re

# Sample text
sample_text = 'The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.'

# Create a regular expression object with the regular expression '\bclass\b'
regex = re.compile(r'\bclass\b')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(12, 17), match='class'>


We can see that now we only have one match because the stand-alone word, `class`, is the only one that has a word boundary (in this case a white space) directly before and after it.

# Not A Word Boundary

As with the other special sequences that we saw before, we also have the uppercase version of `\b`, namely:

* `\B`

As with the other special sequences, `\B` indicates the opposite of `\b`. So if `\b` is used to indicate a word boundary, `\B` is used to indicate **not** a word boundary. Let's see how this works:

Let's use `\B` to only find the word `class` when it **doesn't** have a word boundary directly before it. We can do this by adding `\B` before the word `class` in our regular expression as shown below:

In [10]:
# Import re module
import re

# Sample text
sample_text = 'The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.'

# Create a regular expression object with the regular expression '\Bclass'
regex = re.compile(r'\Bclass')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(85, 90), match='class'>


We can see that we only get one match because the `class` in `subclass` is the only one that **doesn't** have a word boundary directly before it. 

Now, let's use `\B` to only find the word `class` when it **doesn't** have a word boundary directly after it. We can do this by adding `\B` after the word `class` in our regular expression as shown below:

We can see that again we only have one match because the `class` in `classroom` is the only one that **doesn't** have a boundary directly after it. 

Finally, let's use `\B` to only find the word `class` when it **doesn't** have a word boundary directly before or after it. We can do this by adding `\B` both before and after the word `class` in our regular expression as shown below:

In [11]:
# Import re module
import re

# Sample text
sample_text = 'The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.'

# Create a regular expression object with the regular expression '\Bclass\B'
regex = re.compile(r'\Bclass\B')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

In this case, we can see that we get no matches. This is because all instances of the word `class` in our `sample_text` string, have a boundary either before or after it. In order to have a match in this case, the word `class` will have to appear in the middle of a word, such as in the word `declassified`. Let's see an example:

In [12]:
# Import re module
import re

# Sample text
sample_text = 'declassified'

# Create a regular expression object with the regular expression '\Bclass\B'
regex = re.compile(r'\Bclass\B')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(2, 7), match='class'>


# Simple MetaCharacters

As we indicated in a previous lesson, regular expressions use metacharacters to give special instructions. Here again is a complete list of all the metacharacters used in regular expressions:

```python
. ^ $ * + ? { } [ ] \ | ( )
```
We already learned how to use one of these metacharacters, the backslash (`\`), to create special sequences. In the following lessons we will learn how to use the remaining metacharacters to create more complicated regular expressions. 

In this notebook, we will take a look at the following metacharacters:

```python
. ^ $
```

Let’s start by looking at the dot (`.`) metacharacter.

### The Dot (`.`)

As we saw in a previous lesson, the dot (`.`) matches any character except for newline (`\n`) characters. In the code below, we will use `.` as our regular expression to find all the characters in our multi-line `sample_text` string:

### The Caret (`^`)

The caret (`^`) is used to match a sequence of characters when they appear at the beginning of a string. Let's take a look at an example.

In the code below, our `sample_text` string has the word `this` written twice:

```
this watch belongs in this box.
```

As we can see, the first instance of the word `this` occurs at the beginning of the string; while the second instance of the word `this` occurs towards the end of the string.

If we use `this` as our regular expression, we will match both instances of the word as shown in the code below:

In [13]:
# Import re module
import re

# Sample text
sample_text = 'this watch belongs in this box.'

# Create a regular expression object with the regular expression '^this'
regex = re.compile(r'^this')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(0, 4), match='this'>


### The Dollar Sign (`$`)

The dollar sign (`$`) is used to match a sequence of characters when they appear at the end of a string. Let's take a look at an example.

In the code below, our `sample_text` string has the word `watch` written twice:

```
this watch is better than this watch
```

As we can see, the first instance of the word `watch` occurs towards the beginning of the string; while the second instance of the word `watch` occurs at the end of the string.

If we use `watch` as our regular expression, we will match both instances of the word as shown in the code below:

In [14]:
# Import re module
import re

# Sample text
sample_text = 'this watch is better than this watch'

# Create a regular expression object with the regular expression 'watch$'
regex = re.compile(r'watch$')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(31, 36), match='watch'>


# Character Sets

In this lesson, we will continue to look at metacharacters. In particular, we will learn how to look for phone numbers by employing the following metacharacters:

```python
{} []
```

### Finding Phone Numbers

In the code below, our `sample_text` consists of a multi-line string that mimics a phone book:

```
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
```

We can notice that even though all the phone numbers have different digits, they all have the same pattern, namely, 3 digits followed by a single character, followed by 3 more digits, followed by another single character, followed by 4 digits. We will take advantage of this pattern to create a regular expression that can match all these phone numbers. To do this, we will use the special sequence `\d` and the dot (`.`) in our regular expression, as shown in the code below:

The sequence `{m}` specifies that exactly `m` copies of the previous regular expression should be matched. For example, the sequence `\d{3}` specifies that exactly `3` copies of the `\d` regular expression should be matched. Therefore, the sequence `\d{3}` is equivalent to the sequence ` \d\d\d`.

In [15]:
# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''

# Create a regular expression object with a regular expression that can match all the
# phone numbers in our sample_text using the {} metacharacters
regex = re.compile(r'\d{3}.\d{3}.\d{4}')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(12, 24), match='555-123-4567'>
<_sre.SRE_Match object; span=(37, 49), match='455 555 4549'>
<_sre.SRE_Match object; span=(63, 75), match='655-777-7346'>
<_sre.SRE_Match object; span=(89, 101), match='555)999-8464'>


### Finding Phone Numbers With Specific Separators

Now let's suppose we only wanted to find phone numbers in which the groups of digits were separated by either a dash (`-`) or a white space (` `). In this case we can use what is known as a **Character Set**. Character sets are specified using the `[]` metacharacters and are used to indicate a set of characters that you wish to match. Let’s see an example.

In the code below, we employ the character set `[-  ]` (notice that there is a whitespace after the dash) in our regular expression to only match phone numbers whose groups of numbers are separated by either a dash (`-`) or a white space (` `):

In [16]:
# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''

# Create a regular expression object with a regular expression that can match all the
# phone numbers that have either a dash or a white space as a separator
regex = re.compile(r'\d{3}[- ]\d{3}[- ]\d{4}')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(12, 24), match='555-123-4567'>
<_sre.SRE_Match object; span=(37, 49), match='455 555 4549'>
<_sre.SRE_Match object; span=(63, 75), match='655-777-7346'>


We can clearly see that now, we only match the phone numbers that have either a dash (`-`) or a white space (` `) as a separator. Notice, the last phone number is not matched because even though the last group of numbers is separated by a dash (`-`), the first group of numbers is separated by a parenthesis `)` which is not in our character set.

It is important to note that even though a character set can have many characters, it only matches one of those characters at a time. For example, suppose I added a white space after the dash in Mr. Brown's phone number, as shown below:

In [17]:
# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: 555- 123- 4567
'''

# Create a regular expression object with a regular expression that can match all the
# phone numbers that have either a dash or a white space as a separator
regex = re.compile(r'\d{3}[- ]\d{3}[- ]\d{4}')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

We can see that now, we get no matches. This is because the character set `[-  ]`, used in our regular expression, is only matching one of those characters at a time.  In other words, in order to get a match there must be either a dash **or** a white space separating the groups of numbers but **not** both.

### Finding Phone Numbers With Specific Separators and Area Codes

Let's see another example of a character set. Now, let's suppose we only wanted to find phone numbers in which the groups of digits were separated by either a dash or a white space, and that have area code `455` or `655`. Since all the area codes in our `sample_text` end in 55:

```
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
```

Then, in order to find all the phone numbers that have area code `455` or `655`, we only need to indicate that the first digit in the area code must be either a `4` or a `6`. 

To do this, we can use the character set `[46]` in our regular expression to indicate that the first number should be either a `4` or a `6`, as shown in the code below:

In [18]:
# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''

# Create a regular expression object with a regular expression that can match all the
# phone numbers that have either a dash or a white space as a separator and have area
# code 455 or 655
regex = re.compile(r'[46]55[- ]\d{3}[- ]\d{4}')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(37, 49), match='455 555 4549'>
<_sre.SRE_Match object; span=(63, 75), match='655-777-7346'>


We can see that we only get the two phone numbers that have area code `455` and `655`; and that have either a dash or a white space as a separator.

### Finding Phone Numbers With Specific Last Digits

Now let's suppose we wanted to look for phone numbers that end on the numbers `6`, `7`, `8`, or `9`. In this case, we could use the character set `[6789]`. However, there is a more compact form of doing this. **Within** a character set, when a dash (`-`) is placed **between** digits or letters, it is used to specify a range. For example, the character set `[6-9]` is equivalent to the character set `[6789]` and the character set `[a-f]` is equivalent to the character set `[abcdef]`. It is important to note, that when a dash is placed at the **beginning** of a character set, as we did in the previous example, the dash is taken **literally**. Let’s see how this works.

In the code below, we will use the character set `[6-9]` in our regular expression to find all the phone numbers that end on the numbers `6`, `7`, `8`, or `9`:

In [19]:
# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''

# Create a regular expression object with a regular expression that can match all the
# phone numbers that end on the numbers 6, 7, 8, or 9.
regex = re.compile(r'\d{3}.\d{3}.\d{3}[6-9]')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(12, 24), match='555-123-4567'>
<_sre.SRE_Match object; span=(37, 49), match='455 555 4549'>
<_sre.SRE_Match object; span=(63, 75), match='655-777-7346'>


As we can see, we get all the phone numbers that end on the numbers `6`, `7`, `8`, or `9`. Notice, that the last phone number is not matched because its last digit is a `4`.

Now let's suppose we wanted to find the phone numbers that **do not** end on the numbers `6`, `7`, `8`, or `9`. In this case we could use the character set `[1-5]`. However, we could also use the regular expression `[^6-9]` (notice the caret (`^`) at the beginning). We already learned that **outside** of a character set, the caret matches a sequence of characters when they are located at the beginning of a string. However, when the caret (`^`) appears at the **beginning** of a character set it **negates** the set. This means it matches everything that is **not** in that character set. For example, the regular expression `[^6-9]` will match any character that is **not** a `6`, `7`, `8`, or `9`. Similarly, the regular expression `[^a-zA-Z] `will match any character that is **not** a lowercase or uppercase letter. Let’s see how this works.

In the code below, we will use the character set `[^6-9]` in our regular expression to find all the phone numbers that **do not** end on the numbers `6`, `7`, `8`, or `9`:

In [20]:
# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''

# Create a regular expression object with a regular expression that can match all the
# phone numbers that do not end on the numbers 6, 7, 8, or 9.
regex = re.compile(r'\d{3}.\d{3}.\d{3}[^6-9]')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(89, 101), match='555)999-8464'>


As we can see, we only get one match since there is only one phone number that doesn't end with the numbers `6`, `7`, `8`, or `9`.

# Finding Complicated Patterns

In this lesson, we will learn how to use the remaining metacharacters in our list, namely:

```python
* + ? | ( )
```
We will employ these metacharacters to find more complicated patterns of text. 

### Finding Names

In the code below, our `sample_text` consists of a multi-line string that contains the names and heights of the 4 highest mountains in the world according to Wikipedia:

```
Mt Everest: Height 8,848 m
Mt. K2: Height 8,611 m
Mt Kangchenjunga: Height 8,586 m
Mt. Lhotse: Height 8,516 m
```

Let's create a regular expression that will allow us to find the names of these mountains. The first thing to notice is that the word mountain has been abbreviated in two different ways, as `Mt.` and as `Mt` (without the period). Therefore, if we want to find all the names of the mountains we need to indicate in our regular expression that the period (`.`) in the abbreviation is optional. We can do this by using the `?` metacharacter in our regular expression. The `?` will match 0 or 1 repetitions of the preceding regular expression. For example, the regular expression `ab?` will match either `a` or `ab`. In other words, the `?` after the `b` indicates that the `b` after the `a` is optional. Let’s see how this works.

In the code below, we employ the `?` metacharacter to indicate that the period (`.`) after `Mt` is optional by using the regular expression `Mt\.?`:

In [21]:
# Import re module
import re

# Sample text
sample_text = '''
Mt Everest: Height 8,848 m
Mt. K2: Height 8,611 m
Mt Kangchenjunga: Height 8,586 m
Mt. Lhotse: Height 8,516 m
'''

# Create a regular expression object with a regular expression 'Mt\.?'
regex = re.compile(r'Mt\.?')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(1, 3), match='Mt'>
<_sre.SRE_Match object; span=(28, 31), match='Mt.'>
<_sre.SRE_Match object; span=(51, 53), match='Mt'>
<_sre.SRE_Match object; span=(84, 87), match='Mt.'>


We can clearly see that the regular expression `Mt\.?` was able to match either `Mt` or `Mt.`

Now let's continue creating our regular expression so that it can match all the mountain names. 
We continue by matching the next character after the abbreviation. We notice that after each abbreviation there is a white space, therefore,  we will use the special sequence `\s` to match it.

After that white space, we have the name of mountain. We can see that the first letter in all the names is an uppercase letter, so we will use the character set `[A-Z]` to match any possible uppercase letter.

Now comes the tricky part. We can see that the mountain names have different lengths. For example, the third mountain has a long name,  `Kangchenjunga`, but the second mountain has a very short name, `K2`. We can get around this problem by noticing that all the names are composed of only alphanumeric characters.

To match any alphanumeric character we will use the special sequence `\w`, and to help us match names of any length we will use the `*` metacharacter. The `*` metacharacter, matches 0 or more repetitions of the preceding regular expression. In other words, it matches 0 or as many repetitions as possible of the preceding regular expression. For example, the regular expression `ab*` will match `a` or `a` followed by any number of `b`'s, such as `ab` or `abbbbb`. Let's see how this works.

In the code below, we employ the `*` metacharacter to find the names of the mountains regardless of their length:

In [22]:
# Import re module
import re

# Sample text
sample_text = '''
Mt Everest: Height 8,848 m
Mt. K2: Height 8,611 m
Mt Kangchenjunga: Height 8,586 m
Mt. Lhotse: Height 8,516 m
'''

# Create a regular expression object with a regular expression that can match all the
# mountain names
regex = re.compile(r'Mt\.?\s[A-Z]\w*')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(1, 11), match='Mt Everest'>
<_sre.SRE_Match object; span=(28, 34), match='Mt. K2'>
<_sre.SRE_Match object; span=(51, 67), match='Mt Kangchenjunga'>
<_sre.SRE_Match object; span=(84, 94), match='Mt. Lhotse'>


We can see that we managed to match all the mountain names regardless of their length or abbreviation.

# Groups

In the code below, we have added a new mountain to our `sample_text` string:

```
Mnt makalu: Height 8,485 m
```

As we can see, the name of this mountain has two differences from the other ones. The first difference is that the word mountain has been abbreviated as `Mnt` instead of `Mt` or `Mt.`. The second difference is that the first letter of the name is lowercase not uppercase. 

To be able to match `Mnt` as well as `Mt` or `Mt.`, we will use the `( )` metacharacters to define a **Group**. As their name suggests, **groups**, group together the expressions contained inside of them. For example, we saw before that `ab*` will match `a` or `a` followed by any number of `b`'s, such as `ab` or `abbbbb`. But, if you put `ab` inside a parenthesis to define the **group** `(ab)`, then `(ab)*` will match zero or more repetitions of `ab`, for example `ab` or `abababab`. You can repeat the contents of a group with any repeating qualifier, such as `*, ?, or {m}` that we have seen before. We can also use the OR `|` metacharacter within the group to be able to select between two expressions. Let’s see how this works.

In the code below, we will use the group `(Mt|Mnt)` in our regular expression to be able to match either `Mnt` or `Mt`:

In [23]:
# Import re module
import re

# Sample text
sample_text = '''
Mt Everest: Height 8,848 m
Mt. K2: Height 8,611 m
Mt Kangchenjunga: Height 8,586 m
Mt. Lhotse: Height 8,516 m
Mnt makalu: Height 8,485 m
'''

# Create a regular expression object with a regular expression that can match all the
# mountain names
regex = re.compile(r'(Mt|Mnt)\.?\s[a-zA-Z]\w*')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(1, 11), match='Mt Everest'>
<_sre.SRE_Match object; span=(28, 34), match='Mt. K2'>
<_sre.SRE_Match object; span=(51, 67), match='Mt Kangchenjunga'>
<_sre.SRE_Match object; span=(84, 94), match='Mt. Lhotse'>
<_sre.SRE_Match object; span=(111, 121), match='Mnt makalu'>


As we can see, we were able to match all the mountain names, including the new one. Also, notice that we added lowercase letters, `[a-zA-Z]`, to our previous character set in our regular expression. This was done in order to be able to match the first lowercase letter of the new name. 

We should point out, that since the first letter in both abbreviations is an `M`, we could have put the `M` outside of the group and gotten the same result, as shown below:

In [24]:
# Import re module
import re

# Sample text
sample_text = '''
Mt Everest: Height 8,848 m
Mt. K2: Height 8,611 m
Mt Kangchenjunga: Height 8,586 m
Mt. Lhotse: Height 8,516 m
Mnt makalu: Height 8,485 m
'''

# Create a regular expression object with a regular expression that can match all the
# mountain names
regex = re.compile(r'M(t|nt)\.?\s[a-zA-Z]\w*')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(1, 11), match='Mt Everest'>
<_sre.SRE_Match object; span=(28, 34), match='Mt. K2'>
<_sre.SRE_Match object; span=(51, 67), match='Mt Kangchenjunga'>
<_sre.SRE_Match object; span=(84, 94), match='Mt. Lhotse'>
<_sre.SRE_Match object; span=(111, 121), match='Mnt makalu'>


### Finding email Addresses Revisited

In the cell below, our `sample_text` consists of a multi-line string with four different email addresses. Write a regular expression that is able to find all these email addresses. As usual, save the regular expression object in a variable called `regex`. Then use the `.finditer()` method to search the `sample_text` string for the given regular expression. Finally, write a loop to print all the `matches` found by the `.finditer()` method.

**HINTS:** Notice that all the characters before the `@` symbol only contain lowercase letters, underscores, and numbers. To match this part of the email address we can use the character set `[a-z_0-9]` followed by the `+` metacharacter, to account for the fact that all email addresses must have at least one character or more before the `@` symbol. The `+` metacharacter matches 1 or more repetitions of the preceding regular expression. For example, `ab+` will match `a` followed by any non-zero number of `b`’s, such as `ab` or `abb`, etc.., but it will not match just `a`.

The `@` symbol is not a metacharacter so we can match it directly without the need of escaping it. Also, notice that the domain names contain lowercase letters, uppercase letters, underscores, and dashes. Again we can use the characters set `[a-zA-Z_-]` followed by the `+` metacharacter, to account for the fact that all domains must have at least one character or more after the `@` symbol. To match any dot (`.`), we need to use the backslash (`\.`) because the dot is a metacharacter. You can use the character set `[a-z]+` to match either `com`, `edu`, or `gov`.

To match the last email address you need to add an optional dot followed by another character set of only lowercase letters.

In [25]:
# Import re module
import re

# Sample text
sample_text = '''
fake_email@fake-email.edu
fakeemail43@fake_email.com
fake891_email@fakemail.gov
52fake_email@FAKE_email.com.nl
'''

# Create a regular expression object with a regular expression that can match all
# the email addresses
regex = re.compile(r'[a-z_0-9]+@[a-zA-Z_]+\.[a-z]+')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(27, 53), match='fakeemail43@fake_email.com'>
<_sre.SRE_Match object; span=(54, 80), match='fake891_email@fakemail.gov'>
<_sre.SRE_Match object; span=(81, 108), match='52fake_email@FAKE_email.com'>


# Substitutions

As we mentioned at the beginning of this lesson, the `re` module also has functions that allow us to modify strings. Regex objects have the `.sub()` method that allows us to replace patterns within a string. Let' see an example.

In the code below we have a multi-line string that contains two instances of the ampersand character, `&`. Let's use the `.sub` method to replace these ampersands with the word `and`. First we will create a regular expression that matches all the `&` characters in our string. Then we will use `regex.sub(r'and', sample_text)` to replace every match of the `regex` expression in the `sample_text` with the raw string `and`. Let's see this in action:

In [26]:
# Import re module
import re

# Sample text
sample_text = '''
Ben & Jerry
Jack & Jill
'''

# Create a regular expression object with the regular expression '&'
regex = re.compile(r'&')

# Substitute all & in the sample_text with 'and'
new_text = regex.sub(r'and', sample_text)

# Print Original and Modified texts
print('Original text:', sample_text)
print('Modified text:', new_text)

Original text: 
Ben & Jerry
Jack & Jill

Modified text: 
Ben and Jerry
Jack and Jill



We can see that we have successfully replaced all the `&` characters with the word `and`. Being able to make this kind of substitutions can be really useful and save you a lot of time if you are working with large documents that you need to reformat.

# Substitutions with Groups

We can do more sophisticated substitutions by using groups. Let's see an example. In the code below we have a multi-line string that contains the names of 4 people. As we can see, some people have middle names but other don't. Let's use the `.sub()` method to replace all names in the string with just the first and last name. For example, the name `John David Smith` should be replaced by `John Smith` and `Alice Jackson` should stay the same.

The first step is to create a regular expression that matches all the names in the list. Now, keeping in mind that we need to be able to make replacements later we will use groups to be able to distinguish between the first name, the middle name, and the last name. Since all names have a first name then we can use this group `([a-zA-z]+)` to match all the first names. Now, not all names have middle names, so having a middle name is optional. Since the first and middle name are separated by a whitespace we also need to indicate that the whitespace is also optional. So, to do indicate that the whitespace and middle name are optional we will include the `?` metacharacter after the whitespace and second group, `[ ]?([a-zA-z]+)?`. After the first or middle name we have a whitespace that we can match with `\[  \]`. Notice that in this case we didn't use the sequence `\\s` since this will match newlines as well and we don't what match those. Finally we make a third group to match the last name. Since all names have last names, we don't need to use the `?` metacharacter. Putting all together we get:

In [27]:
# Import re module
import re

# Sample text
sample_text = '''
John David Smith
Alice Jackson
Mary Elizabeth Wilson
Mike Brown
'''

# Create a regular expression object with a regular expression that can find all
# the names in the sample_text and group the first, middle, and
# last names separately
regex = re.compile(r'([a-zA-z]+)[ ]?([a-zA-z]+)?[ ]([a-zA-z]+)')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(1, 17), match='John David Smith'>
<_sre.SRE_Match object; span=(18, 31), match='Alice Jackson'>
<_sre.SRE_Match object; span=(32, 53), match='Mary Elizabeth Wilson'>
<_sre.SRE_Match object; span=(54, 64), match='Mike Brown'>


We can clearly see that we matched all the four names in our list. Now, the cool thing about using groups is that we can reference them individually from the Match Objects using the `.group()` method. The `.group(N)` method selects the `N`th group in the match. Therefore, in our particular case, for each match, `.group(1)` will select the first name, `.group(2)` will select the middle name, and `.group(3)` will select the last name. Let's see how this works in the code below:

In [28]:
# Import re module
import re

# Sample text
sample_text = '''
John David Smith
Alice Jackson
Mary Elizabeth Wilson
Mike Brown
'''

# Create a regular expression object with a regular expression that can find all
# the names in the sample_text and group the first, middle, and
# last names separately
regex = re.compile(r'([a-zA-z]+)[ ]?([a-zA-z]+)?[ ]([a-zA-z]+)')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# For each match print the first, middle, and last name separately
for match in matches:
    print('\nFirst Name: '+ match.group(1))
    
    if match.group(2) is None:
        print('Middle Name: None')
    else:
        print('Middle Name: '+ match.group(2))
    print('Last Name: '+ match.group(3))


First Name: John
Middle Name: David
Last Name: Smith

First Name: Alice
Middle Name: None
Last Name: Jackson

First Name: Mary
Middle Name: Elizabeth
Last Name: Wilson

First Name: Mike
Middle Name: None
Last Name: Brown


We can see that for each of the four matches we can selectively choose the first, middle, or last name. We should also mention that `.group(0)` (or equivalently `.group()`) selects all the groups at once. 

Now, that we know how to select groups individually for each match, we are ready to use the `.sub()` method to make substitutions. Remember, `regex.sub(r'string', sample_text)` will replace every match of the `regex` expression in the `sample_text` with the raw string `string`. So, what we want to do in our case, is to replace every match with only the first and last names, or equivalently replace every match with the first and third groups. We can refer to each group in the `string` by using the backslash. For example, `regex.sub(r'\1', , sample_text)` will replace every match with the first group. Here we have reference the first group by using `\1` inside the `string`. Let's put it all together to see how it works:

In [29]:
# Import re module
import re

# Sample text
sample_text = '''
John David Smith
Alice Jackson
Mary Elizabeth Wilson
Mike Brown
'''

# Create a regular expression object with a regular expression that can find all
# the names in the sample_text and group the first, middle, and
# last names separately
regex = re.compile(r'([a-zA-z]+)[ ]?([a-zA-z]+)?[ ]([a-zA-z]+)')

# Substitute all names in the sample_text with the first and last name
new_text = regex.sub(r'\1 \3', sample_text)

# Print the modified text
print(new_text)


John Smith
Alice Jackson
Mary Wilson
Mike Brown



# Flags

We saw at the beginning of this lesson that regexes are case sensitive, therefore we often have to use regexes with both uppercase and lower case letters. However, the `re.compile(pattern, flags)` function, has a `flag` keyword that can be used to allow more flexibility. For example, the `re.IGNORECASE` flag can be used to perform **case-insensitive** matching. In the code below we have a string that contains the name Walter written in two different combinations of upper and lower case letters. In order to be able to find this two renditions of Walter, we will probably have to use a long character set to be able to account for all possible combinations of lower and upper case letters. However, in this case we can use the `re.IGNORECASE` to indicate that we don't care about the case of the letters, we just want to find the name Walter no matter how it is written. Let's see how this works:

In [30]:
# Import re module
import re

# Sample text
sample_text = 'Alice and WaLtEr Brown are talking with wAlTer Jackson.'

# Create a regular expression object with the regular expression 'walter'
# that ignores the case of the letters
regex = re.compile(r'walter', re.IGNORECASE)

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(10, 16), match='WaLtEr'>
<_sre.SRE_Match object; span=(40, 46), match='wAlTer'>


We can clearly see that we were able to match both renditions of `walter` without any fancy regular expression. 

We have seen a lot in this lesson and we have just began to scratch the surface of regular expressions. For more information on regexes make sure to check out the Python [Regex Documentation ](https://docs.python.org/2/library/re.html#module-re)