# Regular Expressions in Python

Regular expressions provide us a language to write patterns for searching through strings. These are used when we want to find certain kinds of text, such as phone numbers, email addresses, or URLs in text. All programming languages have built-in support for regular expressions. Once you have learned their syntax, you can use them in many platforms (e.g., operating systems, programming languages, search engines, etc.)

**Table of Contents**

1. [Raw Strings](#raw)
2. [Create a pattern](#pattern)
3. [An example of a metacharacter](#meta)
4. [Introducing more metacharacters](#meta2)
5. [Introducing anchors](#anchors)
6. [Example: Matching phone numbers](#phone)
7. [Using a character set](#set)
8. [Character ranges with dash](#dash)
9. [The ^ character for negating](#carot)
10. [Quantifiers](#quant) 
11. [Groups](#groups)
12. [Replace using groups](#replace)
13. [Python flags](#flags)
14. [Your turn: Solve simple problems](#turn)

In [None]:
import re

In [None]:
someText = '''
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789

wendy@wellesley.edu
www.wellesley.edu

oh ohohoh

Metacharacters that need to be escaped

. ^ $ * + ? { } [ ] \ | ( )

781-283-3190
800.255.4398
781 305 0000

Mrs. Robinson
Ms Gardner
Mr Potter
Mr. Bond
Mr T

e-mail for up-to-date news
our_values
'''

sentence = "Two roads diverged in a yellow wood, And sorry I could not travel both"

<a id="raw"></a>
## 1. Raw strings

Special string characters, such as `\t` or `\n` are interpreted by Python to display their meaning. By using the raw string notation, we ensure that a string is not interpreted for display. This notation is simply the letter `r` preceding a string. Below we show how when printing a normal string and a raw string, the `\n` character is treated differently.

In [None]:
line = "this is a line.\n"
print(line)

In [None]:
line = r"this is a line.\n"
print(line)

<a id="pattern"></a>
## 2. Create a pattern

To create a pattern of characters to find in a string, we use the function `compile`.

In [None]:
pattern = re.compile("edu")
pattern

In the output, notice the representation of the string as "raw" string within the compile object, even though we didn't use the symbol 'r' in our input.
Now that we have a pattern, we can try to find matches for it in the text. There are two methods we could use: `findall` and `finditer`.

In [None]:
pattern.findall(someText)

In [None]:
for p in pattern.finditer(someText):
    print(p)

Notice that differently from `findall`,  `finditer` returns an object that contains the "span" of the found phrase, that is, its start and end indices, which can be used for string slicing.

We can use special functions to access the start and end indices of the slice, as well as the group the was found.

In [None]:
for p in pattern.finditer(someText):
    print(p.start(), p.end(), p.group())

<a id="meta"></a>
## 3. An example of a metacharacter

Let's assume we want to search for the period character.

In [None]:
pattern = re.compile(".")
pattern

In [None]:
for p in pattern.finditer(someText):
    print(p)

Wow, we got everything! This is because the period is a special character in regular expressions, and it matches every character. If we want to search for the period, then we need to escape it. 

In [None]:
pattern = re.compile("\.")
pattern

In [None]:
for p in pattern.finditer(someText):
    print(p)

Notice that this time we only got the strings that match the period and not every other character.

<a id="meta2"></a>
## 4. Introducing more metacharacters

Here is a list of some of the most common metacharacters that are used commonly. Remember, a metacharacter is a character that has a special meaning during pattern processing.

```
.  - any character but the new line
\d - digits 0 to 9
\D - not a digit
\w - word characters (a-z, A-Z, 0-9, _)
\W - not a word character
\s - whitespace (tab, space, newline)
\S - not whitespace
```

In [None]:
pattern = re.compile('\d')
for p in pattern.finditer(someText):
    print(p)

We can see that `\d`  matches every digit, but `\D` does the opposite and matches every non-digit:

In [None]:
pattern = re.compile('\D')
for p in pattern.finditer(someText):
    print(p)

Meanwhile, `\w` matches every word character (letters, digits, and undercore):

In [None]:
pattern = re.compile('\w')
for p in pattern.finditer(someText):
    print(p)

While `\W` matches all non-word characters:

In [None]:
pattern = re.compile('\W')
for p in pattern.finditer(someText):
    print(p)

We can also match all space characters with `\s`:

In [None]:
pattern = re.compile('\s')
for p in pattern.finditer(someText):
    print(p)

While `\S` will do the opposite and match all non-space characters.

In [None]:
pattern = re.compile('\S')
for p in pattern.finditer(someText):
    print(p)

<a id="anchors"></a>
## 5. Introducing Anchors

In addition to the metacharacters we saw, there are some special characters that match invisible positions before or after characters. They are always used in conjunction with other patterns and are known as **anchors**:

```
\b - word boundary
\B - not a word boundary
^ - start of a string
$ - end of a string
```

In [None]:
pattern = re.compile(r'\boh')
for p in pattern.finditer(someText):
    print(p)

It matched two "oh" strings, which have a word boundary, but didn't match the other two "oh"s. Meanwhile, `\B` will do the opposite and match the `oh`s that don't have a word boundary.

In [None]:
someText[106:115]

In [None]:
pattern = re.compile(r'\Boh')
for p in pattern.finditer(someText):
    print(p)

Let's look at `^` that finds patterns at the beginning of text.

In [None]:
pattern = re.compile(r'^Two')
for p in pattern.finditer(sentence):
    print(p)

It found the pattern "Two" at the beginning, but if we look for something else, it will not work, if that fragment is not at the start of the sentence:

In [None]:
pattern = re.compile(r'^road')
for p in pattern.finditer(sentence):
    print(p)

We know that "road" is in the sentence, but it's not at the beginning:

In [None]:
pattern = re.compile(r'road')
for p in pattern.finditer(sentence):
    print(p)

Similarly, we can use `$` to search for a pattern at the end of a word, notice that `$` is at the end.

In [None]:
pattern = re.compile(r'both$')
for p in pattern.finditer(sentence):
    print(p)

Same way, if we search for something that's in the string but not at the end, it will not find it:

In [None]:
pattern = re.compile(r'travel$')
for p in pattern.finditer(sentence):
    print(p)

<a id="phone"></a>
## 6. Matching phone numbers

There are two phone numbers in our text, they look like this:

```
781-283-3190
800.255.4398
```

We can start by matching the first three digits:

In [None]:
pattern = re.compile(r'\d\d\d')
for p in pattern.finditer(someText):
    print(p)

We can use `.` to match any character, in this case the hyphen or the period itself. Then, we match the other groups of digits as above, the final result will look like this:

In [None]:
pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d')
for p in pattern.finditer(someText):
    print(p)

A more succinct way of writing the pattern for phone numbers is shown in the section about Quantifiers further down in this notebook.

<a id="set"></a>
## 7. Using a character set

The period we used above can match any character as a separator for the phone numbers. If we want to restrict what separators to accept, we can use a character set, denoted by the use of square brackets.

In [None]:
pattern = re.compile(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d')
for p in pattern.finditer(someText):
    print(p)

If we use a different separator in the number, for example space, as shown in the example below, then, our character set will not match anything:

In [None]:
pattern = re.compile(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d')
for p in pattern.finditer("781 283 3190"):
    print(p)

**Question:** How to modify the pattern to also capture the empty space as a separator of the phone numbers? Try it out in the cell above.

Let's see another use for character sets, finding 800 and 900 numbers:

In [None]:
numbers = """800-200-4000
900.234.5678
300-211-9087"""

In [None]:
pattern = re.compile(r'[89]00[-.]\d\d\d[-.]\d\d\d\d')
for p in pattern.finditer(numbers):
    print(p)

**Note:** Character sets only match one and only one of the characters in the set.

<a id="dash"></a>
## 8. Character ranges with dash

The special character dash, when in between two other characters in a character set, serves to create a range.

In [None]:
# find all digits between 0 and 5
pattern = re.compile(r'[0-5]')
for p in pattern.finditer(numbers):
    print(p)

Notice that it's matching only the digits 0 to 5. Let's try range with letter characters. As a reminder, here is our sentence:

In [None]:
sentence

In [None]:
pattern = re.compile(r'[a-e]')
for p in pattern.finditer(sentence):
    print(p)

We can use character range sets for uppercase letters too:

In [None]:
pattern = re.compile(r'[A-Z]')
for p in pattern.finditer(sentence):
    print(p)

<a id="carot"></a>
## 9. The ^ character for negating

The character ^ within a character set behaves differently from when it is outside. In this case, it will negate the content of the set, so that the pattern matches everything that is not in the set.

In [None]:
phrase = "My number is: 555"

In [None]:
pattern = re.compile(r'[^a-zA-Z]')
for p in pattern.finditer(phrase):
    print(p)

For the phrase above, the pattern matches only the white space, the colon, and the digits, which are non-letter characters.

Below is another use of ^ for negating a character set:

In [None]:
words = """cat
bat
mat
pat
tat"""

pattern = re.compile(r'[^bp]at') # find three letter words ending with at, but that don't start with b or p
for p in pattern.finditer(words):
    print(p)

<a id="quant"></a>
## 10. Quantifiers

All the examples so far find one character at a time. Even when we found the phone number, we used a complex pattern. But, that doesn't need to be the case. We can use other special characters to look for repeating patterns.

```
*       - 0 or more
+       - 1 or more
?       - 0 or 1
{3}     - exact number
{3, 4}  - range of numbers (min, max)
```

Let's rewrite the pattern for phone numbers, by using one of these quantifiers, that specifies an exact number of digits:

In [None]:
pattern = re.compile(r'\d{3}.\d{3}.\d{4}')
for p in pattern.finditer(someText):
    print(p.group())


Often, we don't know the length of a sequence as we do with phone numbers, thus, we need to use the other quantifiers. Let's do that to try to get all Misters from this string of names:

In [None]:
names = """Mrs. Robinson
Ms Gardner
Mr Potter
Mr. Bond
Mr T"""

**Step 1: Find Mr.** 

In [None]:
pattern = re.compile(r'Mr\.') # escape the period
for p in pattern.finditer(names):
    print(p)

This finds Mr. but we miss Mr that is not followed by a period. By using the question mark character we can specify that the period can show 0 or 1 times.

**Step 2: Find Mr. and Mr**

In [None]:
pattern = re.compile(r'Mr\.?')
for p in pattern.finditer(names):
    print(p)

This catches all of them, but also Mr in Mrs. Robinson. Now, let's get the rest of the name:

**Step 3: Find name**

In [None]:
pattern = re.compile(r'Mr\.?\s[A-Z]')
for p in pattern.finditer(names):
    print(p)

We followed the question mark with the space character `\s`, and then a character set for uppercase letters. Finally, to get the whole names, we can use `\w`, to match word characters.

**Step 4: Find the complete names**

In [None]:
pattern = re.compile(r'Mr\.?\s[A-Z]\w*')
for p in pattern.finditer(names):
    print(p)

It worked! We found all strings about misters, despite their different structure.

<a id="groups"></a>

## 11. Groups

The special characters `( )` are used to create groups, and often the pipe character `|` (that means OR), is used together with them. Groups allow for the definition of more complex patterns. For example, we can modify the code from **Step 4** above to also include female titles:

In [None]:
pattern = re.compile(r'(Mrs|Ms|Mr)\.?\s[A-Z]\w*') # match Mr or Ms in their variations
for p in pattern.finditer(names):
    print(p)

Let's look at another example, matching email addresses:

In [None]:
emails = """harry.potter@hogwards.edu
hgranger@gryffindor-house.info
ron_weasley@theburrow.com
"""

We can start by thinking of the simplest pattern, some characters, the @ symbol, and some more characters. If we use `\w` to match word characters, we can write:

In [None]:
pattern = re.compile(r'\w+@\w+')
for p in pattern.finditer(emails):
    print(p)

We got all the emails, but they are truncated at characters such as the period or dash. Here is another try:

In [None]:
pattern = re.compile(r'\w+@\w+\.(edu|info|com)')
for p in pattern.finditer(emails):
    print(p)

We got only two addresses, because that from Hermione contains a dash for which we have not accounted. We should add that as a character set that is repeated 0 or one time.

In [None]:
pattern = re.compile(r'\w+[-.]?\w+@\w+[-]?\w+\.(edu|info|com)')
for p in pattern.finditer(emails):
    print(p)

Finally, to cover more email domain endings, we can replace the group with repeated characters:

In [None]:
pattern = re.compile(r'\w+[-.]?\w+@\w+[-]?\w+\.\w+')
for p in pattern.finditer(emails):
    print(p)

By grouping the various elements of the email structure, we can access each of them separately by using the method group. Without an argument, this method prints the entire group:

In [None]:
pattern = re.compile(r'(\w+[-.]?\w+)+@(\w+[-]?\w+)(\.\w+)') # notice three groups
for p in pattern.finditer(emails):
    print(p.group())

By providing the indices of the groups: 1, 2, 3, we can access each group separately:

In [None]:
for p in pattern.finditer(emails):
    print(p.group(1)) # 1st group is the email account

In [None]:
for p in pattern.finditer(emails):
    print(p.group(2)) # 2nd group is the email domain server

In [None]:
for p in pattern.finditer(emails):
    print(p.group(3)) # 3rd group is the domain ending

<a id="replace"></a>
## 12. Replace using groups

Until now we have only used regular expressions to find patterns in text, but often, we are interested in replacing something we find. Groups can be very useful to do this, because they serve as indices to access parts of a matched pattern. This is done through a mechanism known as back references.

For this example, we will use a few URLs. We want to identify irrelevant parts and "remove" them, so that we only have the domain names such as __google.com__, etc.

In [None]:
urls = """
https://www.google.com
https://nytimes.com
https://www.wellesley.edu
http://facebook.com
"""

Below we use the group syntax, `( )`, to refer to three different groups that usually compose a domain URL.

In [None]:
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)') # notice three groups, one for each part of the domain
pattern

We'll use the method `sub` which has a different syntax, so, we'll first create a list of URLs:

In [None]:
urls = urls.split()
urls

The method `sub` takes three arguments, the pattern, a replacement string, and the original string. In the code below, our replacement string is composed of **back references**, that is, of indices that refers to groups in the pattern. The command is saying, if you find the pattern in the string, replace it with the content of the replacement string. 

In [None]:
for url in urls:
    print(re.sub(pattern, r'\2\3', url))

As we can see, the regex finds the whole URLs, but replaces them with the contenxt of groups 2 and 3 that contain the domain name and ending.

To learn more about the syntax of the `sub` function, check out its documentation with `help`:

In [None]:
help(re.sub)

<a id="flags"></a>
## 13. Python flags

The Python module `re` has so-called flags that can be called in the function `findall` or `finditer` to change its behavior. Here are some flags:

```
re.IGNORECASE or re.I  
re.ASCII or re.A. 
re.LOCALE or re.L  
```

Each of them does something different. For example, re.ASCII makes the regex apply only to ascii characters (by default it applies to unicde). Ignore case is about ignoring the casing of a word. Here is an example: find all instances of a word, independently of the case.

In [None]:
pattern = re.compile("python", re.IGNORECASE)
text = "Python is fun. python is powerful. I love PYTHON!"

for p in pattern.finditer(text):
    print(p)

Remember, that if you only want the text, you can always use the function `findall`:

In [None]:
matches = re.findall(r'python', text, re.IGNORECASE)
matches

<a id="turn"></a>
## 14. Your Turn: Solve simple problems

Put to action the things you learned above:  

**Ex. 1:** Find all the words that start with "a" and end with an "e" (independetly of case). Here is a sentence to try: "Alice asked for an apple, an envelope, and an artichoke. What an astute girl!"  The expected result is: ['Alice', 'apple', 'artichoke', 'astute'].

**Ex. 2:** Replace all occurrences of whitespace (space, tab, newline) with a single space. Here is an example text: "This\nstring\tcontains multiple\n\tspaces." You need to use the function `sub`.

**Ex 3:** Find all instances of year in dates of format "YYYY-MM-DD". Here is some text:
"The school year started in 2023-09-05. More than 2300 students were enrolled. 2023-2024 is going to be a great year! The ending ceremony is on 2024-05-17." Your code will use groups to find the dates, for example, [('2023', '09', '05'), ('2024', '05', '17')], and then index the year.