# Regular expressions

A regular expression (regex) is a way of recognizing and often extracting data from certain patterns of text.

A regex that recognizes a piece of text or a string is said to match that text or string.

A regex is defined by a string in which certain characters (the so-called metacharacters) can have a special meaning, which enables a single regex to match many different specific strings.

The following snippet is the simplest of examples, in which the metacharacters are regular characters. We count the lines in which the search string `"hello"` is found in the file. Note that a line containing the search string more than once is counted only once.

In [None]:
import re
from pathlib import Path

search_str = "hello"
regex = re.compile(search_str)
count = 0
with Path.open("datafiles/01_textfile.txt", "r") as file:
    for line in file.readlines():
        if regex.search(line):
            count += 1
print(f"{search_str!r} was found within {count} line(s) in the file.")

'hello' was found within 2 line(s) in the file


Note the line in the example:

```python
regex = re.compile(search_str)
```

This compilation isn't strictly necessary, but compiled regular expressions can significantly increase a program's speed.

## Regex with special characters

Using special characters you'll be able to accommodate more flexible regular expressions that match different variations.

For example, you can look for either `"Hello"` or `"hello"` using:

```python
# option 1
regexp = re.compile("hello|Hello")

# option 2
regexp = re.compile("(h|H)ello")

# option 3
regexp = re.compile("[hH]ello")
```

In [None]:
import re
from pathlib import Path


def get_occurrence_count(filename: str, search_regexp: str) -> int:
    regex = re.compile(search_regexp)
    count = 0
    with Path.open(filename, "r") as file:
        for line in file.readlines():
            if regex.search(line):
                count += 1
    return count

filename = "datafiles/01_textfile.txt"

assert get_occurrence_count(filename, "hello") == 2
assert get_occurrence_count(filename, "hello|Hello") == 3
assert get_occurrence_count(filename, "(h|H)ello") == 3
assert get_occurrence_count(filename, "[hH]ello") == 3


The special characters `[` and `]` take a string of characters between them and match any single character in that string, as in `[Hh]ello` to match `Hello` and `hello`.

There's a shorthand to denote ranges of characters in a range: `[a-z]` which will match a single character between `a` and `z`. It can be used in the following situations:

+ Any numeric character: `[0-9]` 
+ Any alphanumeric character: `[0-9a-z]`
+ Any alphanumeric (uppercase) character: `[0-9A-Z]`
+ ...

Sometimes you might need to match for a hyphen character `"-"`. In that case, the hyphen must be placed at the beginning of the range string that denotes what must be matched:
+ `[-012]`: either `"-"`, 0, 1, or 2.


### Exercise

Create a program that matches regular expression matching the numbers from -5 through 5.
Note: assume that you will only be matching one digit numbers (from -9 to 9).

In [11]:
import re

search_regexp = "(-[1-5])|([0-5])"
regexp = re.compile(search_regexp)

for i in range(-9, 10):
    if i < -5:
        assert not regexp.match(str(i))
    elif i <= 5:
        assert regexp.match(str(i))
    else:
        assert not regexp.match(str(i))


Alternatively:

In [12]:
import re

search_regexp = "-?[0-5]"
regexp = re.compile(search_regexp)

for i in range(-9, 10):
    if i < -5:
        assert not regexp.match(str(i))
    elif i <= 5:
        assert regexp.match(str(i))
    else:
        assert not regexp.match(str(i))

### Exercise

What regular expression would you use to match a hexadecimal digit?

In [16]:
import re

search_regexp = "[0-9a-fA-F]"
regexp = re.compile(search_regexp)

for i in range(16):
    hex_digit = hex(i)[2:]  # Remove the 0x prefix
    assert regexp.match(hex_digit)

assert not regexp.match("G")
assert not regexp.match("Z")
assert not regexp.match("X")

## Regular expressions and raw strings

Because Python automatically recognizes certain character sequences as special (`\n` for newline, `\t` for tab, `\\` for single backslash) when you're dealing with regular expressions you will need to pay special attention when building the string to match.

Assume for example that you need to match the occurrence of the string "\ten" in some text found in a file.

In [18]:
import re
from pathlib import Path

regex = re.compile("\ten")
with Path.open("datafiles/02_textfile.txt", "r") as file:
    for line in file.readlines():
        print(regex.search(line))


None


You can see that Python interprets `"\ten"` as a tab followed by `"en"` which is not what we need. Therefore, we might be tempted to change the search string:

In [19]:
import re
from pathlib import Path

regex = re.compile("\\ten")
with Path.open("datafiles/02_textfile.txt", "r") as file:
    for line in file.readlines():
        print(regex.search(line))

None


But that still doesn't work, because `re` also interprets `\t` as tab.

As a result, you need to double the backslash twice:

In [20]:
import re
from pathlib import Path

regex = re.compile("\\\\ten")
with Path.open("datafiles/02_textfile.txt", "r") as file:
    for line in file.readlines():
        print(regex.search(line))

<re.Match object; span=(0, 4), match='\\ten'>


Now it is found, but understanding what is being searched is really complicated.

That's why in most of the cases when dealing with regular expressions it is recommended to use raw strings as in `r"Hello"`. Raw strings tell Python not to process the special characters in the string:

In [26]:
assert r"Hello" == "Hello"  # noqa: PLR0133
assert r"\the" == "\\the"  # noqa: PLR0133
assert "\\the" != "\the"  # noqa: PLR0133

print(r"\the")
print("\the")


\the
	he


Raw strings really simplify the expression that has to be used for the matching:

In [29]:
import re
from pathlib import Path

regex = re.compile(r"\\ten")
with Path.open("datafiles/02_textfile.txt", "r") as file:
    for line in file.readlines():
        print(regex.search(line))

<re.Match object; span=(0, 4), match='\\ten'>


Note that even with raw strings, a `"\t"` must be written as `"\\t"` to disable the understanding of tab and treat the string as a backslash followed by the string "ten".

## Extracting matched text from strings

A common use case for regular expressions is to perform simple pattern-based parsing on text to extract portions of such text.

For example, assume that you have a file with a list of people and phone numbers with the format:

```
surname, firstname middlename: phonenumber
```

Taking into account that:
+ a middle name may or may not exist
+ phone-numbers follow the format:
    + 3 digit area code (optional)
    + 3 digit exchange code
    + 4 digit number

Therefore, you might find phone numbers such as 800-123-4567 or 123-4567.

The way to deal with this *parsing* problem is to use the divide and conquer approach. Let's start with the parsing of surnames, firstnames, and middle names: those will be letters and possibly hyphens: `[-a-zA-Z]`

The previous regular expression will match a single character in the name. Therefore, we need to use the `+` metacharacter so that we can match names with *one or more characters*: `[-a-zA-Z]`.

Note that names such as `---` will also be valid, but that's OK for this example.

For the phone numbers, we can use the `\d` special sequence that identifies digits: `\d\d\d-\d\d\d-\d\d\d\d`.

We can also instruct that the area code is optional using a group and the `?` metacharacter to indicate that's optional: `(\d\d\d-)?\d\d\d-\d\d\d\d`.

The previous regular expression can also be written in a more compact way using `{}` to indicate the number of times a pattern should repeat: `(\d{3}-)?\d{3}-\d{4}`.

As a result, we can write the whole regular expression as:

```
[-a-zA-Z]+, [-a-zA-Z]+( [-a-zA-Z]+)?: (\d{3}-)?\d{3}-\d{4}
```

Note the space in the specification of the optional middlename.

While the above pattern will let us validate that the lines conform to the expected format, it won't help us extract the individual components (surname, first name, middle name, phone number).

In [40]:
import re
from pathlib import Path

regex = re.compile(r"[-a-zA-Z]+, [-a-zA-Z]+( [-a-zA-Z]+)?: (\d{3}-)?\d{3}-\d{4}")
with Path.open("datafiles/03_textfile.txt", "r") as file:
    for i, line in enumerate(file.readlines()):
        print(f"Line {i}: conforms to pattern: {regex.search(line) is not None}")


Line 0: conforms to pattern: True
Line 1: conforms to pattern: True
Line 2: conforms to pattern: True
Line 3: conforms to pattern: True


We can also benefit from the fact that Python automatically concatenates strings, so we can write the regexp in portions:

In [41]:
import re
from pathlib import Path

regex = re.compile(
    r"[-a-zA-Z]+,"
    r" [-a-zA-Z]+"
    r"( [-a-zA-Z]+)?: "
    r"(\d{3}-)?\d{3}-\d{4}",
)
with Path.open("datafiles/03_textfile.txt", "r") as file:
    for i, line in enumerate(file.readlines()):
        print(f"Line {i}: conforms to pattern: {regex.search(line) is not None}")


Line 0: conforms to pattern: True
Line 1: conforms to pattern: True
Line 2: conforms to pattern: True
Line 3: conforms to pattern: True


Extracting the components requires modifying the regex a bit to include `()` to group each subpattern corresponding to the piece of text we want to extract, as well as using `?P<name>` to give each matched subpattern a name.

Once you do so, you'll be able extract those pieces using `result.group("name")` as seen below:

In [46]:
import re
from pathlib import Path

regex = re.compile(
    r"(?P<lastname>[-a-zA-Z]+),"
    r" (?P<firstname>[-a-zA-Z]+)"
    r"( (?P<middlename>[-a-zA-Z]+))?: "
    r"(?P<phonenumber>(\d{3}-)?\d{3}-\d{4})",
)
with Path.open("datafiles/03_textfile.txt", "r") as file:
    for i, line in enumerate(file.readlines()):
        if result := regex.search(line):
            print(
                f"Name={result.group("firstname")!r} "
                f"MiddleName={result.group("middlename")!r} "
                f"Last Name={result.group("lastname")!r}"
                f"Phone={result.group("phonenumber")!r}",
            )
        else:
            print(f"line {i}: {line!r} could not be interpreted")


line 0: "# This is comment line, won't be parsed\n" could not be interpreted
Name='first-name' MiddleName='middle-name' Last Name='surname'Phone='555-123-4567'
Name='Jason' MiddleName=None Last Name='Isaacs'Phone='123-4567'
Name='Florence' MiddleName='Mary' Last Name='Pugh'Phone='123-4567'
Name='Eugene' MiddleName=None Last Name='Krabs'Phone='800-123-4567'


### Exercise

Making international calls requires a `+` and the country code. Assuming that the country code is two digits, how would you modify the previous snippet to extract the `+` and the country code as part of the number?
Assume that the country code will be optional.

In [57]:
import re
from pathlib import Path

regex = re.compile(
    r"(?P<lastname>[-a-zA-Z]+),"
    r" (?P<firstname>[-a-zA-Z]+)"
    r"( (?P<middlename>[-a-zA-Z]+))?: "
    r"(\(\+(?P<countrycode>\d{2})\))?"
    r"(?P<phonenumber>(\d{3}-)?\d{3}-\d{4})",
)
with Path.open("datafiles/04_textfile.txt", "r") as file:
    for i, line in enumerate(file.readlines()):
        if result := regex.search(line):
            print(
                f"Name={result.group("firstname")!r} "
                f"MiddleName={result.group("middlename")!r} "
                f"Last Name={result.group("lastname")!r} "
                f"Country Code={result.group("countrycode")} "
                f"Phone={result.group("phonenumber")!r}",
            )
        else:
            print(f"line {i}: {line!r} could not be interpreted")


line 0: "# This is comment line, won't be parsed\n" could not be interpreted
Name='first-name' MiddleName='middle-name' Last Name='surname' Country Code=None Phone='555-123-4567'
Name='Jason' MiddleName=None Last Name='Isaacs' Country Code=None Phone='123-4567'
Name='Florence' MiddleName='Mary' Last Name='Pugh' Country Code=None Phone='123-4567'
Name='Eugene' MiddleName=None Last Name='Krabs' Country Code=None Phone='800-123-4567'
Name='Penelope' MiddleName=None Last Name='Cruz' Country Code=34 Phone='555-321-4321'


### Exercise

How would you make the code handle country codes of one to three digits?

In [3]:
import re
from pathlib import Path

regex = re.compile(
    r"(?P<lastname>[-a-zA-Z]+),"
    r" (?P<firstname>[-a-zA-Z]+)"
    r"( (?P<middlename>[-a-zA-Z]+))?: "
    r"(\(\+(?P<countrycode>\d{1,3})\))?"
    r"(?P<phonenumber>(\d{3}-)?\d{3}-\d{4})",
)
with Path.open("datafiles/05_textfile.txt", "r") as file:
    for i, line in enumerate(file.readlines()):
        if result := regex.search(line):
            print(
                f"Name={result.group("firstname")!r} "
                f"MiddleName={result.group("middlename")!r} "
                f"Last Name={result.group("lastname")!r} "
                f"Country Code={result.group("countrycode")} "
                f"Phone={result.group("phonenumber")!r}",
            )
        else:
            print(f"line {i}: {line!r} could not be interpreted")


line 0: "# This is comment line, won't be parsed\n" could not be interpreted
Name='first-name' MiddleName='middle-name' Last Name='surname' Country Code=None Phone='555-123-4567'
Name='Jason' MiddleName=None Last Name='Isaacs' Country Code=None Phone='123-4567'
Name='Florence' MiddleName='Mary' Last Name='Pugh' Country Code=None Phone='123-4567'
Name='Eugene' MiddleName=None Last Name='Krabs' Country Code=None Phone='800-123-4567'
Name='Penelope' MiddleName=None Last Name='Cruz' Country Code=34 Phone='555-321-4321'
Name='Ahmed' MiddleName=None Last Name='Riz' Country Code=1 Phone='800-321-4321'
Name='Charles' MiddleName=None Last Name='Leclerc' Country Code=377 Phone='765-4321'


Note that to specify a number of one to three digits, you use the syntax: `"\d{1, 3}"`, that is, you specify the minimum and maximum number of figures.

## Substituting text with regular expressions

Regexp are also useful when you need to find strings in text and substitute them by other strings.

The following snippet illustrates how to do so using the `sub()` method, which replaces all matching substrings with the value of the first argument.

In [7]:
import re

string = "If the the problem is textual, use the the re module"
pattern = r"the the"
regexp = re.compile(pattern)
result = regexp.sub("the", string)

print(f"result={result!r}")


result='If the problem is textual, use the re module'


The `sub()` method accepts as first argument a function, which will be called with each matching object. Then, the invoked function can decide what to do with the match, and return a replacement string if needed.

As an example, consider the following snippet that takes a string containing integer values and returns a string with the same numerical values, but as floating point numbers with a decimal point and zero:

In [8]:
import re


def matched_int_to_float(match_obj):
    return match_obj.group("int_num") + ".0"

string = "1, 2, 3 count with me, that's how the number goes, 4, 5, 6, 7, 8, 9"
pattern = r"(?P<int_num>\d)"
regexp = re.compile(pattern)
result = regexp.sub(matched_int_to_float, string)

print(f"result={result!r}")

result="1.0, 2.0, 3.0 count with me, that's how the number goes, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0"


### Exercise

Modify the phone number parsing exercise to assume that any phone number without country code should be understood as +1 (for United States and Canada).

In [5]:
import re
from pathlib import Path


def default_country_code(match_obj) -> str:
    if not (country_code := match_obj.group("countrycode")):
         country_code = "1"
    return (
            f"{match_obj.group("lastname")}, "
            f"{match_obj.group("firstname")} "
            f"{match_obj.group("lastname")}: "
            f"(+{country_code})"
            f"{match_obj.group("phonenumber")}"
    )

regex = re.compile(
    r"(?P<lastname>[-a-zA-Z]+),"
    r" (?P<firstname>[-a-zA-Z]+)"
    r"( (?P<middlename>[-a-zA-Z]+))?: "
    r"(\(\+(?P<countrycode>\d{1,3})\))?"
    r"(?P<phonenumber>(\d{3}-)?\d{3}-\d{4})",
)
with Path.open("datafiles/05_textfile.txt", "r") as file:
    for i, line in enumerate(file.readlines()):
        mod_line = regex.sub(default_country_code, line)
        if result := regex.search(mod_line):
            print(
                f"Name={result.group("firstname")!r} "
                f"MiddleName={result.group("middlename")!r} "
                f"Last Name={result.group("lastname")!r} "
                f"Country Code={result.group("countrycode")} "
                f"Phone={result.group("phonenumber")!r}",
            )
        else:
            print(f"line {i}: {mod_line!r} could not be interpreted")


line 0: "# This is comment line, won't be parsed\n" could not be interpreted
Name='first-name' MiddleName='surname' Last Name='surname' Country Code=1 Phone='555-123-4567'
Name='Jason' MiddleName='Isaacs' Last Name='Isaacs' Country Code=1 Phone='123-4567'
Name='Florence' MiddleName='Pugh' Last Name='Pugh' Country Code=1 Phone='123-4567'
Name='Eugene' MiddleName='Krabs' Last Name='Krabs' Country Code=1 Phone='800-123-4567'
Name='Penelope' MiddleName='Cruz' Last Name='Cruz' Country Code=34 Phone='555-321-4321'
Name='Ahmed' MiddleName='Riz' Last Name='Riz' Country Code=1 Phone='800-321-4321'
Name='Charles' MiddleName='Leclerc' Last Name='Leclerc' Country Code=377 Phone='765-4321'


### Exercise: Phone number normalizer

In USA and Canada, phone numbers consist of ten digits, usually separated into:
+ a three-digit area code
+ a three-digit exchange code
+ a four-digit station code

They may or may not be preceded by the country code +1.

As a result, you might find the following possible formats for a phone number in the US and Canada:

+ `+1 223-456-7890`
+ `1-223-456-7890`
+ `+1 223 456-7890`
+ `(223) 456-7890`
+ `1 223 456 7890`
+ `223.456.7890`

Create a phone-number normalizer that takes any of the formats above and returns a normalized phone number of the form: `1-NNN-NNN-NNNN`.

Bonus:
+ The first digit of the area code and the exchange code can only be 2-9, and the second digit of an area code can't be 9.

First we work on the individual components (divide and conquer):

Country code:
+ "+1 "
+ "1-"
+ "+1 "
+ "1 "
+ (absent)

Area code:
+ "223-"
+ "223-"
+ "223 "
+ "(223)"
+ "223."

Exchange code:
+ "456-"
+ "456 "
+ "456."

Station code:
+ 7890


In [46]:
import re
from pathlib import Path

regex = re.compile(
        r"(\+?(?P<country_code>(1))(-| )?)?"
        r"\(?(?P<area_code>\d{3})(\) |-| |\.)"
        r"(?P<exchange_code>\d{3})(-| |\.)"
        r"(?P<station_code>\d{4})",
)

with Path.open("datafiles/06_textfile.txt", "r") as file:
    for i, line in enumerate(file.readlines()):
        if result := regex.search(line):
            print(
                f"Country Code={result.group("country_code")!r} "
                f"Area Code={result.group("area_code")!r} "
                f"Exchange Code={result.group("exchange_code")!r} "
                f"Station Code={result.group("station_code")!r} ",
            )
        else:
            print(f"line {i}: {line.strip()!r} could not be interpreted")


Country Code='1' Area Code='223' Exchange Code='456' Station Code='7890' 
Country Code='1' Area Code='223' Exchange Code='456' Station Code='7890' 
Country Code='1' Area Code='223' Exchange Code='456' Station Code='7890' 
Country Code=None Area Code='223' Exchange Code='456' Station Code='7890' 
Country Code='1' Area Code='223' Exchange Code='456' Station Code='7890' 
Country Code=None Area Code='223' Exchange Code='456' Station Code='7890' 
Country Code=None Area Code='999' Exchange Code='456' Station Code='7890' 
line 7: '1-989-111-222' could not be interpreted


In [47]:
import re
from pathlib import Path


def normalize_phone_number(phone_number) -> str:
    regex = re.compile(
        r"(\+?(?P<country_code>(1))(-| )?)?"
        r"\(?(?P<area_code>\d{3})(\) |-| |\.)"
        r"(?P<exchange_code>\d{3})(-| |\.)"
        r"(?P<station_code>\d{4})",
    )
    if not (match_obj := regex.search(phone_number)):
        raise ValueError("invalid phone number format for US/Canada")
    return (
        f"{match_obj.group("country_code") if match_obj.group("country_code") else "1"}-"  # noqa: E501
        f"{match_obj.group("area_code")}-"
        f"{match_obj.group("exchange_code")}-"
        f"{match_obj.group("station_code")}"
    )



with Path.open("datafiles/06_textfile.txt", "r") as file:
    for i, line in enumerate(file.readlines()):
        try:
            print(f"{i}: {line.strip()} --> {normalize_phone_number(line)!r}")
        except ValueError as e:
            print(f"{i}: {line.strip()}: {e}")



0: +1 223-456-7890 --> '1-223-456-7890'
1: 1-223-456-7890 --> '1-223-456-7890'
2: +1 223 456-7890 --> '1-223-456-7890'
3: (223) 456-7890 --> '1-223-456-7890'
4: 1 223 456 7890 --> '1-223-456-7890'
5: 223.456.7890 --> '1-223-456-7890'
6: 999.456.7890 --> '1-999-456-7890'
7: 1-989-111-222: invalid phone number format for US/Canada


For the bonus part:
+ The first digit of the area code and the exchange code can only be 2-9
+ The second digit of an area code can't be 9.

In [49]:
import re
from pathlib import Path


def normalize_phone_number(phone_number) -> str:
    regex = re.compile(
        r"(\+?(?P<country_code>(1))(-| )?)?"
        r"\(?(?P<area_code>[2-9][0-8]\d)(\) |-| |\.)"
        r"(?P<exchange_code>[2-9]\d{2})(-| |\.)"
        r"(?P<station_code>\d{4})",
    )
    if not (match_obj := regex.search(phone_number)):
        raise ValueError("invalid phone number format for US/Canada")
    return (
        f"{match_obj.group("country_code") if match_obj.group("country_code") else "1"}-"  # noqa: E501
        f"{match_obj.group("area_code")}-"
        f"{match_obj.group("exchange_code")}-"
        f"{match_obj.group("station_code")}"
    )



with Path.open("datafiles/06_textfile.txt", "r") as file:
    for i, line in enumerate(file.readlines()):
        try:
            print(f"{i}: {line.strip()} --> {normalize_phone_number(line)!r}")
        except ValueError as e:
            print(f"{i}: {line.strip()}: {e}")



0: +1 223-456-7890 --> '1-223-456-7890'
1: 1-223-456-7890 --> '1-223-456-7890'
2: +1 223 456-7890 --> '1-223-456-7890'
3: (223) 456-7890 --> '1-223-456-7890'
4: 1 223 456 7890 --> '1-223-456-7890'
5: 223.456.7890 --> '1-223-456-7890'
6: 999.456.7890: invalid phone number format for US/Canada
7: 1-989-111-222: invalid phone number format for US/Canada
