# Regular Expressions

----

Regular expressions are a very powerful tool for text processing.

This chapter is an overview of the most common regular expression functions. Since this is a Python class and not necessarily a regular expression class we may not cover everything you need.

There is a good online regular expression primer [here](https://regexone.com).

In [1]:
import re

**NOTE:** In this module we will mostly make use of *raw* strings, eg `r'foo'`, because regular expression make use of a wide variety of backslash escape sequences.

# Regular Expression Functions

## Exact Matches

The most basic regular expression operation is to test whether an input string matches an entire pattern.

When the regex matches, it returns a *match object*.

In [2]:
re.match(r'Star Wars \d+', "Star Wars 5")

<_sre.SRE_Match object; span=(0, 11), match='Star Wars 5'>

Otherwise, it returns `None`.

In [3]:
print(re.match(r'Star Wars \d+', "Star Wars VI"))

None


Regular expressions are case sensitive by default.

In [4]:
print(re.match(r'Star Wars \d+', "star wars 5"))

None


But you can use flags to ignore case.

In [5]:
re.match(r'Star Wars \d+', "star wars 5", flags=re.IGNORECASE)

<_sre.SRE_Match object; span=(0, 11), match='star wars 5'>

The `match` function starts matching at the beginning of the string.

In [6]:
print(re.match(r'Luke', 'Darth Vader, Obi-wan, Luke, Leia')) # No match

None


In [7]:
re.match(r'.*Luke', 'Darth Vader, Obi-wan, Luke, Leia') # Match

<_sre.SRE_Match object; span=(0, 26), match='Darth Vader, Obi-wan, Luke'>

In [8]:
re.search(r'Luke', 'Darth Vader, Obi-wan, Luke, Leia')

<_sre.SRE_Match object; span=(22, 26), match='Luke'>

# Match Objects

A `match` or `search` operator will either return a match object or `None`.

Match objects always have a boolean value of `True`, so you can do this:

In [9]:
m = re.match(r'(\d{3})-(\d{4})', '867-5309')
if m:
    print(m.groups())

('867', '5309')


The most common operation on a match object is to extract values from the regular expression *groups*.

In [10]:
m.group(1)

'867'

In [11]:
m.group(2)

'5309'

Group number zero is always the entire match:

In [12]:
m.group(0)

'867-5309'

# Regular Expressions Objects

The following code is inefficient:

```python
import re
with open(filename) as fin:
    for line in fin:
        m = re.match(r'foo', line)
        if m:
            foo()
```

Every time the loop passes the `re.match` statement, the Python interpreter re-compiles the regular expression.

Re-write the code using `re.compile`:

```python
import re
REGEX = re.compile(r'foo')
with open(filename) as fin:
    for line in fin:
        m = REGEX.match(line)
        if m:
            foo()
```

Now the regular expression is only compiled once.

## Exercise 1

The file "`data/baby1996.html`" is an HTML document list the top 1000 boys and girls names for 1996 in the United States. Use regular expressions to parse the names into a list of dictionaries. Each dictionary should contain the keys 'rank', 'name', and 'sex' (where sex is 'male' or 'female'). Print out the how many names start with the letter `'S'`.

In [13]:
def parse_names():
    """Your code here"""
            
def print_name_count(names, startswith):
    subnames = [n for n in names if n['name'].startswith(startswith)]
    for name in subnames:
        print(f'{name["name"]}')
    print(f'{len(subnames)} names start with "{startswith}"')

# Parse the names and print
names = parse_names()
print_name_count(names, 'S')

TypeError: 'NoneType' object is not iterable

In [15]:
# Show the answer
! cat answers/regex_1.py

def parse_names():
    names = []
    with open('baby1996.html') as fin:
        # <tr align="right"><td>148</td><td>Edgar</td><td>Kristina</td>
        rex = re.compile(r'<tr align="right"><td>(\d+)</td><td>(\w+)</td><td>(\w+)</td>')
        for line in fin:
            m = rex.match(line)
            if m:
                names.append({
                    'rank': m.group(1),
                    'name': m.group(2),
                    'sex': 'male',
                })
                names.append({
                    'rank': m.group(1),
                    'name': m.group(3),
                    'sex': 'female',
                })
    return names
