# Regular Expression

In [1]:
import sys
from pathlib import Path

current = Path.cwd()
for parent in [current, *current.parents]:
    if (parent / '_config.yml').exists():
        project_root = parent  # ← Add project root, not chapters
        break
else:
    project_root = Path.cwd().parent.parent

sys.path.insert(0, str(project_root))

from shared import thinkpython, diagram, jupyturtle
from shared.download import download

# Register as top-level modules so direct imports work in subsequent cells
sys.modules['thinkpython'] = thinkpython
sys.modules['diagram'] = diagram
sys.modules['jupyturtle'] = jupyturtle

If we know exactly what sequence of characters we're looking for, we can use the **`in`** operator to find it and the `replace` method to replace it.
However, there is another tool, called a **regular expression**, that can also perform these operations—and a lot more.

To demonstrate, I'll start with a simple example and we'll work our way up.
Suppose, again, that we want to find all lines that contain a particular word.
For a change, let's look for references to the titular character of the book, Count Dracula.
Here's a line that mentions him.

In [2]:
text = "I am Dracula; and I bid you welcome, Mr. Harker, to my house."

And here's the **pattern** we'll use to search.

In [3]:
pattern = 'Dracula'

A Python module called **`re`** provides functions related to regular expressions.
We can import it like this and use the `search` function to check whether the pattern appears in the text.

In [4]:
import re

result = re.search(pattern, text)     ### pattern: Dracula; text: the line
result

<re.Match object; span=(5, 12), match='Dracula'>

If the pattern appears in the text, `search` returns a `Match` object that contains the results of the search.
Among other information, it has a variable named `string` that contains the text that was searched.

In [5]:
result.string

'I am Dracula; and I bid you welcome, Mr. Harker, to my house.'

It also provides a method called `group` that returns the part of the text that **match**ed the pattern.

In [6]:
result.group()

'Dracula'

And it provides a method called `span` that returns the index in the text where the pattern starts and ends.

In [7]:
result.span()

(5, 12)

If the pattern doesn't appear in the text, the return value from `search` is `None`.

In [8]:
result = re.search('Count', text)
print(result)

None


So we can check whether the search was successful by checking whether the result is `None`.

In [9]:
result == None

True

## Download and clean the text

Before we can search the text of *Dracula*, we need to download it from Project Gutenberg and remove the header and footer information.

We'll download the Dracula text from Project Gutenberg and save it to the `data` folder. Then we'll clean the file and save the cleaned version in the same folder. All subsequent analysis will use these files.

In [10]:
import os
from urllib.request import urlretrieve

# Ensure data directory exists
os.makedirs('data', exist_ok=True)

# Download Dracula text to data folder
url = 'https://www.gutenberg.org/files/345/345-0.txt'
local_path = '../../data/pg345.txt'
if not os.path.exists(local_path):
    urlretrieve(url, local_path)
    print('Downloaded Dracula to', local_path)
else:
    print('Dracula already downloaded:', local_path)

Dracula already downloaded: ../../data/pg345.txt


In [11]:
# download('https://www.gutenberg.org/cache/epub/345/pg345.txt');

In [12]:
def clean_file(infile, outfile):
    """Read infile, write to outfile skipping special lines."""
    with open(infile, encoding='utf8') as fin, open(outfile, 'w', encoding='utf8') as fout:
        for line in fin:
            if not is_special_line(line):
                fout.write(line)

In [13]:
# def clean_file(input_file, output_file):
#     reader = open(input_file, encoding='utf-8')
#     writer = open(output_file, 'w')

#     for line in reader:
#         if is_special_line(line):
#             break

#     for line in reader:
#         if is_special_line(line):
#             break
#         writer.write(line)
        
#     reader.close()
#     writer.close()

In [14]:
def is_special_line(line):
    """Return True if the line is a header/footer or empty."""
    return (line.startswith('***') or line.strip() == '' or
            line.startswith('End of the Project Gutenberg'))

In [15]:
# def is_special_line(line):
#     return line.strip().startswith('*** ')

In [16]:
# Clean the Dracula text and save to data/pg345_cleaned.txt
clean_file('../../data/pg345.txt', '../../data/pg345_cleaned.txt')
print('Cleaned file saved to data/pg345_cleaned.txt')

Cleaned file saved to data/pg345_cleaned.txt


Putting all that together, here's a function that loops through the lines in the book until it finds one that matches the given pattern, and returns the `Match` object.

In [17]:
def find_first(pattern):
    for line in open('../../data/pg345_cleaned.txt'):
        result = re.search(pattern, line)
        if result != None:
            return result

We can use it to find the first mention of a character.

In [18]:
result = find_first('Harker')
result.string

'CHAPTER I. Jonathan Harker’s Journal\n'

For this example, we didn't have to use regular expressions -- we could have done the same thing more easily with the `in` operator.
But regular expressions can do things the `in` operator cannot.

For example, if the pattern includes the vertical bar character, `'|'`, it can match either the sequence on the left or the sequence on the right.
Suppose we want to find the first mention of Mina Murray in the book, but we are not sure whether she is referred to by first name or last.
We can use the following pattern, which matches either name.

In [19]:
pattern = 'Mina|Murray'
result = find_first(pattern)
result.string

'CHAPTER V. Letters—Lucy and Mina\n'

We can use a pattern like this to see how many times a character is mentioned by either name.
Here's a function that loops through the book and counts the number of lines that match the given pattern.

In [20]:
def count_matches(pattern):
    count = 0
    for line in open('../../data/pg345_cleaned.txt'):
        result = re.search(pattern, line)
        if result != None:
            count += 1
    return count

Now let's see how many times Mina is mentioned.

In [21]:
count_matches('Mina|Murray')

229

The special character `'^'` matches the beginning of a string, so we can find a line that starts with a given pattern.

In [22]:
result = find_first('^Dracula')
result.string

'Dracula, jumping to his feet, said:--\n'

And the special character `'$'` matches the end of a string, so we can find a line that ends with a given pattern (ignoring the newline at the end).

In [23]:
result = find_first('Harker$')
result.string

'by five o’clock, we must start off; for it won’t do to leave Mrs. Harker\n'

### String substitution

Bram Stoker was born in Ireland, and when *Dracula* was published in 1897, he was living in England.
So we would expect him to use the British spelling of words like "centre" and "colour".
To check, we can use the following pattern, which matches either "centre" or the American spelling "center".

In [24]:
pattern = 'cent(er|re)'

In this pattern, the parentheses enclose the part of the pattern the vertical bar applies to.
So this pattern matches a sequence that starts with `'cent'` and ends with either `'er'` or `'re'`.

In [25]:
result = find_first(pattern)
result.string

'horseshoe of the Carpathians, as if it were the centre of some sort of\n'

As expected, he used the British spelling.

We can also check whether he used the British spelling of "colour".
The following pattern uses the special character `'?'`, which means that the previous character is optional.

In [26]:
pattern = 'colou?r'

This pattern matches either "colour" with the `'u'` or "color" without it.

In [27]:
result = find_first(pattern)
line = result.string
line

'undergarment with long double apron, front, and back, of coloured stuff\n'

Again, as expected, he used the British spelling.

Now suppose we want to produce an edition of the book with American spellings.
We can use the `sub` function in the `re` module, which does **string substitution**.

In [28]:
re.sub(pattern, 'color', line)

'undergarment with long double apron, front, and back, of colored stuff\n'

The first argument is the pattern we want to find and replace, the second is what we want to replace it with, and the third is the string we want to search.
In the result, you can see that "colour" has been replaced with "color".

In [29]:
# I used this function to search for lines to use as examples

def all_matches(pattern):
    for line in open('../../data/pg345_cleaned.txt'):
        result = re.search(pattern, line)
        if result:
            print(line.strip())

In [30]:
### e.g., 

all_matches('weather')

weather. As I stood, the driver jumped again into his seat and shook the
weatherworn, was still complete; but it was evidently many a day since
it is a buoy with a bell, which swings in bad weather, and sends in a
am awakened by her moving about the room. Fortunately, the weather is so
learn the weather signs. To-day is a grey day, and the sun as I write is
experienced here, with results both strange and unique. The weather had
kept watch on weather signs from the East Cliff, foretold in an emphatic
_22 July_.--Rough weather last three days, and all hands busy with
weather. Passed Gibralter and out through Straits. All well.
and entering on the Bay of Biscay with wild weather ahead, and yet last
weather influences as we know that the Count can bring to bear; and if
that I am fully armed as there may be wolves; the weather is getting


In [31]:
# Here's the pattern I used (which uses some features we haven't seen)

# names = r'(?<!\.\s)[A-Z][a-zA-Z]+'

# all_matches(names)