In [None]:
import re

### Introduction and Encyclopedia

This is an introduction to **regular expressions**, commonly called **regex**, a powerful method for finding patterns in text.  It has a steep learning curve, but once mastered it can save enormous amounts of time.  There are two things that need to be learned.

1) The regex syntax for defining a pattern.  This is nearly constant across languages that implement regular expressions, and the differences tend to be in the edge cases.  We will study the Python implementation.
2) A library, language, or tool that provides regex functionality.  We will use the `re` package, but `pandas` also has some functionality, and `grep` can be used at a Linux terminal.

The files begins with an encyclopedic reference so you'll have a cheatsheet once you know some regex fundamentals.  But for the purpose of the presentation, we will skip past and introduce them systematically.  It's not completely comprehensive, but should cover 99% of the situations you might need a regex for.


#### Special Characters and Sequences


| Symbol            | Purpose                                                                       | Other                                                                     |
| ----              | ----                                                                          | ----                                                                      | 
|`.`                | Wildcard that matches anything except a newline.                              | If inside character class, matches only literal `.`.                            |       
|`^`                | Matches start of string and immediately after a newline.                      | If first character inside character class, negates or complements class.                   |
|`$`                | Matches end of string or just before a newline.                               |                                                                           |
|`-`                | If inside a character class, can be used to define a range.
|`?`                | Quantifier matching 0 or 1 repetitions of the preceeding segment.             | If used after another quantifier, makes it non-greedy.  If used after `(`, defines a lookaround.  |
|`+`                | Quantifier matching 1 or more of the preceeding segment.                      | If used after another quantifier, makes it greedy without backtracking.   |
|`*`                | Quantifier matching 0 or more reptitions of the preceeding segment.           |           
|`{m}`              | Quantifier matching exactly `m` repetitions of the preceeding segment.        |
|`{m,n}`            | Quantifier matching between `m` and `n` repetitions of the preceeding segment.| Either of `m` or `n` can be left blank to get "less than" and "greater than" behavior. |
|`\`                | Escapes special characters (when necessary).                                  | References one of many built-in character classes.                        |
|`\|`               | Alternation; that is, "or" logic.                                             |                                                                           |
|`[...]`            | Creates a character class.
|`(...)`            | Captures a group for later reference.
|`\n`               | Backreference to group `n` (must be positive integer)  
|`(?P<name>...)`    | Captures a group that can be referenced by `<name>`
|`(?P=<name>)`      | Backreference to `<name>`
|`(?...)`           | A non-capturing group.
|`(?=...)`          | Positive lookahead.
|`(?!...)`          | Negative lookahead.
|`(?<=...)`         | Positive lookbehind.
|`(?<!...)`         | Negative lookbehind.

#### Built-in Character Classes


| Symbol        | Purpose   |
| ----          | ----      |
| `\d`          | A digit: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.
| `\D`          | Anything that is not a digit.
| `\w`          | A word character, which confusingly includes numbers.  That is, alphanumerics regardless of case.
| `\W`          | Anything that is not a word character.
| `\s`          | A space (or whitespace, if you prefer). Spaces, tabs, newlines, etc.
| `\S`          | Anything that is not whitespace.
| `\b`          | A word boundary, matching the empty string but only before or after word characters.
| `\B`          | Matches the empty string except when it is a word boundary.



#### `re` Functions

The order is a mix of complexity and usefulness according to my subjective and fallible opinion.  Again, this is most but not all of the functions.

| Function                              | Description |
| ----                                  | ----        |
|`re.findall(pattern, text)`            | Looks for all occurences of the pattern, returning a list of strings.|
|`re.sub(pattern, replacement, text)`   | Replaces every occurence of the pattern with the replacement, which can be a function.|
|`re.split(pattern, text)`              | Splits the string wherever the pattern is found.|
|`re.compile(pattern)`                  | Creates an `re.Pattern` object.|
|`re.search(pattern, text)`             | Looks for the first occurence of the pattern, returning an `re.Match` object.|
|`re.finditer(pattern, text)`           | Similar to `findall()` but returns an iterator over `re.Match` objects.|
|`re.match(pattern, text)`              | Looks for a match at the beginning of the string.|
|`re.fullmatch(pattern, text)`          | Looks to see if the entire string is a match.|
|`re.escape(pattern)`                   | Escapes all special characters, yielding a literal string to be matched.


#### `Pattern` Methods and Attributes

| Method / Attribute                    | Description |
| ----                                  | ----        |
|`Pattern.search(string)`               | Similar to `re.search()`, but called as a method from the pattern instead of passing it in as an argument. |
|`Pattern.match(string)`                | Similar to `re.match()`.
|`Pattern.fullmatch(string)`            | Similar to `re.fullmatch()`.
|`Pattern.split()`                      | Similar to `re.split()`.
|`Pattern.findall()`                    | Similar to `re.findall()`.
|`Pattern.finditer()`                   | Similar to `re.finditer()`.
|`Pattern.sub()`                        | Similar to `re.sub()`.
|`Pattern.groups()`                     | The number of groups captured by the pattern.
|`Pattern.groupindex()`                 | A dictionary with group names as keys and group number as values.

#### `Match` Methods and Attributes

| Method / Attribute                    | Description |
| ----                                  | ----        |
|`Match.start()`        |Gives index of original string where the match starts.|
|`Match.end()`          |Similar, but the end.|
|`Match.span()`         |A tuple with the start and end indices.|
|`Match.pos`            |The beginning index of the search that found the match.|
|`Match.endpos`         |The end index of the search that found the match.|
|`Match.group()`        |Returns a subgroup of the match.|
|`Match.groups()`       |Returns a tuple of all groups.|
|`Match.groupdict()`    |Returns a dictionary with named groups.|
|`Match.re`             |Returns the pattern the match was found from.|
|`Match.string`         |Returns the original string the match was found in.|

### Basic literals


A regex is a sequence of characters that describes a textual pattern.  The most basic type of pattern is a literal expression.  If you just want to find all occurences, use `re.findall(pattern, text)`.

In [None]:
text = "My name is Steve Carden, I was born in 1985, and my email address is stephen.carden.1@us.af.mil. I used to teach statistics, and now I am learning to be a software engineer."

# This only shows up once
print(re.findall("Steve", text))

# This appears multiple times
print(re.findall("is", text))

By default, regular expressions are case sensitive.

In [None]:
print(re.findall("steve", text))

`re.sub()` is nice for replacing one pattern with something else.  For now the replacement will be a static string, but later we'll learn how to make it a function of the match.

In [None]:
# Suppose I decided to go by my middle name, Wesley.
print(re.sub("Steve", "Wesley", text))

# Or change the tense (mostly, anyway)
print(re.sub("is", "was", text))

# See how this is different?  More clever solutions are later.
print(re.sub(" is ", " was ", text))

`re.split()` splits the string wherever the pattern is found.  We can use this with simple whitespace to create a janky tokenizer (improvable with forthcoming techniques).

In [None]:
print(re.split(" ", text))

### `Match` objects

`re.search()` only finds the first occurence, but it returns a `Match` object that contains more information.

In [None]:
match = re.search("Steve", text)
print(match)
print(type(match))
print(match.start())
print(match.end())
print(match.span())
print(match.re)
print(match.string)

If you want to be able to access a `Match` object for all occurences, `re.finditer()` is your friend.

In [None]:
# Where does the pattern "is" occur?
matches = re.finditer("is", text)
print(matches)
for m in matches:
    print(m.span())

### Exercise 1

In [None]:
text = "My name is Steve Carden, I was born in 1985, and my email address is stephen.carden.1@us.af.mil. I used to teach statistics, and now I am learning to be a software engineer."

# Make substitutions in the above string so that it applies to you.

# At what index does the "@" in your email address appear?

# What are all the locations where a comma appears?

### Special Characters and Backslash Hell

The following characters have special meanings to the regex engine: `\^$.|?*+()[{` .  Some have more than one use, and their behavior depends on the context they are used.  If you want to include one as a literal in a pattern, escape it with a backslash `\`.

In [None]:
# Attempt to split text into sentences.  What if there was no backslash?  Do you see an easy immediate improvement?
re.split("\.", text)

In [None]:
# We don't know what "$" means to the regex engine yet, but whatever it does, it needs to be escaped.
print(re.findall("$", "I need about $3.50."))
print(re.findall("\$", "I need about $3.50."))


But remember that backslashes can also have special meaning to the Python interpreter.  Let's step away from regular expressions for the moment and think about Python strings and how backslashes are interpreted.  Consider the following attempts to construct a string.

In [None]:
albert <- "Einstein said, "God does not play dice with the universe.""

There's several ways around this.  You can alternate single and double quotes, use triple quotes (this allows multiline strings which will be handy later), or use backslashes to escape their ability to terminate a string and treat them as simple, literal quotes.

In [None]:
albert1 = 'Einstein said, "God does not play dice with the universe."'
# I don't like the extra space from this one, but it seems necessary.  Anyone know a fix?
albert2 = """Einstein said, "God does not play dice with the universe." """
albert3 = "Einstein said, \"God does not play dice with the universe.\""
print(albert1)
print(albert2)
print(albert3)


Strings and regex patterns that include backslashes are tricky because both Python and the regex engine see backslashes as special characters.  First, just making a string that includes a literal backslash can be hazardous.

In [None]:
slash_explanation = "Forward slashes lean right /, and backslashes lean left \"
print(slash_explanation)

You need an extra backslash to tell the second one not to be special, and leave its grubby paws off the quotes.

In [None]:
slash_explanation = "Forward slashes lean right /, and backslashes lean left \\"
print(slash_explanation)
# More confusion: Outputting a string with a backslash to the console shows two, even though printing shows one (and only one exists!)
slash_explanation

Now I would like to match occurrences of a backslash.  Here's where my head starts to spin.  If you need to escape escapi-ness at both the Python and regex level, how many are needed to end up with one backslash?

In [None]:
m = re.search("\", slash_explanation)
# Keep in mind it's showing two in the pattern and text, even though each only have one at the regex engine level
print(m)
print(m.string)
m.string

This is one of the more terrible things I've ever encountered.  Fortunately, it's not quite so bad if you use raw strings, in which the Python interpreter treats all backslashes as literal (unless they are escaping quotes).  Prefix a string with `r` to make it a raw string.  I've observed many users default to always using raw strings to define regex patterns.

In [None]:
# Backslashes lose most of their special meaning in a raw string
print("Hello\nWorld")
print(r"Hello\nWorld")

In [None]:
# But can still escape quotes.  Try with two backslashes, and see that both are kept
print(r"Hello \\")
r"Hello \\"


Return to the example of matching a single backslash in a string, but use a raw string to create the pattern.

In [None]:
# Doesn't help to make this one raw, as the backslash would still escape the quote
slash_explanation = "Forward slashes lean right /, and backslashes lean left \\"
# But the pattern can be simplified down to two backslashes: escaping at the regex level, but not the Python level
re.search(r"\\", slash_explanation)

Here are a few of the simplest special characters and their meanings.

- `^` doesn't match a character, but the position at the **beginning** of a string.  It's called an **anchor**.  If `re.MULTILINE` is passed as an argument, it also matches positions after new lines.
- `$` is the other anchor but matches the **ending** position.  It interacts similarly with `re.MULTILINE`.
- `|` is **alternation**, the regex term for "or" logic.
- `.` is a wildcard that matches everything except a line break.  Use with caution, because it usually matches more than you want and results in unintended patterns, especially combined with quantifiers (coming soon).  Every time in my experience that I've combined this wildcard with quantifiers, it's been a mistake.

In [None]:
text = """This is a multiline string.
Is this the third line?
No, it was the second. Doh!
"""

# What are the first characters of each line?
print(re.findall(r"^.", text))
print(re.findall(r"^.", text, re.MULTILINE))

# What punctuation terminates each line?
print(re.findall(r".$", text))
print(re.findall(r".$", text, re.MULTILINE))

text = "The boy threw the ball at the girl.  He missed and it nearly hit a man, but a woman warned him in time."
# Find anything representing a person entity
print(re.findall(r"boy|girl|man|woman", text))


### Exercise 2

In [None]:
homework = """
1 + 2 = ?
3 * 4 = ?
2 ^ 3 = ?
4 / 2 = ?
3 ^ 2 = ?
"""

# Find occurences of "2", but only if they are at the beginning of a line.

# For all exponentiation problems, extract the expression to the left of the equals sign.
# For example, if the text is "a ^ b = ?", the pattern should extract "a ^ b"

# Find all the symbols representing the basic arithmetic operations.
# Addition, subtraction, multiplication, and division.  Do not include exponentiation.  
# Hint:  Some are special characters that need to be escaped, and others are not.

### `Pattern` objects

Until now we've been handling our regex pattern as a (perhaps raw) string.  Under the hood, the regex engine compiles it into a `re.Pattern` object.  We can compile it beforehand with `re.compile()`, which has a few advantages.

First, you can call methods from the pattern object, which is often more efficient.  Scroll up to the cheatsheet tables at the beginning of the notebook.  In fact, most of the `re` functions work by compiling the pattern and calling its equivalent method.

In [None]:
# In the first 5 million natural numbers, how many times does an even digit appear?
# And what's the most efficient way to find out?

import time

start = time.time()
number_of_even_digits = 0
for i in range(5000000):
    number_of_even_digits += len(re.findall(r"2|4|6|8|0", str(i)))
end = time.time()
print(number_of_even_digits)
print("Without a compiled pattern, it took this long: ", end - start)

start = time.time()
number_of_even_digits = 0
pattern = re.compile(r"2|4|6|8|0")
for i in range(5000000):
    number_of_even_digits += len(pattern.findall(str(i)))
end = time.time()
print(number_of_even_digits)
print("With a compiled pattern, it took this long: ", end - start)

Second, it can reduce cognitive load on the programmer.  Assigning a descriptive name to the pattern will communicate what it does faster than mentally parsing out the expression.  Furthermore, with the `re.VERBOSE` flag to `re.compile()`, you can break the pattern into multiple lines and comment each part.  This is very helpful as advanced expressions can get pretty gnarly.

In [None]:
homework = """
1 + 2 = ?
3 * 4 = ?
2 ^ 3 = ?
4 / 2 = ?
3 ^ 2 = ?
"""

# Tip: add the MULTILINE flag to the pattern, not just the method.  I got stuck on this example a long time because of that!
# This is our first time needing multiple flags, so notice they are combined with a "or" | between them.
pattern_arithm_oper = re.compile(r"""
                                 \+     # Literal +
                                 |      # or
                                 \*     # Literal *
                                 |      # or                                 
                                 /      # Literal /
                                 |      # or
                                 -      # Literal -
                                 """, re.VERBOSE|re.MULTILINE)

print(pattern_arithm_oper.findall(homework))

### Character Classes

Alternation gets clunky if there are many possible characters you would like to match.  For example, if you were looking for any digit, who wants to deal with `"0|1|2|3...|8|9"`?  **Character classes** are a briefer alternative.  Here are some of the common built-in classes.

| Symbol        | Purpose   |
| ----          | ----      |
| `\d`          | A digit: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.
| `\D`          | Anything that is not a digit.
| `\w`          | A word character, which confusingly includes numbers.  That is, alphanumerics regardless of case.
| `\W`          | Anything that is not a word character.
| `\s`          | A space (or whitespace, if you prefer). Spaces, tabs, newlines, etc.
| `\S`          | Anything that is not whitespace.
| `\b`          | A word boundary, matching the empty string but only before or after word characters.
| `\B`          | Matches the empty string except when it is a word boundary.

In [None]:
text = """I was born in 1985, and my daughter was born in 2011.
At time of writing these notes in 2024, I am 39 years old.
My daughter recently celebrated her 13th birthday.  Teenagers are terrible!"""

# Extract all years, that is, sequences of four consecutive digits (there's an even more elegant method we'll learn soon)
pattern_year = re.compile(r"\d\d\d\d")
print(pattern_year.findall(text))

# Find all four letter alphanumeric sequences that form a distinct token
pattern_4_len_seq = re.compile(r"\b\w\w\w\w\b")
print(pattern_4_len_seq.findall(text))

# Remember our first attempt at a tokenizer?  Do you see a particular failing that can be improved with one of these character classes?
print(re.split(" ", text))

To make your own character class, enclose everything you want to include in square brackets `[]`. Remember that it's still matching only one character from the text to the class; repetition will arrive later.

- You can use a `-` as a shortcut to specify a range of digits or letters of either case.  If you want a literal `-` in the class, backslash it.
- You can complement the class by using `^` as the first character.  Notice it has different uses when inside a character class (complementing) than outside (beginning of string or line).  If you want a literal `^` in the class, make it the second character or later.
- Other special characters lose their meaning when inside a character class and do not need to be escaped.  For example, `.` is no longer a wildcard, and now represents a literal dot.

In [None]:
text = "Americans spell it gray, the British spell it grey, but nobody spells it graey."

# Retrieve any valid spelling
print(re.findall("gr[ae]y", text))

# Pull out lower-case vowels.
print(re.findall("[aeiouy]", text))

# Pull out anything that isn't a lower-case vowel
print(re.findall("[^aeiouy]", text))

# Pull out only lowercase first five letters of alphabet
print(re.findall("[a-e]", text))

# Pull out first five letters regardless of case
print(re.findall("[a-eA-E]", text))

text = "My name is Steve Carden, I was born in 1985, and my email address is stephen.carden.1@us.af.mil. I used to teach statistics, and now I am learning to be a software engineer."

# Pull out the common punctuation symbols.  
# What if we didn't want the periods in the email?  We'll be able to deal with that later.
print(re.findall("[,.!?]", text))

### Exercise 3

In [None]:
# In hexadecimal (base 16), digits are represented by digits from 0 to 9 and the first six letters of the alphabet, a through f, possibly upper case
# Construct and apply a pattern that can find the four cases that could represent a single hexadecimal digit
# Hint: What can you use as a "boundary" to ensure you're only picking up singular symbols?
text = "A o 5 55 14 0 b 99 Y"

# Construct and apply a pattern that can find the four cases representing 
# the seconds portion when time is communicated in a format like 08:34:45
# Basically, two-digit sequences from 00 to 60.
text = "89 00 5 59 60 61 0 34 000 600 65 222"

# Construct and apply a pattern that can find the three sequences that represent a valid telephone number.
# Area code is included, and the three sections must be connected by a dash.
# For example, 123-456-7890
text = "912-584-0018 584-0018 123-45-6789 2350-1111 912-283-8411 555-867-5309 555555555 123-456-78900"

### Repetition and Quantifiers; Non-Capturing Groups

I bet you're already seeing the need to include repetition in a pattern, and there is indeed a robust system.

- `?` Matches 0 or 1 occurrences of the preceeding segment. Basically a way to designate the segment is optional.
- `+` Matches 1 or more occurrences of the preceeding segment. Use for non-optional, potentially repeating segments.
- `*` Matches 0 or more occurrences of the preceeding segment. Use for optional, potentially repeating segments.
- `{m}` Matches exactly `m` repetitions of the preceeding segment.
- `{m,n}` Matches between `m` and `n` repetitions of the preceeding segment. Either of `m` or `n` can be left blank to get "less than" and "greater than" behavior.

In [None]:
text = "A non-secure link looks like this: http://www.google.com.  A secure link looks like this: https://www.google.com."
# The "?"" only applies to the "s"
pattern_protocol = re.compile(r"https?")
print(pattern_protocol.findall(text))

text = "Some abbreviate it PhD, and some say Ph.D., but it means the same thing."
pattern_phd = re.compile(r"Ph\.?D\.?")
print(pattern_phd.findall(text))

# A more compact way to search for phone numbers
text = "912-584-0018 584-0018 123-45-6789 2350-1111 912-283-8411 555-867-5309 555555555 123-456-78900"
pattern_phone = re.compile(r"\b\d{3}-\d{3}-\d{4}\b")
print(pattern_phone.findall(text))

# I don't allow my checking account to drop below $100, and any amount greater than $9999 overflows into an investment account.
# Which values might I see on my bank statement?  Assume it always displays to the nearest cent.
text = "$10.00 $5432.10 $5432 5432.10 $22222 $99.99 $100.00 $100 $9999.99 $9999.999 $9999.9"
pattern_cash = re.compile(r"""                          
                          \$        # Literal $
                          \d{3,4}   # At least 3 but no more than 4 digits                          
                          \.        # A literal .
                          \d{2}     # Exactly two digits
                          \b        # End with a word boundary
                          """, re.VERBOSE)
print(pattern_cash.findall(text))


By default the quantifier only applies to the immediately preceeding symbol.  If you want it to apply to a larger part of the pattern, your first instinct might be to create a subpattern with `(...)`.  This does work, but it also **captures** the contents, which is a concept we haven't studied yet.  Capturing changes the behavior of some functions such as `findall()`, so for the moment let's use a **non-capturing group** with `(?:...)`.

In [None]:
# Modify this problem so that the area code is optional
# It should find one additional occurence
# For example, 123-456-7890
text = "912-584-0018 584-0018 123-45-6789 2350-1111 912-283-8411 555-867-5309 555555555 123-456-78900"
pattern_phone = re.compile(r"""
                           \b           # Boundary at beginning
                           (?:          # Begin non-capturing group
                           \d{3}-       # Area code and dash
                           )?           # End non-capturing group, and make it optional
                           \d{3}-\d{4}  # The rest of the pattern
                           \b           # End with a boundary
                           """, re.VERBOSE)
print(pattern_phone.findall(text))

### Excercise 4

In [None]:
# Extract all numbers.
text = """I was born in 1985, and my daughter was born in 2011.
At time of writing these notes, I am 39 years old.  How am I still alive?
My daughter recently celebrated her 13th birthday.  Teenagers are terrible!"""

# Extract the last alphanumeric sequence of each line, including punctuation.

# Extract all words (not including numbers this time) that have at least 4 letters.


### Capturing Groups

Often within an overall pattern, there are one or more subpatterns of interest.  With parentheses `(...)` we can capture these subpatterns into **groups** for later reference and processing.

- If your pattern contains any capturing groups, `findall()` will change behavior to only printing the groups rather than the entire match.

- If you need to work extensively with groups, use `search()` or `finditer()` to get one or more `Match` objects, then `Match.group()` to refer to them.  `Match.group(0)` returns the entire match, and `Match.group(n)` refers to the `n`th group in your pattern.

- Groups can be named with `(?P<name>...)`, and later referenced with`(?P=<name>)`.

In [None]:
text = "Personnel on the NITMRE project include Dalton Walker, Chase Perry, Michael Fox, Steve Carden, Leslie Douglas, and Alex Sommers."
# Assuming that two consecutive capitalized words represent first and last names, I'm going to extract names and group the first and last parts
pattern_names = re.compile(r"""
                           ([A-Z][a-z]+)    # Capital letter followed by at least one lowercase letters, grouped
                           \s               # Whitepsace between
                           ([A-Z][a-z]+)    # Same pattern for the last name
                           """, re.VERBOSE)

# See how we get a list of tuples of strings
print(pattern_names.findall(text))
# We could get lists of first or last names from that structure, but let's work with Match objects
all_matches = pattern_names.finditer(text)
# The groups() method is similar to findall() for this case, giving tuples with the groups
# Edit to group(), and try indices 0, 1, 2.
[print(m.groups()) for m in all_matches]

We can then extract information systematically from each group, and recombine to do things like create usernames for each person.

In [None]:
all_matches = pattern_names.finditer(text)
usernames = [m.group(1)[0].lower() + m.group(2).lower() for m in all_matches]
usernames

Here is a recent use-case of capturing groups.  In markdown formatting, here is the syntax for a url.  Double click the cell to see the raw text.  A label is in brackets, and the url is in parentheses.

For the cheapest, greasiest pizza in town, head over to [Little Caesar's](www.littlecaesars.com)!

An NLP application might want only the label from this, so let's match the entire pattern, capture the label in a group, and substitute the entire expression with just the captured label.  This is a difficult pattern because of the need to escape special characters and the variety of text that can be in the label.

In [None]:
text = """For the cheapest, greasiest pizza in town, head over to [Little Caesar's](www.littlecaesars.com)!
I start my PC building research at [PC Part Picker](www.pcpartpicker.com).
For learning how to be a software engineer, the courses at [Udemy](digitalu.udemy.com) have been invaluable."""
pattern_markdown = re.compile(r"""
                              \[            # Literal open bracket
                              (             # Start of group
                              [^\]]+        # Anything that is not a closing bracket
                              )             # Close group
                              \]            # Literal close bracket
                              \(            # Literal open parentheses
                              [^)]+         # Anything that is not a closing parentheses
                              \)            # Literal closing parentheses
                               """, re.VERBOSE)
# Without comments, I love how indecipherable the pattern is: \[([^\]]+)\]\([^)]+\)

# Remember, this is just showing the group
print(pattern_markdown.findall(text))
# This lets us see the entire match
all_matches = pattern_markdown.finditer(text)
[print(m.group()) for m in all_matches]

Try substitution. So far we only know how to substitute a static string, like so.

In [None]:
text2 = pattern_markdown.sub("<LABEL HERE>", text)
print(text2)

But we can pass in a function of the match instead, and whatever the function returns will be the replacement.  This one can be briefly accomplished with a lamba function that returns the first extracted group.

In [None]:
text3 = pattern_markdown.sub(lambda match: match.group(1), text)
print(text3)

You can even reference a group early in the pattern later on in the same pattern with a backslash and integer, such as `\2`.  Let's do that to identify accidental repetition of words.

In [None]:
text = "The quick brown brown fox jumped over the the lazy dog."

pattern_repeat = re.compile(r"""
                            (\w+)   # Capture an alphanumeric sequence of any length
                            \s      # Look for a space
                            \1      # Look for whatever was previously captured
                            """, re.VERBOSE)

all_matches = pattern_repeat.finditer(text)
[print(m.group()) for m in all_matches]

# Correct the error
pattern_repeat.sub(lambda match: match.group(1), text)

If you don't want the cognitive burden of remembering the order of groups in a complicated pattern, you can name them.

Everyone knows the first part of a phone number is the area code.  Did you know the next parts are called the "exchange" and "subscriber"?

In [None]:
text = "912-584-0018 584-0018 123-45-6789 2350-1111 912-283-8411 555-867-5309 555555555 123-456-78900"
pattern_phone = re.compile(r"\b(?P<area>\d{3})-(?P<exchange>\d{3})-(?P<subscriber>\d{4})\b")

# When groups are named, you can get a dictionary with names as keys and groups as values.  This is a method on a match object.
all_matches = pattern_phone.finditer(text)
[print(m.groupdict()) for m in all_matches]


### Exercise 5

In [None]:
# Below is the output from "ls" in one of my directories.  
# For each file, extract the filename and extension, and place them in a dictionary.
# Keys should be "name" and "extension"
text = """
date_parsing.ipynb               learning_hdbscan.ipynb   sawyer.txt              text_for_staging_test.txt
bin_contents.txt                 gatsby.txt               power_tests.csv         shapiro.ipynb
communicating_with_triton.ipynb  hyperlink_removal.ipynb  psql_data.txt
"""

# I mixed up my list of monitor resolutions with memory addresses and random values. Agh!
# Help me fix it by extracting the entries that can be interpreted as width by height,
# and making a list of strings that I can place in a CSS element.
# for example, "height: 100px; width: 200px;"
text = """
1280x720
0x6fed4
0x7ffe5367e044
1920x1080
640x480
100x200x300
1024x768
"""
pattern_resolution = re.compile(r"\b(?P<height>\d+)x(?P<width>\d+\b)")

### Lookarounds next?