# Lab 3 Skills: Regular Expressions

You can optionally do this notebook or 03_eda_spacy. This lab is more basic, though regular expressions are very important in NLP. Do this lab if you don't have much experience with regular expressions, or feel like Lab 2 was moving too fast.

The goal of this notebook is to familiarize you with regular expressions, including:

- What types of applications use regexes.
- How to write regular expressions: syntax. 
- Types of programming tasks you can use them for.

## Setup


In [None]:
#| eval: false

# import some libraries
import pandas as pd
import regex
import re
import os
import data401_nlp

from data401_nlp.helpers.env import load_env

In [None]:
#| eval: false

# This checks to make sure your SUBMIT_API_KEY is present

print("SUBMIT_API_KEY present:", bool(os.getenv("SUBMIT_API_KEY")))

In [None]:
#| eval: false

# load some data to test against
data_file = 'data/train-sentiment.csv'
df = pd.read_csv(data_file,sep=',')
print("headers: ",list(df.columns))
print("rows: ",len(df))
all_tweets =  ' '.join(df['Tweet'].tolist())

headers:  ['Tweet', 'Sentiment']
rows:  2914


## Introduction

Regular expressions are a type of formal language for describing patterns.

Some points to know about regular expressions:

- Used in many applications
  - Concordancers
  - Key Word in Context (KWIC) tools
  - File and database search tools
  - Programming languages such as Perl, Python, Java…
  - Grep 


- Not standardized
  - Different applications use different "flavors"
  - You should learn the basics of what you *can* do
      - details of *how* to do it in different regex flavors will follow from that
  - We will focus on Python re and regex libraries
  
  
- Flexible
    - There are usually multiple regexes that can be used to describe or capture a pattern
    - Some tradeoffs:
        - readability
        - precision
        - speed
        - backtracking, etc.


## Note on the Use of AI

This class may be part of the "tweener" generation between those that have had to learn regular expressions and practice them, and those that may use LLMs to write them.

LLMs are very good at writing regular expressions when the intent is clear, constrained, and testable — and noticeably worse when intent is vague, underspecified, or linguistic. 

There are lots of regexes in training data and LLMs should be very good at matching canonical patterns. But if you are the least bit unclear, you will can get a poor, but confident answer. 

My best advise is:
- Practice writing regular expressions so you are comfortable working them. You want to be able to confidently identify when there is a problem.
- When you practice start with the smallest, simplest expression in a tool like [https://regexr.com](https://regexr.com) and then gradually build up, while visually confirming with test text that it's doing what you want. (This is more fun than wordle.)
- Get your LLM code assistant to write tests, before you get help with writing the regex. LLMs are good at writing tests, though you might not want to entirely rely on them.


## Terminology
    
A little common *jargon* will help our discussion.

| Term     | Definition |
| ----------- | ----------- |
| character | a symbol used in writing, it may be a letter, logograph, punctuation mark, diacritic mark, space, invisible character used to control formatting|   
|regex | short for regular expression, often used interchangeably synonymous with "pattern"|
| string | a sequence of characters|
| string literal | or just "literal", a sequence of characters to be interpreted exactly as it is|
| quantifier | a special character in a regular expression languages that conveys how many times the item it modifies should match| 
| operator | a special character in regular expression langauge that allows for the creation and manipulation of groups, character classes and other regex functions | 
| character class | a set of characters, which can be either predefined by the regex interpreter or created using operators and groups| 
| backtracking | the power and danger of regexes is in their ability to traverse the finite state machine representing the pattern.  Backtracking happens when the engine consums characters from the input, gets to a point where it can't match any more, and goes backword to see if it can find another path to a successful match. | 
| regex flags | options you can pass to a regex engine to alter its default behavior, e.g. regex.VERSION1, regex.IGNORECASE, or regex.MULTILINE. For examples, see https://pypi.org/project/regex/ |

## Reading and Exercises

### Introduction

Try out each of the following regexes.  For this we'll use the Twitter text and the re package.

In [None]:
#| eval: false

# find all instances of the string 'car' anywhere in the text
print(re.findall(r"car", all_tweets))

['car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car']


In [None]:
#| eval: false

# find car followed by an optional 't'
print(set(re.findall(r"cart?", all_tweets)))

{'cart', 'car'}


In [None]:
#| eval: false

#find all of the instances of 'car' surrounded by word boundaries
print(set(re.findall(r"\bcar.\b", all_tweets)))

{'care', 'car '}


In [None]:
#| eval: false

print(set(re.findall(r"car.", all_tweets)))

{'care', 'car,', 'carb', 'caro', 'card', 'carm', 'cart', 'carr', 'carl', 'cary', 'carc', 'cari', 'car '}


In [None]:
#| eval: false

# find all instances of 'car' followed by one or more of any character
# yikes, this is a greedy operator and matches the first instance of 'car'
# and then all of the rest of the text
print(set(re.findall(r"\bcar.+\b", all_tweets)))



In [None]:
#| eval: false

# we can fix this by converting it to use a reluctant operator  .+?
print(set(re.findall(r"car.+?\b", all_tweets)))

{'carly_senatore', 'cards', 'caring', 'carcertaion', 'carmen', 'carol_holman', 'carbon', 'carrying', 'care', 'car, ', 'cared', 'carbonemissions', 'carribean', 'carbontaxscam', 'car . #', 'careless', 'carthur', 'carriage', 'car ', 'cares', 'cartography', 'carries', 'careBot', 'career', 'cary', 'carry', 'careful', 'carefully'}


In [None]:
#| eval: false

# a more precise technique than using the dot operator is to define your own
# character class.  Here we search for 'car'followed by one or more of any lowercase
# Latin letter in the range from a through z
#
#Note that this set does not include the underscore, nor allow other punctuation
print(set(re.findall(r"car[a-z]+", all_tweets)))

{'cards', 'caring', 'carcertaion', 'carmen', 'carbon', 'carrying', 'care', 'cared', 'carbonemissions', 'carribean', 'carbontaxscam', 'careless', 'carthur', 'carol', 'carriage', 'cartography', 'cares', 'carries', 'career', 'cary', 'carry', 'careful', 'carly', 'carefully'}


Here are a couple of exercises for you! For the first few, design and test your regular expression using an online regex tester like [https://regexr.com](https://regexr.com) before you put them into code. Later, we'll dispense with the code (since you'll be comfortable with), and just focus on regular expressions themselves.

#### Exercise 1
Design the regex. and paste this into the test string area: `Dogs bark. Cats meow. Dogs run` Birds fly. 

Write a regular expression that matches every occurence of the word `Dogs`. 

What should be matched:

- ✅ Dogs (first occurrence)
- ✅ Dogs (second occurrence)

What should not be matched:

- ❌ Dog
- ❌ dogs (lowercase)
- ❌ Dogs (with trailing space)
- ❌ Any surrounding punctuation

Write down your final regex pattern. Then test it below.

In [None]:
#| eval: false

text = "Dogs bark. Cats meow. Dogs run. Birds fly."
re.findall(r"<YOUR_REGEX_HERE>", text)

[]

Confirm that:

- The result is a list
- The list length matches the number of highlighted matches you saw on regexr.com

If it does, congratulations!! You are on your way to mastering regular expressions! 

We're going to follow the same process for the other regular expressions below. Copy the test sentence to regexr.com, then find the matches from the exercise.

In [None]:
q1_answer = "Type your regex here after it works"

#### Exercise 2

You are practicing the `+` quantifier, which means “**one or more of the preceding character**, and introducing **word boundaries** (\b).”

Write a regular expression that matches each run of one or more `o` characters, only when they o's occur inside a word. Confirm that each group of repeated `o's` is highlighted as a single match. 

What should be matched:
- ✅ soooo → oooo
- ✅ cool → oo
- ✅ nooo → ooo

What should not be matched

- ❌ The letters s, c, or n
- ❌ Single letters outside the repeated os
- ❌ Any punctuation or spaces

In [None]:
#| eval: false

text = "soooo cool!!! nooo way!"
re.findall(r"<YOUR_REGEX_HERE>", text)

[]

Confirm that:

- The result is a list
- Each list element corresponds to exactly one highlighted o sequence
- The number of matches equals the number of highlighted regions on regexr.com

If it does, congratulations — you’ve now used quantifiers and word boundaries to precisely control what counts as a match!

In [None]:
q2_answer = "Type your regex here after it works"

#### Exercise 3

In Exercise 2, you used word boundaries (\b) to control what counts as a match.

In this exercise, you will remove the word boundaries and observe how the matches change.

Now write a regular expression that matches each run of one or more o characters, but do not use word boundaries.

Confirm what is highlighted.

What should be matched:
- ✅ soooo → oooo
- ✅ cool → oo
- ✅ nooo → ooo

(You should still see the same highlighted o sequences as before.)

What should surprise you:

- Nothing new appears in this example
- The regex engine has less information about where words begin and end
- The pattern now relies entirely on character repetition, not word structure

In [None]:
#| eval: false 

text = "soooo cool!!! nooo way!" 
re.findall(r"<YOUR_REGEX_WITHOUT_WORD_BOUNDARIES>", text)

[]

In [None]:
q3_answer = "Type your regex here after it works"

Word boundaries don’t always change the result — but they make your intent explicit and your regex safer. Why do you think this is?

A note on word boundaries (why they matter)

In the previous exercises, removing word boundaries didn’t change the result — but that was accidental.

Let’s look at a case where word boundaries do matter.

Go to https://regexr.com and paste the following text into the test string area:

`car care career scared cartoon`

Now test two patterns, one at a time.

Pattern 1 (no word boundaries):

`car`

What you will see highlighted:

- car
- care
- career
- scared
- cartoon

The regex engine is matching any occurrence of the substring car, no matter where it appears.

Pattern 2 (with word boundaries):

\bcar\b

What you will see highlighted:

- car (only)

Now the match must:

- start at a word boundary
- end at a word boundary

This tells the regex engine: “match the word car, not car as part of a larger word.”

All of the code blocks  you just executed used *regular expressions* to search through our giant Tweet string.  Regular expressions match substrings, not “words” or “concepts” — unless you explicitly tell them to. The basic components of regular expressions, with some examples, include:

  1. ***quantifiers***:   { .  + ? *  {n,m} }
  2. ***operators***:  { | [] () ] & -
  3. ***character classes***:   [a-z]  [0-9]  \d \s
  

### Quantifiers


![regex_quantifiers](images/regexes/re_quantifiers.png)

Quantifiers tell the regex engine how many times the preceding element is allowed to match.

In [None]:
#| eval: false

test_text = "aaaaaaaaaaaaaargh!"

In [None]:
#| eval: false

# find each character 'a' - one at a time
print(re.findall(r"a", test_text))

['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a']


In [None]:
#| eval: false

# find one or more of the character 'a'
print(re.findall(r"a+", test_text))

['aaaaaaaaaaaaaa']


In [None]:
#| eval: false

# find one or more of the character a followed optionally by one of any character
print(re.findall(r"a+.?", test_text))

['aaaaaaaaaaaaaar']


In [None]:
#| eval: false

# find one or more of the character 'a' followed by zero or more of any character
print(re.findall(r"a+.*", test_text))

['aaaaaaaaaaaaaargh!']


In [None]:
#| eval: false

# find two to four of the character 'a' followed by an 'r'
print(re.findall(r"a{2,4}r", test_text))

['aaaar']


As you can see, quantifiers are really powerful.  However, with great power comes great responsibility.

![Sample preprocessing pipeline](images/regexes/re_greed_1.png)

In [None]:
#| eval: false

# This greed operator matches 'bad, bad, greed' - not what we probably intended
test_phrase = 'You are a bad, bad, greedy operator.'
print(set(re.findall(r"b.+d", test_phrase)))

{'bad, bad, greed'}


In [None]:
#| eval: false

# But if we convert this to a reluctant operator, it behaves much better
print(set(re.findall(r"b.+?d", test_phrase)))

{'bad'}


In [None]:
#| eval: false

# This can seem a lot worse if you run the regex over a very large
# string, such as our all_tweets text
#
# This regular expression finds the first word starting with 'ca' 
# AND the entire rest of the text that follows it, despite that \b
# telling it to stop at a word boundary!  Whoops!
print(set(re.findall(r"\bcar.+\b", all_tweets)))



In [None]:
#| eval: false

# Now if we constrain it to match reluctantly, we get something more reasonable
print(set(re.findall(r"\bcar.+?\b", all_tweets)))

{'carly_senatore', 'cards', 'caring', 'carol_holman', 'carbon', 'carrying', 'care', 'car, ', 'carbonemissions', 'carbontaxscam', 'car . #', 'careless', 'car ', 'cares', 'cartography', 'carries', 'career', 'carry', 'careful', 'carefully'}


![Sample preprocessing pipeline](images/regexes/re_greed_4.png)

The *dot operator* matches one of any character, and the *kleene star operator* matches the element preceding it zero or more times.  It will keep matching as long as it can until it fails to finds an instance of the following element of the regex.  If it cannot match once it gets to the end of the input, it will attempt to backtrack.  Depending on how far it got, this can become extremely efficient.  It's usually better to try to write a more specific regular expression describing exactly what you want to match.


Not all regular expression engines enforce this type of greediness, however, it's best to keep in mind the following characterization of the greediness of operators.

![Sample preprocessing pipeline](images/regexes/re_greed_3.png)

#### Exercise 4

You are practicing the Kleene star (*) and greedy matching.

Design the regex, and paste the following text into the test string area on https://regexr.com:

`I love my car, but my carbon footprint worries me.`

Write a regular expression that matches car followed by any characters until the end of the word.

What should be matched:

- ✅ car,
- ✅ carbon

What should not be matched:

- ❌ The word care (it does not appear here)
- ❌ Anything before car
- ❌ Entire sentences or multiple words

On regexpr.com, you should see the match extend as far as possible after car.

Write down your final regex pattern. Then test it below.

In [None]:
#| eval: false 

text = "I love my car, but my carbon footprint worries me." 
re.findall(r"<YOUR_GREEDY_REGEX>", text)

[]

In [None]:
q4_answer = "Type your regex here after it works"

Confirm that:

- The match after car extends as far as the regex allows
- The regex engine prefers the longest possible match

#### Exercise 5

Now you will make the same pattern reluctant (non-greedy).

Use the same text on https://regexr.com:

`I love my car, but my carbon footprint worries me.`

Modify your regex from Exercise 1 so that it matches the shortest possible sequence after car.

What should be matched:

- ✅ car
- ✅ carbon

(Only the minimal necessary characters should be highlighted.)

What should not be matched:

- ❌ Long stretches of text after car
- ❌ Multiple words after car

On regexr.com, you should see much shorter highlighted matches than in Exercise 1.

Write down your final regex pattern. Then test it below.

In [None]:
#| eval: false 

text = "I love my car, but my carbon footprint worries me." 
re.findall(r"<YOUR_RELUCTANT_REGEX>", text)

[]

In [None]:
q5_answer = "Type your regex here after it works"

Confirm that:

- The reluctant quantifier stops matching as soon as it can
- The greedy version matched more text than the reluctant version

Key takeaway:

- Greedy quantifiers match as much as possible.
- Reluctant quantifiers match as little as possible.

Both follow the same pattern—only the ? changes their behavior.

### Character Classes

In order to write a more precise pattern, you'll need to use character classes.  Regular expression languages provide the syntax necessary to do so. Some character classe come built in.  Others can be constructed by the writer of the regex.  We'll take a look at both.

![Sample preprocessing pipeline](images/regexes/re_char_classes.png)

Practically all regex engines provide this basic set of built-in character classes.  Let's try using them in some regexes.

In [None]:
#| eval: false

# find all sequences of one or more digits
# \d is a character class for digits
print(set(re.findall(r"\d+", all_tweets)))

{'9009', '100', '724', '106', '57', '9000', '44', '600', '605', '00144071', '48', '61', '107', '160', '7', '1999', '850', '74', '200', '10', '292', '92', '0018', '38', '62727983', '5', '13', '116', '730', '25', '1500', '615', '0000023', '27', '876', '1937', '68', '334', '88', '61032', '847', '83', '451', '4', '19', '2010', '1982', '1995', '67', '45', '47', '230', '3000', '52', '35', '11', '143', '411', '51', '2000', '027', '70', '361', '884', '707', '36', '29', '1181', '180', '26', '1791', '007', '0', '75', '12566', '24', '109', '101', '37', '3015', '41', '23', '167', '2016', '140', '242', '96', '300', '86660132', '14', '2013', '3', '666', '604', '1964', '9', '250', '72', '86', '293', '1927', '130', '1973', '911', '8', '216', '28', '1988', '21', '54', '000', '2700', '40', '118', '32', '277', '1996', '49', '15', '9975', '3220', '15544', '120', '14215', '223', '1462', '775', '65', '59', '73', '1376', '193', '69', '720', '003', '866', '316', '312', '01', '1960', '39', '390', '231', '1992'

In [None]:
#| eval: false

# find all sequences of one or more word chars
# this basically gives you alphanumeric words possibly containing underscores
print(set(re.findall(r"\b\w+?\b", all_tweets)))





Many regular expression engines have expanded the scope of ***\w*** to include characters from other writing systems.  This seems to be the case in Python's *re* library.  Later in this lab, we'll take a look at Unicode regular expressions and how we can control matching based on the writing system we are interested in matching against.  We'll also introduce an expanded set of built-in character classes available in the *regex* package.  

In [None]:
#| eval: false

input = "أنا أحب القراءة.  我喜欢阅读。"
print(set(re.findall(r"\b\w+?\b", input)))

{'أنا', '我喜欢阅读', 'أحب', 'القراءة'}


Character classes can also be defined using regular expression *operators*.  The simplest way to describe a character class is using a *range* defined inside of square brackets.

In [None]:
#| eval: false

# find any lowercase character from a to z, one or more times
# Note that, unlike \w, this includes neither uppercase character nor underscores
print(set(re.findall(r"\b[a-z]+?\b", all_tweets)))



Other operators provide even more expressiveness in defining character classes, allowing you do define sets, including intersection and subtraction of sets.

![Sample preprocessing pipeline](images/regexes/re_char_classes_advanced.png)

In [None]:
#| eval: false

# find a sequence of one or more characters not in the set {space, a,e,i,o,u}
# notice that this matches numbers, punctuation and capital letters 
# (and a lot of other things not present in our test data)
test_data = "This will be 1 striking way do look at A SET 42 non-vowel groups!"
print(set(re.findall(r"[^aeiou\s]+", test_data)))

{'k', 'y', 'Th', 'b', 'SET', 'gr', 'n-v', 'str', 'l', 'A', 'n', 's', '42', 'll', 'd', 'w', '1', 'ps!', 't', 'ng'}


In [None]:
#| eval: false

# Let's just add an ignore case flag to this regular expression 
# to avoid an ugly set definition such as [^aAeEiIoOuU]
print(set(re.findall(r"[^aeiou\s]+", test_data, re.IGNORECASE)))

{'k', 'y', 'Th', 'b', 'gr', 'n-v', 'str', 'l', 'n', 'S', 's', '42', 'll', 'd', 'w', '1', 'ps!', 't', 'ng', 'T'}


In [None]:
#| eval: false

# Set subtraction with the regex package requires the regex.VERSION1 flag
#
# find one or more of any character between a and z except a,e,i,o and u.
print(set(regex.findall(r"[[a-z]&&[^aeiou]]+", test_data, regex.IGNORECASE|regex.VERSION1)))

{'n', 'd', 'w', 'S', 'v', 'gr', 't', 'k', 'Th', 'ng', 's', 'str', 'y', 'l', 'T', 'ps', 'll', 'b'}


In [None]:
#| eval: false

# Set subtraction does not work in the re package, 
# it was supposed to be coming, but seems to have been abandoned
print(set(re.findall(r"[[eiou\s]--[a]]+", test_data, re.IGNORECASE)))

set()


From now on, we'll use the regex package, because not only does it allow for richer set definitions, but it offers more complete handling of Unicode regular expressions. You do not need to do anything regexr, because it's PCRE compatible and has a super-set of `re` functionality. 

#### Exercise 6

Earlier, you saw that this pattern:

`car.`

matches any character after car, including letters, spaces, and punctuation.

Now try something more controlled.

Go to https://regexr.com and paste the following text into the test string area:

`car care cart carb car! car?`

Test a pattern than uses character classes to meet the following conditions:

What should be matched:

- ✅ care
- ✅ cart
- ✅ carb

What should not be matched:

- ❌ car!
- ❌ car?
- ❌ car (with a space)

You’ve now restricted the match to cases where car is followed by a lowercase letter.

In [None]:
q6_answer = "Type your regex here after it works"

#### Exercise 7

Character classes can be written in different ways.

Compare these two patterns on regexr.com with the same text text:

- `car[aeiou]`
- `car[a-z]`

What changes:

- car[aeiou] only matches words where car is followed by a vowel
- car[a-z] matches any lowercase letter

This lets you be precise about what kind of character is allowed to appear next.

Key takeaway

- `.` means “any character.”
- Character classes mean “one character from this specific set.”

This is one of the most important tools for making regexes safer and more readable.

#### Exercise 8


At first glance, `[a-z]` and `\w` can look similar — but they mean **very different things**.

Go to **[https://regexr.com](https://regexr.com)** and paste the following text into the test string area:

```
word WORD word_1 café naïve 123 _hidden
```

Now test a pattern that meets the following conditions:


**What is matched:**

* `word`

**What is not matched:**

* `WORD` (uppercase)
* `word_1` (numbers and `_`)
* `café`, `naïve` (non-ASCII letters)
* `123`
* `_hidden`

Your pattern should mean: *one lowercase ASCII letter only*.

---

Now try a word boundaries pattern.

**What is matched:**

* `word`
* `WORD`
* `word_1`
* `123`
* `_hidden`

**What may surprise you:**

* Depending on the regex engine and flags, accented characters like `é` may **not** be included
* `_` (underscore) **is** considered a “word character”

This pattern means: *letters, digits, and underscore* — not “linguistic word.”

### Logical Operators

The set notation we've just looked at reveals another useful tool for defining patterns.  ***Logical operators*** allow you to to express Boolean logic with the operators `AND`, `OR` and `NOT`.

![Sample preprocessing pipeline](images/regexes/re_logical_ops.png)

Let's get a little more practice using logical operators.  

In [None]:
#| eval: false

# Find all the a or z characters 
print(re.findall(r"(a|z)", all_tweets))

['z', 'a', 'a', 'a', 'a', 'a', 'z', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'z', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'z', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'z', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'z', 'a', 'z', 'z', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a',

In [None]:
#| eval: false

# Find all the characters that are not a or z 
# just print the set, so we can see the unique characters
print(set(regex.findall(r"[^az]", all_tweets, regex.IGNORECASE|regex.VERSION1)))

{'v', 'R', 'X', '7', 'u', '-', '*', '<', 'P', '5', "'", ';', 'F', 'D', '_', 'i', '"', '4', '~', 'q', 'T', 'W', 'B', ',', '0', 'b', 'Y', 'p', 'J', ':', '?', '^', 'n', '$', 'N', '3', 'm', 'Q', '9', 'g', ')', 'O', 'K', 'U', 'c', 'd', '8', '\\', 'H', 't', 'f', '.', '#', '+', 'e', 'j', 'y', '!', 'G', 'o', ' ', '|', '>', 'h', '/', ']', 's', 'x', '%', '2', 'w', 'C', '@', '(', 'I', 'E', 'k', '&', 'L', 'l', '6', '=', 'S', '[', 'M', 'V', '1', 'r'}


In [None]:
#| eval: false

# This might be more interesting:
# Find all the non space characters
print(sorted(set(regex.findall(r"[^\s]", all_tweets, regex.VERSION1))))

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '|', '~']


In [None]:
#| eval: false

# How about finding all the non-space, non-letter characters
print(sorted(set(regex.findall(r"[^\sa-zA-Z]", all_tweets, regex.VERSION1))))

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '|', '~']


Notice that we introduced a third argument into the call to `findall()`.  The behavior of the regex matcher can be controlled using predefined *flags*, that you can pass it.  The flags available for use in the regex package are described at its [PyPI site](https://pypi.org/project/regex/).  The flags can be combined together with a vertical bar (i.e. they can be *OR*'ed together).  In the follwing regular expression, `regex.DOTALL` tells the regex engine that the dot is allowed to match the newline character, '\n', and `regex.IGNORECASE` asks the engine to ignore orthographic case.

In [None]:
#| eval: false

search_me = 'I like the red and\nThe blue and the green AND THE PURPLE!'
pattern = regex.compile(r"d.t",regex.DOTALL|regex.IGNORECASE)
print(regex.findall(pattern, search_me ))

['d\nT', 'd t', 'D T']


#### Exercise 9

Regular expressions support simple **logical operations** that let you say:

- **OR**: match *this* **or** *that*
- **NOT**: match anything *except* these characters

These operators let you widen or restrict matches **without writing multiple regexes**.

---

The OR operator is written using a vertical bar:

```
pattern1|pattern2
```

This means: *match `pattern1` OR match `pattern2`.*

Go to **https://regexr.com** and paste the following text into the test string area:

```
I drive a car. She rides a bike. They take the bus.
```

Test a pattern that meets the following conditions:

---

**What should be matched:**

- ✅ `car`
- ✅ `bike`

---

**What should not be matched:**

- ❌ `bus`
- ❌ Any surrounding punctuation or spaces

On regexr.com, you should see **only the words `car` and `bike` highlighted**, wherever they appear.

---

**Key idea:**

> OR combines *entire patterns*, not individual characters.

---

A **negated character class** lets you say *“match any character except these.”*

Negation is written by placing `^` **inside** square brackets:

```
[^abc]
```

This matches **any single character that is NOT** `a`, `b`, or `c`.

Go to **https://regexpr.com** and paste this text into the test string area:

```
car cab cap cat
```

Now test this pattern:

```
ca[^bpt]
```

---

**What should be matched:**

- ✅ `car`

---

**What should not be matched:**

- ❌ `cab`
- ❌ `cap`
- ❌ `cat`

The pattern means:
- `ca`
- followed by **one character**
- that is *not* `b`, `p`, or `t`

---

Important distinction

- `^` **inside** `[]` means **NOT**
- `^` **outside** `[]` means **start of string** (you’ll see this later)

Context matters.

---

Key takeaway

> `|` lets you express alternatives.  
> `[^ ]` lets you express exclusion.

Together, these operators let you control *what is allowed* and *what is forbidden* in a match.

#### Exercise 10

A **negated character class** lets you say:

> “Match any character *except* these.”

Negation is written by placing `^` **inside** square brackets.

```
[^abc]
```

This matches **any single character that is not** `a`, `b`, or `c`.

---

Go to **https://regexr.com** and paste the following text into the test string area:

```
car cab cap cat
```

Now test a pattern with negation:

**What is matched:**

- ✅ `car`

---

**What is not matched:**

- ❌ `cab`
- ❌ `cap`
- ❌ `cat`

---

**How to read this pattern:**

- `ca`  
- followed by **one character**  
- that is **not** `b`, `p`, or `t`

---

Important distinction:

- `^` **inside** `[]` → **NOT**
- `^` **outside** `[]` → **start of string**

Same symbol, different meaning depending on position.



### Boundary characters and anchors


Now let's move on to another useful piece of regex syntax:  boundaries and anchors.

![Sample preprocessing pipeline](images/regexes/re_boundaries_1.png)

Boundary characters are very convenient character classes that let you control where in your input text a match is allowed to occur.  Imagine that you want to convert uppercase letters to lowercase, but only at the beginning of a line.

In [None]:
#| eval: false

# replace capital T at the beginning of a line with lowercase t
# but don't replace it anywhere else, please

thisthis = "This is This and that is that!"
result = regex.sub(r"^T","t", thisthis)
print(result)

this is This and that is that!


***Note:*** Don't confuse the `^` anchor character that is the beginning of the string boundary character with the `^` not operator.  The *not operator* occurs inside of the square brackets used to define character classes, e.g. `[^aeiou]` means *a character that is not in the set [aeiou]*.

![Sample preprocessing pipeline](images/regexes/re_boundaries_2.png)

In [None]:
#| eval: false

# print whatever character is after each word boundary '\b'
print(regex.findall(r"\b(.)", "<noun>big_fish_42</noun>"))

['n', '>', 'b', '<', 'n', '>']


### Match groups

Another useful aspect of regular expressions is the ability to define groups and to access them, both from within the regex application, e.g. as part of escape functions, and also from your code after the regex has found its matches.

![Sample preprocessing pipeline](images/regexes/re_groups_1.png)

In [None]:
#| eval: false

# find any non-space character, then find the same character one or more times
# this lets you find repeated characters! Aaaah!  SOoooooo cooOOOool! <-- better ignore
# case if you want to find that
print(set(regex.findall(r"(([^\s])\2+)", all_tweets )))

{('......', '.'), ('$$$', '$'), ('666', '6'), ('yyy', 'y'), ('HH', 'H'), ('999', '9'), ('dd', 'd'), ('LL', 'L'), ('ii', 'i'), ('aa', 'a'), ('00000', '0'), ('DD', 'D'), ('OO', 'O'), ('xxx', 'x'), ('FFF', 'F'), ('YY', 'Y'), ('AAA', 'A'), ('eee', 'e'), ("''", "'"), ('CC', 'C'), ('^^', '^'), ('ww', 'w'), ('AA', 'A'), ('lll', 'l'), ('00', '0'), ('ff', 'f'), ('111', '1'), ('hh', 'h'), ('OOOO', 'O'), ('_____', '_'), ('!!', '!'), ('bb', 'b'), ('kk', 'k'), ('gg', 'g'), ('11', '1'), ('nn', 'n'), ('ooo', 'o'), ('XX', 'X'), ('EEEEEEEEEEE', 'E'), ('???', '?'), ('BB', 'B'), ('??', '?'), ('CCC', 'C'), ('000', '0'), ('!!!!!!', '!'), ('___', '_'), ('FF', 'F'), ('MMM', 'M'), ('NNN', 'N'), ('MM', 'M'), ('hhh', 'h'), ('aaa', 'a'), ('mmm', 'm'), ('77', '7'), ('____', '_'), ('99', '9'), ('))))', ')'), ('AAAAA', 'A'), ('RR', 'R'), ('....', '.'), ('!!!', '!'), ('22', '2'), ('cc', 'c'), ('mmmm', 'm'), ('tt', 't'), ('..', '.'), ('NN', 'N'), ('hhhh', 'h'), ('uuu', 'u'), ('44', '4'), ('55', '5'), ('ccc', 'c'), ('

Notice that in the cool repeated character sample above, you get a set of tuples.  That is because in this case the regex engine is giving you match-group one and match-group two, which are, again, numbered from left to right by the location of the parentheses.

Now let's see how we can use the back-reference in a replacement function.

In [None]:
#| eval: false

test_silliness = "aaaAAAAaaah!  sweeEEeeet!  ooOOOoh!"
print("before: ",test_silliness)

# find a vowel, then one or more of whichever vowel you found, and replace it with
# just two of that vowel
pattern = regex.compile(r"([aeiou])(\1+)", regex.IGNORECASE)
down_to_two = regex.sub(pattern,r"\1\1", test_silliness)
print("after: ",down_to_two)

before:  aaaAAAAaaah!  sweeEEeeet!  ooOOOoh!
after:  aah!  sweet!  ooh!


In this *sweet* example, we converted our duplicate character sequences into `\1\1`, meaning two of whichever character we matched.  That is to avoid ruining words with two legitimate vowel sequences, such as 'sweet'!

Another interesting use of match groups is to find just the piece or pieces that you want from within a lager match. In this example, the the the set of all matches is returned as an array of tuples, where the tuples represent the ordered set of match groups.

In [None]:
#| eval: false

# write a regular expression to find the two sides of an equation
math_thoughts = "The answer to 1+1=2 is too easy. Let's try 5+5=10"
math_pattern = regex.compile(r"(\d+[+-]\d)=(\d+)")
equations = regex.findall(math_pattern, math_thoughts)

print(equations)
print("The answer to ",equations[0][0]," is:  ",equations[0][1])

[('1+1', '2'), ('5+5', '10')]
The answer to  1+1  is:   2


### Escaping  special characters

This example brings us to another feature of regular expression languages.  The fact that there are special, reserved characters that serve as *operators* and *quantifiers* means that there also needs to be a mechanism for matching those same characters in your text via a pattern.  The way to do that is by *escaping* them.  In regex, that mainly means preceding them with a backslash.

![Sample preprocessing pipeline](images/regexes/re_escaping.png)

![Sample preprocessing pipeline](images/regexes/re_escape_chars.png)

![Sample preprocessing pipeline](images/regexes/re_escape_probs.png)

In [None]:
#| eval: false

# let's just prove that to ourselves
test_noescape = "will it find 1+2 pr 111111112?"
print(regex.findall(r"1+2", test_noescape))

['111111112']


In [None]:
#| eval: false

print(regex.findall(r"1\+2", test_noescape))

['1+2']


### Non-capturing groups

Now, returning for a moment to the the question of match groups, what if you wanted to create a group that didn't form a match?  For example, if you needed a quantifier such as the *kleene star* to apply to a group, but you didn't want it to form a group in the match object?  In this case, there is another special piece of syntax for non-grouping groups (yes, *really*): `(?:blah)`. 

In [None]:
#| eval: false

float_thoughts = "The answer to 1.10+1.00=2.10 is too easy. Let's try 5.6+5.3=10.9"

# find the two sides of an equation involving floats
float_pattern_too_many_groups = regex.compile(r"(\d+(\.\d+)*[+-]\d(\.\d+)*)=(\d+(\.\d+)*)")
equations = regex.findall(float_pattern_too_many_groups, float_thoughts)

print(equations)
print("The answer to ",equations[0][0]," is:  ",equations[0][1])
print("\n That's too many match groups for my liking,\n let's try again with non-grouping groups for the decimal places:\n")
float_pattern_too_many_groups = regex.compile(r"(\d+(?:\.\d+)?[+-]\d(?:\.\d+)?)=(\d+(?:\.\d+)?)")
equations = regex.findall(float_pattern_too_many_groups, float_thoughts)
print(equations)
print("The answer to ",equations[0][0]," is:  ",equations[0][1])


[('1.10+1.00', '.10', '.00', '2.10', '.10'), ('5.6+5.3', '.6', '.3', '10.9', '.9')]
The answer to  1.10+1.00  is:   .10

 That's too many match groups for my liking,
 let's try again with non-grouping groups for the decimal places:

[('1.10+1.00', '2.10'), ('5.6+5.3', '10.9')]
The answer to  1.10+1.00  is:   2.10


So, putting the ?: after the opening paren (?: ) to add the non-grouping sub-expression (?:\.\d+)? allows you to make the decimal point followed by one or more digits optional as a group, without causing that group to be added to the match results. The official term for this is a non-capturing group.

### Lookahead and lookbehind

Another way of grouping elements is include lookahead or lookbehind assertions.  These types of non-capturing groups neither cause groups to be output as part of the match, nor do they "consume" the input text, meaning that the text is still available for matching against subsequent portions of the regex.

Examples will help illustrate this.

In [None]:
#| eval: false

a_string = "an artist a1 a$"

# match a, but only if you find a letter from a-z when you look ahead, 
# but don't include that letter in the match
print("positive lookahead after a: ",regex.findall(r"(a(?=[a-z]))", a_string))

# match a, but only if you find a letter from a-z after it
# and include that letter in the match, but don't make a group for it
print("non-capturing group after a:",regex.findall(r"(a(?:[a-z]))", a_string))
      
# match a, but only if you find a letter from a-z after it
# and include that letter in the match, and make a group for the second letter     
print("regular group after a:",regex.findall(r"(a([a-z]))", a_string))

positive lookahead after a:  ['a', 'a']
non-capturing group after a: ['an', 'ar']
regular group after a: [('an', 'n'), ('ar', 'r')]


Three related concepts are *positive lookbehind*, *negative lookahead*, and *negative lookbehind*.  Think about how those might be similar to and/or different from *positive lookahead*. 

### Unicode regexes


Now that you have had an overview of the basic syntax of regular expressions, let's take a peek at some more modern aspects of regular expressions that are designed to take advantage of Unicode properties, which are described in great detail at [https://www.regular-expressions.info/unicode.html](https://www.regular-expressions.info/unicode.html) (the source of the following image):

![unicode character classes](images/regexes/re_unicode_char_classes.png)

Especially exciting for multilingual computing is the availability of character classes for recognizing different writing systems.

![unicode blocks](images/regexes/re_unicode_1_blocks.png)

Clickable sources of information about Unicode properties:
[https://www.regular-expressions.info/unicode.html](https://www.regular-expressions.info/unicode.html) and [Richard Ishida's Uniview tool](https://r12a.github.io/uniview/)

In [None]:
#| eval: false

# find the set of all punctuation sequences in the all_tweets text
punct_sequences = set(regex.findall(r"\p{Punct}+", all_tweets))
print(sorted(punct_sequences, key=len))

['-', '*', "'", ';', '_', '"', ',', ':', '?', ')', '\\', '.', '#', '!', '/', ']', '%', '@', '(', '&', '[', '?!', ',&', '".', ',"', '?"', '])', '"?', '/#', '([', '.!', '!"', '-@', '##', '.-', '-#', '@_', '(#', "',", ').', ',@', '"-', '__', "'-", '.(', "''", '.*', '??', ':#', ');', '!)', '!?', '--', '.#', ':(', "'.", ':)', '-,', '":', '!!', '(@', '!-', '."', '/@', ',#', '.@', ';)', ';"', ".'", '.)', '%!', '"@', '],', '..', '),', '):', '_:', '"#', '?-', '...', '???', '!!"', ']."', '?..', '!!!', '___', '..#', ".''", '."\\', '-_-', '@__', '--@', '..?', '].#', ';--', ';-)', ':-)', '!!!!', '...?', '..."', '!?..', '.../', '!!!-', '....', "'...", '"...', '____', '?...', '????', '!?!!', '..!!', '#...', '.....', '....#', '_____', '!!!!?', "',...", '?!?!?', '!!!!!!', '......', ':-))))', '.....#', '!!!!!!!', ')!!.....', '?!!??!????']


## Optional Exercises 

These build off ideas from all the above. Use the hints and give them a go. These are fun and like little puzzles to work out. If you are tuckered out, hang onto these exercises and play with them another time. Regular expressions take regular exercise - you only get better by practicing. 

### Exercise 11 
Write a regular expression to match only the first "RIGHT" in the sentence "RIGHT IS RIGHT, MIGHT IS NOT RIGHT."


In [None]:
text11 =  "RIGHT IS RIGHT, MIGHT IS NOT RIGHT"

In [None]:
q11_answer = "Type your regex here after it works"

### Exercise 12 
Use back-references to find the following in the sample sentence:
- moo-moo
- bah-bah
- hoo hoo
- yipyip
- yip yip yip

but do not match black-board, moo-bah or yipyap. 

*Hint: For this look up quantifiers, "named capturing groups", and "numeric reference" in the Rexexr sidepane. Just try to get one to work, then another, etc. Build gradually!* 


In [None]:
text12 = "I want to match moo-moo,bah-bah and hoo hoo and yipyip and yip yip yip but not black-board or moo-bah or yipyap."

In [None]:
q12_answer = "Type your regex here after it works"

### Exercise 13 

Write a regex to match words consisting of the letters of the English alphabet but that do not contain the letters a or e. <br>
*Hint: this is about defining character classes and negated sets.*

In [None]:
text13 = "Food is good but cake is great. He is angling to book a plate ticket"

In [None]:
q13_answer = "Type your regex here after it works"

### Exercise 14 
Write a regex to find *$42.42* and £56,00.01 and 60€ in ex4.

*Hint: Character classes, optional and negated characters, and groupings. Start small and build-up.*

In [None]:
text14 = "4-1. I would like $42.42 for that, or maybe £56,00.01, or perhaps 60€, at least for part 60:3:9."

In [None]:
q14_answer = "Type your regex here after it works"

### Exercise 15 
Write a succinct regex to find the two dates in ex5.

*Hint: Start with the expression above, and tweak.*

In [None]:
text15 = "I was born on 08/19/1991.  He was born on 9-24-1993."

In [None]:
q15_answer = "Type your regex here after it works"

### Exercise 16 
Replace all examples of <3  in ex6 with ♥.

*Hint: For this come back to python and use regex.sub()*

In [None]:
text16 = "Be my valentine <3 <3 <3"

In [None]:
q16_answer = "test 16"

NOTE:

If you restarted the kernel, re-run all exercise cells before reviewing or submitting.

In [None]:
# REVIEW ONLY — does not submit

from data401_nlp.helpers.submit import collect_answers, parse_answers, submit_answers

# REVIEW ONLY — does not submit

raw = collect_answers(
    show=True,
    namespace=globals(),  
)

answers = parse_answers(raw)

print(f"\nDetected {len(answers)} answers:")
for k in answers:
    print(" ", k)


In [None]:
#| eval: false

ALLOW_SUBMISSION = False   # ← student MUST change this

def submit_for_credit(student_id):
    if not ALLOW_SUBMISSION:
        raise RuntimeError(
            "⚠️ Submission is disabled.\n\n"
            "To submit:\n"
            "  1. Set ALLOW_SUBMISSION = True\n"
            "  2. Re-run this cell"
        )

    submit_answers(
        student_id=student_id,
        answers=answers,   # uses reviewed answers
    )

    print("✅ Submission complete.")

In [None]:
#| eval: false
# ✏️ EDIT WITH YOUR STUDENT ID
submit_for_credit("test_student")

RuntimeError: ⚠️ Submission is disabled.

To submit:
  1. Set ALLOW_SUBMISSION = True
  2. Re-run this cell