## Regex Lab

If you use notebooks you can open this file and work directly in it.  If you prefer to work in a REPL you can copy & paste these short snippets into ipython as you go.

You may work with one person, you **may not** use AI. (Gen AI often writes very complex regex and tends to use JS syntax!)

While in practice writing regex is definitely one of those areas you might find yourself using GenAI in the near future, learning how to read and write the basic syntax will remain valuable in understanding/debugging regular expressions.

#### Your Name:
#### (Optional) Partner Name:

To receive credit for this lab work through all problems in this notebook and both you and your partner should make a submission.  (The submission form will be shared via Ed.)

In [4]:

# To make it easy to test your regexes, I've provided a helper class
# that you can use to test your regexes against a set of strings that
# should match and a set of strings that should not match.

# Note: the test cases are not exhaustive, but meant to provide some useful sample
# cases.
import re

class RegexProblem:
    def __init__(self, matches, non_matches):
        self.matches = matches
        self.non_matches = non_matches
    
    def try_regex(self, regex_str):
        regex = re.compile(regex_str)
        wrong = False
        for match in self.matches:
            if not regex.fullmatch(match):
                print(f"{match} should match but doesn't")
                wrong = True
        for non_match in self.non_matches:
            if regex.fullmatch(non_match):
                print(f"{non_match} should not match but does")
                wrong = True
        if not wrong:
            print("All tests passed!")


# example: match any string that only contains a's b's and/or c's
example_problem = RegexProblem(
    matches=["a", "bb", "ccc", "abc", "aaaaaaaa"], 
    non_matches=["d", "ace", "xyz"]
)

In [5]:
# you can experiment with it (in a REPL or in a notebook) like this:
example_problem.try_regex("[abc]")

bb should match but doesn't
ccc should match but doesn't
abc should match but doesn't
aaaaaaaa should match but doesn't


In [6]:
# oh right! [abc] only matches a single character, so it doesn't match "bb"
# let's try again:
example_problem.try_regex("[abc]+")

All tests passed!


**For this lab, with each problem below, try finding a regex that satisfies the tests run by try_regex.**

Modify the calls to **try_regex** but not the **RegexProblem** classes which contain test cases.

## Problem 1: Warm-Up

Let's build a regular expression that matches floating point numbers.  For this problem we'll go one piece at a time:

In [8]:
# first start with a regex that only matches integers
digits_only = RegexProblem(
    ["1", "123", "456789", "094"], 
    ["a", "4.5", "-75.412", "43a4"]
)

In [10]:
digits_only.try_regex("")

1 should match but doesn't
123 should match but doesn't
456789 should match but doesn't
094 should match but doesn't


In [9]:
# the first pattern allowed numbers like 094, modify the regex to not allow zero as a leading digit
integers_only = RegexProblem(
    ["1", "123", "456789"], 
    ["a", "4.5", "-75.412", "43a4", "094", "a94"]
)

In [None]:
integers_only.try_regex("")

In [12]:
# now lets allow negative numbers too, introduce a single (optional) leading dash
allow_negatives = RegexProblem(
    ["1", "123", "456789", "-7", "-90"],
    ["a", "4.5", "-75.412", "43a4", "094", "-04"]
)

In [17]:
allow_negatives.try_regex("")

1 should match but doesn't
123 should match but doesn't
456789 should match but doesn't
-7 should match but doesn't
-90 should match but doesn't


In [19]:
# alright, now allow a point and numbers after the decimal, this entire part is optional
floats = RegexProblem(
    ["1", "123", "456789", "-7", "-90", "4.5", "-75.412"], 
    ["a", "43a4", "094", "-04"]
)

In [20]:
# does your regex allow anything strange? 
# try modifying the test cases to see if you can find any bugs in your regex

## Problem 2: A practical pattern

Let's build something that matches URLs.

As we saw before, URLs take the form:

`protocol :// domain [:port] / [path]`

This time, you can test your own regexes as you go to build a regex that matches this entire pattern.

For our purposes we'll define each part as follows:

* protocol - any string from 2-8 characters long that ends in `://`
* domain - a mixture of letters and numbers optionally separated by dots '.'  (e.g. "example.com", "101domain.com", "localhost", "cs.uchicago.edu") Don't worry about invalid endings, etc. 
* port - a colon followed by a number (e.g. :8000, :443)
* path - must begin with slash, then same rules as domain but also allow forward slashes  (e.g. "/", "/index", "/index.html", "/a/123")

In [10]:
# feel free to practice with these, or go straight to the URL test cases
protocol = RegexProblem(["ws://", "http://", "https://"], ["ws:", "ws//", "x://", "http:/", "99://"])
domain = RegexProblem(["example.com", "101domain.com", "localhost", "cs.uchicago.edu", "a.b.c.biz", "127.0.0.1"],
                      ["no spaces.com", "bad?com", ""])
# if you want to write your own test cases for port and path you can do so

In [11]:
protocol.try_regex("")

ws:// should match but doesn't
http:// should match but doesn't
https:// should match but doesn't


In [12]:
domain.try_regex("")

example.com should match but doesn't
101domain.com should match but doesn't
localhost should match but doesn't
cs.uchicago.edu should match but doesn't
a.b.c.biz should match but doesn't
127.0.0.1 should match but doesn't
 should not match but does


In [14]:
url = RegexProblem(
    ["http://example.com", "http://localhost:8000", "ftp://127.0.0.1", 
     "http://example.com:80/index.html", "ws://a.b.c.edu/fruit/apple/123"],
    ["http://", "example.com", "http://:0", "http://localhost:8000:8000", "http://test?biz",
     "http://example.com/art/238$!!@", "https://e$.com"]
)
# this list is far from exhaustive, but should provide a decent starting point

In [15]:
url.try_regex("")

http://example.com should match but doesn't
http://localhost:8000 should match but doesn't
ftp://127.0.0.1 should match but doesn't
http://example.com:80/index.html should match but doesn't
ws://a.b.c.edu/fruit/apple/123 should match but doesn't



### Note
The rules given here are simplified from the actual URL syntax. In practice you should always opt to use urlparse to parse URLs instead of trying to 
build your own regex.  

https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlparse

## Problem 3: Groups

Remember that a URL can end with a query string.

A query string looks like `?var=val&var2=other_val&var3=123` and comes at the end of the URL.

This is a good fit for `findall`.  We won't use `RegexProblem` anymore, but instead use `re.findall` to break up these query strings into their components. 

To keep things simple, let's use these rules:

Each segment of a query string is separated from the others by an `&` and consists of a `name` and `value`.

`names` must be comprised of letters, numbers, and underscores.

`values` must be comprised of letters, numbers, and any punctuation that isn't a `&`.

Use `re.findall` to extract these patterns.  (Hint: write your pattern to match *one* name=value pair)


In [16]:
query_strings = [
    "?q=hello",
    "?q=hello&lang=en",
    "?q=plus+is+used+for+space+in+queries",
    "?a=1&b=2&c=3&d=4",
    "?first_name=Paul&last_name=Rust",
]

In [17]:
# write your regex here
qs_regex = re.compile("")

In [18]:
for query_string in query_strings:
    print(qs_regex.findall(query_string))

['', '', '', '', '', '', '', '', '']
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']


Your output should look like:
```
['q=hello']
['q=hello', 'lang=en']
['q=plus+is+used+for+space+in+queries']
['a=1', 'b=2', 'c=3', 'd=4']
['first_name=Paul', 'last_name=Rust']
```

**Now try using groups to capture the key and value separately so you instead get**

```
[('q', 'hello')]
[('q', 'hello'), ('lang', 'en')]
[('q', 'plus+is+used+for+space+in+queries')]
[('a', '1'), ('b', '2'), ('c', '3'), ('d', '4')]
```

In [39]:
qs_regex_improved = re.compile("")
for query_string in query_strings:
    print(qs_regex_improved.findall(query_string))

['', '', '', '', '', '', '', '', '']
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']


## Problem 4: Word Variations

We want to count how many times the word "mother" appears in shakespeare.txt.

In [19]:
with open("shakespeare.txt") as f:
    corpus = f.read()

corpus.count("mother")

In [20]:
corpus.count("mother")   # it is tempting to use str.count

441

In [21]:
corpus.count("Mother")   # we want to ignore case too (which we could do with lower() in theory)

35

In [22]:
corpus.count("smothered")  # but there are other words that contain "mother"

2

In [23]:
corpus.count(" mother ") # it is tempting to try something like this

143

In [24]:
corpus.count(" mother,") # but there are a lot of variations to account for

80

In [25]:
# try to find a regex to find all instances of the word "mother" regardless of case
# but excluding other words
# Hint: Take a look at flags and anchors in the notes.

In [26]:
re.findall("write-your-regex-here", corpus)

[]

In [None]:
# you can call `len` on this to get the total count -- my count was 437

## Problem 5: Finding Ghosts

Let's find all of the ghosts in shakespare.txt

First, Find 30 occurrences of the word ghost (excludes cases where it is part of another word)

In [68]:
ghosts = re.findall(r"FIXME", corpus) 
print("found", len(ghosts), "ghosts")

found 0 ghosts


Second, expand the regex to capture neighboring words:

In [73]:
# Expand the regex to find mutiple words on either side -- 
# so you can see the context in which the word "ghost" appears
# as you expand, the number of ghosts should remain the same!
re.findall(r"FIXME", corpus) 

[]

Your output will likely contain strings like "poor mortal living ghost,\n Woe's scene" 
but can vary a lot depending on the rules you choose many words you choose to include as context.
Don't worry about the exact matches, but you can stop when you have found the name of 2-3 ghosts
by reading the expanded context.