# Important note!

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
YOUR_ID = "" # Please enter your GT login, e.g., "rvuduc3" or "gtg911x"
COLLABORATORS = [] # list of strings of your collaborators' IDs

In [None]:
import re

RE_CHECK_ID = re.compile (r'''[a-zA-Z]+\d+|[gG][tT][gG]\d+[a-zA-Z]''')
assert RE_CHECK_ID.match (YOUR_ID) is not None

collab_check = [RE_CHECK_ID.match (i) is not None for i in COLLABORATORS]
assert all (collab_check)

del collab_check
del RE_CHECK_ID
del re

**Jupyter / IPython version check.** The following code cell verifies that you are using the correct version of Jupyter/IPython.

In [None]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

# Part 2: Regular expressions

Part 1 hints at the general problem of finding patterns in text. A handy tool for this problem is Python's [regular expression module](https://docs.python.org/3/howto/regex.html).

A _regular expression_ is specially formatted pattern, written as a string. Matching patterns with regular expressions has 3 steps:

1. You come up with a pattern to find.
2. You compile it into a _pattern object_.
3. You apply the pattern object to a string, to find _matches_, i.e., instances of the pattern within the string.

> What follows is just a small sample of what is possible with regular expressions in Python; refer to the [regular expression documentation](https://docs.python.org/3/howto/regex.html) for many more examples and details.

## Basics

Let's see how this scheme works for the simplest case, in which the pattern is an exact substring.

In [None]:
import re   # Regular expression module: https://docs.python.org/3/howto/regex.html

pattern = 'fox'
pattern_matcher = re.compile (pattern)

input_1 = 'The quick brown fox jumps over the lazy dog'
matches_1 = pattern_matcher.search (input_1)
print (matches_1)

input_2 = '''Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Istam voluptatem perpetuam quis potest praestare sapienti?'''
matches_2 = pattern_matcher.search (input_2)
print (matches_2)

You can also query matches for more information.

In [None]:
print (matches_1.group ())
print (matches_1.start ())
print (matches_1.end ())
print (matches_1.span ())

**Module-level searching.** For infrequently used patterns, you can also skip creating the pattern object and just call the module-level search function, `re.search()`.

In [None]:
matches_3 = re.search ('jump', input_1)
assert matches_3 is not None
print ("Found", matches_3.group (), "@", matches_3.span ())

## Building blocks

Beyond exact matches, there are many ways to build up more complex patterns. Here is a crash course by example.

In [None]:
# Choices: |

digit_words = 'zero|one|two|three|four|five|six|seven|eight|nine'
test_string = 'four score and seven years ago...'
re.findall (digit_words, test_string)

In [None]:
blurg = 'An onion under an Easter egg is odd'

# Character classes: [...]
vowel = '[aAeEiIoOuU]'
print (re.findall (vowel, blurg))

In [None]:
# Alternative: case-insensitive search
vowel_lower = '[aeiou]'
print (re.findall (vowel_lower, blurg, re.IGNORECASE))

In [None]:
pwn_str = '0x3nd3adb33f' # Some test string

# Ranges: [a-z]
digit = '[0-9]'
print (re.findall (digit, pwn_str))

In [None]:
# Complement of a character class: [^...]
non_digit = '[^0-9]'
for x in re.finditer (non_digit, pwn_str):
    print (x.group ())

In [None]:
# Special character classes:
#  \d = [0-9]
#  \D = [^0-9]
#  \s = spaces, including newlines and tabs
#  \S = non-spaces
#  \w = alphanumeric + underscore, i.e., [a-zA-Z0-9_]
#  \W = [^a-zA-Z0-9_]

digit2 = '\d'
print (pwn_str, '==>', re.findall (digit, pwn_str))

In [None]:
# Match vs. search
missus = 'Mrs.'

assert re.search (missus, 'Judy Mrs. Dench') is not None
assert re.match (missus, 'Judy Mrs. Dench') is None
assert re.match (missus, 'Mrs. Judy Dench') is not None

# Alternative: ^ + search == match
missus_pre = '^Mrs.'
assert re.search (missus_pre, 'Mrs. Judy Dench') is not None
assert re.search (missus_pre, 'Judy Mrs. Dench') is None

# Any prefix: Use '(...)' to group
nompre2 = '^(Mr.|Ms.|Mrs.|Dr.|Prof.)'
assert re.search (nompre2, 'Prof. Judy Dench') is not None
assert re.search (nompre2, 'Judy Prof. Dench') is None

# Note: For suffixes, use '$'
nomsuf = '(Jr.|Sr.|III|Esq.|M.D.|Ph.D.)$'
assert re.search (nomsuf, 'Prof. Judy Dench, Ph.D.') is not None
assert re.search (nomsuf, 'Judy Ph.D. Dench') is None

In [None]:
# Repetition and counting

# Zero or more: *
print ("Testing '*'...")
assert re.match ('co*w', 'cw') is not None
assert re.match ('co*w', 'cow') is not None
assert re.match ('co*w', 'coow') is not None
assert re.match ('co*w', 'cooow') is not None
assert re.match ('co*w', 'caw') is None

# One or more: +
print ("Testing '+'...")
print (re.findall ('[aeiou]+', 'How much wood could a woodchuck chuck?'))

# Zero or one: ?
print ("Testing '?'...")
assert re.match ('co?w', 'cw') is not None
assert re.match ('co?w', 'cow') is not None
assert re.match ('co?w', 'coow') is None

# Counts: {m, n}
print ("Testing '{m,n}' ...")
assert re.match ('co{3,5}w', 'coow') is None
assert re.match ('co{3,5}w', 'cooow') is not None
assert re.match ('co{3,5}w', 'coooow') is not None
assert re.match ('co{3,5}w', 'cooooow') is not None
assert re.match ('co{3,5}w', 'coooooow') is None
# Note: {,} == {0,} == *
#       {1,} == +
#       {0,1} == {,1} == ?

In [None]:
# Counts + wildcards: {n} and .
print ("Testing counts '{n}' and wildcards (.) ...")
assert re.match ('c.{3}w', 'cw') is None
assert re.match ('c.{3}w', 'caw') is None
assert re.match ('c.{3}w', 'caew') is None
assert re.match ('c.{3}w', 'caeiw') is not None
assert re.match ('c.{3}w', 'caeiow') is None
assert re.match ('c.{3}w', 'caeiouw') is None

In [None]:
# Groups: (...)

# Match simple names of the form: First [Optional-middle ]Last
re_names = re.compile ('^([a-zA-Z]+)\s([a-zA-Z]+\s)?\s*([a-zA-Z]+)$')

print (re_names.match ('Rich Vuduc').groups ())
print (re_names.match ('Rich S Vuduc').groups ())
print (re_names.match ('Rich Salamander Vuduc').groups ())

In [None]:
# Make the above more readable with a re.VERBOSE pattern
re_names2 = re.compile ('''^              # Beginning of string
                           ([a-zA-Z]+)    # First name
                           \s             # At least one space
                           ([a-zA-Z]+\s)? # Optional middle name
                           \s*            # More spaces
                           ([a-zA-Z]+)    # Last name
                           $              # End of string
                        ''',
                        re.VERBOSE)
print (re_names2.match ('Rich Vuduc').groups ())
print (re_names2.match ('Rich S Vuduc').groups ())
print (re_names2.match ('Rich Salamander Vuduc').groups ())

In [None]:
# Named groups
re_names3 = re.compile ('''^
                           (?P<first>[a-zA-Z]+)
                           \s
                           (?P<middle>[a-zA-Z]+\s)?
                           \s*
                           (?P<last>[a-zA-Z]+)
                           $
                        ''',
                        re.VERBOSE)
print (re_names3.match ('Rich Vuduc').group ('first'))
print (re_names3.match ('Rich S Vuduc').group ('middle'))
print (re_names3.match ('Rich Salamander Vuduc').group ('last'))

**Exercise 1.** Write a function `parse_email(s)` that, given an email address `s`, returns a tuple, `(user-id, domain)` corresponding to the user name and domain name.

For instance, given `richie@cc.gatech.edu` it should return `(richie, cc.gatech.edu)`.

If the input is not an email address, the function should raise a `ValueError`.

In [None]:
def parse_email (s):
    """Parses a string as an email address, returning an (id, domain) pair."""
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert parse_email ('richie@cc.gatech.edu') == ('richie', 'cc.gatech.edu')
assert parse_email ('   quiggy.smith38x@gmail.com') == ('quiggy.smith38x', 'gmail.com')
assert parse_email ('bertha_hugely@sampson.edu  ') == ('bertha_hugely', 'sampson.edu')
assert parse_email ('JKRowling@Huge-Books.org') == ('JKRowling', 'Huge-Books.org')

In [None]:
try:
    parse_email ('x @hpcgarage.org')
except ValueError:
    print ("Correctly throws an exception.")
else:
    raise AssertionError ("Did *not* throw an exception as required!")

**Exercise 2.** Write a function to parse US phone numbers written in the canonical "(404) 555-1212" format, i.e., a three-digit area code enclosed in parentheses followed by a seven-digit local number in three-dash-four digit format. It should return a triple of 3 strings, `(area-code, first-three, last-four)`. It should also ignore all leading and trailing spaces, as well as any spaces that appear between the area code and local numbers.

If the input is not a valid phone number, it should raise a `ValueError`.

In [None]:
def parse_phone1 (s):
    """Parses a string as a phone number in `(XXX) XXX-XXXX` format,
    returning a `(area-code, three-digits, four-digits)` tuple of
    strings.
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert parse_phone1 ('(404) 555-3355') == ('404', '555', '3355')
assert parse_phone1 ('  (404)   555-3355  ') == ('404', '555', '3355')
assert parse_phone1 ('(123)456-7890') == ('123', '456', '7890')

In [None]:
try:
    parse_phone1 ('404-555-3355')
except ValueError:
    print ("Correctly throws an exception.")
else:
    raise AssertionError ("Did *not* correctly raise an exception!")
    
try:
    parse_phone1 ('+1 (404) 555-3355')
except ValueError:
    print ("Correctly throws an exception.")
else:
    raise AssertionError ("Did *not* correctly raise an exception!")

**Exercise 3.** Implement an enhanced phone number parser that can handle any of these patterns.

* (404) 555-1212
* (404) 5551212
* 404-555-1212
* 404-5551212
* 404555-1212
* 4045551212

As before, it should not be sensitive to leading or trailing spaces. Also, for the patterns in which the area code is enclosed in parentheses, it should not be sensitive to the number of spaces separating the area code from the remainder of the number.

In [None]:
def parse_phone2 (s):
    """Parses a string as a US phone number in, returning a
    `(area-code, three-digits, four-digits)` tuple of strings.
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert parse_phone2 ("  (404)   555-1212  ") == ('404', '555', '1212')
assert parse_phone2 ("(404)555-1212  ") == ('404', '555', '1212')
assert parse_phone2 ("  404-555-1212 ") == ('404', '555', '1212')
assert parse_phone2 ("  404-5551212 ") == ('404', '555', '1212')
assert parse_phone2 (" 4045551212") == ('404', '555', '1212')

In [None]:
failure_cases = ['+1 (404) 555-3355',
                 '404.555.3355',
                 '404 555-3355',
                 '404 555 3355'                 
                ]
for s in failure_cases:
    try:
        parse_phone2 (s)
    except ValueError:
        print ("'{}': Function correctly raised an exception.".format (s))
    else:
        raise AssertionError ("Function did *not* raise an exception as expected!")