# Regular expressions

[Cheatsheet](https://www.debuggex.com/cheatsheet/regex/python)

In [1]:
import re

## Part 0 - Basics (email & phone)

General problem of finding patterns in text. A handy tool for this problem is Python's regular expression module.

A regular expression is specially formatted pattern, written as a string. Matching patterns with regular expressions has 3 steps:

1. You come up with a pattern to find.
2. You compile it into a pattern object.
3. You apply the pattern object to a string, to find matches, i.e., instances of the pattern within the string.

### Basics

Let's see how this scheme works for the simplest case, in which the pattern is an exact substring.

In [2]:
pattern = 'fox'
pattern_matcher = re.compile(pattern)

input = 'The quick brown fox jumps over the lazy dog'
matches = pattern_matcher.search(input)
print(matches)
print(matches.group())
print(matches.start())
print(matches.end())
print(matches.span())

<re.Match object; span=(16, 19), match='fox'>
fox
16
19
(16, 19)


**Module-level searching.** For infrequently used patterns, you can also skip creating the pattern object and just call the module-level search function, `re.search()`.

In [3]:
matches_2 = re.search ('jump', input)
assert matches_2 is not None
print ("Found", matches_2.group (), "@", matches_2.span ())

Found jump @ (20, 24)


**Other Search Methods**
1. match() - 	Determine if the RE matches at the beginning of the string.
2. search() - 	Scan through a string, looking for any location where this RE matches.
3. findall() - 	Find all substrings where the RE matches, and returns them as a list.
4. finditer() -	Find all substrings where the RE matches, and returns them as an iterator.

**Creating pattern groups** 

In [4]:
re_names2 = re.compile ('''^              # Beginning of string
                           ([a-zA-Z]+)    # First name
                           \s+            # At least one space
                           ([a-zA-Z]+\s)? # Optional middle name
                           ([a-zA-Z]+)    # Last name
                           $              # End of string
                        ''',
                        re.VERBOSE)
print (re_names2.match ('Rich Vuduc').groups ())
print (re_names2.match ('Rich S Vuduc').groups ())
print (re_names2.match ('Rich Salamander Vuduc').groups ())

('Rich', None, 'Vuduc')
('Rich', 'S ', 'Vuduc')
('Rich', 'Salamander ', 'Vuduc')


**Tagging pattern groups**

In [5]:
# Named groups
re_names3 = re.compile ('''^
                           (?P<first>[a-zA-Z]+)
                           \s
                           (?P<middle>[a-zA-Z]+\s)?
                           \s*
                           (?P<last>[a-zA-Z]+)
                           $
                        ''',
                        re.VERBOSE)
print (re_names3.match ('Rich Vuduc').group ('first'))
print (re_names3.match ('Rich S Vuduc').group ('middle'))
print (re_names3.match ('Rich Salamander Vuduc').group ('last'))

Rich
S 
Vuduc


### Email addresses

In the next exercise, you'll apply what you've learned about regular expressions to build a pattern matcher for email addresses.

Although there is a [formal specification of what constitutes a valid email address](https://tools.ietf.org/html/rfc5322#section-3.4.1), for this exercise, let's use the following simplified rules.

* We will restrict our attention to ASCII addresses and ignore Unicode. If you don't know what that means, don't worry about it---you shouldn't need to do anything special given our code templates, below.
* An email address has two parts, the username and the domain name. These are separated by an `@` character.
* A username **must begin with an alphabetic** character. It may be followed by any number of additional _alphanumeric_ characters or any of the following special characters: `.` (period), `-` (hyphen), `_` (underscore), or `+` (plus).
* A domain name **must end with an alphabetic** character. It may consist of any of the following characters: alphanumeric characters, `.` (period), `-` (hyphen), or `_` (underscore).
* Alphabetic characters may be uppercase or lowercase.
* No whitespace characters are allowed.

Valid domain names usually have additional restrictions, e.g., there are a limited number of endings, such as `.com`, `.edu`, and so on. However, for this exercise you may ignore this fact.

In [6]:
def parse_email (s):
    """Parses a string as an email address, returning an (id, domain) pair."""
    pattern = '''
        ^
        (?P<user>[a-zA-Z][\w.\-+]*)
        @
        (?P<domain>[\w.\-]*[a-zA-Z])
        $
    '''
    matcher = re.compile(pattern, re.VERBOSE)
    matches = matcher.match(s)
    if matches:
        return matches.group('user'), matches.group('domain')
    else:
        raise ValueError("Bad email address")


In [7]:
# Test cell: `parse_email_test`

def pass_case(u, d):
    s = u + '@' + d
    msg = "Testing valid email: '{}'".format(s)
    print(msg)
    assert parse_email(s) == (u, d), msg
    
pass_case('richie', 'cc.gatech.edu')
pass_case('bertha_hugely', 'sampson.edu')
pass_case('JKRowling', 'Huge-Books.org')

def fail_case(s):
    msg = "Testing invalid email: '{}'".format(s)
    print(msg)
    try:
        parse_email(s)
    except ValueError:
        print("==> Correctly throws an exception!")
    else:
        raise AssertionError("Should have, but did not, throw an exception!")
        
fail_case('x @hpcgarage.org')
fail_case('   quiggy.smith38x@gmail.com')
fail_case('richie@cc.gatech.edu  ')
fail_case('4test@gmail.com')
fail_case('richie@cc.gatech.edu7')

Testing valid email: 'richie@cc.gatech.edu'
Testing valid email: 'bertha_hugely@sampson.edu'
Testing valid email: 'JKRowling@Huge-Books.org'
Testing invalid email: 'x @hpcgarage.org'
==> Correctly throws an exception!
Testing invalid email: '   quiggy.smith38x@gmail.com'
==> Correctly throws an exception!
Testing invalid email: 'richie@cc.gatech.edu  '
==> Correctly throws an exception!
Testing invalid email: '4test@gmail.com'
==> Correctly throws an exception!
Testing invalid email: 'richie@cc.gatech.edu7'
==> Correctly throws an exception!


### Phone numbers

Write a function to parse US phone numbers written in the canonical "(404) 555-1212" format, i.e., a three-digit area code enclosed in parentheses followed by a seven-digit local number in three-hyphen-four digit format. It should also **ignore** all leading and trailing spaces, as well as any spaces that appear between the area code and local numbers. However, it should **not** accept any spaces in the area code (e.g., in '(404)') nor should it in the seven-digit local number.

It should return a triple of strings, `(area_code, first_three, last_four)`. 

If the input is not a valid phone number, it should raise a `ValueError`.

In [8]:
def parse_phone1 (s):
    pattern = '''
        \s*
        \((?P<area>\d{3})\) # area code
        \s*
        (?P<local3>\d{3})
        -
        (?P<local4>\d{4})
    '''
    
    matcher = re.compile(pattern, re.VERBOSE)
    matches = matcher.match(s)
    if matches:
        return (matches.group('area'), 
                matches.group('local3'), 
                matches.group('local4'))
    else:
        raise ValueError("Bad phone number")

In [9]:
# Test cell: `parse_phone1_test`

def rand_spaces(m=5):
    from random import randint
    return ' ' * randint(0, m)

def asm_phone(a, l, r):
    return rand_spaces() + '(' + a + ')' + rand_spaces() + l + '-' + r + rand_spaces()

def gen_digits(k):
    from random import choice # 3.5 compatible; 3.6 has `choices()`
    DIGITS = '0123456789'
    return ''.join([choice(DIGITS) for _ in range(k)])

def pass_phone(p=None, a=None, l=None, r=None):
    if p is None:
        a = gen_digits(3)
        l = gen_digits(3)
        r = gen_digits(4)
        p = asm_phone(a, l, r)
    else:
        assert a is not None and l is not None and r is not None, "Need to supply sample solution."
    msg = "Should pass: '{}'".format(p)
    print(msg)
    p_you = parse_phone1(p)
    assert p_you == (a, l, r), "Got {} instead of ('{}', '{}', '{}')".format(p_you, a, l, r)
    
def fail_phone(s):
    msg = "Should fail: '{}'".format(s)
    print(msg)
    try:
        p_you = parse_phone1(s)
    except ValueError:
        print("==> Correctly throws an exception.")
    else:
        raise AssertionError("Failed to throw a `ValueError` exception!")


# Cases that should definitely pass:
pass_phone('(404) 121-2121', '404', '121', '2121')
pass_phone('(404)121-2121', '404', '121', '2121')
for _ in range(5):
    pass_phone()
    
fail_phone("404-121-2121")
fail_phone('(404)555 -1212')
fail_phone(" ( 404)121-2121")
fail_phone("(abc) def-ghij")

Should pass: '(404) 121-2121'
Should pass: '(404)121-2121'
Should pass: '  (876)     546-0975 '
Should pass: '(100)   033-1071'
Should pass: '(277)805-9992'
Should pass: '     (809)  750-4377  '
Should pass: '  (320) 524-5980 '
Should fail: '404-121-2121'
==> Correctly throws an exception.
Should fail: '(404)555 -1212'
==> Correctly throws an exception.
Should fail: ' ( 404)121-2121'
==> Correctly throws an exception.
Should fail: '(abc) def-ghij'
==> Correctly throws an exception.


**Implement an enhanced phone number parser that can handle any of these patterns.**

* (404) 555-1212
* (404) 5551212
* 404-555-1212
* 404-5551212
* 404555-1212
* 4045551212

As before, it should not be sensitive to leading or trailing spaces. Also, for the patterns in which the area code is enclosed in parentheses, it should not be sensitive to the number of spaces separating the area code from the remainder of the number.

In [10]:
def parse_phone2(s):
    pattern = '''
        \s*
        (?P<area>  # area code
            \d{3}-?
            | \(\d{3}\)\s*
        )
        (?P<local3>\d{3})
        -?
        (?P<local4>\d{4})
    '''
    matcher = re.compile(pattern, re.VERBOSE)
    matches = matcher.match(s)
    if not matches:
        raise ValueError("Bad phone number")
    
    areacode = re.search('\d{3}', matches.group('area')).group()
    local3 = matches.group('local3')
    local4 = matches.group('local4')

    return areacode, local3, local4


In [11]:
# Test cell: `parse_phone2_test`

def asm_phone2(a, l, r):
    from random import random
    x = random()
    if x < 0.33:
        a2 = '(' + a + ')' + rand_spaces()
    elif x < 0.67:
        a2 = a + '-'
    else:
        a2 = a
    y = random()
    if y < 0.5:
        l2 = l + '-'
    else:
        l2 = l
    return rand_spaces() + a2 + l2 + r + rand_spaces()

def pass_phone2(p=None, a=None, l=None, r=None):
    if p is None:
        a = gen_digits(3)
        l = gen_digits(3)
        r = gen_digits(4)
        p = asm_phone2(a, l, r)
    else:
        assert a is not None and l is not None and r is not None, "Need to supply sample solution."
    msg = "Should pass: '{}'".format(p)
    print(msg)
    p_you = parse_phone2(p)
    assert p_you == (a, l, r), "Got {} instead of ('{}', '{}', '{}')".format(p_you, a, l, r)
    
pass_phone2("  (404)   555-1212  ", '404', '555', '1212')
pass_phone2("(404)555-1212  ", '404', '555', '1212')
pass_phone2("  404-555-1212 ", '404', '555', '1212')
pass_phone2("  404-5551212 ", '404', '555', '1212')
pass_phone2(" 4045551212", '404', '555', '1212')
    
for _ in range(5):
    pass_phone2()
    
    
def fail_phone2(s):
    msg = "Should fail: '{}'".format(s)
    print(msg)
    try:
        parse_phone2 (s)
    except ValueError:
        print ("==> Function correctly raised an exception.")
    else:
        raise AssertionError ("Function did *not* raise an exception as expected!")
        
failure_cases = ['+1 (404) 555-3355',
                 '404.555.3355',
                 '404 555-3355',
                 '404 555 3355',
                 '(404-555-1212'
                ]
for s in failure_cases:
    fail_phone2(s)
    
print("\n(Passed!)")

Should pass: '  (404)   555-1212  '
Should pass: '(404)555-1212  '
Should pass: '  404-555-1212 '
Should pass: '  404-5551212 '
Should pass: ' 4045551212'
Should pass: '     3255447355     '
Should pass: ' 000-4647826   '
Should pass: ' 5491367117  '
Should pass: '   3573108177    '
Should pass: '  352-3237936   '
Should fail: '+1 (404) 555-3355'
==> Function correctly raised an exception.
Should fail: '404.555.3355'
==> Function correctly raised an exception.
Should fail: '404 555-3355'
==> Function correctly raised an exception.
Should fail: '404 555 3355'
==> Function correctly raised an exception.
Should fail: '(404-555-1212'
==> Function correctly raised an exception.

(Passed!)


## Part 1: Processing an HTML file

One of the richest sources of information is [the Web](http://www.computerhistory.org/revolution/networking/19/314)!

We ask you to use string processing and regular expressions to mine a web page, which is stored in HTML format.

**The data: Yelp! reviews.** The data you will work with is a snapshot of a recent search on the [Yelp! site](https://yelp.com) for the best fried chicken restaurants in Atlanta. That snapshot is hosted here: https://cse6040.gatech.edu/datasets/yelp-example

If you go ahead and open that site, you'll see that it contains a ranked list of places:

![Top 10 Fried Chicken Spots in ATL as of September 12, 2017](https://cse6040.gatech.edu/datasets/yelp-example/ranked-list-snapshot.png)

## Getting the data

First things first: you need an HTML file. The following Python code will download a particular web page that we've prepared for this exercise and store it locally in a file.

> If the file exists, this command will not overwrite it. By not doing so, we can reduce accesses to the server that hosts the file. Also, if an error occurs during the download, this cell may report that the downloaded file is corrupt; in that case, you should try re-running the cell.

In [12]:
import requests
import os
import hashlib

if os.path.exists('.voc'):
    data_url = 'https://cse6040.gatech.edu/datasets/yelp-example/yelp.htm'
else:
    data_url = 'https://github.com/cse6040/labs-fa17/raw/master/datasets/yelp.htm'

if not os.path.exists('yelp.htm'):
    print("Downloading: {} ...".format(data_url))
    r = requests.get(data_url)
    with open('yelp.htm', 'w', encoding=r.encoding) as f:
        f.write(r.text)

with open('yelp.htm', 'r', encoding='utf-8') as f:
    yelp_html = f.read().encode(encoding='utf-8')
    checksum = hashlib.md5(yelp_html).hexdigest()
    assert checksum == "4a74a0ee9cefee773e76a22a52d45a8e", "Downloaded file has incorrect checksum!"
    
print("'yelp.htm' is ready!")

'yelp.htm' is ready!


**Reading the HTML file into a Python string.** Let's also open the file in Python and read its contents into a string named, `yelp_html`.

In [13]:
with open('yelp.htm', 'r', encoding='utf-8') as yelp_file:
    yelp_html = yelp_file.read()
    
# Print first few hundred characters of this string:
print("*** type(yelp_html) == {} ***".format(type(yelp_html)))
n = 1000
print("*** Contents (first {} characters) ***\n{} ...".format(n, yelp_html[:n]))

*** type(yelp_html) == <class 'str'> ***
*** Contents (first 1000 characters) ***
<!DOCTYPE html>
<!-- saved from url=(0079)https://www.yelp.com/search?find_desc=fried+chicken&find_loc=Atlanta%2C+GA&ns=1 -->
<html xmlns:fb="http://www.facebook.com/2008/fbml" class="js gr__yelp_com" lang="en"><!--<![endif]--><head data-component-bound="true"><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><link type="text/css" rel="stylesheet" href="./Best Fried chicken in Atlanta, GA - Yelp_files/css"><style type="text/css">.gm-style .gm-style-cc span,.gm-style .gm-style-cc a,.gm-style .gm-style-mtc div{font-size:10px}
</style><style type="text/css">@media print {  .gm-style .gmnoprint, .gmnoprint {    display:none  }}@media screen {  .gm-style .gmnoscreen, .gmnoscreen {    display:none  }}</style><style type="text/css">.gm-style-pbc{transition:opacity ease-in-out;background-color:rgba(0,0,0,0.45);text-align:center}.gm-style-pbt{font-size:22px;color:white;font-family:Roboto,Arial,san

### Extracting the ranking

Write some Python code to create a variable named `rankings`, which is a list of dictionaries set up as follows:

* `rankings[i]` is a dictionary corresponding to the restaurant whose rank is `i+1`. For example, from the screenshot above, `rankings[0]` should be a dictionary with information about Gus's World Famous Fried Chicken.
* Each dictionary, `rankings[i]`, should have these keys:
    * `rankings[i]['name']`: The name of the restaurant, a string.
    * `rankings[i]['stars']`: The star rating, as a string, e.g., `'4.5'`, `'4.0'`
    * `rankings[i]['numrevs']`: The number of reviews, as an **integer.**
    * `rankings[i]['price']`: The price range, as dollar signs, e.g., `'$'`, `'$$'`, `'$$$'`, or `'$$$$'`.
    
Of course, since the current topic is regular expressions, you might try to apply them (possibly combined with other string manipulation methods) find the particular patterns that yield the desired information.

In [31]:
sections = yelp_html.split('<span class="indexed-biz-name">')


patterns = {
    'name': '''<a class="biz-name js-analytics-click" data-analytics-label="biz-name" href="[^"]*" data-hovercard-id="[^"]*"><span>(.+)</span></a>''',
    'stars': '''title="([0-9.]+) star rating"''',
    'numrevs': '''(\d+) reviews''',
    'price': '''<span class="business-attribute price-range">(\$+)</span>'''
}

def get_field(s, key):
    match = re.search(patterns[key], s)
    if match is not None:
        return match.groups()[0]
    return None

rankings = []
for i, section in enumerate(sections[1:]):
    rankings.append({})
    for key in patterns.keys():
        rankings[i][key] = get_field(section, key)

for r in rankings:
    r['numrevs'] = int(r['numrevs'])

In [32]:
# Test cell: `rankings_test`

assert type(rankings) is list, "`rankings` must be a list"
assert all([type(r) is dict for r in rankings]), "All `rankings[i]` must be dictionaries"

print("=== Rankings ===")
for i, r in enumerate(rankings):
    print("{}. {} ({}): {} stars based on {} reviews".format(i+1,
                                                             r['name'],
                                                             r['price'],
                                                             r['stars'],
                                                             r['numrevs']))

assert rankings[0] == {'numrevs': 549, 'name': 'Gus’s World Famous Fried Chicken', 'stars': '4.0', 'price': '$$'}
assert rankings[1] == {'numrevs': 1777, 'name': 'South City Kitchen - Midtown', 'stars': '4.5', 'price': '$$'}
assert rankings[2] == {'numrevs': 2241, 'name': 'Mary Mac’s Tea Room', 'stars': '4.0', 'price': '$$'}
assert rankings[3] == {'numrevs': 481, 'name': 'Busy Bee Cafe', 'stars': '4.0', 'price': '$$'}
assert rankings[4] == {'numrevs': 108, 'name': 'Richards’ Southern Fried', 'stars': '4.0', 'price': '$$'}
assert rankings[5] == {'numrevs': 93, 'name': 'Greens &amp; Gravy', 'stars': '3.5', 'price': '$$'}
assert rankings[6] == {'numrevs': 350, 'name': 'Colonnade Restaurant', 'stars': '4.0', 'price': '$$'}
assert rankings[7] == {'numrevs': 248, 'name': 'South City Kitchen Buckhead', 'stars': '4.5', 'price': '$$'}
assert rankings[8] == {'numrevs': 1558, 'name': 'Poor Calvin’s', 'stars': '4.5', 'price': '$$'}
assert rankings[9] == {'numrevs': 67, 'name': 'Rock’s Chicken &amp; Fries', 'stars': '4.0', 'price': '$'}

print("\n(Passed!)")

=== Rankings ===
1. Gus’s World Famous Fried Chicken ($$): 4.0 stars based on 549 reviews
2. South City Kitchen - Midtown ($$): 4.5 stars based on 1777 reviews
3. Mary Mac’s Tea Room ($$): 4.0 stars based on 2241 reviews
4. Busy Bee Cafe ($$): 4.0 stars based on 481 reviews
5. Richards’ Southern Fried ($$): 4.0 stars based on 108 reviews
6. Greens &amp; Gravy ($$): 3.5 stars based on 93 reviews
7. Colonnade Restaurant ($$): 4.0 stars based on 350 reviews
8. South City Kitchen Buckhead ($$): 4.5 stars based on 248 reviews
9. Poor Calvin’s ($$): 4.5 stars based on 1558 reviews
10. Rock’s Chicken &amp; Fries ($): 4.0 stars based on 67 reviews

(Passed!)
