# Lab 5


In this lab, you will explore using the methods of getting data and regular expressions. 

### Lab Setup

In [None]:
import re

import otter
grader = otter.Notebook()

# Python Regular Expressions

The Python `re` module provides many functions for regular expression support.  Here you will learn more about the different functions and complete exercises to practice their use. 

## `re.match` 

The `match(pattern, string)` function is used to check a pattern against some text.  It only tries to find the pattern in the beginning of the text.  

`re.match` Documentation:  https://docs.python.org/3.10/library/re.html#re.match


*Reminder* the 'r' at the start of the pattern, indicates that it is a "raw" string which passes through backslashes (handy for regular expresssions).

### Example

In [None]:
tmpStr1 = 'Regular expressions are great'
tmpStr2 = 'It is fun learning about regular expressions'
match = re.match(r'[Rr]egular', tmpStr1)
if match: 
    print('found ', match.group()) 
else: 
    print("did not find")

match = re.match(r'[Rr]egular', tmpStr2)
if match: 
    print('found ', match.group()) 
else: 
    print("did not find")

## `re.search`

The `re.search(pat, str)` function takes two main arguments: `pat` a regular expression pattern and a `str` string.  The method searches for that first occurence of the pattern within the string.  If sucessful, `search()` returns a match object; otherwise it returns None. 

`re.search()` Documentation: https://docs.python.org/3.10/library/re.html#re.search

### Example

In [None]:
match = re.search(r'[Rr]egular', tmpStr1)
if match: 
    print('found ', match.group()) 
else: 
    print("did not find")

match = re.search(r'[Rr]egular', tmpStr2)
if match: 
    print('found ', match.group()) 
else: 
    print("did not find")

### Example

In [None]:
tmpStr1 = 'I have a cat, Fido'
tmpStr2 = 'I have a cat, Felix'
tmpStr3 = 'I have a cat, It'
match = re.search(r'cat,\s\w\w\w\w', tmpStr1)
if match: 
    print('found ', match.group()) 
else: 
    print("did not find")

Try running the expression above on the three test strings. 
 

<!-- BEGIN QUESTION -->

## Exercise 1 - Properties of search

Examine the following uses of the search function.

In [None]:
tmpStr1 = 'baa baaa black sheep'
match = re.search(r'ba+', tmpStr1)
if match: 
    print('found: ', match.group()) 
else: 
    print("did not find")

tmpStr2 = 'baa2 baaaa4 baaa3'
match = re.search(r'ba+\d', tmpStr2)
if match: 
    print('found: ', match.group()) 
else: 
    print("did not find")

**Q** Which of the "baa" words is returned in tmpStr2?  Will the function return the leftmost or rightmost occurance in a string? 

**ANS**   

<!-- END QUESTION -->

### Example - Anchors

The exception to your answer above is if the pattern specifies anchors to find a match at the beginning `^` or end `$` of a string. 

In [None]:
tmpStr1 = 'foobar1 foobar2 foobar3'
match = re.search(r'^f\w+\d', tmpStr1)
if match: 
    print('found: ', match.group()) 
else: 
    print("did not find")

match = re.search(r'f\w+\d$', tmpStr1)
if match: 
    print('found: ', match.group()) 
else: 
    print("did not find")

## Exercise 2 - Escape Character

Consider the following code that reads in each line of a file and prints those lines that match the pattern given. 

```python 
hand = open('example-text.txt')
for line in hand:
    line = line.rstrip()
    if re.search('\$.+', line):
        print(line)
```

Which of the following lines could be printed when running the code? 

```
    1. It will cost you $1.00   
    2. You owe three dollars.  
    3. $2.50 is your change  
    4. From: anon@mtu.edu $a  
```

Your answer, `q2`, will be a list of letters matching the correct response, e.g.,   
`q2 = list()` or `q2 = [1]` or `q2 = [3, 4]`, or `q2 = [1, 2, 3, 4]`.

Try answering without running the code.



In [None]:
q2 = ...

In [None]:
grader.check("q2")

## Exercise 3 - `+` 

Place your answer `True` or `False` in variable `q3` to the following question. 

The following code will match only the first email(up to the @ sign) in the string?

```python 
stri = 'From: Olivia.Rodrigo@yahoo.com, badbunny@hotmail.com, taylorswift@gmail.com'
stri = stri.rstrip()
print(re.findall('From:.+@', stri))
```


In [None]:
q3 = ...   # True or False

In [None]:
grader.check("q3")

<!-- BEGIN QUESTION -->

#### Exercise 3a 

Briefly explain (less than 12 words) your answer above.  

**A**. *Your answer here*

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

## Exercise 4 - Explanation 

Given the following regular expression below.  

**Q** Briefly explain what patterns it matches. 

**Q** Explain what the `?:` does.  

`re.findall('\$\d+(?:\.\d{2})?', x)`

**A**. The patterns matches ...

**A**.  The `?:` ...

<!-- END QUESTION -->

## Exercise 5 - Create a pattern 

Construct a regex that matches `abc` followed by zero to many digits (0-9).

In [None]:
q5 = '...'   # regex pattern, e.g., `a[b]*`

In [None]:
grader.check("q5")

## Exercise 6 - Create a pattern 

Construct a regex that matches one or more lowercase letters (a-z) followed by a space character and then two to four digits.

In [None]:
q6 = '...'   # regex pattern, e.g., `a[b]*`

In [None]:
grader.check("q6")

## Exercise 7 - Create a pattern 

Construct a regex that captures strings that have three digits followed by a period and then five letters from a to z.

In [None]:
q7 = '...'   # regex pattern, e.g., `a[b]*`

In [None]:
grader.check("q7")

## Exercise 8 - Create a Function 

Create a function called `match_word(word, strv)` that returns `True` if the string `strv` contains the word `word`, but not if it is part of another word. 

For example, `match_word('is', "This was bad")` would return `False` and `match_word('is', "This is good")` would return `True`.

In [None]:
def match_word(word, strv): 
    # Function returns True is strv contains the word (but not part of another word)
    #  else function return False 
    
    return 

print(match_word('is', 'This was bad'))
print(match_word('is', 'This is good'))

In [None]:
grader.check("q8")

## Exercise 9 - Pattern Match 

Create a regular expression pattern that matches all the positive examples below, but none of the negative examples.  You can not simply list the positives strings "or"ed together. 

| Positive | Negative | 
|----------|----------|
| pit      | pt       | 
| spot     | Pot      |
| spate    | peat     | 
| slap two | part     | 
| respite  | SLIP ten |

In [None]:
cases = ['pit', 'spot', 'spate', 'slap two', 'respite', 'pt', 'Pot', 'peat', 
         'part', 'SLIP ten']
positive, negative = [], []
pat = r'...'      # Write regular expression pattern here 

# DO NOT CHANGE BELOW
print('Positive Cases: \n')
for ex in cases: 
    match = re.search(pat, ex)
    if ex=="pt": 
        print("\nNegative Cases: \n")
    if match: 
        print("%9s: found" % ex)
        positive.append(ex)
    else: 
        print("%9s: not found" % ex)
        negative.append(ex)

In [None]:
grader.check("q9")

## Exercise 10 - Pattern Match

Create a regular expression pattern that matches all the positive examples below, but none of the negative examples. You can not simply list the positives strings "or"ed together.

| Positive | Negative | 
|----------|----------|
| rap them | aleht    | 
| tapeth   | happy them | 
| apth     | tarpth | 
| wrap/try | Apt | 
| sap tray | peth | 
| 87ap9th  | tarreth | 
| apothecary | ddapdg | 
|      | apples | 
|      | shape the |

In [None]:
cases_E10 = ['rap them', 'tapeth', 'apth', 'wrap/try', 'sap tray', '87ap9th', 'apothecary',
         'aleht', 'happy them', 'tarpth', 'Apt', 'peth', 'tarreth', 'ddapdg', 
         'apples', 'shape the']
positive_E10, negative_E10 = [], []
pattern_E10 = r'...'  # Fill in pattern

# DO NOT CHANGE BELOW
print('Positive Cases: \n')
for ex in cases_E10: 
    match = re.search(pattern_E10, ex)
    if ex=="aleht": 
        print("\nNegative Cases: \n")
    if match: 
        print("%11s: found" % ex)
        positive_E10.append(ex)
    else: 
        print("%11s: not found" % ex)
        negative_E10.append(ex)

In [None]:
grader.check("q10")

## Exercise 11 - Pattern Match 

Create a regular expression pattern that matches all the positive examples below, but none of the negative examples. You can not simply list the positives strings "or"ed together.

| Positive | Negative | 
|----------|----------|
| affgfking | fgok | 
| rafgkahe | a fgk | 
| bafghk | affgm | 
| baffgkit | afffhk | 
| affgfking | fgik | 
| rafgkahe | afg.K | 
| bafghk | aff gm | 
| baffg kit | afffhgk | 









In [None]:
cases_E4 = ['affgfking', 'rafgkahe', 'bafghk', 'baffgkit', 'affgfking', 'rafgkahe', 
         'bafghk', 'baffg kit', 'fgok', 'a fgk', 'affgm', 'afffhk', 'fgik', 
         'afg.K', 'aff gm', 'afffhgk']
positive_E4, negative_E4 = [], []   
pattern_E4 = r'...'   # Fill in pattern

# DO NOT CHANGE BELOW
print('Positive Cases: \n')
for ex in cases_E4: 
    match = re.search(pattern_E4, ex)
    if ex=='fgok':
        print("\nNegative Cases: \n")
    if match: 
        print("%10s: found" % ex)
         positive_E4.append(ex)
    else: 
        print("%10s: not found" % ex)
        negative_E4.append(ex)

In [None]:
grader.check("q11")

### Example - Group Extraction 

The "group" part of regular expressions allows for part of the matching text to be selected out.  Let's say we want to extract an email from a string, but in addition to finding the email we want to extract the username and host separately, e.g., to pull out a MTU ISO login. 

The parenthesis in the pattern are used to identify the "groups" inside the text.  

In [None]:
tempStr = 'send an email to John, jdoe@mtu.edu, by tomorrow'
match = re.search('([\w]+)@([\w.]+)', tempStr)
if match: 
    print("Email:    ", match.group())
    print("username: ", match.group(1))
    print("hostname: ", match.group(2))
else: 
    print("no match")

## Exercise 12 - Groups

There are discussions on what is the best regular expression pattern to match emails (e.g., used to verify emails in forms).  But, let's think about how to extend the pattern above to handle the following cases: 

* usernames, can have both characters and numbers and underscores, but will not start with a number, e.g, jdoe15@mtu.edu, sherlock24@gmail.com, tom_brady@gmail.com 
* an email may have task-specific email address (for example, google allows this), where you can add additional identifiers after your username, e.g., harrypotter+news@gmail.com or jonstark+dragons@gmail.com.  Make sure you can separate out a username from the tasks. 
    * "harrypotter+news@gmail.com" has username "harrypotter" and task "news"



In [None]:
cases_E5 = ['jdoe@gmail.com', 'sherlock24@gmail.com', 'tom_brady@gmail.com', 
            'harrypotter+news@gmail.com', 'jonstark+dragons@gmail.com',
            'juliet_capulet+poison@gmail.com', 'Charles_Dickens@yahoo.com', 
            'AnakinSkywalker@hotmail.com']
email, username, hostname = [], [], []
pattern_E5 = r'...'

# DO NOT CHANGE BELOW
for ex in cases_E5: 
    match = re.search(pattern_E5, ex)
    if match: 
        print("Email: ", match.group(), end='')
        print("\tUsername: ", match.group(1), end='')
        print("\tHostname: ", match.group(2))
        email.append(match.group()) 
        username.append(match.group(1))
        hostname.append(match.group(2))
    else: 
        print("no match")

In [None]:
grader.check("q12")

## `re.findall()` 

The `re.findall()` function returns all occurences (non-overlapping) of a pattern in a string. 

`re.findall()` Documentation: https://docs.python.org/3.7/library/re.html#re.findall

### Example - findall with Files 

In the `nb.week5.part2.ipynb` notebook, we saw examples of looping over the lines of a file and running the regular expression. 


In [None]:
with open('data/rime-intro.txt', 'r') as f:
  rime = f.readlines()
f.close()

In [None]:
for elem in rime:
    #print (elem)
    m = re.search(r"Ship", elem)
    if m:
        print(m.group())
    else:
        print("No match")

Or, we could do this for each line within the file reader block. 

In [None]:
with open('data/rime-intro.txt', 'r') as f:
    for line in f:
        m = re.search(r"Ship", line)
        if m:
            print(m.group())
        else:
            print("No match")

f.close()

Instead, we can let `findall()` do the iteration. 

In [None]:
f = open("data/rime-intro.txt", 'r')
strs = re.findall(r'Ship', f.read())
f.close()
strs

## `re.sub()` 

The function `re.sub(pat, replacement, str)` function takes three arguments: the regular expression pattern, a replacement string, and the string to search on.  The funciton searches for all instaces of the pattern in the passed in string and replaces them.  



In [None]:
print(re.sub(r'benefits', 'advantages', 'Show the benefits of doing many examples'))

### Example - Substitution

Replacement strings can make use of groups using `\1` and `\2`, to refer to `group(1)` and `group(2)`. 

For example, in the following text search for email addresses and replace the host with gmail.com. 

In [None]:
tempStr = 'testing abc@mtu.edu, other words. punctuation doe@foobar.org blah'
print(re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@gmail.com', tempStr))

We will see more examples of regular expressions next week with respect to web scraping. 


## Congratulations! You have finished Lab5! 

### Submission Instructions

Below, you will see a cell. Running this cell will automatically generate a zip file with your autograded answers. Once you submit this file to the Lab 5 assignment on Gradescope. 


Make sure you have run all cells in your notebook **in order** before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)