# Lab - Week 4


In this lab, you will explore using the methods of getting data and regular expressions. 

### Autograder Setup

The next code cell should be uncommented to run the autograder tests when using Colab/DeepNote. If you are using an environment with `otter-grader` already installed (your own machine, lab machines), then do not uncomment the code.

In [56]:
#!pip install otter-grader

Gather a short text document for use later in the lab. 

In [57]:
#!wget https://pages.mtu.edu/~lebrown/un5550-f22/labs/lab4/lab4.files.zip
#!unzip lab4.files.zip 

**COMMENT out these past two cells before submitting**

### Lab Setup

In [58]:
import re

import otter
grader = otter.Notebook()

# Python Regular Expressions

The Python `re` module provides many functions for regular expression support.  Here you will learn more about the different functions and complete exercises to practice their use. 

## `re.match` 

The `match(pattern, string)` function is used to check a pattern against some text.  It only tries to find the pattern in the beginning of the text.  

`re.match` Documentation:  https://docs.python.org/3.7/library/re.html#re.match


*Reminder* the 'r' at the start of the pattern, indicates that it is a "raw" string which passes through backslashes (handy for regular expresssions).

### Example

In [59]:
tmpStr1 = 'Regular expressions are great'
tmpStr2 = 'It is fun learning about regular expressions'
match = re.match(r'[Rr]egular', tmpStr1)
if match: 
    print('found ', match.group()) 
else: 
    print("did not find")

match = re.match(r'[Rr]egular', tmpStr2)
if match: 
    print('found ', match.group()) 
else: 
    print("did not find")

found  Regular
did not find


## `re.search`

The `re.search(pat, str)` function takes two main arguments: `pat` a regular expression pattern and a `str` string.  The method searches for that first occurence of the pattern within the string.  If sucessful, `search()` returns a match object; otherwise it returns None. 

`re.search()` Documentation: https://docs.python.org/3.7/library/re.html#re.search

### Example

In [60]:
match = re.search(r'[Rr]egular', tmpStr1)
if match: 
    print('found ', match.group()) 
else: 
    print("did not find")

match = re.search(r'[Rr]egular', tmpStr2)
if match: 
    print('found ', match.group()) 
else: 
    print("did not find")

found  Regular
found  regular


### Example

In [61]:
tmpStr1 = 'I have a cat, Fido'
tmpStr2 = 'I have a cat, Felix'
tmpStr3 = 'I have a cat, It'
match = re.search(r'cat,\s\w\w\w\w', tmpStr1)
if match: 
    print('found ', match.group()) 
else: 
    print("did not find")

found  cat, Fido


Try running the expression above on the three test strings. 
 

<!-- BEGIN QUESTION -->

## Exercise 1 - Properties of search

Examine the following search uses of the search function.

In [62]:
tmpStr1 = 'baa baaa black sheep'
match = re.search(r'ba+', tmpStr1)
if match: 
    print('found: ', match.group()) 
else: 
    print("did not find")

tmpStr2 = 'baa2 baaaa4 baaa3'
match = re.search(r'ba+\d', tmpStr2)
if match: 
    print('found: ', match.group()) 
else: 
    print("did not find")

found:  baa
found:  baa2


**Q** Which of the "baa" words is returned in tmpStr2?  Will the function return the leftmost or rightmost occurance in a string? 

**ANS**   
The first word 'baa2'. Leftmost one 

<!-- END QUESTION -->

### Example - Anchors

The exception to your answer above is if the pattern specifies anchors to find a match at the beginning `^` or end `$` of a string. 

In [63]:
tmpStr1 = 'foobar1 foobar2 foobar3'
match = re.search(r'^f\w+\d', tmpStr1)
if match: 
    print('found: ', match.group()) 
else: 
    print("did not find")

match = re.search(r'f\w+\d$', tmpStr1)
if match: 
    print('found: ', match.group()) 
else: 
    print("did not find")

found:  foobar1
found:  foobar3


## Exercise 2 - Create a pattern 

Create a regular expression pattern that matches all the positive examples below, but none of the negative examples.  You can not simply list the positives strings "or"ed together. 

| Positive | Negative | 
|----------|----------|
| pit      | pt       | 
| spot     | Pot      |
| spate    | peat     | 
| slap two | part     | 
| respite  | SLIP ten |

In [64]:
cases = ['pit', 'spot', 'spate', 'slap two', 'respite', 'pt', 'Pot', 'peat', 
         'part', 'SLIP ten']
positive, negative = [], []
pat = r'^[sr]|i'
print('Positive Cases: \n')
for ex in cases: 
    match = re.search(pat, ex)
    if ex=="pt": 
        print("\nNegative Cases: \n")
    if match: 
        print("%9s: found" % ex)
        positive.append(ex)
    else: 
        print("%9s: not found" % ex)
        negative.append(ex)

Positive Cases: 

      pit: found
     spot: found
    spate: found
 slap two: found
  respite: found

Negative Cases: 

       pt: not found
      Pot: not found
     peat: not found
     part: not found
 SLIP ten: not found


In [65]:
grader.check("q2")

## Exercise 3 - Create a Pattern 

Create a regular expression pattern that matches all the positive examples below, but none of the negative examples. You can not simply list the positives strings "or"ed together.

| Positive | Negative | 
|----------|----------|
| rap them | aleht    | 
| tapeth   | happy them | 
| apth     | tarpth | 
| wrap/try | Apt | 
| sap tray | peth | 
| 87ap9th  | tarreth | 
| apothecary | ddapdg | 
|      | apples | 
|      | shape the |

In [66]:
cases_E3 = ['rap them', 'tapeth', 'apth', 'wrap/try', 'sap tray', '87ap9th', 'apothecary',
         'aleht', 'happy them', 'tarpth', 'Apt', 'peth', 'tarreth', 'ddapdg', 
         'apples', 'shape the']
positive_E3, negative_E3 = [], []
pattern_E3 = r'ap.t|apt'
print('Positive Cases: \n')
for ex in cases_E3: 
    match = re.search(pattern_E3, ex)
    if ex=="aleht": 
        print("\nNegative Cases: \n")
    if match: 
        print("%11s: found" % ex)
        positive_E3.append(ex)
    else: 
        print("%11s: not found" % ex)
        negative_E3.append(ex)

Positive Cases: 

   rap them: found
     tapeth: found
       apth: found
   wrap/try: found
   sap tray: found
    87ap9th: found
 apothecary: found

Negative Cases: 

      aleht: not found
 happy them: not found
     tarpth: not found
        Apt: not found
       peth: not found
    tarreth: not found
     ddapdg: not found
     apples: not found
  shape the: not found


In [67]:
grader.check("q3")

## Exercise 4 - Create a pattern 

Create a regular expression pattern that matches all the positive examples below, but none of the negative examples. You can not simply list the positives strings "or"ed together.

| Positive | Negative | 
|----------|----------|
| affgfking | fgok | 
| rafgkahe | a fgk | 
| bafghk | affgm | 
| baffgkit | afffhk | 
| affgfking | fgik | 
| rafgkahe | afg.K | 
| bafghk | aff gm | 
| baffg kit | afffhgk | 









In [68]:
cases_E4 = ['affgfking', 'rafgkahe', 'bafghk', 'baffgkit', 'affgfking', 'rafgkahe', 
         'bafghk', 'baffg kit', 'fgok', 'a fgk', 'affgm', 'afffhk', 'fgik', 
         'afg.K', 'aff gm', 'afffhgk']
positive_E4, negative_E4 = [], []   
pattern_E4 = r'^[^af]|g$'
print('Positive Cases: \n')
for ex in cases_E4: 
    match = re.search(pattern_E4, ex)
    if ex=='fgok':
        print("\nNegative Cases: \n")
    if match: 
        print("%10s: found" % ex)
        positive_E4.append(ex)
    else: 
        print("%10s: not found" % ex)
        negative_E4.append(ex)

Positive Cases: 

 affgfking: found
  rafgkahe: found
    bafghk: found
  baffgkit: found
 affgfking: found
  rafgkahe: found
    bafghk: found
 baffg kit: found

Negative Cases: 

      fgok: not found
     a fgk: not found
     affgm: not found
    afffhk: not found
      fgik: not found
     afg.K: not found
    aff gm: not found
   afffhgk: not found


In [69]:
grader.check("q4")

### Example - Group Extraction 

The "group" part of regular expressions allows for part of the matching text to be selected out.  Let's say we want to extract an email from a string, but in addition to finding the email we want to extract the username and host separately, e.g., to pull out a MTU ISO login. 

The parenthesis in the pattern are used to identify the "groups" inside the text.  

In [70]:
tempStr = 'send an email to John, jdoe@mtu.edu, by tomorrow'
match = re.search('([\w]+)@([\w.]+)', tempStr)
if match: 
    print("Email:    ", match.group())
    print("username: ", match.group(1))
    print("hostname: ", match.group(2))
else: 
    print("no match")

Email:     jdoe@mtu.edu
username:  jdoe
hostname:  mtu.edu


## Exercise 5 - Groups

There are discussions on what is the best regular expression pattern to match emails (e.g, used to verify emails in forms).  But, let's think about how to extend the pattern above to handle the following cases: 

* usernames, can have both characters and numbers and underscores, but will not start with a number, e.g, jdoe15@mtu.edu, sherlock24@gmail.com, tom_brady@gmail.com 
* an email may have task-specific email address (for example, google allows this), where you can add additional identifiers after your username, e.g., harrypotter+news@gmail.com or jonstark+dragons@gmail.com.  Make sure you can separate out a username from the tasks. 
    * "harrypotter+news@gmail.com" has username "harrypotter" and task "news"



In [71]:
cases_E5 = ['jdoe@gmail.com', 'sherlock24@gmail.com', 'tom_brady@gmail.com', 
            'harrypotter+news@gmail.com', 'jonstark+dragons@gmail.com',
            'juliet_capulet+poison@gmail.com', 'Charles_Dickons@yahoo.com', 
            'AnakinSkywalker@hotmail.com']
email, username, hostname = [], [], []
pattern_E5 = r'([\w]+)\+?[\w]*@([\w.]+)'     
for ex in cases_E5: 
    match = re.search(pattern_E5, ex)
    if match: 
        print("Email: ", match.group(), end='')
        print(" username: ", match.group(1), end='')
        print(" hostname: ", match.group(2))
        email.append(match.group()); username.append(match.group(1))
        hostname.append(match.group(2))
    else: 
        print("no match")

Email:  jdoe@gmail.com username:  jdoe hostname:  gmail.com
Email:  sherlock24@gmail.com username:  sherlock24 hostname:  gmail.com
Email:  tom_brady@gmail.com username:  tom_brady hostname:  gmail.com
Email:  harrypotter+news@gmail.com username:  harrypotter hostname:  gmail.com
Email:  jonstark+dragons@gmail.com username:  jonstark hostname:  gmail.com
Email:  juliet_capulet+poison@gmail.com username:  juliet_capulet hostname:  gmail.com
Email:  Charles_Dickons@yahoo.com username:  Charles_Dickons hostname:  yahoo.com
Email:  AnakinSkywalker@hotmail.com username:  AnakinSkywalker hostname:  hotmail.com


In [72]:
grader.check("q5")

## `re.findall()` 

The `re.findall()` function returns all occurences (non-overlapping) of a pattern in a string. 

`re.findall()` Documentation: https://docs.python.org/3.7/library/re.html#re.findall

### Example - findall with Files 

In the `nb.week4.part2.ipynb` notebook, we saw examples of looping over the lines of a file and running the regular expression. 


In [73]:
with open('rime-intro.txt', 'r') as f:
  rime = f.readlines()
f.close()

In [74]:
#for elem in rime:
    #print (elem)
for elem in rime:
  #print (elem)
  m = re.search(r"Ship", elem)
  if m:
        print(m.group())
  else:
        print("No match")

No match
No match
No match
No match
Ship
No match
No match
No match
No match
No match
No match
No match
No match
No match
No match
No match


Or, we could do this for each line within the file reader block. 

In [75]:
with open('rime-intro.txt', 'r') as f:
    for line in f:
        m = re.search(r"Ship", line)
        if m:
            print(m.group())
        else:
            print("No match")

f.close()

No match
No match
No match
No match
Ship
No match
No match
No match
No match
No match
No match
No match
No match
No match
No match
No match


Instead, we can let `findall()` do the iteration. 

In [76]:
f = open("rime-intro.txt", 'r')
strs = re.findall(r'Ship', f.read())
f.close()
strs

['Ship']

## `re.sub()` 

The function `re.sub(pat, replacement, str)` function takes three arguments: the regular expression pattern, a replacement string, and the string to search on.  The funciton searches for all instaces of the pattern in the passed in string and replaces them.  



In [77]:
print(re.sub(r'benefits', 'advantages', 'Show the benefits of doing many examples'))

Show the advantages of doing many examples


### Example - Substitution

Replacement strings can make use of groups using `\1` and `\2`, to refer to `group(1)` and `group(2)`. 

For example, in the following text search for email addresses and replace the host with gmail.com. 

In [78]:
tempStr = 'testing abc@mtu.edu, other words. punctuation doe@foobar.org blah'
print(re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@gmail.com', tempStr))

testing abc@gmail.com, other words. punctuation doe@gmail.com blah


## Exercise 6 - Tweets 

Later on in the course you will have a project getting data from Twitter and doing some analysis.  

Let's think about working on an individual tweet (string). 

For the tweets, replace mentions (`@`) and hashtags (`#`), with the just the word itself (hashtags and mentions) without any spaces.  



In [79]:
tweet1 = """See the #Keweenaw anew through the eyes of the little ones you love! From the 
@michigantech Parade of Nations today, to cider pressing at Central - there's so 
much to experience together!
#PureMichigan"""

tweet2 = """A new three-year $594K NSF project will study geometric data structures 
    and their connections with other areas of theoretical computer science. Yakov 
    Nekrich, CS, is the PI. 
    http://blogs.mtu.edu/computing/2022/08/26/yakov-nekrich-cs-is-pi-of-new-594k-nsf/
    @michigantech @NSF #michigantech #computerscience"""

In [80]:
# Print out the tweets removing the # and @ using regular expressions
replaced_tweet1 = re.sub(r'#', 'hashtags', tweet1)
replaced_tweet1 = re.sub(r'@', 'mentions', replaced_tweet1)
print(replaced_tweet1)
replaced_tweet2 = re.sub(r'#', 'hashtags', tweet2)
replaced_tweet2 = re.sub(r'@', 'mentions', replaced_tweet2)
print(replaced_tweet2)

See the hashtagsKeweenaw anew through the eyes of the little ones you love! From the 
mentionsmichigantech Parade of Nations today, to cider pressing at Central - there's so 
much to experience together!
hashtagsPureMichigan
A new three-year $594K NSF project will study geometric data structures 
    and their connections with other areas of theoretical computer science. Yakov 
    Nekrich, CS, is the PI. 
    http://blogs.mtu.edu/computing/2022/08/26/yakov-nekrich-cs-is-pi-of-new-594k-nsf/
    mentionsmichigantech mentionsNSF hashtagsmichigantech hashtagscomputerscience


In [81]:
grader.check("q6")

We will see more examples of regular expressions next week with respect to web scraping. 
