# Week 3
## Advanced Regular Expressions

### Advanced Regular Expressions Cheat-Sheet
Check out the following link for more information:

- https://regexcrossword.com/

### Capturing Groups

`( )` to match specific information in the text

> Pop Quiz

Fix the regular expression used in the rearrange_name function so that it can match middle names, middle initials, as well as double surnames.

In [16]:
import re
def rearrange_name(name):
  result = re.search(r"^([\w \.-]*), ([\w \.-]*)$", name)
  if result == None:
    return name
  return "{} {}".format(result[2], result[1])

name=rearrange_name("Kennedy, John F.")
print(name)

John F. Kennedy


### Match on Repetition Qualifers

`{ }` to indicate length

In [17]:
print(re.search(r'[a-zA-Z]{5}', 'a ghost'))

print(re.search(r'[a-zA-Z]{5}', 'a scary ghost appread'))

print(re.findall(r'[a-zA-Z]{5}', 'a scary ghost appread'))

print(re.findall(r'\b[a-zA-Z]{5}\b', 'a scary ghost appread'))

print(re.findall(r'\w{5,10}', 'I really like strawberries'))

print(re.findall(r'\w{5,}', 'I really like strawberries'))

print(re.search(r's\w{,20}', 'I really like strawberries'))

<re.Match object; span=(2, 7), match='ghost'>
<re.Match object; span=(2, 7), match='scary'>
['scary', 'ghost', 'appre']
['scary', 'ghost']
['really', 'strawberri']
['really', 'strawberries']
<re.Match object; span=(14, 26), match='strawberries'>


> Pop Quiz

The long_words function returns all words that are at least 7 characters. Fill in the regular expression to complete this function.

In [18]:
import re
def long_words(text):
  pattern = r'\w{7,}'
  result = re.findall(pattern, text)
  return result

print(long_words("I like to drink coffee in the morning.")) # ['morning']
print(long_words("I also have a taste for hot chocolate in the afternoon.")) # ['chocolate', 'afternoon']
print(long_words("I never drink tea late at night.")) # []

['morning']
['chocolate', 'afternoon']
[]


### PID Process ID

In [19]:
def extract_pid(log_line):
    regex = r'\[(\d+)\]'
    result = re.search(regex, log_line)
    if result is None:
        return ''
    return result[1]


print(extract_pid('this gonna be use log time[23451]'))

print(extract_pid('eror code return empty [string]'))

23451



>Pop Quiz

Add to the regular expression used in the extract_pid function, to return the uppercase message in parenthesis, after the process id.

In [20]:
import re
def extract_pid(log_line):
    regex = r"\[(\d+)\]:\s([A-Z]+)"
    result = re.search(regex, log_line)
    if result is None:
        return None
    return "{} ({})".format(result[1], result[2])

print(extract_pid("July 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade")) # 12345 (ERROR)
print(extract_pid("99 elephants in a [cage]")) # None
print(extract_pid("A string that also has numbers [34567] but no uppercase message")) # None
print(extract_pid("July 31 08:08:08 mycomputer new_process[67890]: RUNNING Performing backup")) # 67890 (RUNNING)

12345 (ERROR)
None
None
67890 (RUNNING)


### Splitting and Replacing

In [21]:
print(re.split(r'[.?!]', 'One sentence. Another one? And the last one!'))

print(re.split(r'([.?!])', 'One sentence. Another one? And the last one!'))

['One sentence', ' Another one', ' And the last one', '']
['One sentence', '.', ' Another one', '?', ' And the last one', '!', '']


In [22]:
print(re.sub(r'[\w.%+-]+@[\w.-]+', '[REDACTED]', 'Received an email for go_nuts95@my.example.com'))

print(re.sub(r'^([\w .-]*), ([\w .-]*)$', r'\2 \1', "Lovelace, Ada"))

Received an email for [REDACTED]
Ada Lovelace


> Pop Quiz

We want to split a piece of text by either the word "a" or "the", as implemented in the following code. What is the resulting split list?

In [23]:
re.split(r"the|a", "One sentence. Another one? And the last one!")

['One sentence. Ano', 'r one? And ', ' l', 'st one!']

## Practice Quiz

We're working with a CSV file, which contains employee information. Each record has a name field, followed by a phone number field, and a role field. The phone number field contains U.S. phone numbers, and needs to be modified to the international format, with "+1-" in front of the phone number. Fill in the regular expression, using groups, to use the transform_record function to do that.

In [24]:
import re
def transform_record(record):
  new_record = re.sub(r'([\d-]+)', r'+1-\1', record)
  return new_record

print(transform_record("Sabrina Green,802-867-5309,System Administrator")) 
# Sabrina Green,+1-802-867-5309,System Administrator

print(transform_record("Eli Jones,684-3481127,IT specialist")) 
# Eli Jones,+1-684-3481127,IT specialist

print(transform_record("Melody Daniels,846-687-7436,Programmer")) 
# Melody Daniels,+1-846-687-7436,Programmer

print(transform_record("Charlie Rivera,698-746-3357,Web Developer")) 
# Charlie Rivera,+1-698-746-3357,Web Developer

Sabrina Green,+1-802-867-5309,System Administrator
Eli Jones,+1-684-3481127,IT specialist
Melody Daniels,+1-846-687-7436,Programmer
Charlie Rivera,+1-698-746-3357,Web Developer


The multi_vowel_words function returns all words with 3 or more consecutive vowels (a, e, i, o, u). Fill in the regular expression to do that.

In [25]:
import re
def multi_vowel_words(text):
  pattern = r'(\w*[aeiou]{3,}\w+)'
  result = re.findall(pattern, text)
  return result

print(multi_vowel_words("Life is beautiful")) 
# ['beautiful']

print(multi_vowel_words("Obviously, the queen is courageous and gracious.")) 
# ['Obviously', 'queen', 'courageous', 'gracious']

print(multi_vowel_words("The rambunctious children had to sit quietly and await their delicious dinner.")) 
# ['rambunctious', 'quietly', 'delicious']

print(multi_vowel_words("The order of a data queue is First In First Out (FIFO)")) 
# ['queue']

print(multi_vowel_words("Hello world!")) 
# []

['beautiful']
['Obviously', 'queen', 'courageous', 'gracious']
['rambunctious', 'quietly', 'delicious']
['queue']
[]


The transform_comments function converts comments in a Python script into those usable by a C compiler. This means looking for text that begins with a hash mark (#) and replacing it with double slashes (//), which is the C single-line comment indicator. For the purpose of this exercise, we'll ignore the possibility of a hash mark embedded inside of a Python command, and assume that it's only used to indicate a comment. We also want to treat repetitive hash marks (##), (###), etc., as a single comment indicator, to be replaced with just (//) and not (#//) or (//#). Fill in the parameters of the substitution method to complete this function: 

In [26]:
import re
def transform_comments(line_of_code):
  result = re.sub(r'#{1,}', r'//', line_of_code)
  return result

print(transform_comments("### Start of program")) 
# Should be "// Start of program"
print(transform_comments("  number = 0   ## Initialize the variable")) 
# Should be "  number = 0   // Initialize the variable"
print(transform_comments("  number += 1   # Increment the variable")) 
# Should be "  number += 1   // Increment the variable"
print(transform_comments("  return(number)")) 
# Should be "  return(number)"

// Start of program
  number = 0   // Initialize the variable
  number += 1   // Increment the variable
  return(number)


The convert_phone_number function checks for a U.S. phone number format: XXX-XXX-XXXX (3 digits followed by a dash, 3 more digits followed by a dash, and 4 digits), and converts it to a more formal format that looks like this: (XXX) XXX-XXXX. Fill in the regular expression to complete this function.

In [27]:
import re
def convert_phone_number(phone):
  result = re.sub(r'\b\s(\d{3})-(\d{3}-)(\d{4})', r' (\1) \2\3', phone)
  return result

print(convert_phone_number("My number is 212-345-9999.")) # My number is (212) 345-9999.
print(convert_phone_number("Please call 888-555-1234")) # Please call (888) 555-1234
print(convert_phone_number("123-123-12345")) # 123-123-12345
print(convert_phone_number("Phone number of Buckingham Palace is +44 303 123 7300")) # Phone number of Buckingham Palace is +44 303 123 7300

My number is (212) 345-9999.
Please call (888) 555-1234
123-123-12345
Phone number of Buckingham Palace is +44 303 123 7300
