# Regular Expressions 

In this module, you’ll learn about what a regular expression is and why you would use one. We’ll dive into the basics of regular expressions and give examples of wildcards, repetition qualifiers, escapare characters, and more. Next up, we’ll explore advanced regular expressions and deep dive on repetition qualifiers. You’ll tackle new exercises like capturing groups and extracting PIDs using regexes. Finally, we’ll provide a cheat sheet to serve as your go-to guide for regular expressions.

## Learning Objectives
Define what a regular expression is and describe why it is useful
Use basic regular expressions including simple matching, wildcard, and character classes
Explain repetition qualifiers
Use advanced regular expressions

In [38]:
import re

`[]` A set of characters "[a-m]"

In [37]:
txt = "The rain in Spain"

#Find all lower case characters alphabetically between "a" and "m":
x = re.findall("[a-m]", txt)
print(x)

['h', 'e', 'a', 'i', 'i', 'a', 'i']


`\` Signals a special sequence (can also be used to escape special characters)  "\d"

In [39]:
txt = "That will be 59 dollars"

#Find all digit characters:
x = re.findall("\d", txt)
print(x)


['5', '9']


` . `  - Any character (except newline character)	"he..o"

In [40]:
txt = "hello planet"

#Search for a sequence that starts with "he", followed by two (any) characters, and an "o":
x = re.findall("he..o", txt)
print(x)

['hello']


`$`  - Ends with  	"planet$"

In [41]:
txt = "hello planet"

#Check if the string ends with 'planet':
x = re.findall("planet$", txt)
if x:
  print("Yes, the string ends with 'planet'")
else:
  print("No match")


Yes, the string ends with 'planet'


` *    `  .  - Zero or more occurrences	"he.*o"

In [56]:
txt = "hello planet"

#Search for a sequence that starts with "he", followed by 0 or more  (any) characters, and an "o":
x = re.findall("he.*o", txt)
print(x)

['hello']


`+ `  - one or more occurrences	"he.+o"

In [52]:
txt = "hello planet"

#Search for a sequence that starts with "he", followed by 1 or more  (any) characters, and an "o":
x = re.findall("he.+o", txt)
print(x)

['hello']


`?	`  - zero or one occurrences	"he.?o"

In [61]:
txt = "hello planet"

#Search for a sequence that starts with "he", followed by 0 or 1  (any) character, and an "o":
x = re.findall("he.?o", txt)
print(x)

[]


#This time we got no match, because there were not zero, not one, but two characters between "he" and the "o"

`{}` - Exactly the specified number of occurrences	 `"he.{2}o"`

In [64]:

txt = "hello planet"

#Search for a sequence that starts with "he", followed excactly 2 (any) characters, and an "o":
x = re.findall("he.{2}o", txt)
print(x)

['hello']


`|` Either or	`"falls|stays"`

In [65]:
txt = "The rain in Spain falls mainly in the plain!"

#Check if the string contains either "falls" or "stays":
x = re.findall("falls|stays", txt)
print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['falls']
Yes, there is at least one match!


In [66]:
#()	Capture and group

txt = "The rain in Spain"

#Check if the string contains "ai" followed by 1 or more "x" characters:
x = re.findall("aix+", txt)
print(x)

[]


## Basics of Regular Expressions

In [74]:
log = "July 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade"

index = log.index('[')
print(index)
print(log[index+1:index+6])

39
12345


In [77]:
# using the search method to find the pattern in the string
    # [] - match any character inside the square brackets
    # \d - match any digit
    # + - match one or more of the preceding character
    # () - capture and group

regex = r"\[(\d+)\]"

result = re.search(regex, log)
print(result[1])

12345


#### Basic Matching with Grep

In [81]:
# using grep to find the pattern in the string - grep is a command line tool


#!grep thon /usr/share/dict/words

# grep with i option to ignore case
!grep -i python /usr/share/dict/words

python
pythoness
pythonic
pythonical
pythonid
Pythonidae
pythoniform
Pythoninae
pythonine
pythonism
Pythonissa
pythonist
pythonize
pythonoid
pythonomorph
Pythonomorpha
pythonomorphic
pythonomorphous


### The Reserves caracters

In [82]:
!grep l.rts /usr/share/dict/words

In [84]:
 # grep with ^ to match the beginning of the line
# !grep ^fruit /usr/share/dict/words

In [86]:
# the $ matches the end of the line
# !grep cat$ /usr/share/dict/words

### Simple Matching in Python

In [87]:
import re

result = re.search(r"aza", "plaza")
print(result)

<re.Match object; span=(2, 5), match='aza'>


In [88]:
# string that matches the text "aza" followed by one or more alphanumeric characters
result = re.search(r"aza", "bazaar")
print(result)

<re.Match object; span=(1, 4), match='aza'>


In [89]:
# string that does not contain the pattern

result = re.search(r"aza", "maze")
print(result)

None


In [91]:
# string the starts with "x" followed by any character
print(re.search(r"^x", "xenon"))

<re.Match object; span=(0, 1), match='x'>


In [93]:
# string contains a word that starts with "p" followed by any character
print(re.search(r"p.ng", "penguin"))

<re.Match object; span=(0, 4), match='peng'>


In [95]:
print(re.search(r"p.ng", "clapping"))

<re.Match object; span=(4, 8), match='ping'>


In [94]:
print(re.search(r"p.ng", "sponge"))

<re.Match object; span=(1, 5), match='pong'>


In [96]:
print(re.search(r"p.ng", "Pangaea", re.IGNORECASE))

<re.Match object; span=(0, 4), match='Pang'>


### Wildcards and Character Classes

In [97]:
#  inside the square brackets, we can have a range of characters
    ## [a-z] - match any lowercase letter
    ## [A-Z] - match any uppercase letter

print(re.search(r"[Pp]ython", "Python"))

<re.Match object; span=(0, 6), match='Python'>


In [98]:
print(re.search(r"[a-z]way", "The end of the highway"))

<re.Match object; span=(18, 22), match='hway'>


In [99]:
print(re.search(r"[a-z]way", "What a way to go"))

None


In [100]:
print(re.search(r"cloud[a-zA-Z0-9]", "cloudy"))

<re.Match object; span=(0, 6), match='cloudy'>


In [101]:
print(re.search(r"[^a-zA-Z]", "This is a sentence with spaces."))

<re.Match object; span=(4, 5), match=' '>


In [102]:
print(re.search(r"[^a-zA-Z ]", "This is a sentence with spaces."))

<re.Match object; span=(30, 31), match='.'>


In [103]:
print(re.search(r"cat|dog", "I like cats."))

<re.Match object; span=(7, 10), match='cat'>


In [105]:
print(re.search(r"cat|dog", "I like dogs."))

<re.Match object; span=(7, 10), match='dog'>


In [106]:
print(re.search(r"cat|dog", "I like both dogs and cats."))

<re.Match object; span=(12, 15), match='dog'>


In [107]:
# use findall to find all the matches in the string

print(re.findall(r"cat|dog", "I like both dogs and cats."))

['dog', 'cat']


### Repetition Qualifiers

We can use the asterisk to match zero or more occurrences of the letter "o" in the string "Python". So, we would expect this to match the "o" in "Python" and the "o" in "gooooooal". Let's try it out.

In [111]:
# use findall to find all the matches in the string
print(re.search(r"Py.*n", "Pygmalion"))

<re.Match object; span=(0, 9), match='Pygmalion'>


In [112]:
print(re.search(r"Py.*n", "Python Programming"))

<re.Match object; span=(0, 17), match='Python Programmin'>


In [114]:
# just match letters between "Py" and "n"
print(re.search(r"Py[a-z]*n", "Python Programming"))

<re.Match object; span=(0, 6), match='Python'>


In [115]:
print(re.search(r"Py[a-z]*n", "Pyn"))

<re.Match object; span=(0, 3), match='Pyn'>


In [116]:
print(re.search(r"o+l+", "goldfish"))

<re.Match object; span=(1, 3), match='ol'>


In [117]:
print(re.search(r"o+l+", "woolly"))

<re.Match object; span=(1, 5), match='ooll'>


In [118]:
print(re.search(r"o+l+", "boil"))

None


In [119]:
print(re.search(r"p?each", "To each their own"))

<re.Match object; span=(3, 7), match='each'>


In [120]:
print(re.search(r"p?each", "I like peaches"))

<re.Match object; span=(7, 12), match='peach'>


### Escaping Characters

In [121]:
print(re.search(r".com", "welcome"))

<re.Match object; span=(2, 6), match='lcom'>


In [124]:
# escape the dot with a backslash - \. to match the dot .com

print(re.search(r"\.com", "welcome"))

None


In [125]:
print(re.search(r"\.com", "mydomain.com"))

<re.Match object; span=(8, 12), match='.com'>


In [126]:
# \n - new line
# \t - tab character 
# \b - word boundary
# \d - digit
# \w - word character : any letter, digit or underscore
# \s - whitespace character : space, tab, newline
# \W - any character that is not a word character
# \D - any character that is not a digit
# \S - any character that is not a whitespace character


print(re.search(r"\w*", "This is an example"))

<re.Match object; span=(0, 4), match='This'>


In [128]:
print(re.findall(r"\w*", "This is an example"))

['This', '', 'is', '', 'an', '', 'example', '']


In [127]:
print(re.search(r"\w*", "And_this_is_another"))

<re.Match object; span=(0, 19), match='And_this_is_another'>


### Regex in Action

In [129]:
print(re.search(r"A.*a", "Argentina"))

<re.Match object; span=(0, 9), match='Argentina'>


In [130]:
print(re.search(r"A.*a", "Azerbaijan"))

<re.Match object; span=(0, 9), match='Azerbaija'>


In [131]:
# more strict pattern match by adding beginning and end of line anchors
    ## ^ - beginning of the line
    ## $ - end of the line
print(re.search(r"^A.*a$", "Azerbaijan"))

None


In [132]:
print(re.search(r"^A.*a$", "Australia"))

<re.Match object; span=(0, 9), match='Australia'>


In [133]:
pattern = r"^[a-zA-Z_][a-zA-Z0-9_]*$"

print(re.search(pattern, "_this_is_a_valid_variable_name"))

<re.Match object; span=(0, 30), match='_this_is_a_valid_variable_name'>


In [134]:
print(re.search(pattern, "this isn't a valid variable name"))

None


In [135]:
print(re.search(pattern, "my_variable1"))

<re.Match object; span=(0, 12), match='my_variable1'>


In [136]:
print(re.search(pattern, "2my_variable1"))

None


### Quiz

Question 1
```python
import re
def check_web_address(text):
  pattern = ___
  result = re.search(pattern, text)
  return result != None

print(check_web_address("gmail.com")) # True
print(check_web_address("www@google")) # False
print(check_web_address("www.Coursera.org")) # True
print(check_web_address("web-address.com/homepage")) # False
print(check_web_address("My_Favorite-Blog.US")) # True
```

Solution Question 1
```python
import re
def check_web_address(text):
  pattern = r"^[a-zA-Z0-9_\-.]*$"
  result = re.search(pattern, text)
  return result != None



Question 2
```python
import re
def check_time(text):
  pattern = ___
  result = re.search(pattern, text)
  return result != None

print(check_time("12:45pm")) # True
print(check_time("9:59 AM")) # True
print(check_time("6:60am")) # False
print(check_time("five o'clock")) # False
```

Solution Question 2
```python
import re
def check_time(text):
  pattern = r"^[1-9][0-2]?:[0-5][0-9] ?[AaPp][Mm]$"
  result = re.search(pattern, text)
  return result != None
```

Question 3
```python
import re
def contains_acronym(text):
  pattern = ___ 
  result = re.search(pattern, text)
  return result != None

print(contains_acronym("Instant messaging (IM) is a set of communication technologies used for text-based communication")) # True
print(contains_acronym("American Standard Code for Information Interchange (ASCII) is a character encoding standard for electronic communication")) # True
print(contains_acronym("Please do NOT enter without permission!")) # False
print(contains_acronym("PostScript is a fourth-generation programming language (4GL)")) # True
print(contains_acronym("Have fun using a self-contained underwater breathing apparatus (Scuba)!")) # True
```

Solution Question 3
```python
import re
def contains_acronym(text):
  pattern = r"\([A-Za-z0-9]\)"
  result = re.search(pattern, text)
  return result != None
```

Question 4
```python
import re
def check_zip_code (text):
  result = re.search(r"___", text)
  return result != None

print(check_zip_code("The zip codes for New York are 10001 thru 11104.")) # True
print(check_zip_code("90210 is a TV show")) # False
print(check_zip_code("Their address is: 123 Main Street, Anytown, AZ 85258-0001.")) # True
print(check_zip_code("The Parliament of Canada is at 111 Wellington St, Ottawa, ON K1A0A9.")) # False
```

Solution Question 4
```python
import re
def check_zip_code (text):
  result = re.search(r"\d{5}[-\s]?(?:\d{4})?", text)
  return result != None
```

In [137]:


def check_web_address(text):
  pattern = r"^[a-zA-Z0-9_\-.]*$"
  result = re.search(pattern, text)
  return result != None


print(check_web_address("gmail.com")) # True
print(check_web_address("www@google")) # False
print(check_web_address("www.Coursera.org")) # True
print(check_web_address("web-address.com/homepage")) # False
print(check_web_address("My_Favorite-Blog.US")) # True

True
False
True
False
True


In [138]:
import re
def check_time(text):
  pattern = r"^[1-9][0-2]?:[0-5][0-9] ?[AaPp][Mm]$"
  result = re.search(pattern, text)
  return result != None

print(check_time("12:45pm")) # True
print(check_time("9:59 AM")) # True
print(check_time("6:60am")) # False
print(check_time("five o'clock")) # False

True
True
False
False


In [139]:
import re
def contains_acronym(text):
  pattern = r"\([A-Za-z0-9]\)"
  result = re.search(pattern, text)
  return result != None

print(contains_acronym("Instant messaging (IM) is a set of communication technologies used for text-based communication")) # True
print(contains_acronym("American Standard Code for Information Interchange (ASCII) is a character encoding standard for electronic communication")) # True
print(contains_acronym("Please do NOT enter without permission!")) # False
print(contains_acronym("PostScript is a fourth-generation programming language (4GL)")) # True
print(contains_acronym("Have fun using a self-contained underwater breathing apparatus (Scuba)!")) # True

False
False
False
False
False


In [141]:
import re
def contains_acronym(text):
  pattern = r"\([A-Za-z0-9]+\)"
  result = re.search(pattern, text)
  return result != None

print(contains_acronym("Instant messaging (IM) is a set of communication technologies used for text-based communication")) # True
print(contains_acronym("American Standard Code for Information Interchange (ASCII) is a character encoding standard for electronic communication")) # True
print(contains_acronym("Please do NOT enter without permission!")) # False
print(contains_acronym("PostScript is a fourth-generation programming language (4GL)")) # True
print(contains_acronym("Have fun using a self-contained underwater breathing apparatus (Scuba)!")) # True

True
True
False
True
True


In [140]:
import re
def check_zip_code (text):
  result = re.search(r"\d{5}[-\s]?(?:\d{4})?", text)
  return result != None

print(check_zip_code("The zip codes for New York are 10001 thru 11104.")) # True
print(check_zip_code("90210 is a TV show")) # False
print(check_zip_code("Their address is: 123 Main Street, Anytown, AZ 85258-0001.")) # True
print(check_zip_code("The Parliament of Canada is at 111 Wellington St, Ottawa, ON K1A0A9.")) # False

True
True
True
False


In [142]:
def check_zip_code (text):
  result = re.search(r"\s\d{5}[-\s]?(?:\d{4})?\s", text)
  return result != None

print(check_zip_code("The zip codes for New York are 10001 thru 11104.")) # True
print(check_zip_code("90210 is a TV show")) # False
print(check_zip_code("Their address is: 123 Main Street, Anytown, AZ 85258-0001.")) # True
print(check_zip_code("The Parliament of Canada is at 111 Wellington St, Ottawa, ON K1A0A9.")) # False

True
False
False
False


In [4]:
import re

result = re.match(r'AV', 'AV Analytics Vidhya AV')
print(result)

<re.Match object; span=(0, 2), match='AV'>


In [6]:
result = re.search(r"^(\w*), (\w*)$", "Lovelace, Ada")
print(result)

<re.Match object; span=(0, 13), match='Lovelace, Ada'>


In [7]:
print(result.groups())

('Lovelace', 'Ada')


In [8]:
print(result[0])

Lovelace, Ada


In [10]:
print(result[1])
print(result[2])

Lovelace
Ada


In [11]:
"{}, {}".format(result[2], result[1])

'Ada, Lovelace'

In [16]:
def rearrange_name(name):
    result = re.search(r"^([\w .-]*), ([\w .-]*)$", name)
    if result is None:
        return name
    return "{} {}".format(result[2], result[1])

rearrange_name("Lovelace, Ada")
rearrange_name("Ritchie, Dennis")

'Dennis Ritchie'

In [15]:
rearrange_name("Hopper, Grace M.")

'Grace M. Hopper'

### Repetion Quantifiers

In [17]:
print(re.search(r"Py.*n", "Pygmalion"))

<re.Match object; span=(0, 9), match='Pygmalion'>


In [18]:
print(re.search(r"[a-zA-Z]{5}", "a ghost"))

<re.Match object; span=(2, 7), match='ghost'>


In [19]:
print(re.search(r"[a-zA-Z]{5}", "a scary ghost appeared"))

<re.Match object; span=(2, 7), match='scary'>


In [20]:
# The search does not get the whole word because there is a space after the word "scary"
print(re.findall(r"[a-zA-Z]{5}", "a scary ghost appeared"))

['scary', 'ghost', 'appea']


In [21]:
# \b is the metacharacter for word boundary
# Find the 5 letter words in the sentence using \b
print(re.findall(r"\b[a-zA-Z]{5}\b", "a scary ghost appeared"))

['scary', 'ghost']


In [22]:
# match a range of characters 5 to 10 letters long
print(re.findall(r"\w{5,10}\b", "I really like strawberries"))  

['really', 'rawberries']


In [23]:
print(re.findall(r"\w{5,}\b", "I really like strawberries"))  

['really', 'strawberries']


In [24]:
# match a range of characters 1 to 20 letters long
print(re.search(r"s\w{,20}", "I really like strawberries"))  

<re.Match object; span=(14, 26), match='strawberries'>


###  Extracting PID Using Regex in Python

In [25]:
import re

log = "July 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade"
regex = r"\[(\d+)\]"
result = re.search(regex, log)
print(result[1])

12345


In [27]:
result = re.search(regex, "A completely different string that also has numbers [34567]") 
print(result[1])
   

34567


In [30]:
result = re.search(regex, "99 elephants in a [cage]")
#print(result[1])

In [32]:
def extract_pid(log_line):
    regex = r"\[(\d+)\]"
    result = re.search(regex, log_line)
    if result is None:
        return ""
    return result[1]

print(extract_pid(log))


12345



In [None]:
print(extract_pid("99 elephants in a [cage]"))

### Splitting and Replacement Using Regex in Python

In [33]:
# split the string on the regex pattern and return a list
    ## regex pattern is a comma followed by a space or a period followed by a space
re.split(r"[.?!]", "One sentence. Another one? And the last one!")

['One sentence', ' Another one', ' And the last one', '']

In [34]:
re.split(r"([.?!])", "One sentence. Another one? And the last one!")

['One sentence', '.', ' Another one', '?', ' And the last one', '!', '']

In [35]:
# Creating new strings from our extracted groups
re.sub(r"[\w.%+-]+@[\w.-]+", "[REDACTED]", "Received an email for john.doe@example.com")

'Received an email for [REDACTED]'

In [36]:
re.sub(r"^([\w .-]*), ([\w .-]*)$", r"\2 \1", "Lovelace, Ada")

'Ada Lovelace'

Question 1
We're working with a CSV file, which contains employee information. Each record has a name field, followed by a phone number field, and a role field. The phone number field contains U.S. phone numbers, and needs to be modified to the international format, with "+1-" in front of the phone number. Fill in the regular expression, using groups, to use the transform_record function to do that.
    
```python
import re
def transform_record(record):
  new_record = re.sub(___)
  return new_record

print(transform_record("Sabrina Green,802-867-5309,System Administrator")) 
# Sabrina Green,+1-802-867-5309,System Administrator

print(transform_record("Eli Jones,684-3481127,IT specialist")) 
# Eli Jones,+1-684-3481127,IT specialist

print(transform_record("Melody Daniels,846-687-7436,Programmer")) 
# Melody Daniels,+1-846-687-7436,Programmer

print(transform_record("Charlie Rivera,698-746-3357,Web Developer")) 
# Charlie Rivera,+1-698-746-3357,Web Developer
```

Solution Question 1
```python
import re
def transform_record(record):
  new_record = re.sub(r"(\d{3}-\d{3}-\d{4})", r"+1-\1", record)
  return new_record
```

Question 2
The multi_vowel_words function returns all words with 3 or more consecutive vowels (a, e, i, o, u). Fill in the regular expression to do that.
```python
import re
def multi_vowel_words(text):
  pattern = ___
  result = re.findall(pattern, text)
  return result

print(multi_vowel_words("Life is beautiful")) 
# ['beautiful']

print(multi_vowel_words("Obviously, the queen is courageous and gracious.")) 
# ['Obviously', 'queen', 'courageous', 'gracious']

print(multi_vowel_words("The rambunctious children had to sit quietly and await their delicious dinner.")) 
# ['rambunctious', 'quietly', 'delicious']

print(multi_vowel_words("The order of a data queue is First In First Out (FIFO)")) 
# ['queue']

print(multi_vowel_words("Hello world!")) 
# []
```

Solution Question 2
```python
import re
def multi_vowel_words(text):
  pattern = r"\w*[aeiou]{3,}\w*"
  result = re.findall(pattern, text)
  return result
```


Question 4
The transform_comments function converts comments in a Python script into those usable by a C compiler. This means looking for text that begins with a hash mark (#) and replacing it with double slashes (//), which is the C single-line comment indicator. For the purpose of this exercise, we'll ignore the possibility of a hash mark embedded inside of a Python command, and assume that it's only used to indicate a comment. We also want to treat repetitive hash marks (##), (###), etc., as a single comment indicator, to be replaced with just (//) and not (#//) or (//#). Fill in the parameters of the substitution method to complete this function: 

```python
import re
def transform_comments(line_of_code):
  result = re.sub(___)
  return result

print(transform_comments("### Start of program")) 
# Should be "// Start of program"
print(transform_comments("  number = 0   ## Initialize the variable")) 
# Should be "  number = 0   // Initialize the variable"
print(transform_comments("  number += 1   # Increment the variable")) 
# Should be "  number += 1   // Increment the variable"
print(transform_comments("  return(number)")) 
# Should be "  return(number)"
```

Solution Question 4
```python
import re
def transform_comments(line_of_code):
  result = re.sub(r"#+", "//", line_of_code)
  return result
```


Question 5
The convert_phone_number function checks for a U.S. phone number format: XXX-XXX-XXXX (3 digits followed by a dash, 3 more digits followed by a dash, and 4 digits), and converts it to a more formal format that looks like this: (XXX) XXX-XXXX. Fill in the regular expression to complete this function.
```python
import re
def convert_phone_number(phone):
  result = re.sub(___)
  return result

print(convert_phone_number("My number is 212-345-9999.")) # My number is (212) 345-9999.
print(convert_phone_number("Please call 888-555-1234")) # Please call (888) 555-1234
print(convert_phone_number("123-123-12345")) # 123-123-12345
print(convert_phone_number("Phone number of Buckingham Palace is +44 303 123 7300")) # Phone number of Buckingham Palace is +44 303 123 7300
```

Solution Question 5
```python
import re
def convert_phone_number(phone):
  result = re.sub(r"\b(\d{3})-(\d{3})-(\d{4})\b", r"(\1) \2-\3", phone)
  return result
```

