# Week 3
## Regular Expressions
Check out the following links for more information:

- https://docs.python.org/3/howto/regex.html

- https://docs.python.org/3/library/re.html

- https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy

Shout out to [regex101.com](https://regex101.com/), which will explain each stage of a regex. 

### Predefined sets of characters
- `\b` word boundery, marks a start of a word (also works as start of string)
- `\w+` means alphanumeric characters [a-zA-Z0-9_] (one or more)
- `\s+` one or more whitespaces
- `\w+` next alphanumeric characters (1+)
- `\b` word boundery (works at string end too)

### Cheat sheet for brackets
- `^[A-Z]+` : One or more capital letters at the start of the string
- `[a-z]+` : One or more small letters after the capital letters
- `[\s]+` : One or more whitespaces after the small letters
- `[a-z]+` : One or more small letters after the spaces
- `[\.!\?]+$` : One or more of the punctuations (., !, or ?) after the second run of small letters, and then the string ends.

### Example

In [71]:
import re

In [72]:
result = re.search(r'aza', 'plaza')
print(result)

<re.Match object; span=(2, 5), match='aza'>


In [73]:
result = re.search(r'aza', 'bazaar')
print(result)

<re.Match object; span=(1, 4), match='aza'>


In [74]:
result = re.search(r'bax', 'bazaar')
print(result)

None


#### Special Characters

> `^` Circumflex search from first characters

In [75]:
result = re.search(r'^x', 'xenon')
print(result)

<re.Match object; span=(0, 1), match='x'>


> `$` Dollar sign search from last characters

In [76]:
result = re.search(r'cats$', 'anyone have a cats')
print(result)

<re.Match object; span=(14, 18), match='cats'>


> `.` subtitute all alphabets

In [77]:
result = re.search(r'p.ng', 'penguin')
print(result)

result = re.search(r'p.ng', 'ping')
print(result)

result = re.search(r'p.ng', 'sponge')
print(result)

<re.Match object; span=(0, 4), match='peng'>
<re.Match object; span=(0, 4), match='ping'>
<re.Match object; span=(1, 5), match='pong'>


> Ignore case sensitive

In [78]:
result = re.search(r'p.ng', 'PEnGae', re.IGNORECASE)
print(result)

<re.Match object; span=(0, 4), match='PEnG'>


> Pop Quiz

Fill in the code to check if the text passed contains the vowels a, e and i, with exactly one occurrence of any other character in between.

In [79]:
import re
def check_aei (text):
  result = re.search(r'a.e.i', text)
  return result != None

print(check_aei("academia")) # True
print(check_aei("aerial")) # False
print(check_aei("paramedic")) # True

True
False
True


#### Wildcards and Character Classes

In [80]:
result = re.search(r'[Pp]ython', 'Python')
print(result)

result = re.search(r'[Pp]ython', 'python')
print(result)

<re.Match object; span=(0, 6), match='Python'>
<re.Match object; span=(0, 6), match='python'>


> `[ ]` Include all in brackets

In [81]:
result = re.search(r'[a-z]way', 'The end of the highway')
print(result)

result = re.search(r'[a-z]way', 'What a way to go')
print(result)

result = re.search(r'cloud[a-zA-Z0-9]', 'cloudy')
print(result)

result = re.search(r'cloud[a-zA-Z0-9]', 'cloud0')
print(result)

result = re.search(r'cloud[a-zA-Z0-9]', 'cloudD')
print(result)

<re.Match object; span=(18, 22), match='hway'>
None
<re.Match object; span=(0, 6), match='cloudy'>
<re.Match object; span=(0, 6), match='cloud0'>
<re.Match object; span=(0, 6), match='cloudD'>


> `|` Or Match

In [82]:
result = re.search(r'cats|dogs', 'I like cats')
print(result)

result = re.search(r'cats|dogs', 'I like dogs')
print(result)

result = re.search(r'cats|dogs', 'I like cats both dogs')
print(result)

result = re.findall(r'cats|dogs', 'I like cats both dogs')
print(result)

<re.Match object; span=(7, 11), match='cats'>
<re.Match object; span=(7, 11), match='dogs'>
<re.Match object; span=(7, 11), match='cats'>
['cats', 'dogs']


> Pop Quiz

Fill in the code to check if the text passed contains punctuation symbols: commas, periods, colons, semicolons, question marks, and exclamation points.

In [83]:
import re
def check_punctuation (text):
  result = re.search(r"[^a-zA-Z ]", text) # match not a letter and white space
  return result != None

print(check_punctuation("This is a sentence that ends with a period.")) # True
print(check_punctuation("This is a sentence fragment without a period")) # False
print(check_punctuation("Aren't regular expressions awesome?")) # True
print(check_punctuation("Wow! We're really picking up some steam now!")) # True
print(check_punctuation("End of the line")) # False

True
False
True
True
False


### Repetition Qualifiers

> `*` Match Twice Characters

In [84]:
print(re.search(r'Py.*n', 'Pygmaliion'))

print(re.search(r'Py.*m', 'Python Programming'))

print(re.search(r'Py[a-z]*n', 'Python Programming'))

<re.Match object; span=(0, 10), match='Pygmaliion'>
<re.Match object; span=(0, 15), match='Python Programm'>
<re.Match object; span=(0, 6), match='Python'>


> Pop Quiz

The repeating_letter_a function checks if the text passed includes the letter "a" (lowercase or uppercase) at least twice. For example, repeating_letter_a("banana") is True, while repeating_letter_a("pineapple") is False. Fill in the code to make this work. 

In [85]:
import re
def repeating_letter_a(text):
  result = re.search(r"[aA].*[aA]", text)
  return result != None
""
print(repeating_letter_a("banana")) # True
print(repeating_letter_a("pineapple")) # False
print(repeating_letter_a("Animal Kingdom")) # True
print(repeating_letter_a("A is for apple")) # True

True
False
True
True


> `+` Match After 

In [86]:
print(re.search(r'o+l+', 'Doolly'))

print(re.search(r'o+l', 'Dexl'))

<re.Match object; span=(1, 5), match='ooll'>
None


> `?` Match optional

In [87]:
print(re.search(r'p?each', 'To each their own'))

print(re.search(r'p?each', 'I like peaches'))

<re.Match object; span=(3, 7), match='each'>
<re.Match object; span=(7, 12), match='peach'>


### Escaping Characters

In [88]:
print(re.search(r'\.com', 'Welcome'))

print(re.search(r'\.com', 'domain.com'))

None
<re.Match object; span=(6, 10), match='.com'>


> `\w` Match all except white space

In [89]:
print(re.search(r'\w*', 'Welcome to the home'))

print(re.search(r'\w*', 'Welcome_1to_the_home'))

<re.Match object; span=(0, 7), match='Welcome'>
<re.Match object; span=(0, 20), match='Welcome_1to_the_home'>


> Pop Quiz

Fill in the code to check if the text passed has at least 2 groups of alphanumeric characters (including letters, numbers, and underscores) separated by one or more whitespace characters.

In [90]:
import re
def check_character_groups(text):
  result = re.search(r"\w+\s", text)
  return result != None

print(check_character_groups("One")) # False
print(check_character_groups("123  Ready Set GO")) # True
print(check_character_groups("username user_01")) # True
print(check_character_groups("shopping_list: milk, bread, eggs.")) # False

False
True
True
False


### Study Cases with regex


find a countries start from A and end with a

In [91]:
# Bad solution
print(re.search(r'A.*a', 'Argentina'))
print(re.search(r'A.*a', 'Azerbaijan'))

# Good solution
print(re.search(r'^A.*a$', 'Azerbaijan'))
print(re.search(r'^A.*a$', 'Australia'))

<re.Match object; span=(0, 9), match='Argentina'>
<re.Match object; span=(0, 9), match='Azerbaija'>
None
<re.Match object; span=(0, 9), match='Australia'>


In [92]:
pattern = r'^[a-zA-Z_][a-zA-Z0-9_]*$'
print(re.search(pattern, '_this_is_a_valid_variable'))
print(re.search(pattern, "this isn't a valid variable"))
print(re.search(pattern, 'my_variable1'))
print(re.search(pattern, '2my_variable1'))

<re.Match object; span=(0, 25), match='_this_is_a_valid_variable'>
None
<re.Match object; span=(0, 12), match='my_variable1'>
None


> Pop quiz

Fill in the code to check if the text passed looks like a standard sentence, meaning that it starts with an uppercase letter, followed by at least some lowercase letters or a space, and ends with a period, question mark, or exclamation point. 

In [93]:
import re
def check_sentence(text):
  result = re.search(r"^[A-Z][a-z\s]*[\.!\?]$", text)
  return result != None

print(check_sentence("Is this is a sentence?")) # True
print(check_sentence("is this is a sentence?")) # False
print(check_sentence("Hello")) # False
print(check_sentence("1-2-3-GO!")) # False
print(check_sentence("A star is born.")) # True

True
False
False
False
True


## Practice Quiz

The check_web_address function checks if the text passed qualifies as a top-level web address, meaning that it contains alphanumeric characters (which includes letters, numbers, and underscores), as well as periods, dashes, and a plus sign, followed by a period and a character-only top-level domain such as ".com", ".info", ".edu", etc. Fill in the regular expression to do that, using escape characters, wildcards, repetition qualifiers, beginning and end-of-line characters, and character classes.

In [94]:
import re
def check_web_address(text):
  pattern = r'^[\w\-+.]+\.[a-zA-Z]+$'
  result = re.search(pattern, text)
  return result != None

print(check_web_address("gmail.com")) # True
print(check_web_address("www@google")) # False
print(check_web_address("www.Coursera.org")) # True
print(check_web_address("web-address.com/homepage")) # False
print(check_web_address("My_Favorite-Blog.US")) # True

True
False
True
False
True


The check_time function checks for the time format of a 12-hour clock, as follows: the hour is between 1 and 12, with no leading zero, followed by a colon, then minutes between 00 and 59, then an optional space, and then AM or PM, in upper or lower case. Fill in the regular expression to do that. How many of the concepts that you just learned can you use here?

In [95]:
import re
def check_time(text):
  pattern = r'^1[0-2]|[1-9]:[0-5][0-9](\s?[APap][Mm])$'
  result = re.search(pattern, text)
  return result != None

print(check_time("12:45pm")) # True
print(check_time("9:59 AM")) # True
print(check_time("6:60am")) # False
print(check_time("five o'clock")) # False

True
True
False
False


The contains_acronym function checks the text for the presence of 2 or more characters or digits surrounded by parentheses, with at least the first character in uppercase (if it's a letter), returning True if the condition is met, or False otherwise. For example, "Instant messaging (IM) is a set of communication technologies used for text-based communication" should return True since (IM) satisfies the match conditions." Fill in the regular expression in this function: 

In [96]:
import re
def contains_acronym(text):
  pattern = r'\([A-Z0-9].*\)'
  result = re.search(pattern, text)
  return result != None

print(contains_acronym("Instant messaging (IM) is a set of communication technologies used for text-based communication")) # True
print(contains_acronym("American Standard Code for Information Interchange (ASCII) is a character encoding standard for electronic communication")) # True
print(contains_acronym("Please do NOT enter without permission!")) # False
print(contains_acronym("PostScript is a fourth-generation programming language (4GL)")) # True
print(contains_acronym("Have fun using a self-contained underwater breathing apparatus (Scuba)!")) # True

True
True
False
True
True


Fill in the code to check if the text passed includes a possible U.S. zip code, formatted as follows: exactly 5 digits, and sometimes, but not always, followed by a dash with 4 more digits. The zip code needs to be preceded by at least one space, and cannot be at the start of the text.

In [97]:
import re
def check_zip_code (text):
  result = re.search(r"(?!\A)\b\d{5}(?:-\d{4})?\b", text)
  return result != None

print(check_zip_code("The zip codes for New York are 10001 thru 11104.")) # True
print(check_zip_code("90210 is a TV show")) # False
print(check_zip_code("Their address is: 123 Main Street, Anytown, AZ 85258-0001.")) # True
print(check_zip_code("The Parliament of Canada is at 111 Wellington St, Ottawa, ON K1A0A9.")) # False

True
False
True
False
