# Regular Expressions

# Tasks today:
1) <b>Importing</b> <br>
2) <b>Using Regular Expressions</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) re.compile() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) re.match() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) re.findall() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) re.search() <br>
3) <b>Sets</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Integer Ranges <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Character Ranges <br>
4) <b>Counting Occurences</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) {x} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) {, x} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) {?} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) {*} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) {+} <br>
5) <b>In-Class Exercise #1</b> <br>
6) <b>Escaping Characters</b> <br>
7) <b>Grouping</b> <br>
8) <b>In-Class Exercise #2</b> <br>
9) <b>Opening a File</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) open() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) with open() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) re.match() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) re.search() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) Store the String in a Variable <br>
10) <b>Regex Project</b> <br>

### Importing <br>
<p>Regular Expressions are universal throughout most programming languages... They are generally imported through the module 're'.</p>

In [7]:
# import re
import re

### Using Regular Expressions <br>
<p>Regular expressions give us the ability to search for patterns within text, strings, files, etc. They serve several uses, such as; security measures, searching, filtering, pattern recognition, and more...</p>

##### re.compile()

In [5]:
# using compile, pre determines the string to be used in regular expression methods
pattern = re.compile('fox')
pattern

re.compile(r'fox', re.UNICODE)

##### re.match()

In [12]:
# Looks to match the pattern with the very start of a string or textfile
# pattern.match(string)
a_match = pattern.match('fox is the mascot of this class')
print(a_match)
# Accessing the span of the match
a_match.span()
# a tuple containing the starting index and ending index (not inclusive) of our pattern in our larger string

<re.Match object; span=(0, 3), match='fox'>


(0, 3)

##### re.findall()

In [22]:
# pattern.findall(string) returns a list of all occurances of the pattern
# returns an empty list if the pattern never occurs
a_findall = pattern.findall('fox is the mascot of the foxes cohort. fox.')
print(a_findall)

['fox', 'fox', 'fox']


##### re.search()

In [21]:
# just like match but searches the entire string for a match to the pattern
# re.search() returns the first occurance of the pattern
mascot = 'The Foxes cohort at (fox) Coding Temple\'s mascot is the fennec fox.'
a_search = pattern.search(mascot)
print(a_search)

<re.Match object; span=(21, 24), match='fox'>


### Sets <br>
<p>The following cells will allow you to use regular expressions to search for certain values within a range such as numbers 1 through 4.</p>

##### [a-z] or [A-Z] - any lowercase/uppercase letters from a to z<br/>[^2] - anything that's not 2

##### Integer Ranges

In [28]:
# ^ carrot flips from positive match to a negative match
pattern_ints = re.compile('[^4-7][7-9][0-3]')

random_num = pattern_ints.search('67383')
print(random_num)

random_not_found = pattern_ints.search('88888')
print(random_not_found)

# a regular expression pattern won't reuse the same value twice
random_findall = pattern_ints.findall('67383')
print(random_findall)

<re.Match object; span=(2, 5), match='383'>
None
['383']


##### Character Ranges

In [54]:
pattern_ul = re.compile('[A-Z][a-z]+')
# can specify a section of alphabet with a character range
# ex: [r-z] or [A-D]

found = pattern_ul.findall('Hello there, Mr. Burton Guster. I am Sam.')

print(found)

['Hello', 'Mr', 'Burton', 'Guster', 'Sam']


### Counting Occurences

##### {x} - something that occurs {num_of_times}

In [32]:
pattern_as = re.compile('a{3}')

test = pattern_as.findall('Hello foxes aaa, test aa, a, whaat')
print(test)

['aaa']


##### {x, x} - something that occurs between x and x times

In [34]:
pattern_as = re.compile('a{1,3}')

test = pattern_as.findall('Hello aaaa foxes aaa, test aa, a, whaat')
# the four a's (aaaa) gets split into a match of three a's (aaa) and a single a (a)
print(test)

['aaa', 'a', 'aaa', 'aa', 'a', 'aa']


##### ? - something that occurs 0 or 1 time

In [45]:
pattern = re.compile('Mr?ss?')

found = pattern.findall('Hello Ms. there Mrs. Anderson. How are you Mrss?')
print(found)

['Mrss?']


##### * - something that occurs at least 0 times

In [52]:
pattern = re.compile('Dr. [A-Za-z]*')

pattern.findall('Hello Dr. Krieger. I have heard that you are not actualy a Dr.!')

['Dr. Krieger']

##### + - something that occurs at least once

In [55]:
pattern = re.compile('Mrss+')

found = pattern.findall('Hello there Mrs. Anderson. How are you Mrss? Mrsssssssssss')
print(found)

['Mrss', 'Mrsssssssssss']


##### In-class exercise 1: 

Use a regular expression to find every number in the given string

In [58]:
my_string = "This string has 10909090 numbers, but it is only 1 string. I hope you solve this 2day."
# Output: ['10909090','1', '2']

# re.findall()

nums = re.findall('[0-9]+', my_string)
nums

['10909090', '1', '2']

### Escaping Characters

In [59]:
# escaping a character is done with a backslash \
# placing a backslash before a character will either add special behaviour
# or remove special behavior
# ex. \w adds special behaviour to the letter w
# \? removes the special behaviour from ? and makes the pattern just look for an actual ?

##### \w - look for any non-special character<br/>\W - look for any special character

[History on Unicode](http://unicode.org/standard/WhatIsUnicode.html)

[More on Unicode Characters](https://en.wikipedia.org/wiki/List_of_Unicode_characters)

In [68]:
pattern_1 = re.compile('[\w]+')
found_1 = pattern_1.findall('This is a sentence. With an, exclamation mark! At the end?')
print(found_1)

found_2 = re.findall('[\W]+', 'This is a sentence. With an, exclamation mark! At the end?')
print(found_2)

['This', 'is', 'a', 'sentence', 'With', 'an', 'exclamation', 'mark', 'At', 'the', 'end']
[' ', ' ', ' ', '. ', ' ', ', ', ' ', '! ', ' ', ' ', '?']


##### \d - look for any digit 0-9<br/>\D - look for anything that isnt a digit

In [72]:
pattern = re.compile('\d{1,2}\D{2}')

found_day = pattern.findall('Today is the 2nd. In 21 days it will the 23rd. The 1st was yesterday. July 4th is a holiday.')
found_day

['2nd', '21 d', '23rd', '1st', '4th']

##### \s - look for any white space<br/>\S - look for anything that isnt whitespace

In [80]:
pattern_no_space = re.compile('\S+')
pattern_space = re.compile('[se]\s+[ir]')

found_spaces = pattern_space.findall('\nThere are some   random spaces \t\t in this string....  \n')
print(found_spaces)

found_other = pattern_no_space.findall('')

['e   r', 's \t\t i']


##### \b - look for boundaries or edges of a word<br/>\B - look for anything that isnt a boundary

In [86]:
# a simple example would be whether or not the pattern 'Fox' matches with 'Foxes'

# interestingly enough the boundary operator only compiles properly from a raw string
# must specify r'\b' in your pattern

pattern_bounds = re.compile(r'Fox\b')

pattern_no_bounds = re.compile(r'Fox\B')

print(pattern_bounds.findall('The Fox played with the other Foxes. Fox.'))
print(pattern_no_bounds.search('The Fox played with the other Foxes. Fox.'))
print('The Fox played with the other Foxes. Fox.'[30:35])

['Fox', 'Fox']
<re.Match object; span=(30, 33), match='Fox'>
Foxes


### Grouping

In [91]:
# Allows us to split our expression into separate "capture groups"
# Our expression can include values inside or outside of the capture groups
# and the capture groups can be independently accessed
# values inside of capture groups are present in the results of a findall
# values outside of the capture groups in a pattern are not included in the results of a findall

string_of_names = 'Fennec       Fox, Water \tScorpion, angry bee, Polar \nBear, River Otter, GiantPanda'

name_pattern = re.compile('([A-Z][a-z]+)\s*([A-Z][a-z]+)')

names = name_pattern.findall(string_of_names)
print(names)

modifiers = [x[0] for x in names]
animal = [x[1] for x in names]
print(modifiers)
print(animal)

[('Fennec', 'Fox'), ('Water', 'Scorpion'), ('Polar', 'Bear'), ('River', 'Otter'), ('Giant', 'Panda')]
['Fennec', 'Water', 'Polar', 'River', 'Giant']
['Fox', 'Scorpion', 'Bear', 'Otter', 'Panda']


##### In-class Exercise 2:

Write a function using regular expressions to find the domain name in the given email addresses (and return None for the invalid email addresses)<br><b>HINT: Use '|' for either or</b>

In [133]:
my_emails = ["jordanw@codingtemple.orgcom", "pocohontas1776@gmail.com", "helloworld@aol..com",
             "yourfavoriteband@g6.org", "@codingtemple.com"]

# You can also use the $ at the end of your compile expression -- this stops the search

#.com OR .org => com|org

#Expected output:
#None
#pocohontas1776@gmail.com
#None
#yourfavoriteband@g6.org
#None

def validateEmail(email):
    pattern = re.compile(r"([A-Za-z0-9]+)@([A-Za-z0-9]+).(com|org)\b")
    if pattern.match(email):
        # capture groups on a .match() or .search() can be accessed through the .groups() method
        print(pattern.match(email).groups())
        return email
    else:
        return 'Not a valid email'

for email in my_emails:
    print(validateEmail(email))



Not a valid email
('pocohontas1776', 'gmail', 'com')
pocohontas1776@gmail.com
Not a valid email
('yourfavoriteband', 'g6', 'org')
yourfavoriteband@g6.org
Not a valid email


In [134]:
# What if all of my emails are in a single string
my_emails_str = ', '.join(my_emails)
print(my_emails_str)
# goal return a list of the valid emails from this email string
# re.findall()
pattern = re.compile(r"[A-Za-z0-9]+@[A-Za-z0-9]+.com\b|[A-Za-z0-9]+@[A-Za-z0-9]+.org\b")
email_list = re.findall(pattern, my_emails_str)
email_list

jordanw@codingtemple.orgcom, pocohontas1776@gmail.com, helloworld@aol..com, yourfavoriteband@g6.org, @codingtemple.com


['pocohontas1776@gmail.com', 'yourfavoriteband@g6.org']

### Opening a File <br>
<p>Python gives us a couple ways to import files, below are the two used most often.</p>

##### open()

In [136]:
file_ = open('names.txt')
# read in the data from the file
data = file_.read()
print(data, type(data))
# when opening a file in this manner we have to remember to close it otherwise it'll stay open in the background
    # and slow things down
file_.close()

Hawkins, Derek	derek@codingtemple.com	(555) 555-5555	Teacher, Coding Temple	@derekhawkins
Zhai, Mo	mozhai@codingtemple.com	(555) 555-5554	Teacher, Coding Temple
Johnson, Joe	joejohnson@codingtemple.com		Johson, Joe
Osterberg, Sven-Erik	governor@norrbotten.co.se		Governor, Norrbotten	@sverik
, Tim	tim@killerrabbit.com		Enchanter, Killer Rabbit Cave
Butz, Ryan	ryanb@codingtemple.com	(555) 555-5543	CEO, Coding Temple	@ryanbutz
Doctor, The	doctor+companion@tardis.co.uk		Time Lord, Gallifrey
Exampleson, Example	me@example.com	555-555-5552	Example, Example Co.	@example
Pael, Ripal	ripalp@codingtemple.com	(555) 555-5553	Teacher, Coding Temple	@ripalp
Vader, Darth	darth-vader@empire.gov	(555) 555-4444	Sith Lord, Galactic Empire	@darthvader
Fernandez de la Vega Sanz, Maria Teresa	mtfvs@spain.gov		First Deputy Prime Minister, Spanish Gov
 <class 'str'>


##### with open()

In [5]:
# if you dont want to have to remember to close the file
# use with open() instead
with open('names.txt') as file_:
    data = file_.read()
print(data, type(data))

Hawkins, Derek	derek@codingtemple.com	(555) 555-5555	Teacher, Coding Temple	@derekhawkins
Zhai, Mo	mozhai@codingtemple.com	(555) 555-5554	Teacher, Coding Temple
Johnson, Joe	joejohnson@codingtemple.com		Johson, Joe
Osterberg, Sven-Erik	governor@norrbotten.co.se		Governor, Norrbotten	@sverik
, Tim	tim@killerrabbit.com		Enchanter, Killer Rabbit Cave
Butz, Ryan	ryanb@codingtemple.com	(555) 555-5543	CEO, Coding Temple	@ryanbutz
Doctor, The	doctor+companion@tardis.co.uk		Time Lord, Gallifrey
Exampleson, Example	me@example.com	555-555-5552	Example, Example Co.	@example
Pael, Ripal	ripalp@codingtemple.com	(555) 555-5553	Teacher, Coding Temple	@ripalp
Vader, Darth	darth-vader@empire.gov	(555) 555-4444	Sith Lord, Galactic Empire	@darthvader
Fernandez de la Vega Sanz, Maria Teresa	mtfvs@spain.gov		First Deputy Prime Minister, Spanish Gov
 <class 'str'>


##### re.match()

In [139]:
re.match(r'Hawkins, Derek', data)

<re.Match object; span=(0, 14), match='Hawkins, Derek'>

##### re.search()

In [140]:
re.search(r'darth-vader@empire.gov', data)

<re.Match object; span=(665, 687), match='darth-vader@empire.gov'>

##### Store the String to a Variable

In [None]:
# see above -> data = file.read()

In [15]:
# Interaction between input() and regex
# input() returns a string
# regex can use a string as a pattern
answer = input('What should I search for in my text?')

found = re.search(answer, data)
print(found)
if found:
    print(f'I found your data! At indexes {found.span()} {found.group()}')
else:
    print('Your requested work is not present.')

What should I search for in my text?Enchanter
<re.Match object; span=(320, 329), match='Enchanter'>
I found your data! At indexes (320, 329) Enchanter


### In-Class Exercise #3 <br>
<p>Print each persons name and twitter handle, using groups, should look like:</p>
<p>==============<br>
   Full Name / Twitter<br>
   ==============</p>
Derek Hawkins / @derekhawkins

 Erik Sven-Osterberg / @sverik

 Ryan Butz / @ryanbutz

 Example Exampleson / @example

 Ripal Pael / @ripalp

 Darth Vader / @darthvader

In [54]:
# split it up - separating each line into a separate string will be far easier for this problem
    # .split() with a '\n' separator (newline)
    # .readlines() instead of .read()
    
with open('names.txt') as file:
    data = file.readlines()

print('==============\nFull Name / Twitter\n==============')
# goal is just to capture the last name, first name, and twitter handle as separate capture groups
for person in data:
    test = re.match(r'([\w]+), ([\w-]+)[\w\W]*(@[A-Za-z0-9]+)$', person)
    if test:
        print(f'{test.groups()[1]} {test.groups()[0]} / {test.groups()[2]}')

Full Name / Twitter
Derek Hawkins / @derekhawkins
Sven-Erik Osterberg / @sverik
Ryan Butz / @ryanbutz
Example Exampleson / @example
Ripal Pael / @ripalp
Darth Vader / @darthvader


In [41]:
data

['Hawkins, Derek\tderek@codingtemple.com\t(555) 555-5555\tTeacher, Coding Temple\t@derekhawkins\n',
 'Zhai, Mo\tmozhai@codingtemple.com\t(555) 555-5554\tTeacher, Coding Temple\n',
 'Johnson, Joe\tjoejohnson@codingtemple.com\t\tJohson, Joe\n',
 'Osterberg, Sven-Erik\tgovernor@norrbotten.co.se\t\tGovernor, Norrbotten\t@sverik\n',
 ', Tim\ttim@killerrabbit.com\t\tEnchanter, Killer Rabbit Cave\n',
 'Butz, Ryan\tryanb@codingtemple.com\t(555) 555-5543\tCEO, Coding Temple\t@ryanbutz\n',
 'Doctor, The\tdoctor+companion@tardis.co.uk\t\tTime Lord, Gallifrey\n',
 'Exampleson, Example\tme@example.com\t555-555-5552\tExample, Example Co.\t@example\n',
 'Pael, Ripal\tripalp@codingtemple.com\t(555) 555-5553\tTeacher, Coding Temple\t@ripalp\n',
 'Vader, Darth\tdarth-vader@empire.gov\t(555) 555-4444\tSith Lord, Galactic Empire\t@darthvader\n',
 'Fernandez de la Vega Sanz, Maria Teresa\tmtfvs@spain.gov\t\tFirst Deputy Prime Minister, Spanish Gov\n']

### Regex project

Use python to read the file regex_test.txt and print the last name on each line using regular expressions and groups (return None for names with no first and last name, or names that aren't properly capitalized)
##### Hint: use with open() and readlines()

In [None]:
"""
Expected Output
Abraham Lincoln
Andrew P Garfield
Connor Milliken
Jordan Alexander Williams
None
None
"""

In [12]:
with open('regex_test.txt') as file:
    data = file.read().splitlines()
    
def realName(name):
    return name if re.match(r'([A-Z][a-z]*) ([A-Z]?[a-z]*) ?([A-Z][a-z]*)', name) else None

for person in data:
    print(realName(person))

Abraham Lincoln
Andrew P Garfield
Connor Milliken
Jordan Alexander Williams
None
None


In [4]:
import re