# Regular Expression

This notebook presents basic regular expression syntax and cover common use cases for regular expressions: pattern matching, filtering, data extraction, and string replacement. 

We will cover Python’s standard `re` library that provides regular expression matching operations  
[docs.python.org/3.6/library/re.html](https://docs.python.org/3.6/library/re.html)

You may also want to look at this [*excellent* tutorial from Google](https://developers.google.com/edu/python/regular-expressions).

### Basic Patterns

* a, X, 9, < -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ \$ * + ? { [ ] \ | ( ) (details below)
* . (a period) -- matches any single character except newline '\n'
* \w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.
* \s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
* \t, \n, \r -- tab, newline, return
* \d -- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)
* ^ = start, $ = end -- match the start or end of the string
* \ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.

In [None]:
import re

Example
-------
Find all the emails in a webpage.

[www.fox.temple.edu/posts/people-category/cms-faculty/](https://www.fox.temple.edu/posts/people-category/cms-faculty/)

In [None]:
import requests
# fetch html
url = 'https://www.fox.temple.edu/posts/people-category/cms-faculty/'

emails = set()    

response = requests.get(url, verify=False)
html = response.text
# build the regular expression for emails
regex = re.compile(r'([a-z.]+)@\w+(\.\w+)+')
    
# find all parts of the text that match the regular expression above
matches = regex.finditer(html)
# go through all the matches and add them to a set of found emails
for m in matches:
    emails.add(m.group())

with open("emails.txt", "w") as f:
    for email in emails:
        f.write(email + "\n")

## Findall

In [None]:
#Return all non-overlapping matches of pattern in string, as a list of strings.

#re.findall(pattern, string)
re.findall('a','Data Care, Feeding & Cleaning')

### Examples

In [None]:
#The re module has other methods like 'match', 'search', 'sub', etc.
#re.findall(pattern, string, flags=0)
#The findall method will be used in subsequent exercises to introduce the patterns used for constructing regular expressions. 

re.findall('b', 'abc')

In [None]:
re.findall('cat','Cat dog frog cat dog')

In [None]:
re.findall('', 'abc')

### Exercise 1

In [None]:
# Fill in the appropriate 'pattern' and 'string' parameters for 'findall' in following exercises.
a = re.findall( 'e', 'telephone')  # len(a) = 3
b = re.findall( 'a', 'aaaa')  # len(b) = 4
c = re.findall( 's', 'regular expression')  # len(c) = 2

print('len(a)=%d'%len(a))
print('len(b)=%d'%len(b))
print('len(c)=%d'%len(c))

### Boundaries

### Examples

In [None]:
re.findall('.', 'abcd')   # match any character

In [None]:
re.findall('^a', 'abcd abcd')  # match only if 'a' is at start of line

In [None]:
re.findall('d$', 'dcba abcd') # match if 'd' is the only character in a line.

In [None]:
re.findall('.abcd', '1abcd 2abcd') # match if 'd' is the only character in a line.

### Exercise 2

In [None]:
# Fill in the appropriate 'pattern' parameter for 'findall' using ONLY '^','.' and '$'.
a = re.findall('.....$', 'aeroplane') # a =  ['plane']
b = re.findall('^....', 'computer') # b =  ['comp']

print('a = %s'%str(a))
print('b = %s'%str(b))

### Repetitions

#### Examples

In [None]:
re.findall('.*', 'abcd') # match any number of characters

In [None]:
# additional '' because '*' matches zero character or empty string.

In [None]:
re.findall('.+', 'abcd') # match 1 or more characters

In [None]:
re.findall('.?', 'abcd') # match 0 or 1 character

### Exercise 3

In [None]:
# Fill in the appropriate 'pattern' parameter for 'findall' using '*','+','?' and '.'.

#Find all letters between the first and the last entries of letter 'e'
a = re.findall('e.*e', 'ae&e*eropjhjhjhlane')

#Find non-empty set of letters between entries of letter 'p'
b = re.findall('p.+p', 'abcpabcp pabcp')

#Find all entries of letter 'e' and the next letter if it's present
c = re.findall('e.?', 'aeroplane')

print('a = %s'%str(a))
print('b = %s'%str(b))
print('c = %s'%str(c))

### Greedy and Lazy

#### Examples

In [None]:
re.findall('<.*>','<H1>title</H1>')

In [None]:
re.findall('<.*?>','<H1>title</H1>')

### Exercise 4

In [None]:
# Fill in the appropriate 'pattern' pareameter for 'findall' using ONLY '*','+','?' and '.'.

#Find all letters between entries of letter 'e'
a = re.findall('e.*?e', 'ae&e*eroplane')

#Find non-empty set of letters between entries of letter 'p'
b = re.findall('p.+?p', 'abcpabcp pabcp')

#Find all entries of letter 'e' and the next letter if it's present
c = re.findall('e.??', 'aeroplane')

print('a = %s'%str(a))
print('b = %s'%str(b))
print('c = %s'%str(c))

### Matching a Specific Number of Times

#### Examples

In [None]:
re.findall('a{2}', 'aaabcd')    # match a substring of 'aa'

In [None]:
re.findall('.{1,3}', 'abbcccd') # match any substring of 1-3 characters

### Exercise 5

In [None]:
# Fill in the appropriate 'pattern' pareameter for 'findall' so that they
# matches the descriptions to the right.
print(re.findall('a{4}','aaaaaaa')) #  match exactly 4 characters
print(re.findall('a{0,}','aaaaaaaa'))   #  match 0 or more characters
print(re.findall('a{0,1}','aaaaaaaa'))   #  match 0 or 1 character 
print(re.findall('a{1,}','aaaaaaaa'))   #  match 1 or more character

### Greedy and Lazy

#### Examples

In [None]:
re.findall('a{3,5}','aaaaaa') #match 5 'a' characters

In [None]:
re.findall('a{3,5}?','aaaaaa') #a{3,5}? will only match 3 characters

### Character Set: [ ]

In [None]:
re.findall('[aeiou]', 'hello')   # search for vowels

In [None]:
re.findall('[0-9]', '12345678')  # search for numbers 3-6

In [None]:
re.findall('[^3-6]', '12345678jdfhsdjfh') # search for complement set of numbers 3-6

In [None]:
re.findall('[a-z]', 'gGHGjghjgGKJGkgKGkGiupoXHG')   # search for vowels

In [None]:
re.findall('[A-Z]', 'gGHGjghjgGKJGkgKGkGiupoXHG')   # search for vowels

### Exercise 6

In [None]:
#Fill in the appropriate 'pattern' pareameter for 'findall' using [].

#Find all cats and hats
print(re.findall('[ch]at','cat hat dog frog hat door'))

#Find all letters from a to z
print(re.findall('[a-z]', 'hjsJH#3jh#kh838(B#g9g(#G9)'))

#Find all letters from a to z both uppercase and lowercase
print(re.findall('[a-zA-Z]', 'hjsJH#3jh#kh838(B#g9g(#G9)'))

#Find all alphanumeric characters
print(re.findall('[a-zA-Z0-9]', 'hjsJH#3jh#kh838(B#g9g(#G9)'))

### Special Character: |

In [None]:
print(re.findall('cat|hat','cat hat dog frog hat door'))


#### Examples

In [None]:
re.findall('abc|123', '123def') # matches 'abc' or '123'

In [None]:
re.findall('abc|123', '456abc')

In [None]:
re.findall('abc|123', 'a1b2c3')

In [None]:
print(re.findall('org|edu|com', 'nick@mail.org, kbauman@temple.edu, john@gmail.com'))

### Special Character: \

#### Examples

In [None]:
re.findall('\*','gjhgdsfjds*4f**43f43*&#&*')

### Exercise 7

In [None]:
#Fill in the appropriate 'pattern' pareameter for 'findall'.

#Find all groups of quesions more than one
re.findall('\?{2,}','hj?khjfds???jhjsd??bjh?njh????')

### Special Sequence: \d, \D

#### Examples

In [None]:
re.findall('\d', '1a2b3c')

In [None]:
re.findall('\D', '1a2b3c')

### Exercise 8

In [None]:
#Find the equivalent for the following regexes using []

a1 = re.findall('\d','gjhg4jh5g24jhgj2hb23423')
a2 = re.findall('[0-9]','gjhg4jh5g24jhgj2hb23423')

print(a1)
print(a2)
print(a1==a2)

In [None]:
a1 = re.findall('\D','gjhg4jh5g24jhgj2hb23423')
a2 = re.findall('[^0-9]','gjhg4jh5g24jhgj2hb23423')

print(a1)
print(a2)
print(a1==a2)

In [None]:
a1 = re.findall('\d+','gjhg4jh5g24jhgj2hb23423')
a2 = re.findall('[0-9]+','gjhg4jh5g24jhgj2hb23423')

print(a1)
print(a2)
print(a1==a2)

In [None]:
a1 = re.findall('\D?','gjhg4jh5g24jhgj2hb23423')
a2 = re.findall('[^0-9]?','gjhg4jh5g24jhgj2hb23423')

print(a1)
print(a2)
print(a1==a2)

### Special Sequence: \w, \W

#### Examples

In [None]:
re.findall('\w', '%*2&3c_')

In [None]:
re.findall('\W', '^12#3$_')

### Exercise 9

In [None]:
# Write down the equivalent RE for \w using character set [ ]. 
string = 'ad@#45_83kd_8^1t iuDU7  Fgewh' 

w1 = re.findall('\w',string)
w2 = re.findall('[0-9a-zA-Z_]',string)

print('w1 = %s'%str(w1))
print('w2 = %s'%str(w2))
print(w1==w2)

In [None]:
# Write down the equivalent RE for \W using character set [ ]. 
string = 'ad@#45_83kd_8^1t iuDU7  Fgewh' 

W1 = re.findall('\W',string)
W2 = re.findall('[^0-9a-zA-Z_]',string)
print('W1 = %s'%str(W1))
print('W2 = %s'%str(W2))
print(W1==W2)

In [None]:
#Find all groups of alphanumeric characters
print(re.findall('\w', '%*2&3c_'))

# Adanced topics for self study

### Parentheses ()

#### Examples

In [None]:
re.findall('([13579])([a-z])', '1a2b3c') # match a odd number followed by a letter

In [None]:
re.findall('([0-9])\\1', '1233455')  # return a list of repeated numbers. Without 'r', need to backslash the '\'.

In [None]:
re.findall(r'([0-9])\1', '1233455')  # return a list of repeated numbers. 

### Exercise 10

In [None]:
#find all repeared characters
string = '''Mini Quizzes:20% Home assignments:15% Midterm exam:15% Project:30% Final exam:15% Participation:5%'''
re.findall(r"([\w])\1",string)

### Exercise 11

A palindrome is a word, phrase, number, or other sequence of characters which reads the same backward or forward.

In [None]:
#Fill in the appropriate 'pattern' pareameter for 'findall' to identify
#all palindromes with length = 5.
re.findall(r'', 'abcba cffrc ccccc fghgf')

### Exercise 12

In [None]:
#get age by name from the text
string = 'Mary is 24 years old. Her mother Helen is 52 and her grandfather Joe is 76 years old.'
re.findall('',string)

### Exercise 13

In [None]:
#get tuples of (Grade, percent)
string = '''Mini Quizzes:20% Home assignments:15% Midterm exam:15% Project:30% Final exam:15% Participation:5%'''
re.findall(r'',string)

### Exercise 14

In [None]:
#Find all people mentioned in news
news = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
print(re.findall(r'', news))

## Search, Finditer

In [None]:
# The search command goes through the string to find the longest expression that matches the regex
# and once it finds the first match, it stops. For example, we will not get the second phone number
regex = re.compile('[0-9]{3}-[0-9]{3}-[0-9]{4}')
text = '''
Konstantin Bauman, Data Care Feeding & Cleaning, 
215-204-3750, kbauman@temple.edu, 978-555-5555
'''
search_result = regex.search(text)
if search_result:
    print(search_result.group())
else:
    print("not found")

In [None]:
# The finditer command goes through the string to find the all the expressions that matches the regex
regex = re.compile(r'[0-9]{3}-[0-9]{3}-[0-9]{4}')
text = '''
Konstantin Bauman, Data Care Feeding & Cleaning, 
215-204-3750, kbauman@temple.edu, 978-555-5555
'''
result = regex.finditer(text)
for m in result:
    print("Starts at:", m.start(), 
    "Ends at:", m.end(),
    "Content:", m.group())

Example
-------

Find all the emails in a webpage. 

In [None]:
import requests
# fetch html
url = 'https://www.fox.temple.edu/posts/people-category/cms-faculty/'

emails = set()    

response = requests.get(url, verify=False)
html = response.text
# build the regular expression for emails
regex = re.compile(r'([a-z.]+)@\w+(\.\w+)+')
    
# find all parts of the text that match the regular expression above
matches = regex.finditer(html)
# go through all the matches and add them to a set of found emails
for m in matches:
    emails.add(m.group())

with open("emails.txt", "w") as f:
    for email in emails:
        f.write(email + "\n")

### Exercise 15

In [None]:
#Write a function that validates the password.
#It should be 6-8 characters long.
#It consists of alphanumeric characters with a mixture of digits and letters.

#template
def validate(passwd):
    if (False): #change this and add other conditions
        return 'Valid password.'
    else:
        return 'Invalid password.'

    
print(validate('1234'))      
print('Invalid password because it\'s too short\n')

print(validate('google'))      
print('Invalid password because there is no digit\n')

print(validate('passwd123'))
print('Invalid password because it\'s too long\n')

print(validate('passwd12'))      
print('Valid password.\n')

print(validate('pas@wd12'))
print('Invalid password because Invalid character \'@\'')