**Regular Expression**

A regular expression is a powerful tool for matching text, based on a pre-defined pattern. It can detect the presence or absence of a text by matching with a particular pattern, and also can split a pattern into one or more sub-patterns. 

In [10]:
import re  
import pandas as pd

##### Regular expression rules

.       - Any Character Except New Line <br>
\d      - Digit (0-9)<br>
\D      - Not a Digit (0-9)<br>
\w      - Word Character (a-z, A-Z, 0-9, _)<br>
\W      - Not a Word Character<br>
\s      - Whitespace (space, tab, newline)<br>
\S      - Not Whitespace (space, tab, newline)<br>

\b      - Word Boundary<br>
\B      - Not a Word Boundary<br>
^       - Beginning of a String<br>
$       - End of a String<br>

[]      - Matches Characters in brackets <br>
[^ ]    - Matches Characters NOT in brackets<br>
|       - Either Or<br>
( )     - Group<br>

Quantifiers:<br>
    *       - 0 or More<br>
    +       - 1 or More<br>
    ?       - 0 or One<br>
    {3}     - Exact Number<br>
    {3,4}   - Range of Numbers (Minimum, Maximum)<br>

#### Remember #####
MetaCharacters (Need to be escaped): <br>
. ^ $ * + ? { } [ ] \ | ( )


 #### Sample Regexs ####

[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+


#### Create a sample text 

Task: Find `.` and print out all `.`.

In [45]:
text="ha Van Tuyen. This is 30 years old & and % $ and # ? \ 90 |"

pattern =re.compile(r"\.")

matches = pattern.finditer(text)

for i in matches:
    print(i[0])

.


- What if we don't have `re` library 

In [46]:
for item in text:
    if item==".":
        print(item)

.


In [126]:
text='''
Dave Martin
615-555-7164
173 Main St., Springfield RI 55924
davemartin@bogusemail.com

Charles Harris
800-555-5669
969 High St., Atlantis VA 34075
charlesharris@bogusemail.com

Eric Williams
560-555-5153
806 1st St., Faketown AK 86847
laurawilliams@bogusemail.com
Mr. Ha Van Tuyen
Ms. Nguyen Thi Ha
560.555.5153
Mrs. Le Thi Linh
trinhtam@gmail.com16
This is a good start point & and % and $ and . and (). Many other things \ and []
^ $ * + ? { } [] \ | ( )
'''


**Challenge:**

If you don't know regular expression, how can you get all email addresses?

In [50]:
for i in text.split():
    if "@" in i:
        print(i)

davemartin@bogusemail.com
charlesharris@bogusemail.com
laurawilliams@bogusemail.com
trinhtam@gmail.com


###### 1. Search for a certain patterns

- Search for only period (.) from the text

In [51]:
pattern=re.compile(r"\.") # Search for only dấu chấm (period)

matches=pattern.finditer(text)
count=0
for i in matches:
    print(i,":",i[0])
    count+=1
    if count==3:
        break

<re.Match object; span=(37, 38), match='.'> : .
<re.Match object; span=(82, 83), match='.'> : .
<re.Match object; span=(127, 128), match='.'> : .


In [144]:
# Similarly, search for special characters []

pattern=re.compile(r"\[]")

matches = pattern.finditer(text)

for i in matches:
    print(i.group())

[]
[]


- Search for only numbers

In [6]:
# Search for numbers
pattern=re.compile(r"\d") 

count=0
matches=pattern.finditer(text)
for i in matches:
    print(i)
    count+=1
    if count==4: # if count =4, stop the loop
        break

<re.Match object; span=(12, 13), match='6'>
<re.Match object; span=(13, 14), match='1'>
<re.Match object; span=(14, 15), match='5'>
<re.Match object; span=(16, 17), match='5'>


- Search for phone numbers

In [57]:
# Search for phone numbers

pattern=re.compile(r"\d\d\d.\d\d\d.\d\d\d\d") 

matches=pattern.finditer(text)
for i in matches:
    print(i[0])

615-555-7164
800-555-5669
560-555-5153
560.555.5153


- Search for street numbers and names

In [68]:
# Search for street number and names
pattern=re.compile(r"\d\d\d.[a-zA-Z0-9]+.[a-zA-Z]+\.") # dấu chấm lấy tất cả chữ, space và số, nhưng ko lấy tab và new line.
matches=pattern.finditer(text)

for i in matches:
    print(i[0])

173 Main St.
969 High St.
806 1st St.


In [69]:
# Search for street numbers and street names
pattern =re.compile(r"[0-9]+\s[a-zA-Z0-9]+\s[a-zA-Z]+\.")

matches=pattern.finditer(text)

for item in matches:
    print(item[0])

173 Main St.
969 High St.
806 1st St.


In [70]:
sentence="This is a good time to travel."

pattern=re.compile(r'^This') # search for character at starting

match=pattern.finditer(sentence)

for i in match:
    print(i)

<re.Match object; span=(0, 4), match='This'>


In [71]:
sentence="This is a good time to travel."

pattern=re.compile(r'travel.$') # search for character at ending

match=pattern.finditer(sentence)

for i in match:
    print(i)

<re.Match object; span=(23, 30), match='travel.'>


- Search for everything but not texts and numbers

In [72]:
# Search for everything but not texts and numbers

pattern=re.compile(r"[^a-zA-Z0-9]")

matches=pattern.finditer(text)
count=0

for item in matches:
    print(item)
    count+=1
    if count==5:
        break

<re.Match object; span=(0, 1), match='\n'>
<re.Match object; span=(5, 6), match=' '>
<re.Match object; span=(12, 13), match='\n'>
<re.Match object; span=(16, 17), match='-'>
<re.Match object; span=(20, 21), match='-'>


- Search for emails

In [79]:
# Search for email
pattern=re.compile(r"[a-zA-Z0-9._]+@[a-zA-Z.-]+\.[a-zA-Z]+")

matches=pattern.finditer(text)

for item in matches:
    print(item[0])


davemartin@bogusemail.com
charlesharris@bogusemail.com
laurawilliams@bogusemail.com
trinhtam@gmail.com


- Search for only numbers (three digits)

In [83]:
pattern=re.compile(r"\d{4}")

matches = pattern.finditer(text)

for item in matches:
    print(item[0])

7164
5592
5669
3407
5153
8684
5153


In [88]:
# Search for a set of characters

number="""
8800-456-899
900.567.900
800*445*490
123*456-890
"""

pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d')


pattern = re.compile(r'[8-9]00[-.*][0-9]+[-*.][0-9]+')



match=pattern.finditer(number)

for i in match:
    print(i[0])
    


800-456-899
900.567.900
800*445*490


- Find Mr. or Ms or Mrs. 

In [19]:
name='''
Mr. Ha Van
Ms Van Nguyen
Mrs. Kha Thuy
Mr. T
'''

In [89]:
# Tìm tất cả các chữ bên trên theo pattern
pattern = re.compile(r"M\w+\.?[a-zA-Z\s.]+")

matches = pattern.finditer(name)

for match in matches:
    print(match[0])

Mr. Ha Van
Ms Van Nguyen
Mrs. Kha Thuy
Mr. T



In [97]:
# Dùng group
name='''
Mr. Ha Van
Ms Van Nguyen
Mrs. Kha Thuy
Mr. T
'''

pattern = re.compile(r"(Mr|Ms)s?\.?.+")

matches = pattern.finditer(name)

for match in matches:
    print(match[0])

Mr. Ha Van
Ms Van Nguyen
Mrs. Kha Thuy
Mr. T


**Other Handy Functions**

`re.findall()`, `re.search(pattern, string, flags)`, `re.match(pattern, string, flags)`, `re.sub(pattern, replace, string)`

In [103]:
text ="""A regular expression is a powerful tool for matching text, 
based on a pre-defined pattern. It can detect the presence or absence of a text 
by matching with a particular pattern, 
and also can split a pattern into one or more sub-patterns.
regular express.
"""

**`re.search()` and `re.match()`**

In [104]:
search = re.search(r'regular express', text) # Search anywhere in string
# Print what is inside match
print(search)
# Print out the matching string
print(search.group())
# Get starting index
print('Start Index:', search.start())
# Get last index
print('End Index:', search.end())

<re.Match object; span=(2, 17), match='regular express'>
regular express
Start Index: 2
End Index: 17


In [109]:
# search only first instances 

search=re.search(r"a","bddaabcada")

search

<re.Match object; span=(3, 4), match='a'>

In [24]:
# search only first instances 

search=re.search(r"a+","aabcada")

search

<re.Match object; span=(0, 2), match='aa'>

In [111]:
# match the begining of the string
match=re.match(r"A", text,re.IGNORECASE) # re.IGNORECASE ignores case sensitive

match
# Check if exist

bool(match)

True

**`re.findall()`**

In [116]:
info="Tuyen is 30 years old, Tam is 31 years old and Ha is 45 years old."
# Get name
name=re.findall(r'[A-Z][a-z]*',info)

print(name)

# Get age
age=re.findall(r"\d{1,3}",info)

age

['Tuyen', 'Tam', 'Ha']


['30', '31', '45']

In [27]:
# Find all instances of a certain string

re.findall("aa","aa,aahahaha")

['aa', 'aa']

In [118]:
# Split

re.split(r"\s", "This is a good package") # Split texts between spaces

['This', 'is', 'a', 'good', 'package']

In [121]:
name='''
Mr. Ha Van
Ms Van Nguyen
Mrs. Kha Thuy
Mr. T
'''

pattern=re.compile(r'(Mrs|Mr|Ms)\.?')
match=pattern.finditer(name)
for i in match:
    print(i[0])

Mr.
Ms
Mrs.
Mr.


In [118]:
s="""
havantuyen@gmail.com
havan@gmail.net
khanan@gmail.eu

"""

pattern=re.compile(r'[a-zA-Z]+@[a-z]+\.(com|net|eu)')

match=pattern.finditer(s)


for i in match:
    print(i)

<re.Match object; span=(1, 21), match='havantuyen@gmail.com'>
<re.Match object; span=(22, 37), match='havan@gmail.net'>
<re.Match object; span=(38, 53), match='khanan@gmail.eu'>


In [124]:
text="havantuyenkhang"

re.sub("^ha","Okay",text)

'Okayvantuyenkhang'

**Challenge 1**

Write a Python program that matches a word containing 'z', not at the start or end of the word.

**Challenge 2**

Write a Python program to remove leading zeros from an IP address.

sample: ip = "216.08.094.196"

result: ip = "216.8.94.196"

# Exercises 

1. Write a Python program to check that a string contains only a certain set of characters (in this case a-z, A-Z and 0-9).

2. Write a Python program that matches a string that has an a followed by zero or one 'b'.

3. Write a Python program to find sequences of lowercase letters joined with a underscore.

4. Write a Python program that matches a string that has an 'a' followed by anything, ending in 'b'. 

5. Write a Python program where a string will start with a specific number

6. Write a Python program to convert a date of yyyy-mm-dd format to dd-mm-yyyy format.

7. Write a program to match the follow patterns from `text` below.

a. All USD money such as 45$

b. All address and postal code such as Atlantis VA 34075

c. Get all words or letters after @ such as bogusemail.com

d. Get all first names such as Dave

e. Get all special characters such *,&, etc.


In [None]:
text='''
Dave Martin
615-555-7164
Amount: 45$
173 Main St., Springfield RI 55924
davemartin@bogusemail.com

Charles Harris
800-555-5669
969 High St., Atlantis VA 34075
charlesharris@bogusemail.com
Amount: 23334$
Eric Williams
560-555-5153
806 1st St., Faketown AK 86847
laurawilliams@bogusemail.com
Mr. Ha Van Tuyen
Ms. Nguyen Thi Ha
560.555.5153
Amount: 456$
Mrs. Le Thi Linh
trinhtam@gmail.com16
This is a good start point & and % and $ and . and (). Many other things \ and []
^ $ * + ? { } [] \ | ( )
'''