# Examples of using regular expressions in Python

## Table of content

- [Match character set](#match_character_set)
- [Match phone numbers](#phone_numbers)
- [Match last names](#last_names)
- [Advanced use of group](#advance_group)
- [Other methods in `re` module](#other_methods)
- [Flags](#flags)

In [1]:
import re

In [2]:
text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
Ha HaHa
MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )
coreyms.com
321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
'''

<a name='match_character_set'></a>
## Match character set

Anything betwee `[]` are character sets, where we can specify the characters or ranges of them that we want them to appear in the matches.

In [46]:
# Matches ?, +, x, y, z, N, O, P, 7 that appeared exactly once
pattern = re.compile(r'[?+x-zN-P7]')
matches = pattern.finditer(text_to_search)

In [47]:
for match in matches:
    print(match)

<re.Match object; span=(24, 25), match='x'>
<re.Match object; span=(25, 26), match='y'>
<re.Match object; span=(26, 27), match='z'>
<re.Match object; span=(41, 42), match='N'>
<re.Match object; span=(42, 43), match='O'>
<re.Match object; span=(43, 44), match='P'>
<re.Match object; span=(61, 62), match='7'>
<re.Match object; span=(90, 91), match='N'>
<re.Match object; span=(119, 120), match='+'>
<re.Match object; span=(121, 122), match='?'>
<re.Match object; span=(143, 144), match='y'>


<a name='phone_numbers'></a>
## Match the phone numbers

The pattern uses `\d` to matche the digits(0-9) and uses the wild card `.` to match any character. The number of instances (quantifier) are specified in `{}`.

In [5]:
pattern = re.compile(r'\d{3}.\d{3}.\d{4}')
matches = pattern.finditer(text_to_search)

In [6]:
for match in matches:
    print(match)

<re.Match object; span=(151, 163), match='321-555-4321'>
<re.Match object; span=(164, 176), match='123.555.1234'>
<re.Match object; span=(177, 189), match='123*555*1234'>
<re.Match object; span=(190, 202), match='800-555-1234'>
<re.Match object; span=(203, 215), match='900-555-1234'>


<a name='last_names'></a>
## Match the last names (using group)

It uses group in `()` to specify groups of characters that can exist here (separated by `|` or operation). `\.?` matches the literal `.` symbol (0 or 1 time) and then matches 1 or more word character.

In [11]:
pattern = re.compile(r'M(r|s|rs)\.? \w+')
matches = pattern.finditer(text_to_search)

In [12]:
for match in matches:
    print(match)

<re.Match object; span=(216, 227), match='Mr. Schafer'>
<re.Match object; span=(228, 236), match='Mr Smith'>
<re.Match object; span=(237, 245), match='Ms Davis'>
<re.Match object; span=(246, 259), match='Mrs. Robinson'>
<re.Match object; span=(260, 265), match='Mr. T'>


<a name='advance_group'></a>
## Advanced uses of group

In [68]:
urls = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''

In [79]:
# The 's' in after http is optional and the 'www.' as a group is optional
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')

In [80]:
matches = pattern.finditer(urls)
for match in matches:
    print(match)

<re.Match object; span=(1, 23), match='https://www.google.com'>
<re.Match object; span=(24, 42), match='http://coreyms.com'>
<re.Match object; span=(43, 62), match='https://youtube.com'>
<re.Match object; span=(63, 83), match='https://www.nasa.gov'>


In [82]:
# Print out group 0 in matches, which is basically everything
matches = pattern.finditer(urls)
for match in matches:
    print(match.group(0))

https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov


In [83]:
# Print out the group 1
matches = pattern.finditer(urls)
for match in matches:
    print(match.group(1))

www.
None
None
www.


In [84]:
# Print out the group 2
matches = pattern.finditer(urls)
for match in matches:
    print(match.group(2))

google
coreyms
youtube
nasa


In [88]:
# The groups can be used directly with patter.sub method to replace the strings with its combinations of groups
subbed_urls = pattern.sub(r'\2\3', urls)
print(subbed_urls)


google.com
coreyms.com
youtube.com
nasa.gov



<a name='other_methods'></a>
## Other methods in `re` module

### search

In [108]:
# Single version of finditer
pattern = re.compile(r'\d{3}.\d{3}.\d{4}')
match = pattern.search(text_to_search)
match

<re.Match object; span=(151, 163), match='321-555-4321'>

### findall

In [103]:
# It returns all matches as a list
pattern = re.compile(r'\d{3}.\d{3}.\d{4}')
matches = pattern.findall(text_to_search)
matches

['321-555-4321',
 '123.555.1234',
 '123*555*1234',
 '800-555-1234',
 '900-555-1234']

In [105]:
# But only return the groups if there is group specified in the regex expression
pattern = re.compile(r'M(r|s|rs)\.? \w+')
matches = pattern.findall(text_to_search)
matches

['r', 'r', 's', 'rs', 'r']

### match

In [109]:
# It returns None because it only match the beginning of string.
pattern = re.compile(r'\d{3}.\d{3}.\d{4}')
match = pattern.match(text_to_search)
match

In [113]:
pattern = re.compile(r'\na')
match = pattern.match(text_to_search)
match

<re.Match object; span=(0, 2), match='\na'>

Considering there is already `^` in regex that specifies beginning of string, so the method might be not that useful...

<a name='flags'></a>
## Flags

In [116]:
# re.IGNORECASE or re.I ignores the upper/lower case
pattern = re.compile(r'abcd', re.I)
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(1, 5), match='abcd'>
<re.Match object; span=(28, 32), match='ABCD'>
