# Regular Expressions in Python

## Introduction

### Reading Files

RegEx: match patterns against text.

Problem: sort text file and make nice interface to look at contacts.

Open files with utf-8 encoding

In [1]:
open("names.txt", encoding="utf-8")

<_io.TextIOWrapper name='names.txt' mode='r' encoding='utf-8'>

Open the file and read the data:

In [2]:
names_file = open("names.txt", encoding="utf-8")
data = names_file.read()

If you don't know the size of a file, it's better to read it a chunk at a time and close it automatically. The following snippet does that:

In [3]:
with open("names.txt") as open_file:
    data2 = open_file.read()

Close the file:

In [4]:
names_file.close()

In [5]:
print(data)

Love, Kenneth	kenneth@teamtreehouse.com	(555) 555-5555	Teacher, Treehouse	@kennethlove
McFarland, Dave	dave@teamtreehouse.com	(555) 555-5554	Teacher, Treehouse
Arthur, King	king_arthur@camelot.co.uk		King, Camelot
Österberg, Sven-Erik	governor@norrbotten.co.se		Governor, Norrbotten	@sverik
, Tim	tim@killerrabbit.com		Enchanter, Killer Rabbit Cave
Carson, Ryan	ryan@teamtreehouse.com	(555) 555-5543	CEO, Treehouse	@ryancarson
Doctor, The	doctor+companion@tardis.co.uk		Time Lord, Gallifrey
Exampleson, Example	me@example.com	555-555-5552	Example, Example Co.	@example
Trump, Donald	president.44@us.gov	555 555-5551	President, United States of America	@potus44
Chalkley, Andrew	andrew@teamtreehouse.com	(555) 555-5553	Teacher, Treehouse	@chalkers
Vader, Darth	darth-vader@empire.gov	(555) 555-4444	Sith Lord, Galactic Empire	@darthvader
Fernández de la Vega Sanz, María Teresa	mtfvs@spain.gov		First Deputy Prime Minister, Spanish Govt.


Import regex library:

In [6]:
import re

The "match" key word searches from the beginning of the string.  The character 'r' indicates raw data.

In [7]:
print(re.match(r'Love', data))

<re.Match object; span=(0, 4), match='Love'>


In [8]:
print(re.match(r'Kenneth', data))

None


The "search" key word searches the entire string:

In [9]:
print(re.search(r'Kenneth', data))

<re.Match object; span=(6, 13), match='Kenneth'>


In [10]:
first_name = r'Tim'
print(re.search(first_name, data))

<re.Match object; span=(293, 296), match='Tim'>


#### Code Challenge

In [11]:
import re

file_object = open("basics.txt")
data3 = file_object.read()
file_object.close()

first = re.match("Four", data3)
liberty = re.search("Liberty", data3)

### Escape Hatches

Escape Characters:
* \w: matches a Unicode word character. That's any letter, uppercase or lowercase, numbers, and the underscore character. In "new-releases-204", \w would match each of the letters in "new" and "releases" and the numbers 2, 0, and 4. It wouldn't match the hyphens.
* \W: is the opposite to \w and matches anything that isn't a Unicode word character. In "new-releases-204", \W would only match the hyphens.
* \s: matches whitespace, so spaces, tabs, newlines, etc.
* \S: matches everything that isn't whitespace.
* \d: is how we match any number from 0 to 9
* \D: matches anything that isn't a number.
* \b: matches word boundaries. What's a word boundary? It's the edges of word, defined by white space or the edges of the string.
* \B: matches anything that isn't the edges of a word.
* \A (^): matches the beginning of the string.
* \Z ($): matches the end of the string.

In [12]:
print(re.match(r'\w, \w', data))

None


In [13]:
print(re.search(r'\d\d\d-\d\d\d\d', data))

<re.Match object; span=(46, 54), match='555-5555'>


The parentheses characters are special characters and must be escaped:

In [14]:
print(re.search(r'\(\d\d\d\) \d\d\d-\d\d\d\d', data))

<re.Match object; span=(40, 54), match='(555) 555-5555'>


#### Code Challenge

In [15]:
import re

def first_number(str1):
    return re.search(r'\d', str1)

def numbers(count, str2):
    return re.search(r'\d'*count, str2)

### Counts

Matching:
* \w{3}: matches any three word characters in a row.
* \w{,3}: matches 0, 1, 2, or 3 word characters in a row.
* \w{3,}: matches 3 or more word characters in a row. There's no upper limit.
* \w{3, 5}: matches 3, 4, or 5 word characters in a row.
* \w?: matches 0 or 1 word characters.
* \w*: matches 0 or more word characters. Since there is no upper limit, this is, effectively, infinite word characters.
* \w+:  matches 1 or more word characters. Like *, it has no upper limit, but it has to occur at least once.
* .findall(pattern, text, flags): finds all non-overlapping occurrences of the pattern in the text.

In [16]:
print(re.match(r'\w+, \w+', data))

<re.Match object; span=(0, 13), match='Love, Kenneth'>


In [17]:
print(re.search(r'\(\d{3}\) \d{3}-\d{4}', data))

<re.Match object; span=(40, 54), match='(555) 555-5555'>


In [18]:
print(re.findall(r'\(?\d{3}\)? \d{3}-\d{4}', data))

['(555) 555-5555', '(555) 555-5554', '(555) 555-5543', '555 555-5551', '(555) 555-5553', '(555) 555-4444']


In [19]:
print(re.findall(r'\(?\d{3}\)?-?\s?\d{3}-\d{4}', data))

['(555) 555-5555', '(555) 555-5554', '(555) 555-5543', '555-555-5552', '555 555-5551', '(555) 555-5553', '(555) 555-4444']


In [20]:
print(re.findall(r'\w+, \w+', data))

['Love, Kenneth', 'Teacher, Treehouse', 'McFarland, Dave', 'Teacher, Treehouse', 'Arthur, King', 'King, Camelot', 'Österberg, Sven', 'Governor, Norrbotten', 'Enchanter, Killer', 'Carson, Ryan', 'CEO, Treehouse', 'Doctor, The', 'Lord, Gallifrey', 'Exampleson, Example', 'Example, Example', 'Trump, Donald', 'President, United', 'Chalkley, Andrew', 'Teacher, Treehouse', 'Vader, Darth', 'Lord, Galactic', 'Sanz, María', 'Minister, Spanish']


In [21]:
print(re.findall(r'\w*, \w+', data))

['Love, Kenneth', 'Teacher, Treehouse', 'McFarland, Dave', 'Teacher, Treehouse', 'Arthur, King', 'King, Camelot', 'Österberg, Sven', 'Governor, Norrbotten', ', Tim', 'Enchanter, Killer', 'Carson, Ryan', 'CEO, Treehouse', 'Doctor, The', 'Lord, Gallifrey', 'Exampleson, Example', 'Example, Example', 'Trump, Donald', 'President, United', 'Chalkley, Andrew', 'Teacher, Treehouse', 'Vader, Darth', 'Lord, Galactic', 'Sanz, María', 'Minister, Spanish']


#### Code Challenge

In [22]:
import re

def phone_numbers(str1):
    return re.findall(r'\d{3}-\d{3}-\d{4}', str1)

In [23]:
import re

def find_words(count, str1):
    exp = r'\w{' + str(count) + ',}'
    return re.findall(exp, str1)

In [24]:
find_words(4, "dog, cat, baby, balloon, me")

['baby', 'balloon']

In [25]:
find_words(6, '123456, Treehouse, student, learn, Kenneth, Python, regex, match, Ryan, g0tcha')

['123456', 'Treehouse', 'student', 'Kenneth', 'Python', 'g0tcha']

## Sets

Sets let us combine explicit characters and escape patterns into pieces that can be repeated multiple times. They also let us specify pieces that should be left out of any matches.

Sets: if I know the exact characters I want to match, or need to make sure a certain character isn't there.

Matching:
* [abc]: a set of the characters 'a', 'b', and 'c'. It'll match any of those characters, in any order, but only once each.
* [a-z], [A-Z], or [a-zA-Z]: match any/all letters in the English alphabet in lowercase, uppercase, or both upper and lowercases.
* [0-9]: match any number from 0 to 9. You can change the ends to restrict the set.

In [26]:
print(re.findall(r'[-\w\d+.]+@[-\w\d.]+', data))

['kenneth@teamtreehouse.com', 'dave@teamtreehouse.com', 'king_arthur@camelot.co.uk', 'governor@norrbotten.co.se', 'tim@killerrabbit.com', 'ryan@teamtreehouse.com', 'doctor+companion@tardis.co.uk', 'me@example.com', 'president.44@us.gov', 'andrew@teamtreehouse.com', 'darth-vader@empire.gov', 'mtfvs@spain.gov']


re.IGNORECASE = re.I

In [27]:
print(re.findall(r'\b[trehous]+\b', data, re.IGNORECASE))

['Treehouse', 'Treehouse', 'se', 'Treehouse', 'The', 'us', 'Treehouse']


#### Code Challenge

In [28]:
import re

def find_emails(str1):
    return re.findall(r'[\w+.]+@[\w.]+', str1)

In [29]:
find_emails("kenneth.love@teamtreehouse.com, @support, ryan@teamtreehouse.com, test+case@example.co.uk")
# ['kenneth@teamtreehouse.com', 'ryan@teamtreehouse.com', 'test@example.co.uk']

['kenneth.love@teamtreehouse.com',
 'ryan@teamtreehouse.com',
 'test+case@example.co.uk']

In [30]:
find_emails('kenneth@teamtreehouse.com, andrew+gotcha@teamtreehouse.com, exa.mple@example.co.uk')

['kenneth@teamtreehouse.com',
 'andrew+gotcha@teamtreehouse.com',
 'exa.mple@example.co.uk']

### Negation

Negated sets let us specify characters and sequences that should be left out of any matches.
* [^abc]: a set that will not match, and, in fact, exclude, the letters 'a', 'b', and 'c'.
* re.IGNORECASE or re.I: flag to make a search case-insensitive. re.match('A', 'apple', re.I) would find the 'a' in 'apple'.
* re.VERBOSE or re.X: flag that allows regular expressions to span multiple lines and contain (ignored) whitespace and comments.

In [31]:
# Find a word boundary, an @, and any number of characters
# Ignore one or more instances of 'g', 'o', or 'v', and a tab
# Match another word boundary

print(re.findall(r'''
    \b@[-\w\d.]*
    [^gov\t]+
    \b
    ''', data, re.VERBOSE | re.I))

['@teamtreehouse.com', '@teamtreehouse.com', '@camelot.co.uk', '@norrbotten.co.se', '@killerrabbit.com', '@teamtreehouse.com', '@tardis.co.uk', '@example.com', '@us.', '@teamtreehouse.com', '@empire.', '@spain.']


re.VERBOSE (re.X): for multiple lines

In [32]:
# Find a word boundary, 1+ hyphens or characters, and a comma
# 1 white space
# 1+ hyphens and characters and explicit spaces
# Ignore tabs and new lines.

print(re.findall(r"""
    \b[-\w]+,
    \s
    [-\w ]+
    [^\t\n]
    """, data, re.X))

['Love, Kenneth', 'Teacher, Treehouse', 'McFarland, Dave', 'Teacher, Treehouse', 'Arthur, King', 'King, Camelot', 'Österberg, Sven-Erik', 'Governor, Norrbotten', 'Enchanter, Killer Rabbit Cave', 'Carson, Ryan', 'CEO, Treehouse', 'Doctor, The', 'Lord, Gallifrey', 'Exampleson, Example', 'Example, Example Co.', 'Trump, Donald', 'President, United States of America', 'Chalkley, Andrew', 'Teacher, Treehouse', 'Vader, Darth', 'Lord, Galactic Empire', 'Sanz, María Teresa', 'Minister, Spanish Govt.']


#### Code Challenge

In [33]:
# No need to re-import re

string = '1234567890'

good_numbers = re.findall(r'[^567]', string)

print(good_numbers)

['1', '2', '3', '4', '8', '9', '0']


### Groups

Regular expressions give us indexed and named groups to help organize things:
* ([abc]): creates a group that contains a set for the letters 'a', 'b', and 'c'. This could be later accessed from the Match object as .group(1)
* (?P<name>[abc]): creates a named group that contains a set for the letters 'a', 'b', and 'c'. This could later be accessed from the Match object as .group('name').
* .groups(): method to show all of the groups on a Match object.
* re.MULTILINE or re.M: flag to make a pattern regard lines in your text as the beginning or end of a string.
* ^: specifies, in a pattern, the beginning of the string.
* $: specifies, in a pattern, the end of the string.

Define groups with parentheses.

In [34]:
print(data)

Love, Kenneth	kenneth@teamtreehouse.com	(555) 555-5555	Teacher, Treehouse	@kennethlove
McFarland, Dave	dave@teamtreehouse.com	(555) 555-5554	Teacher, Treehouse
Arthur, King	king_arthur@camelot.co.uk		King, Camelot
Österberg, Sven-Erik	governor@norrbotten.co.se		Governor, Norrbotten	@sverik
, Tim	tim@killerrabbit.com		Enchanter, Killer Rabbit Cave
Carson, Ryan	ryan@teamtreehouse.com	(555) 555-5543	CEO, Treehouse	@ryancarson
Doctor, The	doctor+companion@tardis.co.uk		Time Lord, Gallifrey
Exampleson, Example	me@example.com	555-555-5552	Example, Example Co.	@example
Trump, Donald	president.44@us.gov	555 555-5551	President, United States of America	@potus44
Chalkley, Andrew	andrew@teamtreehouse.com	(555) 555-5553	Teacher, Treehouse	@chalkers
Vader, Darth	darth-vader@empire.gov	(555) 555-4444	Sith Lord, Galactic Empire	@darthvader
Fernández de la Vega Sanz, María Teresa	mtfvs@spain.gov		First Deputy Prime Minister, Spanish Govt.


In [35]:
# Last and first names
# Email addresses: hyphens, word characters, numbers, periods, plus signs
# Phone numbers
# Job and company
# Twitter

print(re.findall(r'''
    ([-\w ]+,\s[-\w ]+)\t
    ([-\w\d.+]+@[-\w\d.]+)\t
    (\(?\d{3}\)?-?\s?\d{3}-\d{4})\t
    ([\w\s]+,\s[\w\s]+)\t
    (@[\w\d]+)
    ''', data, re.X))

[('Love, Kenneth', 'kenneth@teamtreehouse.com', '(555) 555-5555', 'Teacher, Treehouse', '@kennethlove'), ('Carson, Ryan', 'ryan@teamtreehouse.com', '(555) 555-5543', 'CEO, Treehouse', '@ryancarson'), ('Trump, Donald', 'president.44@us.gov', '555 555-5551', 'President, United States of America', '@potus44'), ('Chalkley, Andrew', 'andrew@teamtreehouse.com', '(555) 555-5553', 'Teacher, Treehouse', '@chalkers'), ('Vader, Darth', 'darth-vader@empire.gov', '(555) 555-4444', 'Sith Lord, Galactic Empire', '@darthvader')]


Each item in the tuple is one of our groups.

In [36]:
# Mark beginning and end of string, make phone number optional, tab after job optional in case no twitter
# Dot as possible character in company name, twitter also optional
# Multiline: treat each line as end of string

print(re.findall(r'''
    ^([-\w ]*,\s[-\w ]+)\t
    ([-\w\d.+]+@[-\w\d.]+)\t
    (\(?\d{3}\)?-?\s?\d{3}-\d{4})?\t
    ([\w\s]+,\s[\w\s.]+)\t?
    (@[\w\d]+)?$
    ''', data, re.X | re.M))

[('Love, Kenneth', 'kenneth@teamtreehouse.com', '(555) 555-5555', 'Teacher, Treehouse\t', '@kennethlove'), ('McFarland, Dave', 'dave@teamtreehouse.com', '(555) 555-5554', 'Teacher, Treehouse', ''), ('Arthur, King', 'king_arthur@camelot.co.uk', '', 'King, Camelot', ''), ('Österberg, Sven-Erik', 'governor@norrbotten.co.se', '', 'Governor, Norrbotten\t', '@sverik'), (', Tim', 'tim@killerrabbit.com', '', 'Enchanter, Killer Rabbit Cave', ''), ('Carson, Ryan', 'ryan@teamtreehouse.com', '(555) 555-5543', 'CEO, Treehouse\t', '@ryancarson'), ('Doctor, The', 'doctor+companion@tardis.co.uk', '', 'Time Lord, Gallifrey', ''), ('Exampleson, Example', 'me@example.com', '555-555-5552', 'Example, Example Co.\t', '@example'), ('Trump, Donald', 'president.44@us.gov', '555 555-5551', 'President, United States of America\t', '@potus44'), ('Chalkley, Andrew', 'andrew@teamtreehouse.com', '(555) 555-5553', 'Teacher, Treehouse\t', '@chalkers'), ('Vader, Darth', 'darth-vader@empire.gov', '(555) 555-4444', 'Sith

Convert from tuples to dictionary.

In [37]:
line = re.search(r'''
    ^(?P<name>[-\w ]*,\s[-\w ]+)\t
    (?P<email>[-\w\d.+]+@[-\w\d.]+)\t
    (?P<phone>\(?\d{3}\)?-?\s?\d{3}-\d{4})?\t
    (?P<job>[\w\s]+,\s[\w\s.]+)\t?
    (?P<twitter>@[\w\d]+)?$
    ''', data, re.X | re.M)
print(line)

<re.Match object; span=(0, 86), match='Love, Kenneth\tkenneth@teamtreehouse.com\t(555) 5>


In [38]:
print(line.groupdict())

{'name': 'Love, Kenneth', 'email': 'kenneth@teamtreehouse.com', 'phone': '(555) 555-5555', 'job': 'Teacher, Treehouse\t', 'twitter': '@kennethlove'}


#### Code Challenge

In [54]:
string = 'Perotto, Pier Giorgio'

names = re.match(r'([\w]+),\s([\w ]+)', string)

print(names)

<re.Match object; span=(0, 21), match='Perotto, Pier Giorgio'>


#### Code Challenge 2

In [59]:
string = '''Love, Kenneth, kenneth+challenge@teamtreehouse.com, 555-555-5555, @kennethlove
Chalkley, Andrew, andrew@teamtreehouse.co.uk, 555-555-5556, @chalkers
McFarland, Dave, dave.mcfarland@teamtreehouse.com, 555-555-5557, @davemcfarland
Kesten, Joy, joy@teamtreehouse.com, 555-555-5558, @joykesten'''

# contacts = re.search(r'(?P<email>[-\w\d.+]+@[-\w\d.]+)', string)
# contacts = re.search(r'(?P<phone>\(?\d{3}\)?-?\s?\d{3}-\d{4})', string)
contacts = re.search(r'(?P<email>[-\w\d.+]+@[-\w\d.]+), (?P<phone>\(?\d{3}\)?-?\s?\d{3}-\d{4})', string)

print(contacts)

<re.Match object; span=(15, 64), match='kenneth+challenge@teamtreehouse.com, 555-555-5555>


In [60]:
print(contacts.groupdict())

{'email': 'kenneth+challenge@teamtreehouse.com', 'phone': '555-555-5555'}


In [63]:
# twitters = re.search(r'(?P<twitter>@[\w\d]+)$', string, re.M)
twitters = re.search(r'@[\w\d]+$', string, re.M)

print(twitters)

<re.Match object; span=(66, 78), match='@kennethlove'>


In [64]:
print(twitters.groupdict())

{}


### Compiling and Loops

Methods to compile a pattern into an object to match against:
* re.compile(pattern, flags): method to pre-compile and save a regular expression pattern, and any associated flags, for later use.
* .groupdict(): method to generate a dictionary from a Match object's groups. The keys will be the group names. The values will be the results of the patterns in the group.
* re.finditer(): method to generate an iterable from the non-overlapping matches of a regular expression. Very handy for for loops.
* .group(): method to access the content of a group. 0 or none is the entire match. 1 through how ever many groups you have will get that group. Or use a group's name to get it if you're using named groups.

Compile a RegEx pattern to save it.  Eliminate the data stream:

In [69]:
line2 = re.compile(r'''
    ^(?P<name>[-\w ]*,\s[-\w ]+)\t
    (?P<email>[-\w\d.+]+@[-\w\d.]+)\t
    (?P<phone>\(?\d{3}\)?-?\s?\d{3}-\d{4})?\t
    (?P<job>[\w\s]+,\s[\w\s.]+)\t?
    (?P<twitter>@[\w\d]+)?$
    ''', re.X | re.M)

In [71]:
print(re.search(line2, data).groupdict())

{'name': 'Love, Kenneth', 'email': 'kenneth@teamtreehouse.com', 'phone': '(555) 555-5555', 'job': 'Teacher, Treehouse\t', 'twitter': '@kennethlove'}


Better yet, eliminate re.search:

In [73]:
print(line2.search(data).groupdict())

{'name': 'Love, Kenneth', 'email': 'kenneth@teamtreehouse.com', 'phone': '(555) 555-5555', 'job': 'Teacher, Treehouse\t', 'twitter': '@kennethlove'}


Method **finditer**:
* Gives us back an iterable of each nonoverlapping match.
* It's like giving us back a list, but it's not exactly a list.
* It's also like using **findall**, but instead of getting back tupples we get back a match object, like when we use re.match or re.search.

In [75]:
for match in line2.finditer(data):
    print(match.group('name'))

Love, Kenneth
McFarland, Dave
Arthur, King
Österberg, Sven-Erik
, Tim
Carson, Ryan
Doctor, The
Exampleson, Example
Trump, Donald
Chalkley, Andrew
Vader, Darth
Fernández de la Vega Sanz, María Teresa


In [79]:
line3 = re.compile(r'''
    ^(?P<name>(?P<last>[-\w ]*),\s(?P<first>[-\w ]+))\t
    (?P<email>[-\w\d.+]+@[-\w\d.]+)\t
    (?P<phone>\(?\d{3}\)?-?\s?\d{3}-\d{4})?\t
    (?P<job>[\w\s]+,\s[\w\s.]+)\t?
    (?P<twitter>@[\w\d]+)?$
    ''', re.X | re.M)

In [80]:
for match in line3.finditer(data):
    print('{first} {last} <{email}>'.format(**match.groupdict()))

Kenneth Love <kenneth@teamtreehouse.com>
Dave McFarland <dave@teamtreehouse.com>
King Arthur <king_arthur@camelot.co.uk>
Sven-Erik Österberg <governor@norrbotten.co.se>
Tim  <tim@killerrabbit.com>
Ryan Carson <ryan@teamtreehouse.com>
The Doctor <doctor+companion@tardis.co.uk>
Example Exampleson <me@example.com>
Donald Trump <president.44@us.gov>
Andrew Chalkley <andrew@teamtreehouse.com>
Darth Vader <darth-vader@empire.gov>
María Teresa Fernández de la Vega Sanz <mtfvs@spain.gov>


Build dictionaries from strings through RegEx's.

#### Code Challenge

In [96]:
string3 = '''Love, Kenneth: 20
Chalkley, Andrew: 25
McFarland, Dave: 10
Kesten, Joy: 22
Stewart Pinchback, Pinckney Benton: 18'''

In [99]:
# players = re.search(r'''
#     ([\w ]+),\s([\w ]+):\s([\d]+)
#     ''', string3, re.M | re.X)

players = re.search(r'(?P<last_name>[\w ]+),\s(?P<first_name>[\w ]+):\s(?P<score>[\d]+)', string3, re.M)

# players = re.search(r'''
#     ([\w ]+), ([\w ]+): ([\d]+)
#     ''', string3, re.M | re.X)
print(players)

<re.Match object; span=(0, 17), match='Love, Kenneth: 20'>


In [100]:
Player = players.groupdict()
print(Player)

{'last_name': 'Love', 'first_name': 'Kenneth', 'score': '20'}


In [101]:
class Player2:
    def __init__(self, last_name, first_name, score):
        self.last_name = last_name
        self.first_name = first_name
        self.score = score

#### Quiz

* Start a set with **^** to indicate not to match any characters.
* Reasons to compile a pattern:
    1. Use it multiple times.
    2. Pass it to functions.
    3. Use it directly.
    4. Provide multiple patterns as part of a library.
* re.MULTILINE: Newlines are treated as individual strings.
* Match 5 or more occurrences of a pattern: {5,}
* Flag to write patterns over multiple lines, ignoring whitespaces and comments: re.VERBOSE (re.X)
* Iterable full of match objects: .finditer()
* Match number in string with escape character: \d