# Review of Regex Symbols

The ? matches zero or one of the preceding group.

The * matches zero or more of the preceding group. 

The + matches one or more of the preceding group.

The {n} matches exactly n of the preceding group. 

The {n,} matches n or more of the preceding group. 

The {,m} matches 0 to m of the preceding group. 

The {n,m} matches at least n and at most m of the preceding group. 

{n,m}? or *? or +? performs a nongreedy match of the preceding
group.

^spam means the string must begin with spam .

spam$ means the string must end with spam .

The . matches any character, except newline characters. 

\d, \w, and \s match a digit, word, or space character, respectively. 

\D, \W, and \S match anything except a digit, word, or space character, respectively. 

[abc] matches any character between the brackets (such as a , b , or c ). 

[^abc] matches any character that isn’t between the brackets. 

# Finding Patterns of Text Without Regular Expressions 


In [1]:
def isPhoneNumber(text):
    if len(text) != 12: # check if string is exactly 12 characters long
        return False
    for i in range(0, 3): # check if area code cinsists of only numeric characters
        if not text[i].isdecimal():
            return False
    if text[3] != '-': # check if third char is hyphen (-)
        return False
    for i in range(4, 7): # check if there's another 3 numeric char's
        if not text[i].isdecimal():
            return False
    if text[7] != '-': # check for another hyphen
        return False
    for i in range(8, 12): # check if the last 4 chars is numeric chars
        if not text[i].isdecimal():
            return False
    return True

x = '415-555-4242'
print(isPhoneNumber(x))
len(x)
x[3]



True


'-'

In [2]:
# This script takes 576 steps in total to find the phone numbers in a small string.

def isPhoneNumber(text):
    if len(text) != 12: # check if string is exactly 12 characters long
        return False
    for i in range(0, 3): # check if area code cinsists of only numeric characters
        if not text[i].isdecimal():
            return False
    if text[3] != '-': # check if third char is hyphen (-)
        return False
    for i in range(4, 7): # check if there's another 3 numeric char's
        if not text[i].isdecimal():
            return False
    if text[7] != '-': # check for another hyphen
        return False
    for i in range(8, 12): # check if the last 4 chars is numeric chars
        if not text[i].isdecimal():
            return False
    return True

message = 'Call me at 415-555-1011. 415-555-9999 is my office.'
for i in range(len(message)): 
    chunk = message[i:i + 12] # takes 12 chars and checks all steps.
    if isPhoneNumber(chunk):
        print("Phone number found : " + chunk)

Phone number found : 415-555-1011
Phone number found : 415-555-9999


# Finding Patterns of Text with Regular Expressions 


Regular expressions, called *regexes* for short, are descriptions for a pattern of text.

 \d in a regex stands for a digit character — that is, any single numeral 0 to 9.
 
 The regex \d\d\d-\d\d\d-\d\d\d\d is used by Python to match the same text the
previous isPhoneNumber() function did: a string of three numbers, a hyphen, three more
numbers, another hyphen, and four numbers. Any other string would not match the \d\
\d-\d\d\d-\d\d \d\d regex. 

 For example, adding a 3 in curlybrackets ({3}) after a pattern is like saying, “Match this pattern three times.” So the slightly
shorter regex \d{3}-\d{3}-\d{4} also matches the correct phone number format. 



### Creating regex objects
All the regex functions in Python are in the re module

In [3]:
import re

Passing a string value representing your regular expression to re.compile() returns a
Regex pattern object (or simply, a Regex object). 


In [4]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
# Now the phoneNumRegex variable contains a Regex object. 

 by putting an r before the first
quote of the string value, you can mark the string as a raw string , which does not escape
characters. 


**Matching Regex Objects**

A Regex object’s search() method searches the string it is passed for any matches to the
regex. The search() method will return None if the regex pattern is not found in the
string. If the pattern is found, the search() method returns a Match object. Match objects
have a group() method that will return the actual matched text from the searched string.


In [5]:
# \d{3} means "match this pattern three times.

phoneNumRegex = re.compile(r'\d{3}-\d{3}-\d{4}')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print('Phone number found: ', mo.group())

Phone number found:  415-555-4242


**Grouping with Parentheses **


Say you want to separate the area code from the rest of the phone number. Adding
parentheses will create groups in the regex: 

**(\d\d\d)-(\d\d\d-\d\d\d\d)**. 

Then you can
use the group() match object method to grab the matching text from just one group. 

The first set of parentheses in a regex string will be group 1. The second set will be group 2

In [6]:
phoneNumRegex = re.compile(r'(\d{3})-(\d{3}-\d{4})')

mo = phoneNumRegex.search('My number is 415-555-4242.')

print('Phone number found: ', mo.group())



Phone number found:  415-555-4242


In [7]:
mo.group(1)

'415'

In [8]:
mo.group(2)

'555-4242'

In [9]:
mo.groups()

('415', '555-4242')

In [10]:
areaCode, mainNumber = mo.groups()

In [11]:
print(areaCode)

415


In [12]:
print(mainNumber)

555-4242


For instance, maybe the phone numbers you are trying to match have the area code set in parentheses. In this case, you need to escape the ( and
) characters with a backslash.

In [13]:
#(415)-555-4242
phoneNumRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My phone number is (415) 555-4242.')
mo.group(1)

'(415)'

**Matching Multiple Groups with the Pipe **

The | character is called a pipe . You can use it anywhere you want to match one of many
expressions. For example, the regular expression *r'Batman|Tina Fey'* will match either
'Batman' or 'Tina Fey'. 

When both Batman and Tina Fey occur in the searched string, the first occurrence of
matching text will be returned as the Match object. Enter the following into the interactive
shell: 



In [14]:
heroRegex = re.compile(r'Batman|Tina Fey')

In [15]:
mo1 = heroRegex.search('Batman and Tina Fey.')

In [16]:
mo1.group()


'Batman'

In [17]:
mo2 = heroRegex.search('Tina Fey and Bataman.')

In [18]:
mo2.group()

'Tina Fey'

**By using the pipe character and grouping parentheses, you can specify several alternative patterns you would like your regex to match.**

In [19]:
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')

In [20]:
mo = batRegex.search('Batmobile lost a wheel.')

In [21]:
mo.group()

'Batmobile'

In [22]:
mo.group(1)

'mobile'

### Optional Matching with the Question Mark

Sometimes there is a pattern that you want to match only optionally. That is, the regex
should find a match whether or not that bit of text is there.

The (wo)? part of the regular expression means that the pattern wo is an optional group.


In [23]:
batRegex = re.compile(r'Bat(wo)?man')

In [24]:
mo1 = batRegex.search('The adventures of Batman')

In [25]:
mo1.group()

'Batman'

In [26]:
mo2 = batRegex.search('The Adventures os Batwoman')

In [27]:
mo2.group()

'Batwoman'

Using the earlier phone number example, you can make the regex look for phone numbers
that do or do not have an area code. Enter the following into the interactive shell: 


In [28]:
phoneRegex = re.compile(r'(\d{3}-)?\d{3}-\d{4}')

In [29]:
mo1 = phoneRegex.search('My number is 415-555-4242')

In [30]:
mo1.group()

'415-555-4242'

In [31]:
mo2 = phoneRegex.search('My phone is 555-4242')

In [32]:
mo2.group()


'555-4242'

## Matching Zero or More with the Star 


The * (called the star or asterisk ) means “match zero or more” — the group that precedes
the star can occur any number of times in the text. It can be completely absent or repeated
over and over again.

In [33]:
batRegex =  re.compile(r'Bat(wo)*man')

In [34]:
mo1 = batRegex.search('The Adventures of Batman')

In [35]:
mo1.group()

'Batman'

In [36]:
mo2 = batRegex.search('Adventures of Batwoman')

In [37]:
mo2.group()

'Batwoman'

In [38]:
mo3 = batRegex.search('Adventures of Batwowowowowoman')

In [39]:
mo3.group()

'Batwowowowowoman'

## Matching One or More with the Plus 


The + (or plus ) means “match one or more.

Unlike the
star, which does not require its group to appear in the matched string, the group preceding
a plus must appear at least once . It is not optional. 

The regex Bat(wo)+man will not match the string 'The Adventures of Batman ' because at least one wo is required by the plus sign. 


In [40]:
batRegex = re.compile(r'Bat(wo)+man')

In [41]:
mo1 = batRegex.search('Adventures of Batwoman')

In [42]:
mo1.group()

'Batwoman'

In [43]:
mo2 = batRegex.search('The Adventures of Batwowoman')

In [44]:
mo2.group()

'Batwowoman'

In [45]:
mo3 = batRegex.search('The Adv of Batman')

In [46]:
mo3 == None

True

## Matching Specific Repetitions with Curly Brackets 


If you have a group that you want to repeat a specific number of times, follow the group in
your regex with a number in curly brackets.

*(Ha){3}*

Instead of one number, you can specify a range by writing a minimum, a comma, and a
maximum in between the curly brackets.

*(Ha){3,5}*

You can also leave out the first or second number in the curly brackets to leave the
minimum or maximum unbounded

*(Ha){3,}*   will match three of more instances.

*(Ha){,5}*  will match zeo up to five instances.

In [47]:
haRegex = re.compile(r'(Ha){3}')

In [48]:
mo1 = haRegex.search('HaHaHa')

In [49]:
mo1.group()

'HaHaHa'

In [50]:
mo2 = haRegex.search('Ha')

In [51]:
mo2 == None

True

## Greedy and Nongreedy Matching 


Python’s regular expressions are greedy by default, which means that in ambiguous
situations they will match the longest string possible

The non-greedy version of the curly
brackets, which matches the shortest string possible, has the closing curly bracket followed
by a question mark. 


In [52]:
greedyHaRegex = re.compile(r'(Ha){3,5}')

In [53]:
mo1 = greedyHaRegex.search('HaHaHaHaHaHa')

In [54]:
mo1.group()

'HaHaHaHaHa'

In [55]:
nongreedyHaRegex = re.compile(r'(Ha){3,5}?')

In [56]:
mo2 = nongreedyHaRegex.search('HaHaHaHaHaHa')

In [57]:
mo2.group()

'HaHaHa'

## The findall() Method

1. When called on a regex with no groups, such as \d\d\d-\d\d\d-\d\d\d\d, the
method findall() returns a list of string matches, such as ['415-555-9999',
'212-555-0000']. 


2. When called on a regex that has groups, such as (\d\d\d)-(\d\d\d)-(\d\ d\
\d), the method findall() returns a list of tuples of strings (one string for each
group), such as [('415', '555', '1122'), ('415', '555', '8899')]. 


In [58]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')

In [59]:
mo = phoneNumRegex.search('Cell: 415-555-9999 work: 215-555-0000')

In [60]:
mo.group()

'415-555-9999'

findall() method will return the strings of every match in the searched string

On the other hand, findall() will not return a Match object but a list of strings — as long
as there are no groups in the regular expression . 

In [61]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # has no groups

In [62]:
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

['415-555-9999', '212-555-0000']

If there are groups in the regular expression, then findall() will return a list of tuples.

In [63]:
phoneNumRegex = re.compile(r'(\d{3})-(\d{3}-\d{4})')

In [64]:
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

[('415', '555-9999'), ('212', '555-0000')]

# Character Classes 


**\d** Any numeric digit from 0 to 9.

**\D** Any character that IS NOT a numeric digit from 0-9

**\w** Any letter, numeric digit, or the underscore char.

**\W** Any char that IS NOT a letter, numeric digit, or the underscore charachter

**\s** Any space, tab, or newline char.

**\S** Any char that IS NOT a space, tab, or newline char

In [65]:
# one or more digit, followed by space, followed by one or more letter/digit/underscore characters
xmasRegex = re.compile(r'\d+\s\w+') 

In [66]:
xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, seven digit')

['12 drummers', '11 pipers', '10 lords', '9 ladies', '8 maids']

## Making Your Own Character Classes 


 You can define your own character class using
square brackets.

In [67]:
# will catch all the vowels, lowercase and uppercase
vowelRegex = re.compile(r'[aeiouAEIOU]')

In [68]:
vowelRegex.findall('Robocop eats baby food. BABY FOOD322.')

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']

In [69]:
#  will match all lowercase letters in range, uppercase letters in range , and numbers. 
customRegex = re.compile(r'[x-zX-Z0-9]')
# whitespaces will also count if [x-z X-Z 0-9]

In [70]:
customRegex.findall('Robocop eats baby food. BABY FOOD322.')

['y', 'Y', '3', '2', '2']

** By placing a caret character (^) just after the character class’s opening bracket, you can
make a negative character class **

In [71]:
# ^ (caret) will make a negative class. Will search every non-vowel char
consonantRegex = re.compile(r'[^aeiouAEIOU]')

In [72]:
consonantRegex.findall('Robocop eats baby food. BABY FOOD.')

['R',
 'b',
 'c',
 'p',
 ' ',
 't',
 's',
 ' ',
 'b',
 'b',
 'y',
 ' ',
 'f',
 'd',
 '.',
 ' ',
 'B',
 'B',
 'Y',
 ' ',
 'F',
 'D',
 '.']

## The Caret and Dollar Sign Characters

In [73]:
beginWithHello = re.compile(r'^Hello') # Match when string begin with Hello

In [74]:
beginWithHello.search('Hello, world!')

<_sre.SRE_Match object; span=(0, 5), match='Hello'>

In [75]:
beginWithHello.search('He said Hello!') == None

True

In [76]:
endWithNumber = re.compile(r'\d.$') # match when string end with numerical char and dot.


In [77]:
endWithNumber.search('Your number is 42.')

<_sre.SRE_Match object; span=(16, 18), match='2.'>

In [78]:
endWithNumber.search('You are forty second') == None

True

In [79]:
wholestringIsNum = re.compile(r'^\d+$')

In [80]:
wholestringIsNum.search('123456')

<_sre.SRE_Match object; span=(0, 6), match='123456'>

In [81]:
wholestringIsNum.search('123sdfd456') == None


True

In [82]:
wholestringIsNum.search('1 23456') == None

True

### The Wildcard Character 


The . (or dot ) character in a regular expression is called a wildcard and will match any
character except for a newline.

In [83]:
atRegex = re.compile(r'.at')

In [84]:
atRegex.findall('The cat in the hat sat on the flat mat at house.')

['cat', 'hat', 'sat', 'lat', 'mat', ' at']

Remember that the dot character will match just one character, which is why the match for
the text flat in the previous example matched only lat

## Matching Everything with Dot-Star 


 You can use the dot-star (.*) to stand in for that
“anything.” Remember that the dot character means “any single character except the
newline,” and the star character means “zero or more of the preceding character.” 


In [85]:
nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')

In [86]:
mo = nameRegex.search('First Name: Al, Last Name: Sweigart')

In [87]:
mo.group()

'First Name: Al, Last Name: Sweigart'

In [88]:
mo.group(2)

'Sweigart'

The dot-star uses greedy mode: It will always try to match as much text as possible.for nongreedy fashion use (.*?)

In [89]:
nongreedyRegex = re.compile(r'<.*?>')

In [90]:
mo = nongreedyRegex.search('<To serve man> for dinner.>')

In [91]:
mo.group()

'<To serve man>'

In [92]:
greedyRegex = re.compile(r'<.*>')

In [93]:
mo = greedyRegex.search('<To serve man> for dinner.>')

In [94]:
mo.group()

'<To serve man> for dinner.>'

### Matching Newlines with the Dot Character

The dot-star will match everything except a newline. By passing re.DOTALL as the second
argument to re.compile(), you can make the dot character match all characters,
including the newline character. 


In [95]:
noNewlineRegex = re.compile('.*')

In [96]:
noNewlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()

'Serve the public trust.'

In [97]:
newlineRegex = re.compile('.*', re.DOTALL)

In [98]:
newlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()

'Serve the public trust.\nProtect the innocent.\nUphold the law.'

# Case-Insensitive Matching 


To make your regex case-insensitive, you can pass
re.IGNORECASE or re.I as a second argument to re.compile().

In [99]:
robocop = re.compile(r'robocop', re.I)

In [100]:
robocop.search('Robocop is part man, part machine, all cop.').group()

'Robocop'

### Substituting Strings with the sub() Method 


Regular expressions can not only find text patterns but can also substitute new text in place
of those patterns. The sub() method for Regex objects is passed two arguments. 

The first
argument is a string to replace any matches. The second is the string for the regular
expression. The sub() method returns a string with the substitutions applied. 


In [101]:
namesRegex = re.compile(r'Agent \w+')

In [102]:
namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent John')

'CENSORED gave the secret documents to CENSORED'

Sometimes you may need to use the matched text itself as part of the substitution. 

In the
first argument to sub(), you can type \1, \2, \3, and so on, to mean “Enter the text of
group 1, 2, 3, and so on, in the substitution.” 


In [103]:
agentNamesRegex = re.compile(r'Agent (\w)\w*')

In [104]:
agentNamesRegex.sub(r'\1*****', 'Agent Alice told Agent Carol ...')

'A***** told C***** ...'

# Managing Complex Regexes 


Regular expressions are fine if the text pattern you need to match is simple. But matching
complicated text patterns might require long, convoluted regular expressions. 

You can
mitigate this by telling the re.compile() function to ignore whitespace and comments
inside the regular expression string. This “verbose mode” can be enabled by passing the
variable re.VERBOSE as the second argument to re.compile(). 

you can spread the regular expression over multiple lines with comments like this: 


In [105]:
phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?           # area code
    (\s|-|\.)?                   # separator
    \d{3}                        # first 3 digits
    (\s|-|\.)                    # separator
    \d{4}                        # last 4 digits
    (\s*(ext|x|ext.)\s*\d{2,5})? # extension
    )''', re.VERBOSE)

Note how the previous example uses the triple-quote syntax (''') to create a multiline
string so that you can spread the regular expression definition over many lines, making it
much more legible.

## Combining re.IGNORECASE, re.DOTALL, and re.VERBOSE

So if you want a regular expression that’s case-insensitive and includes newlines to match
the dot character, you would form your re.compile() call like this,which in this context is known as the bitwise or operator:
 


In [106]:
someRegexValue = re.compile('foo', re.I | re.DOTALL)

In [107]:
someRegexValue = re.compile('foo', re.I | re.DOTALL | re.VERBOSE)