### <font color="brown">Regular Expressions Continued</font>

In [1]:
import re

---

#### <font color="brown">Special regular expression sequences to match predefined sets of characters</font>
1. Whitespace: \\s, \\S
2. Word (alphanumeric, plus underscore) characters: \\w, \\W
3. Digits: \\d, \\D
4. Word Boundary: \\b

---

**Whitespace**
- \\s : matches any whitespace character (including tab and newline)
- \\S : matches any non-whitespace character 

In [12]:
# at least two of '.','?',or '!', followed by whitespace
def matchit(astr):
    res = re.search(r'[.?!]{2,}\s+',astr) 
    print(res) if res else print('no match')

In [13]:
matchit('...What the?')  

no match


In [15]:
matchit('... What the?') 

<re.Match object; span=(0, 4), match='... '>


**In the above, the space after ... is matched by \\s+**

In [14]:
matchit('...  What the?') 

<re.Match object; span=(0, 5), match='...  '>


In [17]:
matchit('What the?!! Next...')

<re.Match object; span=(8, 12), match='?!! '>


In [18]:
# at least 4 non-whitespace characters followed by at least one whitespace
res = re.search(r'\S{4,}\s+','The quick brown fox...')
print(res)

<re.Match object; span=(4, 10), match='quick '>


In [20]:
# can specify whitspace alternatively by using [] class with space, tab, and newline
astr = '... What the?'
res = re.search(r'[.?!]{2,}[ \t\n]+',astr)  
print(res)

<re.Match object; span=(0, 4), match='... '>


---

**"Word": characters (alphanumeric)**

- \\w : matches any alphanumeric character => [a-zA-Z0-9_]  (includes underscore)
- \\W : matches any non-alphanumeric character => [^a-zA-Z0-9_]  

In [21]:
# want at least 4 word characters followed by at least one whitespace
res = re.search(r'\w{4,}\s+',"Hey! What's up?")
print(res)

None


In [22]:
# want at least 4 word characters followed by at least one whitespace
res = re.search(r'\w{4,}\s+',"Hey! What's up with you?")
print(res)

<re.Match object; span=(15, 20), match='with '>


---

**Digits**

- \\d : matches any digit character => [0-9]
- \\D : matches any non-digit character => [^0-9]

##### Exercise:
Write a regular expression to determine if a given string is an acceptable phone number.<br>
Following are the acceptable phone number formats (d stands for digit):
- ddddddddd
- ddd-ddd-dddd
- (ddd)ddddddd
- (ddd)ddd-dddd

**First, let's handle the last two variants that have ()**

In [25]:
while True:
    astr = input("phone number? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search(r'^\(\d{3}\)\d{3}-?\d{4}$',astr)  # escape '(' and ')' because they are metachars
    print('match') if res else print('no match')

phone number? ('quit' to stop)  (848)445-2590


match


phone number? ('quit' to stop)  (848)4452590


match


phone number? ('quit' to stop)  8484452590


no match


phone number? ('quit' to stop)  84812


no match


phone number? ('quit' to stop)  (8480)445-259


no match


phone number? ('quit' to stop)  quit


**Next, let's strengthen the above with the ability to handle leading/trailing whitespaces**

In [13]:
while True:
    astr = input("phone number? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search(r'^\s*\(\d{3}\)\d{3}-?\d{4}\s*$',astr)  # \s* after ^ and before $
    print('match') if res else print('no match')

phone number? ('quit' to stop)    (848)123-4567  


match


phone number? ('quit' to stop)  quit


**For the non '( )' variants, we can't have a single pattern using -? for each of the - positions because it will match even if only one dash is present, and the other is not**

In [26]:
# so, for instance, it will work for this string
res = re.search(r'^\s*\d{3}-?\d{3}-?\d{4}\s*$','  848-445-2790')
print(res)

<re.Match object; span=(0, 14), match='  848-445-2790'>


In [27]:
# but also for this string, which is not an acceptable variant
res = re.search(r'^\s*\d{3}-?\d{3}-?\d{4}\s*$','  848-4452790')
print(res)

<re.Match object; span=(0, 13), match='  848-4452790'>


**So let's do one pattern to catch both dashes**

In [28]:
# both dashes
print(re.search(r'^\s*\d{3}-\d{3}-\d{4}\s*$','  848-445-2790   '))

<re.Match object; span=(0, 17), match='  848-445-2790   '>


**And another pattern to catch a straight sequence of 10 digits**

In [29]:
# 10 digits in sequence
print(re.search(r'^\s*\d{10}\s*$','  8484452790   '))

<re.Match object; span=(0, 15), match='  8484452790   '>


**Final solution, single regexp to catch all variants**

In [18]:
# next, let's strengthen the above with the ability to handle leading/trailing whitespaces
while True:
    astr = input("phone number? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search(r'\s*(\(\d{3}\)\d{3}-?\d{4}|\d{3}-\d{3}-\d{4}|\d{10})\s*$',astr) 
    print('match') if res else print('no match')

phone number? ('quit' to stop)   8484452790 


match


phone number? ('quit' to stop)   848-445-2790 


match


phone number? ('quit' to stop)   (848)445-2790 


match


phone number? ('quit' to stop)   (848)4452790 


match


phone number? ('quit' to stop)   (848)-445-2790 


no match


phone number? ('quit' to stop)  quit


---

**Word boundary**
- \\b : matches only at word boundary (doesn't actually match any character, just sets the rule).
(Rememver, a word is a sequence of alphanumeric characters plus underscore.)<br> 

In [51]:
# check if a string contains the word 'part'
res = re.search(r'\b[pP]art\b',"I'm going to a party tomorrow")
print(res)
res = re.search(r'\b[pP]art\b',"This is the best part of the movie.")
print(res)
res = re.search(r'\b[pP]art\b',"This is a big apartment.")
print(res)
res = re.search(r'\b[p|P]art\b','Til death do us part') # end of string is also word boundary
print(res)

None
<re.Match object; span=(17, 21), match='part'>
None
<re.Match object; span=(16, 20), match='part'>


In [33]:
res = re.search(r'\b[eE]pisode\b',"Episode3 has a high rating.") 
print(res)

None


**In the above, since word includes digits, the '3' is not a word boundary**

---

#### <font color="brown">Using the match function</font>
**The match function always starts matching from the beginning of string**

In [34]:
print(re.search('ar','barbaric')) # 'ar' is in 'barbaric'
print(re.match('ar','barbaric')) # but 'barbaric' doesn't begin with 'ar'

<re.Match object; span=(1, 3), match='ar'>
None


In [35]:
# match all strings that start with ar, end with t, 
# and have at least one lowercase letter between

res = re.search('^ar[a-z]+t$', 'arrest')  # version 1, using search
print(res)
res = re.match('ar[a-z]+t$', 'arrest')  # version 2, using match   
print(res)

<re.Match object; span=(0, 6), match='arrest'>
<re.Match object; span=(0, 6), match='arrest'>


**Note that if you want to match an entire string with the match function, you will still need to use $ at the end**

---

#### <font color="brown">Using the match object returned by search/match</font>
**Applying the methods group(), span(), start(), end()**

In [39]:
res = re.search('at', 'catch')  # returned Match object is stored in res 

**group() returns the matched result string**

In [37]:
print(res.group())

at


**span() returns the range tuple (start,end) indices of matching part of original string**

In [40]:
print(res.span())

(1, 3)


**start() and end() return starting and ending indices of matching part of original string**

In [41]:
print(res.start()) 
print(res.end()) 

1
3


**of course, you can get these same values from the tuple returned by span()**

In [42]:
start,end = res.span()
print(start,',',end)

1 , 3


**By definition, re.match() will always return a span that starts at 0 (if a match is found)**

In [45]:
res = re.match(r'<.*?>','<span>This is within a span tag in html</span>')  # non-greedy
print(res.group())
print(res.span())
print(res.start())
print(res.end())

<span>
(0, 6)
0
6


In [46]:
res = re.match(r'<.*>','<span>This is within a span tag in html</span>')  # greedy
print(res.group())
print(res.span())
print(res.start())
print(res.end())

<span>This is within a span tag in html</span>
(0, 46)
0
46


**Be careful to check for existence of returned match object before applying methods!**

In [47]:
res = re.match('bar','sandbar')
print(res.group())

AttributeError: 'NoneType' object has no attribute 'group'

In [48]:
# defend!
res = re.match('bar','sandbar')
print(res.group()) if res else print('No match')

No match


**Typical usage is to store in Match object, check if it exists (not None), and then get matched string with group**

In [49]:
# find out if a string contains any sequence that starts with ar, ends with t, 
# and has at least one lowercase letter between
def substr(astr):
    res = re.search('ar[a-z]+t',astr)  
    print('Match:',res.group()) if res else print('No match')
        
substr('parasite')
substr('artist')
substr('part')

Match: arasit
Match: artist
No match


---

#### <font color="brown">Splitting a string with split function</font>

In [4]:
str = 'ab;cd'
re.split(';',str)

['ab', 'cd']

In [5]:
str.split(';')

['ab', 'cd']

In [6]:
str = 'Really? I mean, really?!'
re.split('[?!]',str)

['Really', ' I mean, really', '', '']

**Regexp split will split separately on each of the characters in the given class.<br>
Also, notice the empty string returned between consecutive split characters,<br>
and between consecutive split character and end of string**

In [7]:
str.split('?!')

['Really? I mean, really', '']

**But String.split will only split on ALL characters in the given set as a group.<br>
Empty string returned as in regexpt split**

In [8]:
# split into words, using \W (non-word character) as delimiter
res = re.split('\W+','This   is  a bunch of words!')
print(res)

['This', 'is', 'a', 'bunch', 'of', 'words', '']


---

#### <font color="brown">Substituting in a string with sub function</font>

In [15]:
# substitute all digits in 'Account number 1223456789' with '#'
re.sub('\d','#','Account number 1223456789')

'Account number ##########'

In [67]:
# substitute last 3 digits with '#'
re.sub('\d{3}$','###','Account number 1223456789')

'Account number 1223456###'

In [17]:
# removing comments from html
# <!-- this is a comment -->

htmlstr = 'Before comment...<!-- This is a comment -->, and after comment'
res = re.sub('<!--.*-->','', htmlstr)  # replace comment with nothing
print(res)

Before comment..., and after comment


In [18]:
# warning, the regexp is greedy!
htmlstr = 'Before first... <!-- comment1 -->between first and second <!-- comment2--> ... after second'
res = re.sub('<!--.*-->','', htmlstr)  # replace comment with nothing
print(res)

Before first...  ... after second


**Since the regexp above does a greedy match, everything from the first '<' to the last '>' is matched,<br>
including the string between the two comment sections**

In [19]:
# make it non-greedy
htmlstr = 'Before first... <!-- comment1 -->between first and second <!-- comment2--> ... after second'
res = re.sub('<!--.*?-->','', htmlstr)
print(res)

Before first... between first and second  ... after second


In [20]:
# does not work with a multiline string
htmlstr2 = """<!-- first 
comment -->Not a comment<!-- comment2 -->"""
res = re.sub('<!--.*?-->','', htmlstr2)
print(res)

<!-- first 
comment -->Not a comment


**The '.' metacharacter does not match a newline**

In [119]:
# change to either . or newline
res = re.sub(r'<!--(.|\n)*?-->','', htmlstr2)
print(res)

Not a comment


---

#### <font color="brown">Grouping/Capturing</font>

In [68]:
# want to extract ("capture") area code and local part from phone number
# assume format (ddd)ddd-dddd

res = re.match(r'\s*\((\d{3})\)(\d{3}-\d{4})', '(848)445-2790')

**Notice the grouping/capturing with parentheses around the area code part, as in (\d{3})
and likewise for the entire non-area code part**

In [23]:
print(res.group())  # for the whole thing
print(res.groups()) # for all parts captured with ( )
print(res.group(0)) # entire thing
print(res.group(1)) # first grouping with ( )
print(res.group(2)) # second grouping with ( )

(848)445-2790
('848', '445-2790')
(848)445-2790
848
445-2790


In [24]:
# equally, you can use search instead of match, just make sure to use ^ for start of string
res = re.search(r'^\s*\((\d{3})\)(\d{3}-\d{4})', '(848)445-2790')

In [25]:
print(res.group())  # for the whole thing
print(res.groups()) # for all parts grouped with ( )
print(res.group(0)) # entire thing
print(res.group(1)) # first grouping with ( )
print(res.group(2)) # second grouping with ( )

(848)445-2790
('848', '445-2790')
(848)445-2790
848
445-2790


In [26]:
# alternatively, you can index into the groups() tuple
print(res.groups()[0])
print(res.groups()[1])

848
445-2790


In [28]:
# iterate through all the groups
res = re.match(r'\s*\((\d{3})\)(\d{3}-\d{4})', '(848)445-2790')
if res:
    for gr in res.groups():
        print(gr)

848
445-2790


**Numbering and back-referencing capture groups**

In [58]:
# captures can be numbered, and backreferenced using numbers
res = re.search(r'(air).*\1','cool air or hot air')
print(res)

<re.Match object; span=(5, 19), match='air or hot air'>


In [59]:
# captures can be numbered, and backreferenced using numbers
res = re.search(r'(air).*\1','cool air or hot')
print(res)

None


**When using back references, make sure to use raw string for the regexp, otherwise it won't work, see below**

In [66]:
# same as 2 cells above, but without using raw string
res = re.search('(air).*\1','cool air or hot air')
print(res)

None


---

#### <font color="brown">Pre-compiling a regular expression</font>

**Sometimes it's easier to "compile" a regular expression and use it several times later**

In [2]:
pattrn = re.compile(r'\s*\((\d{3})\)(\d{3}-\d{4})')
res = pattrn.match('(848)445-2790')
print(res.groups())

('848', '445-2790')


In [3]:
patt = re.compile(r'\s*#?\s*(\d+)')
res = patt.match('#25 Infinite Loop,Cupertino,CA 12345')
print(res.groups())
res = patt.match(' # 25 Infinite Loop,Cupertino,CA 12345')
print(res.groups())
res = patt.match(' 25 Infinite Loop,Cupertino,CA 12345')
print(res.groups())

('25',)
('25',)
('25',)


---

**Exercise**
<pre>
Given a string of the form:
     '"&lt;last name>, &lt;first name>",&lt;netid>'

Output the string:
     '&lt;first name>,&lt;last name>,&lt;netid>'

e.g. '"  Venugopal,   Sesh ", sv123 ' => 'Sesh,Venugopal,sv123@rutgers.edu'
</pre>

In [69]:
# capture the last name, first name and netid and use the captures to construct result
# allow for leading and trailing whitespaces around name and netid, and whitespaces around comma separators 
student_str = '"  Venugopal ,   Sesh " , sv123 '
res = re.sub(r'"\s*(\S+)\s*,\s*(\S+)\s*"\s*,\s*(\w+)\s*',r'\2,\1,\3@rutgers.edu',student_str)
print(res)

Sesh,Venugopal,sv123@rutgers.edu


In [43]:
# what if try pre-compiling both the strings?
student_str = '"  Venugopal,   Sesh ", sv123 '
target = re.compile(r'"\s*(\S*)\s*,\s*(\S*)\s*"\s*,\s*(\w*)')
repl = re.compile(r'\2,\1,\3@rutgers.edu')
res = re.sub(target,repl,student_str)
print(res)

error: invalid group reference 2 at position 1

**The above doesn't work: the context of the pattern is restricted to the target variable, so the references to the captured groups in the repl variable are out of context**