### <font color="brown">Regular Expressions Continued</font>

In [92]:
import re

---

#### Verbose, multiline regexp, capturing
Suppose we want to do some capturing in an address of the form<br>
`<optional #><apt num><whitespace><street name>,<city>,<2 character upper case state code><whitespace><zip>`

In [97]:
addr = re.compile(r"""
        \s*             # leading white space
        \#?             # optional, use \ before # to disambiguate from comment #
        \s*             # possible whitespace
        (\d+)           # capture apt number
        \s+             # at least one white space 
        (.*)?,          # capture street name, non-greedy sequence until ',', 
        \s*             # possible whitespace
        (.*)?,          # capture city name, non-greedy sequence until ',', 
        \s*             # possible white space
        ([A-Z]{2})      # capture state code
        \s*             # possible white space
        (\d{5})         # capture zip code
        """, re.VERBOSE)

In [98]:
res = addr.match(' # 25 Infinite Loop,Cupertino,CA 12345')
if res:
    for gr in res.groups():
        print(gr)

25
Infinite Loop
Cupertino
CA
12345


In [99]:
res = addr.match('#25 Infinite Loop,  Cupertino , CA 12345')
if res:
    for gr in res.groups():
        print(gr)

25
Infinite Loop
Cupertino 
CA
12345


#### Naming captured fields

In [28]:
# Can give names to the captured fields for easier access, using ?P in group
named_addr = re.compile(r"""
        \s*             # leading white space
        \#?             # optional, use \ before # to disambiguate from comment #
        \s*             # possible whitespace
        (?P<apt>\d+)    # capture apt number
        \s+             # at least one white space 
        (?P<street>.*)?, # capture street name, non-greedy sequence until ',', 
        \s*             # possible whitespace
        (?P<city>.*)?,  # capture city name, non-greedy sequence until ',', 
        \s*             # possible white space
        (?P<state>[A-Z]{2})      # capture state code
        \s*             # possible white space
        (?P<zip>\d{5})         # capture zip code
        """, re.VERBOSE)

In [29]:
res = named_addr.match(' # 10 California Avenue,Palo Alto,CA 94304')
res.groupdict()

{'apt': '10',
 'street': 'California Avenue',
 'city': 'Palo Alto',
 'state': 'CA',
 'zip': '94304'}

#### Suppressing captures

In [30]:
# Can suppress capture using ?: inside group
named_addr = re.compile(r"""
        \s*             # leading white space
        \#?             # optional, use \ before # to disambiguate from comment #
        \s*             # possible whitespace
        (?:\d+)         # don't capture apt num
        \s+             # at least one white space 
        (?:.*)?,        # don't capture street 
        \s*             # possible whitespace
        (?P<city>.*)?,  # capture city name, name it as 'city'
        \s*             # possible white space
        (?P<state>[A-Z]{2})      # capture state code, name it as 'state'
        \s*             # possible white space
        (?:\d{5})         # don't capture zip code
        """, re.VERBOSE)

In [31]:
res = named_addr.match(' #10 California Avenue,Palo Alto,CA 94304')
res.groupdict()

{'city': 'Palo Alto', 'state': 'CA'}

#### Back referencing captures

In [34]:
# Captured string can be back referenced
backref = re.compile(r"""
            (?P<air>air)     # capture the string 'air', name it as 'air'
            .*               # greedy
            (?P=air)         # capture backrefernce to previous name 'ai'
            """, re.VERBOSE)
res = backref.search('cool air or hot air')
print(res)

<re.Match object; span=(5, 19), match='air or hot air'>


In [35]:
res = backref.search('cool air or hot?')
print(res)

None


In [36]:
# captures can be numbered, and backreferenced using numbers
res = re.search(r'(air).*\1','cool air or hot air')
print(res)

<re.Match object; span=(5, 19), match='air or hot air'>


---

#### Splitting a string with split function

In [41]:
str = 'ab;cd'
re.split(';',str)

['ab', 'cd']

In [42]:
str.split(';')

['ab', 'cd']

In [45]:
str = 'Really? I mean, really?!'
re.split('[?!]',str)

['Really', ' I mean, really', '', '']

**Regexp split will split separately on each of the characters in the given class.<br>
Also, notice the empty string returned between consecutive split characters,<br>
and between consecutive split character and end of string**

In [46]:
str.split('?!')

['Really? I mean, really', '']

**But String.split will only split on ALL character in the given set as a group.<br>
Empty string returned as in regexpt split**

In [49]:
# split into words, using \W (non-word character) as delimiter
res = re.split('\W+','This   is  a bunch of words!')
print(res)

['This', 'is', 'a', 'bunch', 'of', 'words', '']


#### Substituting in a string with sub function

In [50]:
# substitute all digits with '#'
re.sub('\d','#','Account number 1223456789')

'Account number ##########'

In [52]:
# substitute all except last 3 digits with '#'
re.sub(r'(\d*)\d{3}','\\1###','Account number 1223456789')

'Account number 1223456###'

In [54]:
# substitute all except last 3 digits with '#' (using raw string for target of substitution)
re.sub(r'(\d*)\d{3}',r'\1###','Account number 1223456789')

'Account number 1223456###'

In [55]:
# removing comments from html
# <!-- this is a comment -->

htmlstr = 'Before comment...<!-- This is a comment -->, and after comment'
res = re.sub(r'<!--.*-->','', htmlstr)  # replace comment with nothing
print(res)

Before comment..., and after comment


In [56]:
# warning, the regexp is greedy!
htmlstr = 'Before first... <!-- comment1 -->between first and second <!-- comment2--> ... after second'
res = re.sub(r'<!--.*-->','', htmlstr)  # replace comment with nothing
print(res)

Before first...  ... after second


In [57]:
# make it non-greedy
htmlstr = 'Before first... <!-- comment1 -->between first and second <!-- comment2--> ... after second'
res = re.sub('<!--.*?-->','', htmlstr)
print(res)

Before first... between first and second  ... after second


In [59]:
# does not work with a multiline string
htmlstr2 = """<!-- first 
comment -->Not a comment<!-- comment2 -->"""
res = re.sub('<!--.*?-->','', htmlstr2)
print(res)

<!-- first 
comment -->Not a comment


**The '.' metacharacter does not match a newline**

In [62]:
# either . or newline
res = re.sub('<!--(.|\n)*?-->','', htmlstr2)
print(res)

Not a comment


In [64]:
# Given a string of the form:
#     '"<last name>, <first name>",<netid>'

# Output the string:
#     '<first name>,<last name>,<netid>'

# e.g. '"  Venugopal,   Sesh ", sv123 ' => 'Sesh,Venugopal,sv123@rutgers.edu'

student_str = '"  Venugopal,   Sesh ", sv123 '
res = re.sub(r'"\s*(.*)?,\s*(\S*)\s*",\s*(\w*)',r'\2,\1,\3@rutgers.edu',student_str)
print(res)

Sesh,Venugopal,sv123@rutgers.edu 


In [65]:
# what if try pre-compiling both the strings?
student_str = '"  Venugopal,   Sesh ", sv123 '
target = re.compile(r'"\s*(\w*),\s*(\w*)\s*",\s*(\w*)')
repl = re.compile(r'\2,\1,\3@rutgers.edu')
res = re.sub(target,repl,student_str)
print(res)

error: invalid group reference 2 at position 1

**Doesn't work! The capture group references have to be parsed in the same context as the target**

---

### <font color="brown">Processing Datasets</font>

#### Example 1: UCI Auto MPG dataset - Plain Text File

In the text file auto-mpg-original.txt there are several fields. 
Of these we want the mpg (first field), cylinders (second field),
the model year (third to last), and car name (last). 
Using a regular expression to process each line, we want to write these 
fields out in the following format:<br>
"car name",year (19xx),cylinders (int),mpg

#### Solution 1: Using Regular Expressions

In [84]:
test_str='18.0   8.   307.0      130.0      3504.      12.0   70.  1.	"chevrolet chevelle malibu"'

car_reg = re.compile(r"""
                \s*                    # skip over whitespaces at start
                (?P<mpg>\d{2}\.\d)     # mpg field is of the form dd.d
                \s*                    # skip white spaces
                (?P<cyl>\d)\.          # cylinders field is of the form d., only want d
                .*                     # skip all intervening stuff
                (?P<yy>\d{2})\.        # year is of form dd., only want dd
                \s*                    # skip whitespaces
                \d\.                   # origin is of the form d.
                .*                     # skip intervening stuff
                (?P<name>".*")         # car name is in double quotes, want double quotes
            """, re.VERBOSE)

In [85]:
res = car_reg.match(test_str)
res.groupdict()

{'mpg': '18.0', 'cyl': '8', 'yy': '70', 'name': '"chevrolet chevelle malibu"'}

In [74]:
res = car_reg.match(test_str)
if res:
    car_dict = res.groupdict()
    keys = ['name','yy','cyl','mpg']
    values = [car_dict[k] for k in ['name','yy','cyl','mpg']]
    values[1] = '19' + values[1]
    print(','.join(values))    # notice the join method 

"chevrolet chevelle malibu",1970,8,18.0


**print a few lines**

In [75]:
def my_filter(in_line):
    res = car_reg.match(in_line)
    if res:
        car_dict = res.groupdict()
        values = [car_dict[k] for k in ['name','yy','cyl','mpg']]
        values[1] = '19' + values[1]
        return ','.join(values) 
    return None

In [78]:
for i,line in enumerate(open("auto-mpg-original.txt")):
    out_line = my_filter(line)
    if out_line:
        print(my_filter(line))
    if i > 14:
        break

"chevrolet chevelle malibu",1970,8,18.0
"buick skylark 320",1970,8,15.0
"plymouth satellite",1970,8,18.0
"amc rebel sst",1970,8,16.0
"ford torino",1970,8,17.0
"ford galaxie 500",1970,8,15.0
"chevrolet impala",1970,8,14.0
"plymouth fury iii",1970,8,14.0
"pontiac catalina",1970,8,14.0
"amc ambassador dpl",1970,8,15.0
"dodge challenger se",1970,8,15.0


**the 5 lines immediately before dodge challenger se are rejected because the first field '(NA)' 
doesn't meet the regular expression requirement**

#### Solution 2: Using String split

In [80]:
for i,line in enumerate(open("auto-mpg-original.txt")):
    flds = line.split()  # on white space
    out_flds = []
    out_flds.append(flds[0])
    out_flds.append(flds[1][:-1])
    out_flds.append('19' + flds[6][:-1])
    out_flds.append(flds[8])
    print(','.join(out_flds))
    if i > 0:
        break

18.0,8,1970,"chevrolet
15.0,8,1970,"buick


##### Hmmm, the car name doesn't work because it's got a space in it, and split will break it up into parts. Sow do we address this? We could simply grab all the remainging fields at the end, and concatenate them

In [81]:
for i,line in enumerate(open("auto-mpg-original.txt")):
    flds = line.split()  # on white space
    out_flds = []
    out_flds.append(flds[0])
    out_flds.append(flds[1][:-1])
    out_flds.append('19' + flds[6][:-1])
    out_flds.append(flds[8:])
    print(','.join(out_flds))
    if i > 0:
        break

TypeError: sequence item 3: expected str instance, list found

In [82]:
flds = line.split()  # on white space
out_flds = []
out_flds.append(flds[8:])
out_flds

[['"chevrolet', 'chevelle', 'malibu"']]

**The car name parts are broken up and gathered into a list. We don't want a list, instead we want a single string out of the parts. We can get this by joining the list items around a space.**

In [83]:
# need to join the flds[8:] list items using a space
for i,line in enumerate(open("auto-mpg-original.txt")):
    flds = line.split()  # on white space
    out_flds = []
    out_flds.append(flds[0])
    out_flds.append(flds[1][:-1])
    out_flds.append('19' + flds[6][:-1])
    out_flds.append(' '.join(flds[8:]))
    print(','.join(out_flds))
    if i > 0:
        break

18.0,8,1970,"chevrolet chevelle malibu"
15.0,8,1970,"buick skylark 320"


#### So why use regexp instead of string? 

##### If fields are missing, or incorrectly formatted, much easier with regexp because you specify exact formats for all fields. With split, you will need to read, then check if there are required number of fields, then check each accepted field for correctness of type