---

### <font color="brown">Regular Expressions - Continued</font>

In [2]:
import re

#### <font color="brown">Verbose, multiline regexp</font>
Suppose we want to do some capturing in an address of the form<br>
`<optional #><apt num><whitespace><street name>,<city>,<2 character upper case state code><whitespace><zip>`

In [70]:
addr = re.compile(r"""
        \s*             # possible leading white space
        \#?             # optional, use \ before # to disambiguate from comment #
        \s*             # possible whitespace
        (\d+)           # capture apt number
        \s+             # at least one white space 
        (.*)?,          # capture street name, non-greedy sequence until ',', 
        \s*             # possible whitespace
        (.*)?,          # capture city name, non-greedy sequence until ',', 
        \s*             # possible white space
        ([A-Z]{2})      # capture state code
        \s*             # possible white space
        (\d{5})         # capture zip code
        \s*             # possible trailing whitespace
        $               # end of string
        """, re.VERBOSE)

In [71]:
res = addr.match(' # 25 Infinite Loop,Cupertino,CA 12345')
if res:
    for gr in res.groups():
        print(gr)

25
Infinite Loop
Cupertino
CA
12345


In [72]:
res = addr.match('#25 Infinite Loop,  Cupertino , CA 12345')
if res:
    for gr in res.groups():
        print(gr)

25
Infinite Loop
Cupertino 
CA
12345


---

#### <font color="brown">Naming captured fields</font>

In [73]:
# Can give names to the captured fields for easier access, using ?P in group
named_addr = re.compile(r"""
        \s*             # possible leading white space
        \#?             # optional, use \ before # to disambiguate from comment #
        \s*             # possible whitespace
        (?P<apt>\d+)    # capture apt number
        \s+             # at least one white space 
        (?P<street>.*)?, # capture street name, non-greedy sequence until ',', 
        \s*             # possible whitespace
        (?P<city>.*)?,  # capture city name, non-greedy sequence until ',', 
        \s*             # possible white space
        (?P<state>[A-Z]{2})      # capture state code
        \s*             # possible white space
        (?P<zip>\d{5})  # capture zip code
        \s*             # possible trailing whitespace
        $               # end of string
        """, re.VERBOSE)

In [74]:
res = named_addr.match(' # 10 California Avenue,Palo Alto,CA 94304')
res.groupdict()

{'apt': '10',
 'street': 'California Avenue',
 'city': 'Palo Alto',
 'state': 'CA',
 'zip': '94304'}

---

#### <font color="brown">Suppressing captures</font>

In [75]:
# Can suppress capture using ?: inside group
named_addr = re.compile(r"""
        \s*             # possible leading white space
        \#?             # optional, use \ before # to disambiguate from comment #
        \s*             # possible whitespace
        (?:\d+)         # don't capture apt num
        \s+             # at least one white space 
        (?:.*)?,        # don't capture street 
        \s*             # possible whitespace
        (?P<city>.*)?,  # capture city name, name it as 'city'
        \s*             # possible white space
        (?P<state>[A-Z]{2})      # capture state code, name it as 'state'
        \s*             # possible white space
        (?:\d{5})       # don't capture zip code
        \s*             # possible trailing whitespace
        $               # end of string
        """, re.VERBOSE)

In [76]:
res = named_addr.match(' #10 California Avenue,Palo Alto,CA 94304')
res.groupdict()

{'city': 'Palo Alto', 'state': 'CA'}

**You can, of course, get rid of the () for capture altogether** 

In [78]:
# Can suppress capture using ?: inside group
named_addr = re.compile(r"""
        \s*             # possible leading white space
        \#?             # optional, use \ before # to disambiguate from comment #
        \s*             # possible whitespace
        \d+             # don't capture apt num
        \s+             # at least one white space 
        .*?,            # don't capture street 
        \s*             # possible whitespace
        (?P<city>.*)?,  # capture city name, name it as 'city'
        \s*             # possible white space
        (?P<state>[A-Z]{2})      # capture state code, name it as 'state'
        \s*             # possible white space
        \d{5}           # don't capture zip code
        \s*             # possible trailing whitespace
        $               # end of string
        """, re.VERBOSE)

In [79]:
res = named_addr.match(' #10 California Avenue,Palo Alto,CA 94304')
res.groupdict()

{'city': 'Palo Alto', 'state': 'CA'}

**But the reason you may want to keep them is you can then turn captures on and off as needed** 

---

#### <font color="brown">Back referencing captures using name</font>

In [60]:
# Captured string can be back referenced
backref = re.compile(r"""
            (?P<air>air)     # capture the string 'air', name it as 'air'
            .*               # greedy
            (?P=air)         # capture backrefernce to previous name 'air'
            """, re.VERBOSE)
res = backref.search('cool air or hot air')
print(res)

<re.Match object; span=(5, 19), match='air or hot air'>


In [62]:
res = backref.search('cool air or hot air')
print(res)

<re.Match object; span=(5, 19), match='air or hot air'>


In [63]:
res = backref.search('cool air or hot')
print(res)

None


---

#### <font color="brown">Using findall and finditer functions to get all matches</font>
- findall constructs the entire list of matches before returning it
- finditer returns one match at a time, on demand, in a Match object

**<font color="brown">findall()</font>**

In [110]:
# Example 1
res = re.findall(r'\w+','These are the days of miracles and wonders!')
print(res)

['These', 'are', 'the', 'days', 'of', 'miracles', 'and', 'wonders']


In [111]:
# Example 2-a
res = re.findall(r'\w+',"I can't believe it")
print(res)

['I', 'can', 't', 'believe', 'it']


In [112]:
# Example 2-b
res = re.findall(r'\S+',"I can't believe it")
print(res)

['I', "can't", 'believe', 'it']


**<font color="brown">finditer()</font>**

In [114]:
# Example 1
iterator = re.finditer(r'\w+','These are the days of miracles and wonders!')
print(iterator)
for match in iterator:
    print(match.group(),'@',match.span())

<callable_iterator object at 0x7fdd685e9d90>
These @ (0, 5)
are @ (6, 9)
the @ (10, 13)
days @ (14, 18)
of @ (19, 21)
miracles @ (22, 30)
and @ (31, 34)
wonders @ (35, 42)


In [115]:
# Example 2
iterator = re.finditer(r'\S+',"I can't believe it")
for match in iterator:
    print(match.group(),'@',match.span())

I @ (0, 1)
can't @ (2, 7)
believe @ (8, 15)
it @ (16, 18)


---

### <font color="brown">Working with Plain Text and CSV Datasets</font>

#### Example 1: UCI Auto MPG dataset - Plain Text File

In the text file auto-mpg-original.txt there are several fields in each line. 
Of these we want the mpg (first field), cylinders (second field),
the model year (third to last), and car name (last). 
We want to read lines from this file, and write these 
fields out in the following format:
<pre>
"car name",year (19xx),cylinders (int),mpg
</pre>

#### Solution 1: Using Regular Expressions

In [3]:
test_str='18.0   8.   307.0      130.0      3504.      12.0   70.  1.	"chevrolet chevelle malibu"'

car_reg = re.compile(r"""
                \s*                    # skip over leading whitespaces, if any
                (?P<mpg>\d{2}\.\d)     # mpg field is of the form dd.d
                \s*                    # skip white spaces
                (?P<cyl>\d)\.          # cylinders field is of the form d., only want d
                .*                     # skip all intervening stuff
                (?P<yy>\d{2})\.        # year is of form dd., only want dd
                \s*                    # skip whitespaces
                \d\.                   # origin is of the form d.
                .*                     # skip intervening stuff
                (?P<name>".*")         # car name is in double quotes, want double quotes
            """, re.VERBOSE)

In [4]:
res = car_reg.match(test_str)
res.groupdict()

{'mpg': '18.0', 'cyl': '8', 'yy': '70', 'name': '"chevrolet chevelle malibu"'}

In [6]:
res = car_reg.match(test_str)
if res:
    car_dict = res.groupdict()
    keys = ['name','yy','cyl','mpg']
    values = [car_dict[k] for k in keys]
    values[1] = '19' + values[1]
    print(','.join(values))    
    

"chevrolet chevelle malibu",1970,8,18.0


**Notice the string join method above<br>
Iterable for join must have string values, otherwise won't work**

**Print a few lines**

In [7]:
def my_filter(in_line):
    res = car_reg.match(in_line)
    if res:
        car_dict = res.groupdict()
        keys = ['name','yy','cyl','mpg']
        values = [car_dict[k] for k in keys]
        values[1] = '19' + values[1]
        return ','.join(values) 
    return None

In [8]:
for i,line in enumerate(open("auto-mpg-original.txt")):
    out_line = my_filter(line)
    if out_line:
        print(my_filter(line))
    if i > 14:
        break

"chevrolet chevelle malibu",1970,8,18.0
"buick skylark 320",1970,8,15.0
"plymouth satellite",1970,8,18.0
"amc rebel sst",1970,8,16.0
"ford torino",1970,8,17.0
"ford galaxie 500",1970,8,15.0
"chevrolet impala",1970,8,14.0
"plymouth fury iii",1970,8,14.0
"pontiac catalina",1970,8,14.0
"amc ambassador dpl",1970,8,15.0
"dodge challenger se",1970,8,15.0


**The 5 lines immediately before that for "dodge challenger se" in the file are rejected because the first field '(NA)' doesn't meet the regular expression requirement**

#### Solution 2: Using String split
**This alternative wasn't covered in class, but it's based on material we have covered before in basic Python, so I am leaving this as an exercise for you to go over.**

In [9]:
for i,line in enumerate(open("auto-mpg-original.txt")):
    flds = line.split()  # on white space
    out_flds = []
    out_flds.append(flds[0])
    out_flds.append(flds[1][:-1])
    out_flds.append('19' + flds[6][:-1])
    out_flds.append(flds[8])
    print(','.join(out_flds))
    if i > 0:
        break

18.0,8,1970,"chevrolet
15.0,8,1970,"buick


**Hmm, the car name gets truncated because it's got a space in it, and split will break it up into parts. Sow do we address this? We could simply grab all the remainging fields at the end, and concatenate them**

In [11]:
for i,line in enumerate(open("auto-mpg-original.txt")):
    flds = line.split()  # on white space
    out_flds = []
    out_flds.append(flds[0])
    out_flds.append(flds[1][:-1])
    out_flds.append('19' + flds[6][:-1])
    out_flds.append(flds[8:])
    print(','.join(out_flds))
    if i > 0:
        break

TypeError: sequence item 3: expected str instance, list found

**Above, join is expected strings in the iterable, but finds a list**

In [12]:
flds = line.split()  # on white space
out_flds = []
out_flds.append(flds[8:])
out_flds

[['"chevrolet', 'chevelle', 'malibu"']]

**The car name parts are broken up and gathered into a list. We don't want a list, instead we want a single string out of the parts. We can get this by joining the list items around a space.**

In [14]:
out_flds = []
out_flds.append(' '.join(flds[8:]))
out_flds

['"chevrolet chevelle malibu"']

In [15]:
# need to join the flds[8:] list items using a space
for i,line in enumerate(open("auto-mpg-original.txt")):
    flds = line.split()  # on white space
    out_flds = []
    out_flds.append(flds[0])
    out_flds.append(flds[1][:-1])
    out_flds.append('19' + flds[6][:-1])
    out_flds.append(' '.join(flds[8:]))
    print(','.join(out_flds))
    if i > 0:
        break

18.0,8,1970,"chevrolet chevelle malibu"
15.0,8,1970,"buick skylark 320"


#### <font color="brown">So why use regexp instead of string split?</font>

##### If fields are missing, or incorrectly formatted, much easier with regexp because you specify exact formats for all fields. With split, you will need to read, then check if there are required number of fields, then check each accepted field for correctness of type

---

#### Example 2: UCI Iris Dataset - CSV File

This file (*iris-messy.csv*) has 5 columns (fields): sepal_length, sepal_width, petal_length, petal_width, iris_type

I have deliberately introduced errors in the dataset so you get a feel for what kinds of errors you might generally expect, and how to take corrective action. 

These are some of the kinds of errors you might see in datasets in general:
- Too many fields
- Too few fields
- Missing value for field
- Unknown value (e.g. ?,NA instead of actual value)
- Non-numeric value when numeric is expected

Other errors are possible (such as outlier values), and we will tackle some of then when we study the Pandas library

**0. Import the csv module**

In [2]:
import csv

**1. Make sure there are exactly 5 fields in each row**

In [18]:
with open('iris-messy.csv') as irisfile:      # using the with statement
    
    reader = csv.reader(irisfile)             # set up CSV reader from file
    
    next(reader)                              # skip first line of column (field) names
    
    for num,row in enumerate(reader):         # row will be a list of all column (field) values
        if len(row) != 5:                     # lines that have too many or too few columns (fields)
            print(f'{(num+1):03} >>> {row}')  # pad row number with leading zeros as needed for 3 digits width

009 >>> ['4.4', '2', '9', '1.4', '0.2', 'Iris-setosa']
064 >>> ['6.1', '4.7', '1.4', 'Iris-versicolor']
078 >>> ['6.7', '3.0', '4.5', '1.7', '6.5', 'Iris-versicolor']
103 >>> ['7', '1', '3.0', '5.9', '2.1', 'Iris-virginica']
113 >>> ['6.8', '3.0', '5.5', '2.1']
152 >>> []


**2. Make sure all fields except last are real numbers**

In [25]:
with open('iris-messy.csv') as irisfile:
    reader = csv.reader(irisfile)
    next(reader)                           # skip first line of field names
    
    for num,row in enumerate(reader):
        if len(row) != 5:                  # lines that have too many or too few fields
            print(f'Row {(num+1):03}:',end='') 
            print(' Too few fields') if len(row) < 5 else print(' Too many fields')
            print('\t',row,'\n')
        else:
            for val in row[:-1]:           # skip last field
                try:
                    float(val)
                except:
                    print(f"Row {(num+1):03}: Non-numeric value '{val}'")
                    print('\t',row,'\n')

Row 009: Too many fields
	 ['4.4', '2', '9', '1.4', '0.2', 'Iris-setosa'] 

Row 013: Non-numeric value 'N/A'
	 ['4.8', 'N/A', '1.4', '0.1', 'Iris-setosa'] 

Row 035: Non-numeric value 'n/a'
	 ['4.9', '3.1', 'n/a', '0.1', 'Iris-setosa'] 

Row 036: Non-numeric value 'na'
	 ['5.0', 'na', '1.2', '0.2', 'Iris-setosa'] 

Row 043: Non-numeric value '?'
	 ['?', '3.2', '1.3', '0.2', 'Iris-setosa'] 

Row 064: Too few fields
	 ['6.1', '4.7', '1.4', 'Iris-versicolor'] 

Row 070: Non-numeric value 'NA'
	 ['5.6', '2.5', '3.9', 'NA', 'Iris-versicolor'] 

Row 077: Non-numeric value '?'
	 ['6.8', '2.8', '?', '1.4', 'Iris-versicolor'] 

Row 078: Too many fields
	 ['6.7', '3.0', '4.5', '1.7', '6.5', 'Iris-versicolor'] 

Row 103: Too many fields
	 ['7', '1', '3.0', '5.9', '2.1', 'Iris-virginica'] 

Row 113: Too few fields
	 ['6.8', '3.0', '5.5', '2.1'] 

Row 127: Non-numeric value '4x8'
	 ['6.2', '2.8', '4x8', '1.8', 'Iris-virginica'] 

Row 137: Non-numeric value '?'
	 ['6.3', '3.4', '5.6', '?', 'Iris-vir

**3. Finalize by writing out acceptable lines:**
- Skip lines that have too few or too many fields
- Replace non-numeric field with NA (standardize)

In [27]:
with open('iris-better.csv','w') as outfile:
    with open('iris-messy.csv') as irisfile:
        
        reader = csv.reader(irisfile)
        
        row = next(reader)                # read first line of field names
        outfile.write(','.join(row))
        outfile.write('\n')
    
        for num,row in enumerate(reader):
            if len(row) != 5:             # skip lines that have too many or too few fields
                continue
            
            outrow = []
            for val in row[:-1]:      # check all fields except last for numeric
                try:
                    float(val)
                    outrow.append(val)
                except:
                    outrow.append('NA')

            outrow.append(row[-1])    # last field, non-numeric string
            outfile.write(','.join(outrow))
            outfile.write('\n')

**Alternatively, you can use a CSV writer to write out**

In [3]:
with open('iris-better.csv','w',newline='') as csvfile:  # note the newline='' parameter
    writer = csv.writer(csvfile, delimiter=',')          # set outfile column delimiter to comma, which is the default
   
    with open('iris-messy.csv') as irisfile:
        
        reader = csv.reader(irisfile)
        
        row = next(reader)                               # first line of column names
        writer.writerow(row)                             # use writerow method of writer with list of columns as param
    
        for num,row in enumerate(reader):
            if len(row) != 5:                            # lines that have too many or too few columns
                continue
            
            outrow = []
            for val in row[:-1]:                         # check all fields except last for numeric
                try:
                    float(val)
                    outrow.append(val)
                except:
                    outrow.append('NA')
            outrow.append(row[-1])                      # last field, non-numeric string
            writer.writerow(outrow)