# Lab 4.2- Regular Expression Practice

Now that we have grouped the data blocks, we need to identify and correct problems in the data.  The go-to tools for this task is the regular expression.  In this lab, we will point out a number of problems in the grouped data and create regular expression to perform various tasks (matching/splitting/substitution).

## Problem 1 -- Reading in current progress

Recall that we saved the results of grouping the data in a file named `911_Deaths_Grouped.csv`.  Read in the content of this file and split the content into a list of lines.

In [2]:
# Your code here

#### Key

In [3]:
with open('911_Deaths_Grouped.csv') as f:
    content = f.read()
content[:500]

"Gordon M. Aamoth, Jr., 32, Sandler O'Neill + Partners, World Trade Center.\nEdelmiro Abad, 54, Brooklyn, N.Y., Fiduciary Trust Company International, World Trade Center.\nMarie Rose Abad, 49, Keefe, Bruyette&Woods, Inc., World Trade Center.\nAndrew Anthony Abate, 37, Melville, N.Y., Cantor Fitzgerald, World Trade Center.\nVincent Paul Abate, 40, Brooklyn, N.Y., Cantor Fitzgerald, World Trade Center.\nLaurence Christopher Abel, 37, New York City, Cantor Fitzgerald, World Trade Center.\nAlona Abraham, 3"

In [4]:
grouped_lines = content.split('\n')
grouped_lines

["Gordon M. Aamoth, Jr., 32, Sandler O'Neill + Partners, World Trade Center.",
 'Edelmiro Abad, 54, Brooklyn, N.Y., Fiduciary Trust Company International, World Trade Center.',
 'Marie Rose Abad, 49, Keefe, Bruyette&Woods, Inc., World Trade Center.',
 'Andrew Anthony Abate, 37, Melville, N.Y., Cantor Fitzgerald, World Trade Center.',
 'Vincent Paul Abate, 40, Brooklyn, N.Y., Cantor Fitzgerald, World Trade Center.',
 'Laurence Christopher Abel, 37, New York City, Cantor Fitzgerald, World Trade Center.',
 'Alona Abraham, 30, Ashdod, Israel, Passenger, United 175, World Trade Center.',
 'William F. Abrahamson, 55, Westchester County, N.Y., Marsh&McLennan Companies, Inc., World Trade Center.',
 'Richard Anthony Aceto, 42, Marsh&McLennan Companies, Inc., World Trade Center.',
 'Heinrich Bernhard Ackermann, 38, Aon Corporation, World Trade Center.',
 'Paul Acquaviva, 29, Glen Rock, N.J., Cantor Fitzgerald, World Trade Center.',
 'Christian Adams, 37, Passenger, United 93, Shanksville, Pa.',


## Problem 2 -- Inspecting problem lines

I have provided some examples of problems that can be found in this data set below.  Inspect the lines and determine one or more things that are problematic for each line.

In [5]:
example_idx = (0, 33, 75, 76, 150, 232, 1304, 1305, 1343)
examples = [l for i, l in enumerate(grouped_lines) if i in example_idx]
examples

["Gordon M. Aamoth, Jr., 32, Sandler O'Neill + Partners, World Trade Center.",
 'Godwin O. Ajala, 33, Summit Security Services, Inc., World Trade Center, died 9/15/01.',
 'Mary Lynn Edwards Angell, 52, Cape Cod, Mass. and Pasadena, Calif., Passenger, United 11, World Trade Center.',
 'Laura Angilletta, 23, Staten Island, N.Y., Cantor Fitzgerald, World Trade Center.',
 'Lorraine G. Bay, 58, East Windsor, N.J., Flight Crew, United 93, Shanksville, Pa.',
 'Canfield D. Boone, ??, United States Army, Pentagon.',
 'Albert Gunnis Joseph, 79, New York City, Morgan Stanley, World Trade Center, died 1/2/02.',
 'Ingeborg Joseph, 53, Marriott guest, World Trade Center, died 10/9/01.',
 'Brenda Kegler, ??, Capitol Heights, Md., United States Army Civilian, Pentagon.']

> Your answers here

#### Key

Some problems include

1. Comma's in names, companies, location
2. Missing ages (represented as `??`)
3. Rows missing an entry for hometowns, passenger type, flight, and date of death.
4. Non-uniform hometown/state entries.


## General procedure

1. Create the expression and test against positive case.
2. Match/search against all example
3. After you know it works, add the `groups` method for all examples.
4. Look for any non-matches in the full data set.
5. Test on the fill data set if all rows match.

## What's the big deal?

So why are we being so careful to make sure everything matches? Turns out that if any row fails to match, adding `groups` will crash the code :/

In [48]:
import re
test = re.compile(', \d\d,')
[test.search(l) for l in examples]

[<re.Match object; span=(21, 26), match=', 32,'>,
 <re.Match object; span=(15, 20), match=', 33,'>,
 <re.Match object; span=(24, 29), match=', 52,'>,
 <re.Match object; span=(16, 21), match=', 23,'>,
 <re.Match object; span=(15, 20), match=', 58,'>,
 None,
 <re.Match object; span=(20, 25), match=', 79,'>,
 <re.Match object; span=(15, 20), match=', 53,'>,
 None]

In [49]:
[test.search(l).groups() for l in examples]

AttributeError: 'NoneType' object has no attribute 'groups'

## Problem 3 -- Capturing the age field

Notice that all victims have a passenger field that contains either their age or `??` if the age is unknown.

In this problem, we will build a regular expression to match this field, which will ALSO allow us to capture the name field (even when there are problems with extra commas).

#### Task 1 - Capture the age field.

Write a regular expression that matches and captures the age field.  

**Hints:** Remember that 

* Use `(pat)` to capture a pattern.
* Use `\d` to match digits
* `(p1|p2)` allows you to match `p1` or `p2`.   

In [6]:
# Your code here

#### Key

In [52]:
import re
age = re.compile(', (\?\?|\d{1,3}),')
type(age.search(examples[0]))

re.Match

In [37]:
examples[0], age.search(examples[0]).groups()

("Gordon M. Aamoth, Jr., 32, Sandler O'Neill + Partners, World Trade Center.",
 ('32',))

In [38]:
[age.search(l) for l in examples]

[<re.Match object; span=(21, 26), match=', 32,'>,
 <re.Match object; span=(15, 20), match=', 33,'>,
 <re.Match object; span=(24, 29), match=', 52,'>,
 <re.Match object; span=(16, 21), match=', 23,'>,
 <re.Match object; span=(15, 20), match=', 58,'>,
 <re.Match object; span=(17, 22), match=', ??,'>,
 <re.Match object; span=(20, 25), match=', 79,'>,
 <re.Match object; span=(15, 20), match=', 53,'>,
 <re.Match object; span=(13, 18), match=', ??,'>]

In [39]:
[age.search(l).groups() for l in examples]

[('32',),
 ('33',),
 ('52',),
 ('23',),
 ('58',),
 ('??',),
 ('79',),
 ('53',),
 ('??',)]

In [40]:
[(i, l) for i, l in enumerate(grouped_lines) if not age.search(l)]

[]

In [41]:
[age.search(l).groups() for l in grouped_lines]

[('32',),
 ('54',),
 ('49',),
 ('37',),
 ('40',),
 ('37',),
 ('30',),
 ('55',),
 ('42',),
 ('38',),
 ('29',),
 ('37',),
 ('28',),
 ('61',),
 ('25',),
 ('51',),
 ('62',),
 ('28',),
 ('22',),
 ('36',),
 ('48',),
 ('32',),
 ('37',),
 ('36',),
 ('37',),
 ('35',),
 ('46',),
 ('30',),
 ('43',),
 ('74',),
 ('27',),
 ('47',),
 ('30',),
 ('33',),
 ('37',),
 ('37',),
 ('41',),
 ('39',),
 ('46',),
 ('25',),
 ('46',),
 ('57',),
 ('43',),
 ('51',),
 ('44',),
 ('39',),
 ('31',),
 ('30',),
 ('36',),
 ('48',),
 ('41',),
 ('31',),
 ('23',),
 ('38',),
 ('25',),
 ('60',),
 ('40',),
 ('60',),
 ('43',),
 ('41',),
 ('32',),
 ('29',),
 ('28',),
 ('42',),
 ('35',),
 ('26',),
 ('57',),
 ('53',),
 ('52',),
 ('34',),
 ('43',),
 ('37',),
 ('63',),
 ('38',),
 ('54',),
 ('52',),
 ('23',),
 ('44',),
 ('32',),
 ('48',),
 ('26',),
 ('55',),
 ('26',),
 ('26',),
 ('36',),
 ('45',),
 ('32',),
 ('38',),
 ('37',),
 ('34',),
 ('52',),
 ('29',),
 ('48',),
 ('50',),
 ('49',),
 ('37',),
 ('47',),
 ('53',),
 ('25',),
 ('21',),


#### Task 2 - Capture the age field, as well as everything before and after.

Adapt your work from the last problem to not only capture the age field, but also everything before and after.

**Hint:** Remember that 

* Use greedy wild-cards `.*` and/or `.+` to grab as much as possible.
* Use comma's to anchor the three parts, e.g. `(pat1), (pat2), (pat3)`

In [42]:
# Your code here

#### Key

In [43]:
import re
age_plus = re.compile('(.+), (\?\?|\d{1,3}), (.+)')

In [44]:
[age_plus.search(l) for l in examples]

[<re.Match object; span=(0, 74), match="Gordon M. Aamoth, Jr., 32, Sandler O'Neill + Part>,
 <re.Match object; span=(0, 86), match='Godwin O. Ajala, 33, Summit Security Services, In>,
 <re.Match object; span=(0, 109), match='Mary Lynn Edwards Angell, 52, Cape Cod, Mass. and>,
 <re.Match object; span=(0, 81), match='Laura Angilletta, 23, Staten Island, N.Y., Cantor>,
 <re.Match object; span=(0, 81), match='Lorraine G. Bay, 58, East Windsor, N.J., Flight C>,
 <re.Match object; span=(0, 52), match='Canfield D. Boone, ??, United States Army, Pentag>,
 <re.Match object; span=(0, 89), match='Albert Gunnis Joseph, 79, New York City, Morgan S>,
 <re.Match object; span=(0, 70), match='Ingeborg Joseph, 53, Marriott guest, World Trade >,
 <re.Match object; span=(0, 79), match='Brenda Kegler, ??, Capitol Heights, Md., United S>]

In [45]:
[age_plus.search(l).groups() for l in examples]

[('Gordon M. Aamoth, Jr.',
  '32',
  "Sandler O'Neill + Partners, World Trade Center."),
 ('Godwin O. Ajala',
  '33',
  'Summit Security Services, Inc., World Trade Center, died 9/15/01.'),
 ('Mary Lynn Edwards Angell',
  '52',
  'Cape Cod, Mass. and Pasadena, Calif., Passenger, United 11, World Trade Center.'),
 ('Laura Angilletta',
  '23',
  'Staten Island, N.Y., Cantor Fitzgerald, World Trade Center.'),
 ('Lorraine G. Bay',
  '58',
  'East Windsor, N.J., Flight Crew, United 93, Shanksville, Pa.'),
 ('Canfield D. Boone', '??', 'United States Army, Pentagon.'),
 ('Albert Gunnis Joseph',
  '79',
  'New York City, Morgan Stanley, World Trade Center, died 1/2/02.'),
 ('Ingeborg Joseph',
  '53',
  'Marriott guest, World Trade Center, died 10/9/01.'),
 ('Brenda Kegler',
  '??',
  'Capitol Heights, Md., United States Army Civilian, Pentagon.')]

In [29]:
[(i, l) for i, l in enumerate(grouped_lines) if not age_plus.search(l)]

[]

In [31]:
[age_plus.search(l).groups() for l in grouped_lines]

[('Gordon M. Aamoth, Jr.',
  '32',
  "Sandler O'Neill + Partners, World Trade Center."),
 ('Edelmiro Abad',
  '54',
  'Brooklyn, N.Y., Fiduciary Trust Company International, World Trade Center.'),
 ('Marie Rose Abad', '49', 'Keefe, Bruyette&Woods, Inc., World Trade Center.'),
 ('Andrew Anthony Abate',
  '37',
  'Melville, N.Y., Cantor Fitzgerald, World Trade Center.'),
 ('Vincent Paul Abate',
  '40',
  'Brooklyn, N.Y., Cantor Fitzgerald, World Trade Center.'),
 ('Laurence Christopher Abel',
  '37',
  'New York City, Cantor Fitzgerald, World Trade Center.'),
 ('Alona Abraham',
  '30',
  'Ashdod, Israel, Passenger, United 175, World Trade Center.'),
 ('William F. Abrahamson',
  '55',
  'Westchester County, N.Y., Marsh&McLennan Companies, Inc., World Trade Center.'),
 ('Richard Anthony Aceto',
  '42',
  'Marsh&McLennan Companies, Inc., World Trade Center.'),
 ('Heinrich Bernhard Ackermann', '38', 'Aon Corporation, World Trade Center.'),
 ('Paul Acquaviva',
  '29',
  'Glen Rock, N.J., Cant

## Problem 4 -- Capturing the date of death

While most victims of the attack died on 9/11, a few died at a later date.  Notice that those that those that died later have an additional field at the end of the line.

In this problem, we will build a regular expression to match this field.

In [14]:
examples[-2]

'Ingeborg Joseph, 53, Marriott guest, World Trade Center, died 10/9/01.'

#### Task 1 - Capture the date of death field.
  
Write a regular expression that matches and captures the date of death (e.g. `10/9/01`).  This expression should return `None` when this field is missing.

**Hints:** Remember that 

* Use `$` to match the end of the line.
* Escape to match periods exactly, i.e. `\.`
* Use `\d{n,m}` to match digits to match between `n` and `m` digits
* `?` allows you to match optional patterns

In [32]:
# Your code here

#### Key

In [79]:
dod = re.compile('(, died \d{1,2}/\d{1,2}/\d{1,2})?(\.)?$')
dod.search(examples[-2]).groups()

(', died 10/9/01', '.')

In [80]:
[dod.search(l) for l in examples]

[<re.Match object; span=(73, 74), match='.'>,
 <re.Match object; span=(71, 86), match=', died 9/15/01.'>,
 <re.Match object; span=(108, 109), match='.'>,
 <re.Match object; span=(80, 81), match='.'>,
 <re.Match object; span=(80, 81), match='.'>,
 <re.Match object; span=(51, 52), match='.'>,
 <re.Match object; span=(75, 89), match=', died 1/2/02.'>,
 <re.Match object; span=(55, 70), match=', died 10/9/01.'>,
 <re.Match object; span=(78, 79), match='.'>]

In [81]:
[dod.search(l).groups() for l in examples]

[(None, '.'),
 (', died 9/15/01', '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (', died 1/2/02', '.'),
 (', died 10/9/01', '.'),
 (None, '.')]

In [83]:
[(i, l) for i, l in enumerate(grouped_lines) if not dod.search(l)]

[]

In [84]:
[dod.search(l).groups() for l in grouped_lines]

[(None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (', died 9/15/01', '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, 

#### Task 2 - Capture the age field, as well as everything before and after.

Adapt your work from the last problem to not only capture the data of death field, but also everything before.

**Hint:** Remember that 

* Use greedy wild-cards `.*` and/or `.+` to grab as much as possible.
* Use comma's to anchor the three parts, e.g. `(pat1), (pat2), (pat3)`

In [11]:
# Your code here

## Problem 5 -- Working with passenger data

Notice that 

1. Passengers on the flights have two extra fields: passenger status and flight
2. Other victim are missing these fields.

In this problem, we will build a regular expression to match these fields and use this expression to split the data.  In the process, we will be able to add the missing fields to the other rows.

#### Task 1 - Make an expression that matches the passenger status.

Make a regular expression that matches and extracts the passenger status field.  This expression should match all lines, returning `None` for the other rows.

**Hint:** Remember that 

* `(p1|p2)` allows you to match `p1` or `p2`.   
* `?` allows you to match optional patterns

In [None]:
# Your code here.

#### Task 2 - Capture the flight field

Make a regular expression that matches and extracts the flight field.  This expression should match all lines, returning `None` for the other rows.

**Hint:** Remember that 

* Look through the data file to identify the airlines.
* Use `\d` to match digits
* `(p1|p2)` allows you to match `p1` or `p2`.   
* `?` allows you to match optional patterns

In [None]:
# Your code here.

#### Task 3 - Combine the last two expression

Now combine the last two expressions to capture the two flight fields, but also all content before and after these fields.

In [None]:
# Your code here.