# Class 8 - Regular Expressions

+ Recall: regular expressions is a mini sub-language used by several programming languages to work with strings; good for moving from unstructured data to structured data. 

Last class, we had encounted a **difficult problem** that we were attempting to solve using regular expressions.

In [1]:
from urllib.request import urlretrieve

# urlretrieve() allows us to download a file by passing it the URL where it is located.
# This function will save your downloaded file whereever you are running you Jupyter notebook

urlretrieve("https://raw.githubusercontent.com/ledeprogram/courses/master/databases/data/enronsubjects.txt", "enronsubjects.txt")

('enronsubjects.txt', <http.client.HTTPMessage at 0x10cce5470>)

In [2]:
# Remember, we can run our usual command line commands from Jupyter notebook if add a '!' in front of it
# This reads the first 10 lines of our file:

!head -10 enronsubjects.txt

# This file contains the subject lines from every message in the EnronSent corpus.
# For more information, see http://verbs.colorado.edu/enronsent

Headcount
utilities roll
utilities roll
TIME SENSITIVE: Executive Impact & Influence Program Survey
TIME SENSITIVE: Executive Impact & Influence Program Survey
Wow
Wow


In [7]:
# open.().readlines() opens the file that we pass it, and "reads" each line as a string
# It knows it has finished a line of text when it encounters the '\n' newline character
# .readlines() is awesome, but it keeps the newline character in your strings

# In our list comprehension, we ask it to "strip" each string of whitespace
# Because it's a list comprehension, it evaluates to a list which we are saving in our variable 'subjects'

subjects = [line.strip() for line in open("enronsubjects.txt").readlines()]


In [8]:
subjects[:10]

['# This file contains the subject lines from every message in the EnronSent corpus.',
 '# For more information, see http://verbs.colorado.edu/enronsent',
 '',
 'Headcount',
 'utilities roll',
 'utilities roll',
 'TIME SENSITIVE: Executive Impact & Influence Program Survey',
 'TIME SENSITIVE: Executive Impact & Influence Program Survey',
 'Wow',
 'Wow']

In [9]:
# Evaluates to a list of items from our 'subjects' list that start with 'Hi!'

[line for line in subjects if line.startswith("Hi!")]

['Hi!',
 'Hi!',
 'Hi!!!!',
 'Hi!!!',
 'Hi!!!',
 'Hi!!',
 'Hi!',
 'Hi!',
 'Hi!',
 'Hi!!',
 'Hi!',
 'Hi!',
 'Hi!',
 'Hi!',
 'Hi!',
 'Hi!',
 'Hi!',
 'Hi!',
 'Hi!',
 'Hi!!  How are you?',
 'Hi!',
 'Hi!',
 'Hi!',
 'Hi!',
 'Hi!',
 'Hi!',
 'Hi!',
 'Hi!',
 'Hi!',
 'Hi!',
 'Hi!',
 'Hi!']

In [10]:
import re

In [11]:
# Evaluates to a list of items from our 'subjects' list that have the string "shipping" somewhere in the subject
# IMPORTANT: 'line' is a TEMPORARY VARIABLE
# Temporary variables only exist in their context (below, 'line' only exists inside this list comprehension)

[line for line in subjects if re.search("shipping", line)]

['FW: How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'How to use UPS for shipping on the internet',
 'lng shipping/mosk meeting in tokyo 2nd of feb',
 'lng shipping/mosk meeting in tokyo 2nd of feb',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'lng shipping',
 'lng shipping',
 'Re: lng shipping',
 'lng shipping']

# Metacharacters

+ Special characters that you can use in regular expressions that have a special meaning.
+ They stand for multiple different characters

### Character Classes

+ `.` : any character
+ `\w`: any alphanumeric character (a-z, A-Z, 0-9)
+ `\s`: any whitespace character (spaces, tab `\t`, or newline `\n`
+ `\S`: any non-whitespace character
+ `\d`: any single digit from 0-9

In [12]:
# The dot in regular expressions represents ANY one single character. 

# For example, this would return strings with either "shipping" or "shopping"
[line for line in subjects if re.search("sh.pping", line)]

['FW: How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'How to use UPS for shipping on the internet',
 "FW: We've been shopping!",
 'Re: Start shopping...',
 'Start shopping...',
 'lng shipping/mosk meeting in tokyo 2nd of feb',
 'lng shipping/mosk meeting in tokyo 2nd of feb',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'lng shipping',
 'lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'FW: Online shopping',
 'Online shopping']

In [13]:
# Subjects that contain a time, e.g., 5:52pm or 12:06am
# It's looking for this particular sequence of things: [digit] : [digit][digit][alphanumeric] m
# the ':' and 'm' are "harded-coded in the sense that re.search() is literally search for ':' and 'm'
# Remember that it's reading character for character! 
# Note that this returns 12:06am as well as 2:06am even though we explicit only asked for one `\d`
# This occurs because '12:06am' contains '2:06am' (and '2:06am' matches what requested)

[line for line in subjects if re.search("\d:\d\d\wm", line)]

['RE: 3:17pm',
 '3:17pm',
 "RE: It's On!!! - 2:00pm Today",
 "FW: It's On!!! - 2:00pm Today",
 "It's On!!! - 2:00pm Today",
 'Re: Registration Confirmation: Larry Summers on 12/6 at 1:45pm (was',
 'Re: Conference Call today 2/9/01 at 11:15am PST',
 'Conference Call today 2/9/01 at 11:15am PST',
 '5/24 1:00pm conference call.',
 '5/24 1:00pm conference call.',
 'FW: 07:33am EDT 15-Aug-01 Prudential Securities (C',
 'FW: 07:33am EDT 15-Aug-01 Prudential Securities (C',
 '07:33am EDT 15-Aug-01 Prudential Securities (C',
 "Re: Updated Mar'00 Requirements Received at 11:25am from CES",
 "Re: Updated Mar'00 Requirements Received at 11:25am from CES",
 "Re: Updated Mar'00 Requirements Received at 11:25am from CES",
 "Updated Mar'00 Requirements Received at 11:25am from CES",
 'Reminder: Legal Team Meeting -- Friday, 9:00am Houston time',
 'Thursday, March 7th 1:30-3:00pm: REORIENTATION',
 'Meeting at 2:00pm Friday',
 'Meeting at 2:00pm Friday',
 'Fw: 12:30pm Deadline for changes to letters or

In [14]:
# How do we search for an actual period if that is one of our special characters?

# We use `\.` so that it knows we mean an actual period
# This returns strings that have at least 5 periods in a row
[line for line in subjects if re.search('\.\.\.\.\.', line)]

['Re: Hmmmmm........',
 'Hmmmmm........',
 'RE: hum......free fall in wti started a little early',
 'hum......free fall in wti started a little early',
 'RE: hum......free fall in wti started a little early',
 'RE: hum......free fall in wti started a little early',
 'RE: hum......free fall in wti started a little early',
 'hum......free fall in wti started a little early',
 'RE: Leaving Enron.....',
 'Leaving Enron.....',
 'Fwd: Football season is here.....this one is terrible, nonetheless,',
 'Fwd: Football season is here.....this one is terrible, nonetheless,',
 'Football season is here.....this one is terrible, nonetheless, it',
 'Re: Just a little something to make you smile.......',
 'Just a little something to make you smile.......',
 'Just a little something to make you smile.......',
 'Re: Congratulations, etc...................',
 'Congratulations, etc...................',
 "Re: Fw: it ain't easy.....",
 'FW: Message from Boeing.......',
 'FW: Message from Boeing.......',
 'FW

In [15]:
# Find all of the subject lines that have dates in them, e.g., 12/01/99

[line for line in subjects if re.search("\d\d/\d\d/\d\d", line)]

["Enron's December physical fixed price deals as of 11/28/00",
 "Enron's December physical fixed price deals as of 11/28/00",
 "FW: Enron' s August Baseload Physical Fixed Price Transactions as of 07/27/01",
 "Enron' s August Baseload Physical Fixed Price Transactions as of 07/27/01",
 "FW: Enron' s August Baseload Physical Fixed Price Transactions as of 07/27/01",
 "Enron' s August Baseload Physical Fixed Price Transactions as of 07/27/01",
 'FW: FERC Special Meetings on Friday 10/26/01 and Monday 10/29/01',
 'FERC Special Meetings on Friday 10/26/01 and Monday 10/29/01',
 'RE: Confirmation: Risk Management Simulation Meeting 10/30/01',
 'Confirmation: Risk Management Simulation Meeting 10/30/01',
 'RE: Confirmation: Risk Management Simulation Meeting 10/30/01',
 'RE: Confirmation: Risk Management Simulation Meeting 10/30/01',
 'ACCESS Trades for 11/09/00',
 'ACCESS Trades for 11/09/00',
 'ACCESS Trades 11/03/00',
 'ACCESS Trades 11/03/00',
 'FW: Enron Mentions - 06/04/01',
 'Enron Me

In [None]:
# Only things that refer to the 2000s
# Hard-coding a 0 into our MM/DD/YY structure as MM/DD/0Y

[line for line in subjects if re.search("\d\d/\d\d/0\d", line)]

In [16]:
# Only things that refer to June
[line for line in subjects if re.search("6/\d\d/\d\d", line)]

['FW: Enron Mentions - 06/04/01',
 'Enron Mentions - 06/04/01',
 'FW: Bullets for 6/28/01',
 '=09FW: Bullets for 6/28/01',
 '=09Bullets for 6/28/01',
 'Re: HR Floor Meeting - Friday, 6/30/00',
 'Hours - w/c 6/26/00',
 'Re: REQUEST FOR 6/30/00 STOCK HOLDINGS',
 'Doyle Update 6/23/00',
 'Doyle Update 6/23/00',
 'NG Resources Meeting 6/13/00',
 'NG Resources Meeting 6/13/00',
 'FW: Eastrans Nomination for 6/01/01',
 'Eastrans Nomination for 6/01/01',
 'FW: ERCOT 6/14/01',
 'ERCOT 6/14/01',
 'RE: ERCOT 6/20/01',
 'ERCOT 6/20/01',
 'RE: ERCOT 6/27/01',
 'ERCOT 6/27/01',
 'ISDA Press Report, 6/15/00',
 'Re: State Bar of Michigan e-Journal - 6/23/99',
 'RE: Calendar as of 6/26/01',
 'Calendar as of 6/26/01',
 'RE: Ryan Thomas Interview 6/19/01',
 'FW: Ryan Thomas Interview 6/19/01',
 'Ryan Thomas Interview 6/19/01',
 'EOL / Credit / GCP Responses 6/12/00',
 'EOL / Credit / GCP Responses 6/13/00',
 'EOL / Credit / GCP Responses 6/14/00',
 'EOL / Credit / GCP Responses 6/15/00',
 'EOL / Credit 

### Regular expressions allow us to find out own character classes (user-defined character classes). 

+ Inside your regular expression, write your own "class"
+ Character lasses we've encountered include digits, alphanumeric characters, etc.). 
+ You can also do this to create a character class of vowels: `"re.search([aeiou][aeiou][aeiou][aeiou])`
    + This says, search for something has any one of these characters [aeiou], followed by [aeiou], followed by [aeiou], followed by [aeiou]. 
    + Each [aeiou] is still search for a SINGLE character. 

In [17]:
[line for line in subjects if re.search("[aeiou][aeiou][aeiou][aeiou]", line)]

['Re: Natural gas quote for Louiisiana-Pacific (L-P)',
 'WooooooHoooooo more Vacation',
 'Re: Clickpaper Counterparties waiting to clear the work queue',
 'Gooooooooooood Bye!',
 'Gooooooooooood Bye!',
 'RE: Hello Sweeeeetie',
 'Hello Sweeeeetie',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'Re: FW: Wasss Uuuuuup STG?',
 'RE: Rrrrrrrooooolllllllllllll TIDE!!!!!!!!',
 'Rrrrrrrooooolllllllllllll TIDE!!!!!!!!',
 'FW: The Osama Bin Laden Song ( Soooo Funny !! )',
 'Fw: The Osama Bin Laden Song ( Soooo Funny !! )',
 'The Osama Bin Laden Song ( Soooo Funny !! )',
 'RE: duuuuhhhhh',
 'RE: duuuuhhhhh',
 'RE: duuuuhhhhh',
 'duuuuhhhhh',
 'RE: duuuuhhhhh',
 'duuuuhhhhh',
 'RE: FPL Queue positions 1-15',
 'Re: FPL Queue positions 1-15',
 'Re: Helloooooo!!!',
 'Re: Helloooooo!!!',
 'Fw: FW: OOOooooops',
 'FW: FW: OOOooooops',
 'Re: yeeeeha',
 'yeeeeha',
 'yahoooooooooooooooooooo',
 'RE: yahoooooooooooooooooooo',
 'RE: yahoooooooooooooooooooo

In [18]:
# Our forwards have both "FW:" and "Fw:" in the subject line; we need to search for both to get the forwards
# We can make a user-defined character class `[wW]`

[line for line in subjects if re.search("F[wW]:", line)]

['Re: FW: Trading Track Program',
 'Re: FW: 2nd lien info. and private lien info - The Stage Coach',
 'Re: FW: SanJuan/SoCal spread prices',
 'FW: ALL 1099 TAX QUESTIONS - ANSWERED',
 'FW: ALL 1099 TAX QUESTIONS - ANSWERED',
 'FW: Cross Commodity',
 'FW: Cross Commodity',
 'Re: FW: Change in the agroup Cycling Schedule',
 'FW: fixed forward or other Collar floor gas price terms',
 'FW: fixed forward or other Collar floor gas price terms',
 'Re: FW: fixed forward or other Collar floor gas price terms',
 'FW: charts',
 'FW: charts',
 'FW: Bishops Corner',
 'FW: Western Wholesale Activities - Gas & Power Conf. Call',
 'FW: Western Wholesale Activities - Gas & Power Conf. Call',
 'FW: charts',
 'FW: NEWGen June Release',
 'FW: Crossroads Storage Project',
 'FW: Crossroads Storage Project',
 'FW: Meeting to discuss West gas desk "FERC messages"',
 'FW:',
 'FW:',
 'FW: The Stage',
 'FW: Goldman Comment re: Enron issued this morning - Revised Price',
 'RE: FW: The Stage',
 'Re: FW: The Stage'

In [19]:
# An updated version of our time regular expression
[line for line in subjects if re.search("\d:[012345]\d[apAP][mM]", line)]

['RE: 3:17pm',
 '3:17pm',
 "RE: It's On!!! - 2:00pm Today",
 "FW: It's On!!! - 2:00pm Today",
 "It's On!!! - 2:00pm Today",
 'Re: Registration Confirmation: Larry Summers on 12/6 at 1:45pm (was',
 'Re: Conference Call today 2/9/01 at 11:15am PST',
 'Conference Call today 2/9/01 at 11:15am PST',
 '5/24 1:00pm conference call.',
 '5/24 1:00pm conference call.',
 'FW: 07:33am EDT 15-Aug-01 Prudential Securities (C',
 'FW: 07:33am EDT 15-Aug-01 Prudential Securities (C',
 '07:33am EDT 15-Aug-01 Prudential Securities (C',
 "Re: Updated Mar'00 Requirements Received at 11:25am from CES",
 "Re: Updated Mar'00 Requirements Received at 11:25am from CES",
 "Re: Updated Mar'00 Requirements Received at 11:25am from CES",
 "Updated Mar'00 Requirements Received at 11:25am from CES",
 'Reminder: Legal Team Meeting -- Friday, 9:00am Houston time',
 'Re: Are you going to be back for your meeting w/M.Becker @ 3:30PM?',
 'Thursday, March 7th 1:30-3:00pm: REORIENTATION',
 'Meeting at 2:00pm Friday',
 'Meet

### Anchors

+ This is a another type of metacharacters (character classes were one type)
+ They "anchor" the search to a particular part of a string
+ **Useful anchors**:

    + `^` : beginning of string
    + `$` : end of string 
    + `b` : word boundary

In [20]:
# The old us:
[line for line in subjects if re.search("New York", line)]

['RE: New York Details',
 'New York Details',
 'Re: Please Take a Look--Draft Slides for Monday Talk in New York on',
 'Please Take a Look--Draft Slides for Monday Talk in New York on',
 'Please Take a Look--Draft Slides for Monday Talk in New York on',
 'Re: Please Take a Look--Draft Slides for Monday Talk in New York on',
 'Please Take a Look--Draft Slides for Monday Talk in New York on',
 'Please Take a Look--Draft Slides for Monday Talk in New York on',
 'Fwd: The New York Times - Governor Pledges to Save California From',
 'Please Take a Look--Draft Slides for Monday Talk in New York on',
 'Please Take a Look--Draft Slides for Monday Talk in New York on',
 'Re: Please Take a Look--Draft Slides for Monday Talk in New York on',
 'Please Take a Look--Draft Slides for Monday Talk in New York on California Crisis',
 'Please Take a Look--Draft Slides for Monday Talk in New York on California Crisis',
 'Re: Please Take a Look--Draft Slides for Monday Talk in New York on',
 'Please Take a

In [21]:
# The new us:
# We want strings that BEGIN with New York

[line for line in subjects if re.search("^New York", line)]

['New York Details',
 'New York Power Authority',
 'New York Power Authority',
 'New York Power Authority',
 'New York Power Authority',
 'New York',
 'New York',
 'New York',
 'New York, etc.',
 'New York, etc.',
 'New York sites',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York',
 'New York',
 'New York City Marathon Guaranteed Entry',
 'New York State Electric & Gas Corporation ("NYSEG")',
 'New York State Electric & Gas Corporation ("NYSEG")',
 'New York State Electric & Gas Corporation ("NYSEG")',
 'New York State Electric & Gas ("NYSEG")',
 'New York regulatory restriccions',
 'New York regulatory restriccions',
 'New York Bar Numbers']

In [22]:
[line for line in subjects if re.search("^[nN]ew [yY]ork", line)]

['New York Details',
 'New York Power Authority',
 'New York Power Authority',
 'New York Power Authority',
 'New York Power Authority',
 'New York',
 'New York',
 'New York',
 'New York, etc.',
 'New York, etc.',
 'New York sites',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York',
 'New York',
 'New York City Marathon Guaranteed Entry',
 'new york rest reviews',
 'New York State Electric & Gas Corporation ("NYSEG")',
 'New York State Electric & Gas Corporation ("NYSEG")',
 'New York State Electric & Gas Corporation ("NYSEG")',
 'New York State Electric & Gas ("NYSEG")',
 'New York regulatory restriccions',
 'New York regulatory restriccions',
 'New York Bar Numbers']

In [23]:
# Lines that end with an ellipsis '...'
[line for line in subjects if re.search("\.\.\.$", line)]

['Re: Inquiry....',
 'Re: Inquiry....',
 'RE: the candidate we spoke about this morning...',
 'the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'the candidate we spoke about this morning...',
 'Re: Hmmmmm........',
 'Hmmmmm........',
 'FW: Bumping into the husband....',
 'FW: Bumping into the husband....',
 'RE: try this one...',
 'RE: try this one...',
 'Re: try this one...',
 'try this one...',
 'RE: try this one...',
 'RE: try this one...',
 'Re: try this one...',
 'try this one...',
 'RE: try this one...',
 'RE: try this one...',
 'Re: try this one...',
 'try this one...',
 'RE: try this one...',
 'RE: try this one...',
 '

In [24]:
[line for line in subjects if re.search("!!!!!$", line)]

['FW: The today show!!!!!',
 'FW: The today show!!!!!',
 'Re: Yeah Monkey!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 'Yeah Monkey!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 'RE: RE: Whats up!!!!!',
 'RE: RE: Whats up!!!!!',
 'RE: RE: Whats up!!!!!',
 'FW: RE: Whats up!!!!!',
 'RE:RE: Whats up!!!!!',
 'RE: RE: Whats up!!!!!',
 'FW: RE: Whats up!!!!!',
 'RE:RE: Whats up!!!!!',
 "RE: I don't know Brad Horn!!!!!!!!!!!!!!!",
 "RE: I don't know Brad Horn!!!!!!!!!!!!!!!",
 "Re: I don't know Brad Horn!!!!!!!!!!!!!!!",
 "RE: I don't know Brad Horn!!!!!!!!!!!!!!!",
 "Re: I don't know Brad Horn!!!!!!!!!!!!!!!",
 "RE: I don't know Brad Horn!!!!!!!!!!!!!!!",
 "RE: I don't know Brad Horn!!!!!!!!!!!!!!!",
 "Re: I don't know Brad Horn!!!!!!!!!!!!!!!",
 "RE: I don't know Brad Horn!!!!!!!!!!!!!!!",
 "Re: I don't know Brad Horn!!!!!!!!!!!!!!!",
 'Re: Mark Your Calendar!!!!!!!',
 'Re: Friday is the last day for purchasing Discount Ski Tickets!!!!!',
 'Re: lunch!!!!!',
 'lunch!!!!!',
 "FW: HELP!!! I'VE FAINTED AND I CAN'T COME TO!!

In [28]:
# Word boundaries exist between letters and a space or punctuation mark
# find subject lines that contain the word "oil" in them (but we don't want foil, soil, boil, etc.)
# Why can't we just do:

[line for line in subjects if re.search(" oil ", line)]

# This wouldn't match if 'oil' was a the end of the string

['Re: Period after commissioning on oil - PPA availability penalties',
 'Re: Period after commissioning on oil - PPA availability penalties',
 'Re: Period after commissioning on oil - PPA availability penalties',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Forward oil prices',
 'Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Forward oil prices',
 'Forward oil prices',
 'Forward oil prices',
 'Forward oil prices',
 'how to go forward in the oil markets',
 'how to go forward in the oil markets']

In [27]:
[line for line in subjects if re.search(r"\boil\b", line)]

['Re: Period after commissioning on oil - PPA availability penalties',
 'Re: Period after commissioning on oil - PPA availability penalties',
 'Re: Period after commissioning on oil - PPA availability penalties',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Forward oil prices',
 'Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Forward oil prices',
 'Forward oil prices',
 'Forward oil prices',
 'Forward oil prices',
 'Re: exploration data as the root of the energy (oil) supply chain',
 'exploration data as the root of the energy (oil) supply chain and',
 'Draft term sheet for oil-power spread option pruchase from FPL',
 'Re: Draft term sheet for oil-power spread option pruchase from FPL',
 'Draft term sheet for oil-power spread option pruchase from FPL',
 'how to go forward in the oil markets

### Why did we add 'r' outside our regular expression above?

### Aside: metacharacters and escape characters

In [29]:
x = "this is\na test"
print(x)

this is
a test


In [30]:
x = "this is\t\tanother test"
print(x)

this is		another test


These are called **escape characters**!

Python interprets escape characters as: 

+ `\n` ; new line character
+ `\t` : tab
+ `\\` : backslash

Whenever Python sees a backslash followed a character that isn't on its escape characters [list](https://docs.python.org/2/reference/lexical_analysis.html), it thinks you actually mean to write backslash. 

`\b` happens to be both a metacharacter in the world of regular expressions AND a Python escape character. Therefore, we had to tell Python explicit that we were passing it a regular expression by adding `r` to the beginning of our regular expression. 

In [31]:
# ASCII backspace
print("hello there\b\b\b\bhi")

hello therehi


In [32]:
[line for line in subjects if re.search("\boil\b", line)]

[]

In [35]:
# We can add an additional \ in front of `\b` to indicate that
# we mean an \b rather than its escape character

[line for line in subjects if re.search("\\boil\\b", line)]

['Re: Period after commissioning on oil - PPA availability penalties',
 'Re: Period after commissioning on oil - PPA availability penalties',
 'Re: Period after commissioning on oil - PPA availability penalties',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Forward oil prices',
 'Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Forward oil prices',
 'Forward oil prices',
 'Forward oil prices',
 'Forward oil prices',
 'Re: exploration data as the root of the energy (oil) supply chain',
 'exploration data as the root of the energy (oil) supply chain and',
 'Draft term sheet for oil-power spread option pruchase from FPL',
 'Re: Draft term sheet for oil-power spread option pruchase from FPL',
 'Draft term sheet for oil-power spread option pruchase from FPL',
 'how to go forward in the oil markets

In [33]:
# If you don't know you if something is a special character, you should probably add an extra \
# BUT Python has a fix: add 'r' in front of your string literal (Python interprets this as a 'raw string')
# This way, Python knows to leave it alone/ not interpret anything inside it as an escape character

[line for line in subjects if re.search(r"\boil\b", line)]

['Re: Period after commissioning on oil - PPA availability penalties',
 'Re: Period after commissioning on oil - PPA availability penalties',
 'Re: Period after commissioning on oil - PPA availability penalties',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Forward oil prices',
 'Forward oil prices',
 'Re: Forward oil prices',
 'Re: Forward oil prices',
 'Forward oil prices',
 'Forward oil prices',
 'Forward oil prices',
 'Forward oil prices',
 'Re: exploration data as the root of the energy (oil) supply chain',
 'exploration data as the root of the energy (oil) supply chain and',
 'Draft term sheet for oil-power spread option pruchase from FPL',
 'Re: Draft term sheet for oil-power spread option pruchase from FPL',
 'Draft term sheet for oil-power spread option pruchase from FPL',
 'how to go forward in the oil markets

So, for *r*egular expressions, use `r` in front of it. 

In [36]:
# Finds exactly three periods in a row: 
[line for line in subjects if re.search(r"\b\.\.\.\b", line)]

['Re: credit facility...finally',
 'credit facility...finally',
 'Re: AWESOME THANKS FOR INPUT 7...I AWAIT THE REST',
 'You Godfather is calling upon you for a favor...check your voice',
 'Trader did not press button to migrate add book...incl.',
 'Re: Virginia Natural Gas...Columbia Gas',
 'Re: Virginia Natural Gas...Columbia Gas',
 'Virginia Natural Gas...Columbia Gas',
 'Re: Virginia Natural Gas...Columbia Gas',
 'Re: Virginia Natural Gas...Columbia Gas',
 'Virginia Natural Gas...Columbia Gas',
 'RE: revised htl date...now Sept 13',
 'revised htl date...now Sept 13',
 'FW: FW: I am not ashamed to pass this on...Are you?',
 'FW: FW: I am not ashamed to pass this on...Are you?',
 'FW: FW: I am not ashamed to pass this on...Are you?',
 'FW: FW: I am not ashamed to pass this on...Are you?',
 "Fw: If u delete this...u seriously don't have a heart!",
 "Fw: If u delete this...u seriously don't have a heart!",
 "Fw: If u delete this...u seriously don't have a heart!",
 "Fw: If u delete this

In [38]:
[line for line in subjects if re.search(r"\bregulation\b", line)]

['Re: Analysis by Academics----Why De-regulation is the right policy',
 'De-regulation Project',
 'Messages Regarding Recent Interest in the History of De-regulation']

In [39]:
[line for line in subjects if re.search(r"\banti", line)]

["C.H. Guernsey & Company's antitrust links",
 'FW: FW: Canadian Contribution to the anti-terrorist campaign',
 'FW: FW: Canadian Contribution to the anti-terrorist campaign',
 'FW: FW: Canadian Contribution to the anti-terrorist campaign',
 'Re: Stopped anti-energy amendment!!!',
 'FW: FW: Canadian Contribution to the anti-terrorist campaign',
 'FW: FW: Canadian Contribution to the anti-terrorist campaign',
 'FW: FW: Canadian Contribution to the anti-terrorist campaign',
 'FW: FW: Canadian Contribution to the anti-terrorist campaign',
 'FW: FW: Canadian Contribution to the anti-terrorist campaign',
 'FW: FW: Canadian Contribution to the anti-terrorist campaign',
 'FW: FW: Canadian Contribution to the anti-terrorist campaign',
 'FW: FW: Canadian Contribution to the anti-terrorist campaign']

## Quantifiers

+ Our third class of metacharacters. 

        {n}     matches exactly n times
        {n,m}   mathces at least n times, but no more than m times
        {n,}    mathces at least n times, but maybe infinite times
        +       matches at least once
        *       mathces zero or more times
        ?       more one time or zero time

In [40]:
[line for line in subjects if re.search(r"[A-Z]{15,}", line)]

['CONGRATULATIONS!',
 'CONGRATULATIONS!',
 'Re: FW: Fw: Fw: Fw: Fw: Fw: Fw: PLEEEEEEEEEEEEEEEASE READ!',
 'ACCOMPLISHMENTS',
 'ACCOMPLISHMENTS',
 'Re: FW: FORM: BILATERAL CONFIDENTIALITY AGREEMENT',
 'FORM: BILATERAL CONFIDENTIALITY AGREEMENT',
 'Re: CONGRATULATIONS!',
 'CONGRATULATIONS!',
 'Re: ORDER ACKNOWLEDGEMENT',
 'ORDER ACKNOWLEDGEMENT',
 'RE: CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'Re: CONGRATULATIONS',
 'CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'Re: CONGRATULATIONS',
 'CONGRATULATIONS',
 'Re: VEPCO INTERCONNECTION AGREEMENT',
 'VEPCO INTERCONNECTION AGREEMENT',
 'Re: VEPCO INTERCONNECTION AGREEMENT',
 'Re: VEPCO INTERCONNECTION AGREEMENT',
 'VEPCO INTERCONNECTION AGREEMENT',
 'Re: CONGRATULATIONS !',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'Re: FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAA

In [41]:
# 4 vowels in a row
[line for line in subjects if re.search(r"[aeiou]{4}", line)]

['Re: Natural gas quote for Louiisiana-Pacific (L-P)',
 'WooooooHoooooo more Vacation',
 'Re: Clickpaper Counterparties waiting to clear the work queue',
 'Gooooooooooood Bye!',
 'Gooooooooooood Bye!',
 'RE: Hello Sweeeeetie',
 'Hello Sweeeeetie',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'Re: FW: Wasss Uuuuuup STG?',
 'RE: Rrrrrrrooooolllllllllllll TIDE!!!!!!!!',
 'Rrrrrrrooooolllllllllllll TIDE!!!!!!!!',
 'FW: The Osama Bin Laden Song ( Soooo Funny !! )',
 'Fw: The Osama Bin Laden Song ( Soooo Funny !! )',
 'The Osama Bin Laden Song ( Soooo Funny !! )',
 'RE: duuuuhhhhh',
 'RE: duuuuhhhhh',
 'RE: duuuuhhhhh',
 'duuuuhhhhh',
 'RE: duuuuhhhhh',
 'duuuuhhhhh',
 'RE: FPL Queue positions 1-15',
 'Re: FPL Queue positions 1-15',
 'Re: Helloooooo!!!',
 'Re: Helloooooo!!!',
 'Fw: FW: OOOooooops',
 'FW: FW: OOOooooops',
 'Re: yeeeeha',
 'yeeeeha',
 'yahoooooooooooooooooooo',
 'RE: yahoooooooooooooooooooo',
 'RE: yahoooooooooooooooooooo

In [44]:
# Some subjects have Fwd and others have Fwd:
# This expression searches for F followed by w or W and either 1 d or 0 d. 
[line for line in subjects if re.search(r"^F[wW]d?:", line)]

['FW: ALL 1099 TAX QUESTIONS - ANSWERED',
 'FW: ALL 1099 TAX QUESTIONS - ANSWERED',
 'FW: Cross Commodity',
 'FW: Cross Commodity',
 'FW: fixed forward or other Collar floor gas price terms',
 'FW: fixed forward or other Collar floor gas price terms',
 'FW: charts',
 'FW: charts',
 'FW: Bishops Corner',
 'FW: Western Wholesale Activities - Gas & Power Conf. Call',
 'FW: Western Wholesale Activities - Gas & Power Conf. Call',
 'FW: charts',
 'FW: NEWGen June Release',
 'FW: Crossroads Storage Project',
 'FW: Crossroads Storage Project',
 'FW: Meeting to discuss West gas desk "FERC messages"',
 'FW:',
 'FW:',
 'FW: The Stage',
 'FW: Goldman Comment re: Enron issued this morning - Revised Price',
 'FW: California gas intrastate matters',
 'FW: El Paso Announces Binding Open Season for Additional Capacity',
 'FW: California gas intrastate matters - July 11 conference call',
 'FW: West Power Strategy Briefing',
 'FW:',
 'FW: Party',
 'FW: CA Instrate Gas matters',
 'FW: American Express Let

**Quantier apply to single character classes.**

In [47]:
# Match all the lines that have "news" or "News" followed by any number of characters and that ends with ! 
# We say any number of characters using '.*'
[line for line in subjects if re.search(r"[nN]ews.*!$", line)]

['RE: Christmas Party News!',
 'FW: Christmas Party News!',
 'Christmas Party News!',
 'Good News!',
 'Good News--Twice!',
 'Re: VERY Interesting News!',
 'Great News!',
 'Re: Great News!',
 'News Flash!',
 'RE: News Flash!',
 'RE: News Flash!',
 'News Flash!',
 'RE: Good News!',
 'RE: Good News!',
 'RE: Good News!',
 'RE: Good News!',
 'Good News!',
 'RE: Good News!!!',
 'Good News!!!',
 'RE: Big News!',
 'Big News!',
 'Fw: Newspaper Articles -- Not About the Election!',
 'Fw: Newspaper Articles -- Not About the Election!',
 'FW: Newspaper Articles -- Not About the Election!',
 'Fw: Newspaper Articles -- Not About the Election!',
 'Fw: Newspaper Articles -- Not About the Election!',
 'FW: Newspaper Articles -- Not About the Election!',
 'Newsletter: EuroFlash!!',
 'Individual.com - News From a Friend!',
 'Individual.com - News From a Friend!',
 'Re: Individual.com - News From a Friend!',
 'RE: We need news!',
 '=09We need news!',
 'RE: Big News!',
 'FW: Big News!',
 'RE: Big News!',
 

In [48]:
# Subjects that start with Re: or RE: and somewhere in the string include investor or Investor
[line for line in subjects if re.search(r"^R[eE]:.*\b[iI]nvestor", line)]

['RE: Prudential\'s "Investor Weekly" for 10-24-01',
 'Re: Angelides Investor Memo - Timetable Update -- says CPUC vote',
 "RE:  Angelides' Memo to Investors 9/ 25/01",
 'Re: Angelides Investor Memo - Timetable Update -- says CPUC vote',
 'RE: Energy Companies Hit by Investor Fears of Illiquidity',
 'RE: Energy Companies Hit by Investor Fears of Illiquidity',
 'RE: Energy Companies Hit by Investor Fears of Illiquidity',
 'RE: Energy Companies Hit by Investor Fears of Illiquidity',
 'RE: A pleasant thought for long term investors...',
 'RE: A pleasant thought for long term investors...',
 'RE: ETS  - Investor Questions',
 'RE: ETS  - Investor Questions',
 "RE: Moody's Investors Service downgrads Enron",
 'Re: E2 Investor List.xls',
 'Re: E2 Investor List.xls',
 'RE: Investor Letter',
 'RE: Investor List',
 "Re: FW: H2FC Investors' Newsletter - Vol.3 No.1"]

### More metacharacters: alternation

        (?:x|y)     match either x or y
        (?:x|y|z)   match either x, y, or z

In [53]:
# Search for either cat/Cat or kitty/Kitty
[line for line in subjects if re.search(r"\b(?:[cC]at|[kK]itten)\b", line)]

['Re: FW: cat attack',
 'Re: FW: cat attack',
 'Re: FW: cat attack',
 'Re: FW: cat attack',
 'Fw: Cat clip',
 'Fw: Cat clip',
 'FW: Cat clip',
 'Re: Amazing Kitten',
 'RE: How To Tell Which Cat Ate Your Drugs',
 'FW: How To Tell Which Cat Ate Your Drugs',
 'FW: How To Tell Which Cat Ate Your Drugs',
 "FW: Fw: A cat's tale",
 "Fwd: Fw: A cat's tale",
 'Kim lost her cat this morning',
 'Fw: cat clip............',
 'Fw: cat clip............',
 'Fw: cat clip............',
 'cat clip............',
 'Fw: cat clip............',
 'Fw: cat clip............',
 'Fw: cat clip............',
 'cat clip............',
 'Fw: cat clip............',
 'Fw: cat clip............',
 'Fw: cat clip............',
 'cat clip............',
 'Fw: cat clip............',
 'Fw: cat clip............',
 'Fw: cat clip............',
 'cat clip............',
 'Fw: cat clip............',
 'Fw: cat clip............',
 'Fw: cat clip............',
 'cat clip............',
 'Diary of a Cat',
 'Diary of a Cat',
 'Diary of a Cat

In [55]:
[line for line in subjects if re.search(r"\b(?:energy|oil|electricity)\b", line)]

['FW: Fishingtrip Nat gas /electricity',
 'FW: Fishingtrip Nat gas /electricity',
 'Re: Fwd: California: U.S. energy sec says FERC proposals not',
 'Re: Proposed initiative on energy issue',
 'Davis trying to spin out of energy crisis',
 'Officials criticize energy report',
 'FW: where does our energy come from?',
 'FW: where does our energy come from?',
 'where does our energy come from?',
 'updated energy timeline',
 'EPSA study attributes lower electricity prices to competition',
 "Dan Walters: Blame game over California's energy crisis will",
 'Re: retail competition in electricity',
 'retail competition in electricity',
 'retail competition in electricity',
 'Re: San Francisco Examiner: "California told it must solve energy',
 'Sac Bee, Tues 2/13 Editorial: "Lawmakers failed to respond to energy',
 "Davis' deadlines on energy much easier set than met",
 "Re: Davis' deadlines on energy much easier set than met",
 'Politicians seek shelter as energy Armageddon looms',
 'Re: Dan Walt

## Capturing

Regular expressions allow us to also "pluck" out the thing that matches our pattern (rather than just get a yes/no as above). 

In [56]:
# read the whole corpus as one big string
all_subjects = open("enronsubjects.txt").read()

In [57]:
all_subjects[:1000]

'# This file contains the subject lines from every message in the EnronSent corpus.\n# For more information, see http://verbs.colorado.edu/enronsent\n\nHeadcount\nutilities roll\nutilities roll\nTIME SENSITIVE: Executive Impact & Influence Program Survey\nTIME SENSITIVE: Executive Impact & Influence Program Survey\nWow\nWow\nWow\nWow\nRe:\nRe:  \nRe:\nRE: Receipt of Team Selection Form - Executive Impact & Influence\nRE: Receipt of Team Selection Form - Executive Impact & Influence \nReceipt of Team Selection Form - Executive Impact & Influence\nFYI\nFYI\nRe: Transportation Reports\nRe: Western Gas Market Report -- Draft\nReceipt of Team Selection Form - Executive Impact & Influence\nReceipt of Team Selection Form - Executive Impact & Influence Program\nRe: (No Subject)\nRe: Security Request: CLOG-4NNJEZ has been Denied.\nNew Generation\nNew Generation\nRe: Meeting to discuss 2001 direct expense plan?\nRe: regulatory filing summary\nRe: Evaluation for new trading application\nRe: recei

**This returns the parts of the string that match the regular expression we wrote**

Search for domain names: `re.findall(r"\b\w+\.(?:com|net|org)\b", all_subjects)`

In [61]:
# Our old way
# re.search() was really return a True or False to us! 
[line for line in subjects if re.search(r"\b\w+\.(?:com|net|org)\b", line)]

['Your Approval is Overdue: Access Request for paul.t.lucci@enron.com',
 'Your Approval is Overdue: Access Request for paul.t.lucci@enron.com',
 'Your Approval is Overdue: Access Request for paul.t.lucci@enron.com',
 'Request Submitted: Access Request for frank.ermis@enron.com',
 'Request Submitted: Access Request for frank.ermis@enron.com',
 'Your Approval is Overdue: Access Request for mike.grigsby@enron.com',
 'Your Approval is Overdue: Access Request for mike.grigsby@enron.com',
 'Your Approval is Overdue: Access Request for barry.tycholiz@enron.com',
 'Forbes.com story',
 'FW: [Cortlandtwines.com] 25% OFF Premium American Wine',
 '[Cortlandtwines.com] 25% OFF Premium American Wine',
 "RE: Match.com - You've Got Mail:",
 'FW: Your Amazon.com order (#002-4083380-7905653): your approval',
 'Your Amazon.com order (#002-4083380-7905653): your approval',
 'Your Order with Ticketmaster.com (6-22069/DAL)',
 'Your Order with Ticketmaster.com (6-22069/DAL)',
 'Concierge.com St. Thomas Overv

In [62]:
# Our new way
# Returns a list of strings of parts that match

re.findall(r"\b\w+\.(?:com|net|org)\b", all_subjects)

['enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'Forbes.com',
 'Cortlandtwines.com',
 'Cortlandtwines.com',
 'Match.com',
 'Amazon.com',
 'Amazon.com',
 'Ticketmaster.com',
 'Ticketmaster.com',
 'Concierge.com',
 'Concierge.com',
 'har.com',
 'har.com',
 'HoustonChronicle.com',
 'HoustonChronicle.com',
 'har.com',
 'har.com',
 'har.com',
 'har.com',
 'har.com',
 'har.com',
 'Concierge.com',
 'Concierge.com',
 'washingtonpost.com',
 'washingtonpost.com',
 'washingtonpost.com',
 'washingtonpost.com',
 'ESPN.com',
 'ESPN.com',
 'ESPN.com',
 'enron.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'EnronCredit.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'CommodityLogic.com',
 'CommodityLogic.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'enron.com',
 'MarketWatch.com',
 'MarketWatch.com',
 'MarketWatch.com',
 'INSIDER.com',


In [63]:
input_str = "12345-1234 asdf 2741212345 asdfdskghldf 123 asdj 98751"
re.search(r"\b\d{5}\b", input_str)

<_sre.SRE_Match object; span=(0, 5), match='12345'>

In [64]:
re.findall(r"\b\d{5}\b", input_str)

['12345', '98751']

In [65]:
# All of the phrases in which `New York` occurs and the phrase the follows `New York`

re.findall(r"New York \b\w+\b", all_subjects)

['New York Details',
 'New York Details',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York Times',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York on',
 'New York Times',
 'New York Times',
 'New York Times',
 'New York Times',
 'New York Times',
 'New York Times',
 'New York Times',
 'New York City',
 'New York City',
 'New York City',
 'New York Power',
 'New York Power',
 'New York Power',
 'New York Power',
 'New York Power',
 'New York Power',
 'New York Power',
 'New York Power',
 'New York Mercantile',
 'New York Mercantile',
 'New York Branch',
 'New York City',
 'New York Energy',
 'New York Energy',
 'New York Energy',
 'New York Energy',
 'New York Energy',
 'New York sites',
 'New York sites',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',


In [66]:
# Hmm, removed duplicates for fun... 
set(re.findall(r"New York \b\w+\b", all_subjects))

{'New York Bar',
 'New York Branch',
 'New York City',
 'New York Details',
 'New York Energy',
 'New York Hotel',
 'New York Inc',
 'New York Mercantile',
 'New York Office',
 'New York Power',
 'New York State',
 'New York Times',
 'New York on',
 'New York regulatory',
 'New York sites',
 'New York voice'}

In [69]:
# How many different zipcodes mentioned in the subject lines 
len(set(re.findall(r"\b\d{5}\b", all_subjects)))

140

In [None]:
# Search for California zip codes
re.findall(r"\b9\d{4}\b", all_subjects)

In [71]:
# What if we only care about the part that FOLLOWS 'New York'
# Using () in this way is called GROUPING

re.findall(r"New York (\b\w+\b)", all_subjects)

['Details',
 'Details',
 'on',
 'on',
 'on',
 'on',
 'on',
 'on',
 'Times',
 'on',
 'on',
 'on',
 'on',
 'on',
 'on',
 'on',
 'on',
 'Times',
 'Times',
 'Times',
 'Times',
 'Times',
 'Times',
 'Times',
 'City',
 'City',
 'City',
 'Power',
 'Power',
 'Power',
 'Power',
 'Power',
 'Power',
 'Power',
 'Power',
 'Mercantile',
 'Mercantile',
 'Branch',
 'City',
 'Energy',
 'Energy',
 'Energy',
 'Energy',
 'Energy',
 'sites',
 'sites',
 'Hotel',
 'Hotel',
 'Hotel',
 'Hotel',
 'Hotel',
 'Hotel',
 'Hotel',
 'Hotel',
 'Hotel',
 'Hotel',
 'Hotel',
 'Hotel',
 'Hotel',
 'Hotel',
 'City',
 'City',
 'City',
 'City',
 'voice',
 'State',
 'State',
 'State',
 'State',
 'State',
 'State',
 'Inc',
 'Office',
 'Office',
 'regulatory',
 'regulatory',
 'regulatory',
 'regulatory',
 'Bar',
 'Bar']

In [72]:
# Find the two words that follow:
re.findall(r"New York (\b\w+\b) (\b\w+\b)", all_subjects)

[('on', 'California'),
 ('on', 'California'),
 ('on', 'California'),
 ('on', 'California'),
 ('Times', 'Article'),
 ('Power', 'Authority'),
 ('Power', 'Authority'),
 ('Power', 'Authority'),
 ('Power', 'Authority'),
 ('Power', 'Authority'),
 ('Power', 'Authority'),
 ('Power', 'Authority'),
 ('Power', 'Authority'),
 ('Mercantile', 'Exchange'),
 ('Mercantile', 'Exchange'),
 ('Energy', 'Risk'),
 ('Energy', 'Risk'),
 ('Energy', 'Risk'),
 ('Energy', 'Risk'),
 ('Energy', 'Risk'),
 ('City', 'Gets'),
 ('City', 'Gets'),
 ('City', 'Marathon'),
 ('City', 'Marathon'),
 ('voice', 'recorder'),
 ('State', 'Electric'),
 ('State', 'Electric'),
 ('State', 'Electric'),
 ('State', 'Electric'),
 ('State', 'Electric'),
 ('State', 'Electric'),
 ('Office', 'Requests'),
 ('Office', 'Requests'),
 ('regulatory', 'restriccions'),
 ('regulatory', 'restriccions'),
 ('regulatory', 'restriccions'),
 ('regulatory', 'restriccions'),
 ('Bar', 'Numbers'),
 ('Bar', 'Numbers')]

In [75]:
re.findall(r"New York (\b\w+\b \b\w+\b)", all_subjects)

['on California',
 'on California',
 'on California',
 'on California',
 'Times Article',
 'Power Authority',
 'Power Authority',
 'Power Authority',
 'Power Authority',
 'Power Authority',
 'Power Authority',
 'Power Authority',
 'Power Authority',
 'Mercantile Exchange',
 'Mercantile Exchange',
 'Energy Risk',
 'Energy Risk',
 'Energy Risk',
 'Energy Risk',
 'Energy Risk',
 'City Gets',
 'City Gets',
 'City Marathon',
 'City Marathon',
 'voice recorder',
 'State Electric',
 'State Electric',
 'State Electric',
 'State Electric',
 'State Electric',
 'State Electric',
 'Office Requests',
 'Office Requests',
 'regulatory restriccions',
 'regulatory restriccions',
 'regulatory restriccions',
 'regulatory restriccions',
 'Bar Numbers',
 'Bar Numbers']

In [76]:
from collections import Counter
c = Counter(re.findall(r"\b9\d{4}\b", all_subjects))
c.most_common(10)

[('93836', 26),
 ('90593', 3),
 ('96731', 2),
 ('93481', 2),
 ('92886', 2),
 ('96724', 2),
 ('93394', 1),
 ('93871', 1),
 ('94074', 1)]

### Using re.search() to capture

In [77]:
src = "This example has been used 423 times"

# check to see whether this string matches a particular pattern
if re.search(r"\d\d\d", src):
    print("yep")
else:
    print("nope")

yep


In [78]:
# match object
src = "This example has been used 423 times"
match = re.search(r"\d\d\d", src)
type(match)


_sre.SRE_Match

In [79]:
# Gives us the index in this string where the match starts
print(match.start())

27


In [80]:
# Gives us the index in this string where the match ends
print(match.end())

30


In [81]:
# Gives us the actual string that matched
print(match.group())

423


In [82]:
# This is sort of like writing our own .findall()

for line in subjects:
    match = re.search(r"[A-Z]{15,}", line)
    
    # if you found a match
    if match: 
        print(match.group())
        

CONGRATULATIONS
CONGRATULATIONS
PLEEEEEEEEEEEEEEEASE
ACCOMPLISHMENTS
ACCOMPLISHMENTS
CONFIDENTIALITY
CONFIDENTIALITY
CONGRATULATIONS
CONGRATULATIONS
ACKNOWLEDGEMENT
ACKNOWLEDGEMENT
CONGRATULATIONS
CONGRATULATIONS
CONGRATULATIONS
CONGRATULATIONS
CONGRATULATIONS
CONGRATULATIONS
CONGRATULATIONS
CONGRATULATIONS
CONGRATULATIONS
CONGRATULATIONS
INTERCONNECTION
INTERCONNECTION
INTERCONNECTION
INTERCONNECTION
INTERCONNECTION
CONGRATULATIONS
WASSSAAAAAAAAAAAAAABI
WASSSAAAAAAAAAAAAAABI
WASSSAAAAAAAAAAAAAABI
WASSSAAAAAAAAAAAAAABI
WASSSAAAAAAAAAAAAAABI
WASSSAAAAAAAAAAAAAABI
WASSSAAAAAAAAAAAAAABI
NOOOOOOOOOOOOOOOO
NOOOOOOOOOOOOOOOO
NOOOOOOOOOOOOOOOO
CONGRATULATIONS
CONGRATULATIONS
CONGRATULATIONS
CONGRATULATIONS
CONFIDENTIALITY
CONFIDENTIALITY
ACCOMPLISHMENTS
ACCOMPLISHMENTS
CONGRATULATIONS
STANDARDIZATION
STANDARDIZATION
STANDARDIZATION
STANDARDIZATION
BRRRRRRRRRRRRRRRRRRRRR
CONGRATULATIONS
CONGRATULATIONS
NETCOTRANSMISSION
NETCOTRANSMISSION
NETCOTRANSMISSION
INTERCONTINENTAL
INTERCONTINENTAL


### An Example: Course Listing

In [83]:
courses = [
    "CSCI 105: Introductory Programming for Cat-Lovers",
    "LING 214: Pronouncing Things Backwards",
    "ANTHRO 342: Theory and Practice of Cheesemongery (Graduate Seminar)",
    "CSCI 205: Advanced Programming for Cat-Lovers",
    "ENGL 112: Speculative Travel Writing"
]

In [85]:
for item in courses:
    match = re.search(r"^(\w+) (\d+): (.*)$", item)
    print(match.group())

CSCI 105: Introductory Programming for Cat-Lovers
LING 214: Pronouncing Things Backwards
ANTHRO 342: Theory and Practice of Cheesemongery (Graduate Seminar)
CSCI 205: Advanced Programming for Cat-Lovers
ENGL 112: Speculative Travel Writing


In [86]:
for item in courses:
    match = re.search(r"^(\w+) (\d+): (.*)$", item)
    print(match.group(1))

CSCI
LING
ANTHRO
CSCI
ENGL


In [87]:
for item in courses:
    match = re.search(r"^(\w+) (\d+): (.*)$", item)
    print(match.group(3))

Introductory Programming for Cat-Lovers
Pronouncing Things Backwards
Theory and Practice of Cheesemongery (Graduate Seminar)
Advanced Programming for Cat-Lovers
Speculative Travel Writing


In [89]:
print("Course catalog reports:\n")
for item in courses:
    match = re.search(r"^(\w+) (\d+): (.*)$", item)
    print("Course dept", match.group(1))
    print("Course #", match.group(2))
    print("Course title", match.group(3))

Course catalog reports:

Course dept CSCI
Course # 105
Course title Introductory Programming for Cat-Lovers
Course dept LING
Course # 214
Course title Pronouncing Things Backwards
Course dept ANTHRO
Course # 342
Course title Theory and Practice of Cheesemongery (Graduate Seminar)
Course dept CSCI
Course # 205
Course title Advanced Programming for Cat-Lovers
Course dept ENGL
Course # 112
Course title Speculative Travel Writing
