## INVESTIGATING UNSTRUCTURED TEXT
As we saw last week, even the sometimes messy and unpredictable Markup language of HTML can give us clues to how data may be structured. But language as a system (as we saw in Borges) also comes with its own structures. Python provides numerous methods for navigating through basic linguistic patterns. Let's begin with repetition itself:

In [2]:
speech = '''Tomorrow, and tomorrow, and tomorrow,
Creeps in this petty pace from day to day,
To the last syllable of recorded time;
And all our yesterdays have lighted fools
The way to dusty death. Out, out, brief candle!
Life's but a walking shadow, a poor player,
That struts and frets his hour upon the stage,
And then is heard no more. It is a tale
Told by an idiot, full of sound and fury,
Signifying nothing.'''

speech.lower()

"tomorrow, and tomorrow, and tomorrow,\ncreeps in this petty pace from day to day,\nto the last syllable of recorded time;\nand all our yesterdays have lighted fools\nthe way to dusty death. out, out, brief candle!\nlife's but a walking shadow, a poor player,\nthat struts and frets his hour upon the stage,\nand then is heard no more. it is a tale\ntold by an idiot, full of sound and fury,\nsignifying nothing."

There're various ways to investigate Macbeth's famous, very short, speech. We begin by searching for the obvious, searching through the whole speech.

In [3]:
'tomorrow' in speech

True

In [4]:
speech.find('tomorrow')

14

In [5]:
speech[14:14+len('tomorrow')]
#speech[14:22]


'tomorrow'

In [6]:
speech.count('tomorrow')

2

In [7]:
#speech.lower().count('tomorrow')
speech.lower().find('idiot')

In [8]:

speech.count('And')

2

In [9]:
speech.count('\nA')

2

In [10]:
speech.lower().count(' a ')
len(speech)

402

Of course, there is already a structure to the speech that we are ignoring--it has lines. Let's get out those lines and put them into a list.

In [11]:
lines = speech.split('\n')
#lines = tom.splitlines() 
lines

['Tomorrow, and tomorrow, and tomorrow,',
 'Creeps in this petty pace from day to day,',
 'To the last syllable of recorded time;',
 'And all our yesterdays have lighted fools',
 'The way to dusty death. Out, out, brief candle!',
 "Life's but a walking shadow, a poor player,",
 'That struts and frets his hour upon the stage,',
 'And then is heard no more. It is a tale',
 'Told by an idiot, full of sound and fury,',
 'Signifying nothing.']

In [12]:
firstline = lines[0]
firstline

Python has a handful of built-in ways to search a line. Here are just a few.

In [13]:
yest = firstline.lower().replace('tomorrow','yesterday',2)
yest

'yesterday, and yesterday, and tomorrow,'

In [14]:
firstline.lower().startswith('tomorrow')

True

In [15]:
firstline.endswith('tomorrow,')

True

## List comprehensions
What if we want to search through every line. The obvious way is using a `for` loop.

In [16]:
for line in lines:
    if line.startswith('And'):
        print(line)

And all our yesterdays have lighted fools
And then is heard no more. It is a tale


That is a very simple loop, so simple that Python has a solution for a looping through a list using a one-line statement, called a **list comprehension**

In [17]:
[line for line in lines if line.startswith('T')]


['Tomorrow, and tomorrow, and tomorrow,',
 'To the last syllable of recorded time;',
 'The way to dusty death. Out, out, brief candle!',
 'That struts and frets his hour upon the stage,',
 'Told by an idiot, full of sound and fury,']

Remember this, when we start using more robust ways of searching line by line (sentence by sentence, etc) these will come in handy. But before we jump to those special searching methods, let's have a little detour on sorting.

## Sorting!
Say we want to investigate the lines in the speech, and order them from longest line to shortest line. Well we know how to get the length of each line using loop, but how can we measure them to reorder our list?

In [18]:
for line in lines:
    print(len(line))

37
42
38
41
47
43
46
39
41
19


We could write a function that pairs these numbers with each line, and then sorts through everything--but sort functions are notoriously challenging to write. And Python has a built in sorting function.

In [19]:
sortlines = lines.copy()
sortlines.sort()
sortlines

['And all our yesterdays have lighted fools',
 'And then is heard no more. It is a tale',
 'Creeps in this petty pace from day to day,',
 "Life's but a walking shadow, a poor player,",
 'Signifying nothing.',
 'That struts and frets his hour upon the stage,',
 'The way to dusty death. Out, out, brief candle!',
 'To the last syllable of recorded time;',
 'Told by an idiot, full of sound and fury,',
 'Tomorrow, and tomorrow, and tomorrow,']

But not only that, Python has a built in mini-function generator called `lambda` that you can nest inside at sorting function.

In [20]:
sortlines = lines.copy()
sortlines.sort(key=lambda x: len(x), reverse=True)
# what is this one down here doing?
#sortlines.sort(key=lambda x: x.split()[-1], reverse=True)
sortlines

## Regular Expressions
The more you work with unstructured text, the greater desire you will have for the power that regular expressions give you. Regular expressions are a mini-language to themselves (often sharing similarities across different programming languages). They allow you to search for a variety of patterns within text. The most obvious patterns you might find are telephone numbers, ZIP Codes, email addresses (social security numbers and credit card numbers for the more malicious)--and many regular expressions have been written to capture these with varying levels accuracy. Today, however, our focus will be on exploring text.

First import the built-in regular expression library `re`

In [21]:
import re

There are five main regular expression functions that we will work with:

**match()** & **search()**: these methods tell you whether or not they found a match, and where that match was located--although match() only searches at the very beginning of the line--so it is rarely useful.

**split()** & **sub()**: these two work just like split() & replace(), but they search for patterns and return a list or a substitute string respectively.

**findall()**: just as the name sounds, this method returns a list of matching patterns that were found throughout the entire string.

In [22]:

#found = re.match("morrow",firstline,re.IGNORECASE)
#found
found = re.search("morrow",firstline,re.IGNORECASE)
#found.group()
found.end()


8

In [23]:
newlist = re.split("and",firstline,flags=re.IGNORECASE)
newstring = re.sub("tomorrow","yesterday",firstline,flags=re.IGNORECASE)
print(newlist,newstring)

['Tomorrow, ', ' tomorrow, ', ' tomorrow,'] yesterday, and yesterday, and yesterday,


In [23]:
words = re.findall("to",firstline,re.IGNORECASE)
words

['To', 'to', 'to']

## Special characters
While the search methods above are more useful than what's built into Python, it is the pattern seeking commands that--once you get used to them--do the most powerful work.

Here's a list  of the most common pattern seeking characters:

| special character | what it does |
|--------|---------|
| `.` | Match any character except newline |
| `^` | match the beginning of string |
| `$` | match the end of string, including `\n` |
| `*` | match 0 or more repetitions |
| `+` | match 1 or more repetitions  |
| `?` | match 0 or 1 repetitions  |
| `{m}` | m specifies the number of repetitions  |
| `{m,n}` | m and n specifies a range of repetitions  |
| `{m,}` | m specifies the minimum number of repetitions  |


In [25]:
all_ll = re.findall("..ll",speech)
re.search("^Tomorrow",firstline)
re.search("tomorrow,$",firstline)
#all_ll


<_sre.SRE_Match object; span=(28, 37), match='tomorrow,'>

In [26]:
#a list comprehension again!
#Note that match() would produce the same thing
[line for line in lines if re.search("^And",line)]

In [27]:
[line for line in lines if re.search(",$",line)]

['Tomorrow, and tomorrow, and tomorrow,',
 'Creeps in this petty pace from day to day,',
 "Life's but a walking shadow, a poor player,",
 'That struts and frets his hour upon the stage,',
 'Told by an idiot, full of sound and fury,']

In [28]:
th_plus = re.findall("the*..",speech)
th_plus

In [29]:
names = "Jon, John, Jonn, Johhhn, Joan"
find_names = re.findall(r"Joh?n\b",names)
find_names

['Jon', 'John']

In [30]:
# see what happens if you replace the + with a *
l_plus = re.findall("..l+..",speech)
l_plus

In [31]:
l_plus = re.findall(".or?",speech)
l_plus

['To',
 'mor',
 'ro',
 'to',
 'mor',
 'ro',
 'to',
 'mor',
 'ro',
 'ro',
 'to',
 'To',
 ' o',
 'cor',
 ' o',
 'fo',
 'to',
 ' o',
 'do',
 'po',
 'ho',
 'po',
 'no',
 'mor',
 'To',
 'io',
 ' o',
 'so',
 'no']

In [32]:
o_2 = re.findall("..o{2}..",speech)
o_2

## Sets and Groups
**Sets**, which include `[]` in shortcuts like `\w`, allow you to search for certain types of characters. **Groups**, which are demarcated by `()` allow you to specify important sub-patterns that you can access individually.

| enclosures | what it does |
|--------|---------|
| `[]` | A defined set of characters to search for |
| `()` | A group of characters to search for, can be accessed individually in the results. |


| Examples of sets | what it does |
|--------|---------|
| `[aeiou]` | Find any vowel |
| `[Tt]` | Find a lowercase or uppercase t |
| `[0-9]` | Find any number, there is a shortcut for this |
| `[^0-9]` | Find anything that's not number, there is a shortcut for this |
| `[13579]` | Find any odd numer |
| `[A-Za-z]` | Find any letter, there is a shortcut for this too |
| `[+.*]` | Find those actual characters, special characters are canceled in sets (not including shortcuts: see below) |


| Shortcut | what it does |
|--------|---------|
| `\b` | Word boundary: spaces, commas, end of line, anything that comes at the beginning or end of a word |
| `\B` | Not a word-boundary |
| `\d` | numbers [0-9] |
| `\D` | not numbers |
| `\s` | whitespace characters: space, tab... |
| `\S` | not space |
| `\w` | letters |
| `\W` | not letters |


In [33]:
words = re.findall(r"\b[TtOo]\w+",speech)
words

['Tomorrow',
 'tomorrow',
 'tomorrow',
 'this',
 'to',
 'To',
 'the',
 'of',
 'time',
 'our',
 'The',
 'to',
 'Out',
 'out',
 'That',
 'the',
 'then',
 'tale',
 'Told',
 'of']

In [34]:
words = re.findall(r"[tT]\w+",speech)
words

['Tomorrow',
 'tomorrow',
 'tomorrow',
 'this',
 'tty',
 'to',
 'To',
 'the',
 'time',
 'terdays',
 'ted',
 'The',
 'to',
 'ty',
 'th',
 'That',
 'truts',
 'ts',
 'the',
 'tage',
 'then',
 'tale',
 'Told',
 'thing']

Looking for phrases

In [35]:
# three-word phrases that begin with two-letter words
phrases = re.findall(r"\b\w{2}\W+\w+\W+\w+",speech)
#overlapping
#phrases = re.findall(r"(?=(\b\w{2}\W+\w+\W+\w+))",speech) 
#groups
#phrases = re.findall(r"(\b\w{2})\W+(\w+)\W+(\w+)",speech)
#phrases = re.findall(r"(?=(\b\w{2})\W+(\w+)\W+(\w+))",speech)

phrases

['in this petty',
 'to day,\nTo',
 'of recorded time',
 'to dusty death',
 'is heard no',
 'It is a',
 'by an idiot',
 'of sound and']

Searching a longer poem

In [36]:
f = open('/Users/Jon/Documents/Columbia-2018/pythonNotebooks/weekFour/wasteland.txt', 'r')
wasteland = f.read()

FileNotFoundError: [Errno 2] No such file or directory: '/Users/Jon/Documents/Columbia-2018/pythonNotebooks/weekFour/wasteland.txt'

In [37]:
poemlines = wasteland.split('\n')

NameError: name 'wasteland' is not defined

In [38]:

[line for line in poemlines if re.search("win.", line)]


NameError: name 'poemlines' is not defined

Searching a whole play

In [39]:
f = open('hamlet.txt', 'r')
play = f.read()


FileNotFoundError: [Errno 2] No such file or directory: 'hamlet.txt'

In [40]:
type(play)
play

NameError: name 'play' is not defined

In [41]:
play[:500]

NameError: name 'play' is not defined

In [42]:
all_chars = re.findall(r"[\n]([A-Z]+)[\n]",play)
# all_chars = re.findall(r"[\n](ACT [IV]+)[\n]",play)
# all_chars = re.findall(r"[\n](ACT [IV]+[\n]+SCENE [IVX]+.)",play)
# all_chars = re.findall(r"(SCENE [IVX]+)",play)
# all_chars = re.findall(r"(ACT [IV]+)*[\n]+(SCENE [IVX]+)",play)
# all_chars = re.findall(r"(ACT [IV]+)*[\n]+(SCENE [IVX]+)(.*)",play)
# acts = re.split(r"(ACT [IV]+)*[\n]+(SCENE [IVX]+)",play)
all_chars
# acts[8]

NameError: name 'play' is not defined

In [43]:
house_reps = '''1st Zeldin, Lee R 1517 LHOB (202) 225-3826 Financial Services Foreign Affairs 
2nd King, Pete R 339 CHOB (202) 225-7896  Financial Services Homeland Security Intelligence 
3rd Suozzi, Thomas D 226 CHOB (202) 225-3335  Armed Services Foreign Affairs 
4th Rice, Kathleen D 1508 LHOB (202) 225-5516  Homeland Security Veterans' Affairs 
5th Meeks, Gregory W. D 2234 RHOB (202) 225-3461  Financial Services Foreign Affairs 
6th Meng, Grace D 1317 LHOB (202) 225-2601  Appropriations 
7th Velázquez, Nydia M. D 2302 RHOB (202) 225-2361  Financial Services Natural Resources Small Business 
8th Jeffries, Hakeem D 1607 LHOB (202) 225-5936  Budget Judiciary 
9th Clarke, Yvette D. D 2058 RHOB (202) 225-6231  Energy and Commerce Small Business Ethics 
10th Nadler, Jerrold D 2109 RHOB (202) 225-5635  Judiciary 
11th Donovan, Daniel R 1541 LHOB (202) 225-3371  Foreign Affairs Homeland Security 
12th Maloney, Carolyn D 2308 RHOB (202) 225-7944  Financial Services Oversight and Government Reform 
13th Espaillat, Adriano D 1630 LHOB (202) 225-4365  Education and the Workforce Foreign Affairs Small Business 
14th Crowley, Joseph D 1035 LHOB (202) 225-3965  Ways and Means 
15th Serrano, José E. D 2354 RHOB (202) 225-4361  Appropriations 
16th Engel, Eliot D 2462 RHOB (202) 225-2464  Foreign Affairs Energy and Commerce 
17th Lowey, Nita D 2365 RHOB (202) 225-6506  Appropriations Joint Select Committee on Budget and APPNs Process Reform 
18th Maloney, Sean Patrick D 1027 LHOB (202) 225-5441  Agriculture Transportation and Infrastructure 
19th Faso, John R 1616 LHOB (202) 225-5614  Agriculture Budget Transportation and Infrastructure 
20th Tonko, Paul D. D 2463 RHOB (202) 225-5076  Energy and Commerce Science, Space, and Technology 
21st Stefanik, Elise R 318 CHOB (202) 225-4611  Armed Services Education and the Workforce Intelligence 
22nd Tenney, Claudia R 512 CHOB (202) 225-3665  Financial Services 
23rd Reed, Tom R 2437 RHOB (202) 225-3161  Ways and Means 
24th Katko, John R 1620 LHOB (202) 225-3701  Homeland Security Transportation and Infrastructure 
25th Slaughter, Louise McIntosh - Vacancy D 2469 RHOB (202) 225-3615  
26th Higgins, Brian D 2459 RHOB (202) 225-3306  Budget Ways and Means 
27th Collins, Chris R 1117 LHOB (202) 225-5265  Energy and Commerce
'''

In [44]:
house_list = house_reps.splitlines()
house_list

['1st Zeldin, Lee R 1517 LHOB (202) 225-3826 Financial Services Foreign Affairs ',
 '2nd King, Pete R 339 CHOB (202) 225-7896  Financial Services Homeland Security Intelligence ',
 '3rd Suozzi, Thomas D 226 CHOB (202) 225-3335  Armed Services Foreign Affairs ',
 "4th Rice, Kathleen D 1508 LHOB (202) 225-5516  Homeland Security Veterans' Affairs ",
 '5th Meeks, Gregory W. D 2234 RHOB (202) 225-3461  Financial Services Foreign Affairs ',
 '6th Meng, Grace D 1317 LHOB (202) 225-2601  Appropriations ',
 '7th Velázquez, Nydia M. D 2302 RHOB (202) 225-2361  Financial Services Natural Resources Small Business ',
 '8th Jeffries, Hakeem D 1607 LHOB (202) 225-5936  Budget Judiciary ',
 '9th Clarke, Yvette D. D 2058 RHOB (202) 225-6231  Energy and Commerce Small Business Ethics ',
 '10th Nadler, Jerrold D 2109 RHOB (202) 225-5635  Judiciary ',
 '11th Donovan, Daniel R 1541 LHOB (202) 225-3371  Foreign Affairs Homeland Security ',
 '12th Maloney, Carolyn D 2308 RHOB (202) 225-7944  Financial Servi

In [50]:
dists = re.findall(r"^\d{1,}\w{2}",house_reps) 
dists

['1st']

In [61]:
# Finding the district
# [re.findall(r"^\d+\w{2}",line) for line in house_list]
# alternative, more exact
[re.findall(r"^\d+[nrst][dht]",line) for line in house_list]

# Finding the last name
# still has some problems
# [re.findall(r" [A-Z][\w]+, [A-Z][\w]+",line) for line in house_list]
# combine with previous search
[re.findall(r"^(\d+[nrst][dht]) ([A-Z][\w]+),", line) for line in house_list]

# Getting phone numbers
[re.findall(r"\(\d+\) \d{3}-\d{4}", line) for line in house_list]

# 
[re.findall(r"\(\d+\) \d{3}-\d{4}", line) for line in house_list]

[['(202) 225-3826'],
 ['(202) 225-7896'],
 ['(202) 225-3335'],
 ['(202) 225-5516'],
 ['(202) 225-3461'],
 ['(202) 225-2601'],
 ['(202) 225-2361'],
 ['(202) 225-5936'],
 ['(202) 225-6231'],
 ['(202) 225-5635'],
 ['(202) 225-3371'],
 ['(202) 225-7944'],
 ['(202) 225-4365'],
 ['(202) 225-3965'],
 ['(202) 225-4361'],
 ['(202) 225-2464'],
 ['(202) 225-6506'],
 ['(202) 225-5441'],
 ['(202) 225-5614'],
 ['(202) 225-5076'],
 ['(202) 225-4611'],
 ['(202) 225-3665'],
 ['(202) 225-3161'],
 ['(202) 225-3701'],
 ['(202) 225-3615'],
 ['(202) 225-3306'],
 ['(202) 225-5265']]