## Regular expressions
More ways to use regular expressions...

More ways to think about parsing and reading unstructured texts...

In [1]:
import re

**Parsing an entire text** with groups and split

Open this text file for Hamlet and take a look at it! The text is very basic. 
What I do below is look for the various patterns that begin to group the parts of the play together.

Plays are well-structured texts, in beautiful soup or JavaScript object notation we would understand play being organized by:
	play.act.scene.dialogue_stageDirection
    These are the levels of organization of a play.



In [3]:
f = open('hamlet.txt', 'r', encoding='utf8')
play = f.read()
play[:500]
#That last line just shows us the first 500 characters of the play.

"\nThe Tragedy of Hamlet, Prince of Denmark\n\nACT I\n\nSCENE I. Elsinore. A platform before the castle.\n\nFRANCISCO at his post. Enter to him BERNARDO\nBERNARDO\nWho's there?\nFRANCISCO\nNay, answer me: stand, and unfold yourself.\nBERNARDO\nLong live the king!\nFRANCISCO\nBernardo?\nBERNARDO\nHe.\nFRANCISCO\nYou come most carefully upon your hour.\nBERNARDO\n'Tis now struck twelve; get thee to bed, Francisco.\nFRANCISCO\nFor this relief much thanks: 'tis bitter cold,\nAnd I am sick at heart.\nBERNARDO\nHave you had qui"

**GROUPING PATTERNS**

We can isolate information types using regular expressions. Here are a bunch of different regular expressions using groups ( ) and re.findall() that pull out every instance of a pattern. Try them out!

In [4]:
#This gets a list of characters
#all_chars = re.findall(r"[\n]([A-Z ]+)[\n]",play)
#all_chars

#Gets a list of act names
# act_names = re.findall(r"[\n](ACT [IV]+)[\n]",play)
# act_names

#Gets a list of act and scene names
# act_scene=re.findall(r"[\n](ACT [IV]+[\n]+SCENE [IVX]+.)",play)
# act_scene

#List of scene names
# all_scenes = re.findall(r"(SCENE [IVX]+)",play)
# all_scenes

#List of all acts and all scenes with acts (with a blank when ACT doesn't appear)
# act_w_scene = re.findall(r"(ACT [IV]+)*[\n]+(SCENE [IVX]+)",play)
# act_w_scene

#List of all acts all scenes plus the scene description
act_w_scene_des = re.findall(r"(ACT [IV]+)*[\n]+(SCENE [IVX]+)(.+)[\n]",play)
act_w_scene_des



[('ACT I', 'SCENE I', '. Elsinore. A platform before the castle.'),
 ('', 'SCENE II', '. A room of state in the castle.'),
 ('', 'SCENE III', ". A room in Polonius' house."),
 ('', 'SCENE IV', '. The platform.'),
 ('', 'SCENE V', '. Another part of the platform.'),
 ('ACT II', 'SCENE I', ". A room in POLONIUS' house."),
 ('', 'SCENE II', '. A room in the castle.'),
 ('ACT III', 'SCENE I', '. A room in the castle.'),
 ('', 'SCENE II', '. A hall in the castle.'),
 ('', 'SCENE III', '. A room in the castle.'),
 ('', 'SCENE IV', ". The Queen's closet."),
 ('ACT IV', 'SCENE I', '. A room in the castle.'),
 ('', 'SCENE II', '. Another room in the castle.'),
 ('', 'SCENE III', '. Another room in the castle.'),
 ('', 'SCENE IV', '. A plain in Denmark.'),
 ('', 'SCENE V', '. Elsinore. A room in the castle.'),
 ('', 'SCENE VI', '. Another room in the castle.'),
 ('', 'SCENE VII', '. Another room in the castle.'),
 ('ACT V', 'SCENE I', '. A churchyard.'),
 ('', 'SCENE II', '. A hall in the castle

**regex split()**

If you use split() with groups ( ) it will remember the patterns you are using to split by, but it will also isolate everything between those patterns as well! So now you are getting an organized list of all of the components of this play.

In [5]:
#EVERYTHING!!! this one regular expression, using split, 
#parses the entire structure of the play.
#It gets a list that has the act, the scene, the scene description, 
#and the entirety of that scene--each as a separate list element
#so every fourth element contains the complete text of a scene.
acts = re.split(r"(ACT [IV]+)*[\n]+(SCENE [IVX]+)(.+)[\n]",play)

print(len(acts))
acts[1]

81


'ACT I'

**Making that list into a useful dictionary!**

A good thing to understand: **lists** are great for isolating and ordering a series of data.

Whereas **dictionaries** are great for grouping that data into units and making those units more meaningful by assigning keys to them.

The list I get from the regular expression above is extremely useful because it has every component part of the play isolated and in a series. But when I transform it into a dictionary below, then I get a **list of every scene**: each *scene* is a **dictionary** with **keys** for the *act number*, the *scene number*, the *scene description*, and all of the *dialogue* (and stage directions) from that scene within a dictionary.

In [6]:
#The list we get from that regular expression give us a pattern, 
#(after the first element [0] which is just the title of the play.)

# The pattern is: act [1], scene [2], description [3], dialogue [4]
#                 act [5], scene [6], description [7], dialogue [8]
#                 act [9], scene [10], description [11], dialogue [12]
#                 ...and so forth..
# So starting at element 1, every 4 elements match that pattern.
# So this loop sets the range to start x at 1 and to jump every 4 ahead, 
# and it uses x in the loop to isolate each element 
# and to enter each element into a more meaningful dictionary by keys
# (one tricky part: the act text "ACT I" element is empty for all subsequent scenes
#  until we get to the next Act so I control for that with the variable current_act)

hamlet_structure=[]
current_act = ""
for x in range(1,len(acts),4):
    if acts[x] is not None:
        current_act = acts[x]
    scene_dict = {}
    scene_dict['act'] = current_act
    scene_dict['scene'] = acts[x+1]
    scene_dict['setting'] = acts[x+2]
    scene_dict['dialogue'] = acts[x+3]
    hamlet_structure.append(scene_dict)
hamlet_structure[6]

{'act': 'ACT II',
 'scene': 'SCENE II',
 'setting': '. A room in the castle.',
 'dialogue': "\nEnter KING CLAUDIUS, QUEEN GERTRUDE, ROSENCRANTZ, GUILDENSTERN, and Attendants\nKING CLAUDIUS\nWelcome, dear Rosencrantz and Guildenstern!\nMoreover that we much did long to see you,\nThe need we have to use you did provoke\nOur hasty sending. Something have you heard\nOf Hamlet's transformation; so call it,\nSith nor the exterior nor the inward man\nResembles that it was. What it should be,\nMore than his father's death, that thus hath put him\nSo much from the understanding of himself,\nI cannot dream of: I entreat you both,\nThat, being of so young days brought up with him,\nAnd sith so neighbour'd to his youth and havior,\nThat you vouchsafe your rest here in our court\nSome little time: so by your companies\nTo draw him on to pleasures, and to gather,\nSo much as from occasion you may glean,\nWhether aught, to us unknown, afflicts him thus,\nThat, open'd, lies within our remedy.\nQUEEN G

**Groups for data type**

In [7]:
house_reps = '''1st Zeldin, Lee R 1517 LHOB (202) 225-3826 Financial Services Foreign Affairs 
2nd King, Pete R 339 CHOB (202) 225-7896  Financial Services Homeland Security Intelligence 
3rd Suozzi, Thomas D 226 CHOB (202) 225-3335  Armed Services Foreign Affairs 
4th Rice, Kathleen D 1508 LHOB (202) 225-5516  Homeland Security Veterans' Affairs 
5th Meeks, Gregory W. D 2234 RHOB (202) 225-3461  Financial Services Foreign Affairs 
6th Meng, Grace D 1317 LHOB (202) 225-2601  Appropriations 
7th Velázquez, Nydia M. D 2302 RHOB (202) 225-2361  Financial Services Natural Resources Small Business 
8th Jeffries, Hakeem D 1607 LHOB (202) 225-5936  Budget Judiciary 
9th Clarke, Yvette D. D 2058 RHOB (202) 225-6231  Energy and Commerce Small Business Ethics 
10th Nadler, Jerrold D 2109 RHOB (202) 225-5635  Judiciary 
11th Donovan, Daniel R 1541 LHOB (202) 225-3371  Foreign Affairs Homeland Security 
12th Maloney, Carolyn D 2308 RHOB (202) 225-7944  Financial Services Oversight and Government Reform 
13th Espaillat, Adriano D 1630 LHOB (202) 225-4365  Education and the Workforce Foreign Affairs Small Business 
14th Crowley, Joseph D 1035 LHOB (202) 225-3965  Ways and Means 
15th Serrano, José E. D 2354 RHOB (202) 225-4361  Appropriations 
16th Engel, Eliot D 2462 RHOB (202) 225-2464  Foreign Affairs Energy and Commerce 
17th Lowey, Nita D 2365 RHOB (202) 225-6506  Appropriations Joint Select Committee on Budget and APPNs Process Reform 
18th Maloney, Sean Patrick D 1027 LHOB (202) 225-5441  Agriculture Transportation and Infrastructure 
19th Faso, John R 1616 LHOB (202) 225-5614  Agriculture Budget Transportation and Infrastructure 
20th Tonko, Paul D. D 2463 RHOB (202) 225-5076  Energy and Commerce Science, Space, and Technology 
21st Stefanik, Elise R 318 CHOB (202) 225-4611  Armed Services Education and the Workforce Intelligence 
22nd Tenney, Claudia R 512 CHOB (202) 225-3665  Financial Services 
23rd Reed, Tom R 2437 RHOB (202) 225-3161  Ways and Means 
24th Katko, John R 1620 LHOB (202) 225-3701  Homeland Security Transportation and Infrastructure 
25th Slaughter, Louise McIntosh - Vacancy D 2469 RHOB (202) 225-3615  
26th Higgins, Brian D 2459 RHOB (202) 225-3306  Budget Ways and Means 
27th Collins, Chris R 1117 LHOB (202) 225-5265  Energy and Commerce
'''

In [8]:
house_list = house_reps.splitlines()
house_list

['1st Zeldin, Lee R 1517 LHOB (202) 225-3826 Financial Services Foreign Affairs ',
 '2nd King, Pete R 339 CHOB (202) 225-7896  Financial Services Homeland Security Intelligence ',
 '3rd Suozzi, Thomas D 226 CHOB (202) 225-3335  Armed Services Foreign Affairs ',
 "4th Rice, Kathleen D 1508 LHOB (202) 225-5516  Homeland Security Veterans' Affairs ",
 '5th Meeks, Gregory W. D 2234 RHOB (202) 225-3461  Financial Services Foreign Affairs ',
 '6th Meng, Grace D 1317 LHOB (202) 225-2601  Appropriations ',
 '7th Velázquez, Nydia M. D 2302 RHOB (202) 225-2361  Financial Services Natural Resources Small Business ',
 '8th Jeffries, Hakeem D 1607 LHOB (202) 225-5936  Budget Judiciary ',
 '9th Clarke, Yvette D. D 2058 RHOB (202) 225-6231  Energy and Commerce Small Business Ethics ',
 '10th Nadler, Jerrold D 2109 RHOB (202) 225-5635  Judiciary ',
 '11th Donovan, Daniel R 1541 LHOB (202) 225-3371  Foreign Affairs Homeland Security ',
 '12th Maloney, Carolyn D 2308 RHOB (202) 225-7944  Financial Servi

In [9]:
#Multiline flag re.M allows you to search across multiple lines in a string.
dists = re.findall(r"^\d\d?",house_reps,re.M) 
dists

['1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '19',
 '20',
 '21',
 '22',
 '23',
 '24',
 '25',
 '26',
 '27']

In [10]:
#different searching goodies in here!

#[re.findall(r"^\d+",line) for line in house_list]
#[re.findall(r"(^\d+[nrst][dht])",line) for line in house_list]
#[re.findall(r" [A-Z][\w]+, [A-Z][\w]+",line) for line in house_list]
#[re.findall(r"^\d+[nrst][dht] [A-Z][\w]+,",line) for line in house_list]
#[re.findall(r"[(]\d+[)][ 0-9\-]+",line) for line in house_list]
#[re.findall(r"[(]\d+[)] \d{3}-\d{4}",line) for line in house_list]
#[re.findall(r"[(]\d+[)] \d+-*\d+",line) for line in house_list]
#[re.findall(r"\D+$",line) for line in house_list]
#[re.findall(r" [DR] \d",line) for line in house_list]
#[re.findall(r", ([A-Z]\w+) ([A-Z]\w+ )*([A-Z][.])*",line) for line in house_list]
[re.findall(r", Jo[hs]",line) for line in house_list]


[[],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [', Jos'],
 [', Jos'],
 [],
 [],
 [],
 [', Joh'],
 [],
 [],
 [],
 [],
 [', Joh'],
 [],
 [],
 []]

**Phrase search and ? (lookahead)**
Lookheads and lookbehinds and negative lookahead/behind are more advanced and you should just learn them when you need to use them. Basically they search for patterns without moving on to new characters. They just see if patterns happen ahead or behind or don't happen. Here is a very simple example. But when the time comes for using these you will know. This is a decent explanation online: https://www.rexegg.com/regex-lookarounds.html

In [11]:
#phrases = re.findall(r"\w{2}","hello") 
phrases = re.findall(r"(?=(\w{2}))","hello") 
phrases

['he', 'el', 'll', 'lo']

In [12]:
speech = '''Tomorrow, and tomorrow, and tomorrow,
Creeps in this petty pace from day to day,
To the last syllable of recorded time;
And all our yesterdays have lighted fools
The way to dusty death. Out, out, brief candle!
Life's but a walking shadow, a poor player,
That struts and frets his hour upon the stage,
And then is heard no more. It is a tale
Told by an idiot, full of sound and fury,
Signifying nothing.'''

Below is a search for three word phrases, that begin with two letter words. See the difference between the lookahead version and the regular version:

With the lookahead, in the line `And then is heard no more. It is a tale` it finds all of the phrases that begin with two letter words even those that overlap:
```
'is heard no',
 'no more. It',
 'It is a',
 'is a tale',
```
Without the lookahead, once it finds a pattern it moves on to the next character location, so it doesn't search for any of the two letter words that are already included in the each result:

```
'is heard no',
 'It is a',
```

In [13]:
# three-word phrases that begin with two-letter words
phrases = re.findall(r"\b\w{2}\W+\w+\W+\w+",speech)
#overlapping
phrases = re.findall(r"(?=(\b\w{2}\W+\w+\W+\w+))",speech) 


#groups--I'm ready access but I'm using groups to isolate each word.
#phrases = re.findall(r"(\b\w{2})\W+(\w+)\W+(\w+)",speech)
#phrases = re.findall(r"(?=(\b\w{2})\W+(\w+)\W+(\w+))",speech)

phrases

['in this petty',
 'to day,\nTo',
 'To the last',
 'of recorded time',
 'to dusty death',
 'is heard no',
 'no more. It',
 'It is a',
 'is a tale',
 'by an idiot',
 'an idiot, full',
 'of sound and']

**Splitting a sentence**
This is tricky, and this isn't even the best or most robust regular expression. But it does work on an annoying group of sentences like this. Depending on the kind of text you are parsing it can be virtually impossible to accurately 100% all of the time split by sentence.

In [14]:
to_sentence = '''
Mr. Smith bought cheapsite.com for 1.5 million 
dollars, i.e. he paid a lot for it. Did he 
mind? Adam Jones Jr. thinks he didn't want to. In any 
case, this isn't true... Well, with a 
probability of .9 it isn't. Right?! Mr. Comey of 
the F.B.I. thinks not. 
'''

In [15]:
#Write a regex that accurately splits 
#the paragraph above into sentences

#this works ok!!
#First group looks for:
#     any three characters that are NOT capitalized letters followed by . or ? or !
#           (this will not work for something that's written in all caps)
#           (this will not even work for a sentence ending: "said Jon.")
#Second group looks ahead from that first group for:
#     one or more spaces and a Cap to start the next sentence.
#     That second part, because it is a lookahead, does not get captured in the groups.
#split() splits by that first pattern as long as the lookahead is true. 

#So we get a list 'sents' that has the pattern and the beginning of the next sentence.

sents = re.split(r"([^A-Z]{3}[.?!])(?=\s+[A-Z])",to_sentence)
sents



['\nMr. Smith bought cheapsite.com for 1.5 million \ndollars, i.e. he paid a lot for',
 ' it.',
 ' Did he \nm',
 'ind?',
 " Adam Jones Jr. thinks he didn't want",
 ' to.',
 " In any \ncase, this isn't tru",
 'e...',
 ' Well, with a \nprobability of .9 it is',
 "n't.",
 ' Rig',
 'ht?!',
 ' Mr. Comey of \nthe F.B.I. thinks not. \n']

In [16]:
#Here we have to join every other part of the 'sents' list together 
#to re-combine the full sentence.

join_sents = [sents[x] + sents[x+1] for x in range(0,len(sents)-2,2)]
join_sents.append(sents[-1])
join_sents

['\nMr. Smith bought cheapsite.com for 1.5 million \ndollars, i.e. he paid a lot for it.',
 ' Did he \nmind?',
 " Adam Jones Jr. thinks he didn't want to.",
 " In any \ncase, this isn't true...",
 " Well, with a \nprobability of .9 it isn't.",
 ' Right?!',
 ' Mr. Comey of \nthe F.B.I. thinks not. \n']

**More problems with text: The Waste Land**

In [17]:
f = open('wasteland.txt', 'r', encoding='utf8')
wasteland = f.read()

In [18]:
#?! Looked ahead for something not containing
# Homework example of 2 occurrences "ow" 

**Sometimes phrases are more useful than word searches**

In [19]:
#phrases = re.findall(r"\b\w{2}\W+\w+\W+\w+",wasteland)
phrases = re.findall(r"\bof\W+the\W+\w+",wasteland,re.IGNORECASE)
phrases

['of the Dead',
 'of the dead',
 'of the night',
 'of the Rocks',
 'of the bones',
 'of the window',
 'of the low',
 'of the dead',
 'of the key',
 'of the key']

**Writing a Most Frequent Words script**

In [20]:
waste_words = wasteland.lower()

#get a list of words
#waste_words1 = re.split(r"\W+",waste_words)
waste_words2 = re.findall(r"\b\w+\b",waste_words)

In [21]:
waste_words2

['the',
 'waste',
 'land',
 'related',
 'poem',
 'content',
 'details',
 'by',
 't',
 's',
 'eliot',
 'for',
 'ezra',
 'pound',
 'il',
 'miglior',
 'fabbro',
 'i',
 'the',
 'burial',
 'of',
 'the',
 'dead',
 'april',
 'is',
 'the',
 'cruellest',
 'month',
 'breeding',
 'lilacs',
 'out',
 'of',
 'the',
 'dead',
 'land',
 'mixing',
 'memory',
 'and',
 'desire',
 'stirring',
 'dull',
 'roots',
 'with',
 'spring',
 'rain',
 'winter',
 'kept',
 'us',
 'warm',
 'covering',
 'earth',
 'in',
 'forgetful',
 'snow',
 'feeding',
 'a',
 'little',
 'life',
 'with',
 'dried',
 'tubers',
 'summer',
 'surprised',
 'us',
 'coming',
 'over',
 'the',
 'starnbergersee',
 'with',
 'a',
 'shower',
 'of',
 'rain',
 'we',
 'stopped',
 'in',
 'the',
 'colonnade',
 'and',
 'went',
 'on',
 'in',
 'sunlight',
 'into',
 'the',
 'hofgarten',
 'and',
 'drank',
 'coffee',
 'and',
 'talked',
 'for',
 'an',
 'hour',
 'bin',
 'gar',
 'keine',
 'russin',
 'stamm',
 'aus',
 'litauen',
 'echt',
 'deutsch',
 'and',
 'when',

In [22]:
#sort words alphabetically
sortwords = waste_words2.copy()
sortwords.sort()
sortwords

['a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'abolie',
 'about',
 'about',
 'above',
 'above',
 'addresses',
 'aethereal',
 'affina',
 'after',
 'after',
 'after',
 'after',
 'again',
 'again',
 'again',
 'againe',
 'against',
 'age',
 'age',
 'agent',
 'ago',
 'agony',
 'ahead',
 'air',
 'air',
 'air',
 'air',
 'albert',
 'albert',
 'albert',
 'albert',
 'albert',
 'alexandria',
 'alive',
 'all',
 'all',
 'all',
 'all',
 'alley',
 'allows',
 'alone',
 'alone',
 'along',
 'already',
 'also',
 'always',
 'always',
 'am',
 'am',
 'among',
 'among',
 'among',
 'among',
 'amongst',
 'amongst',
 'an',
 'an',
 'and',
 'and',
 'and',
 'and',
 'and

In [23]:
#Loop through the alphabetically arranged list
#count each instance of the word and make a dictionary
all_words = []
counter = 1
this_word = ""
for word in sortwords:
    if word != this_word:
        all_words.append({'word':this_word,'count':counter})
        counter = 1
        this_word = word
    else:
        counter +=1

In [24]:
all_words

[{'word': '', 'count': 1},
 {'word': 'a', 'count': 66},
 {'word': 'abolie', 'count': 1},
 {'word': 'about', 'count': 2},
 {'word': 'above', 'count': 2},
 {'word': 'addresses', 'count': 1},
 {'word': 'aethereal', 'count': 1},
 {'word': 'affina', 'count': 1},
 {'word': 'after', 'count': 4},
 {'word': 'again', 'count': 3},
 {'word': 'againe', 'count': 1},
 {'word': 'against', 'count': 1},
 {'word': 'age', 'count': 2},
 {'word': 'agent', 'count': 1},
 {'word': 'ago', 'count': 1},
 {'word': 'agony', 'count': 1},
 {'word': 'ahead', 'count': 1},
 {'word': 'air', 'count': 4},
 {'word': 'albert', 'count': 5},
 {'word': 'alexandria', 'count': 1},
 {'word': 'alive', 'count': 1},
 {'word': 'all', 'count': 4},
 {'word': 'alley', 'count': 1},
 {'word': 'allows', 'count': 1},
 {'word': 'alone', 'count': 2},
 {'word': 'along', 'count': 1},
 {'word': 'already', 'count': 1},
 {'word': 'also', 'count': 1},
 {'word': 'always', 'count': 2},
 {'word': 'am', 'count': 2},
 {'word': 'among', 'count': 4},
 {'wo

In [25]:
#sort a dictionary by a key's value
order_words = sorted(all_words, key=lambda d: d['count'], reverse=True)
order_words

[{'word': 'the', 'count': 206},
 {'word': 'and', 'count': 107},
 {'word': 'i', 'count': 68},
 {'word': 'a', 'count': 66},
 {'word': 'of', 'count': 66},
 {'word': 'in', 'count': 55},
 {'word': 'you', 'count': 40},
 {'word': 'is', 'count': 34},
 {'word': 'to', 'count': 33},
 {'word': 'on', 'count': 27},
 {'word': 'with', 'count': 26},
 {'word': 's', 'count': 23},
 {'word': 'at', 'count': 21},
 {'word': 'my', 'count': 20},
 {'word': 'there', 'count': 20},
 {'word': 'what', 'count': 20},
 {'word': 'are', 'count': 19},
 {'word': 'by', 'count': 18},
 {'word': 'he', 'count': 18},
 {'word': 'it', 'count': 18},
 {'word': 'said', 'count': 18},
 {'word': 'that', 'count': 18},
 {'word': 'her', 'count': 17},
 {'word': 'water', 'count': 16},
 {'word': 'no', 'count': 15},
 {'word': 'if', 'count': 14},
 {'word': 'me', 'count': 14},
 {'word': 'only', 'count': 14},
 {'word': 'or', 'count': 14},
 {'word': 'his', 'count': 13},
 {'word': 'out', 'count': 13},
 {'word': 'we', 'count': 13},
 {'word': 'were', 

In [26]:
#Or use the Built-in version!!!!!
from collections import Counter
wordcount = Counter(waste_words2)

In [27]:
wordcount.most_common()

[('the', 206),
 ('and', 107),
 ('i', 68),
 ('of', 66),
 ('a', 66),
 ('in', 55),
 ('you', 40),
 ('is', 34),
 ('to', 33),
 ('on', 27),
 ('with', 26),
 ('s', 23),
 ('at', 21),
 ('my', 20),
 ('there', 20),
 ('what', 20),
 ('are', 19),
 ('by', 18),
 ('he', 18),
 ('said', 18),
 ('that', 18),
 ('it', 18),
 ('her', 17),
 ('water', 16),
 ('no', 15),
 ('me', 14),
 ('or', 14),
 ('only', 14),
 ('if', 14),
 ('out', 13),
 ('we', 13),
 ('were', 13),
 ('his', 13),
 ('one', 12),
 ('who', 12),
 ('but', 12),
 ('o', 12),
 ('can', 12),
 ('t', 11),
 ('was', 11),
 ('down', 11),
 ('rock', 11),
 ('from', 11),
 ('she', 11),
 ('dead', 10),
 ('this', 10),
 ('under', 10),
 ('not', 10),
 ('nothing', 10),
 ('do', 10),
 ('up', 10),
 ('time', 10),
 ('for', 9),
 ('over', 9),
 ('when', 9),
 ('which', 9),
 ('so', 9),
 ('have', 9),
 ('mountains', 8),
 ('man', 8),
 ('your', 8),
 ('jug', 8),
 ('shall', 8),
 ('where', 7),
 ('eyes', 7),
 ('be', 7),
 ('here', 7),
 ('its', 7),
 ('upon', 7),
 ('as', 7),
 ('now', 7),
 ('night', 6