# Hist 3368 - Week 5 - Working With Tabular Data in Pandas

Until now, in this class we have worked with lists of words. We have cleaned them and counted and compared them.

For the rest of the class, we will be working with data in tables. Tables allow us to keep track of the date when each word is from. If we have time data, we can compare wordcounts over time, compare wordcounts for different speakers, and so on.

We will need a few special commands to navigate tabular data.

In this notebook, we will learn to navigate tables:

   * how to call a column
   * how to move through a column, row by row, using a for loop
   * how to stopword
   * how to subset or 'filter' data by a column, for example, finding all the speeches of one speaker, or all the speeches on a certain date, using square brackets -- **[ ]** -- and the operators **.isin()**, **==**, and **!=**.
   * how to find the largest counts in a dataset using **.nlargest()**

We will also learn some basics of counting with tables:

   * how to count the words in a subset of data.

#### Learning Research Strategies

We will practice navigating around the tabular data for Congress, asking the kind of questions a researcher might want to know, such as:

   * given a set of years, who were the top speakers in Congress?
   * given a speaker, what was his or her longest speech?
   * given a certain set of words, who were the speakers who used those words the most?
   
The research questions profiled here are fairly simple, but if combined with strategies such as a *controlled vocabulary* they can result in a good deal of important information about which speakers were engaged with a particular topic -- for instance, the environment, crime, or women's health.  

These research strategies can also help the researcher to navigate to the longest speeches where a speaker invokes those topics, or the speeches where the speaker invokes the highest number of words related to a particular topic.  Those research strategies should form the basis for guided reading.


## Load some data

In [1]:
import pandas as pd
import csv

In [2]:
cd ~/digital-history/

/users/jguldi/digital-history


In [3]:
congress = pd.read_csv("congress1967-2010.csv")

TROUBLESHOOTING: if the line above doesn't work, you might have missed something earlier this week.

In [4]:
congress.head()

Unnamed: 0.1,Unnamed: 0,speech,date,speaker,word_count,year,month,month_year
0,0,Those who do not enjoy the privilege of the fl...,1967-01-10,The VICE PRESIDENT,16,1967,1,1967-01-01
1,1,Mr. President. on the basis of an agreement re...,1967-01-10,Mr. MANSFIELD,35,1967,1,1967-01-01
2,2,The Members of the Senate have heard the remar...,1967-01-10,The VICE PRESIDENT,40,1967,1,1967-01-01
3,3,The Chair lays before the Senate the following...,1967-01-10,The VICE PRESIDENT,151,1967,1,1967-01-01
4,4,Secretary of State.,1967-01-10,Mrs. AGNES BAGGETT,3,1967,1,1967-01-01


The data you are looking at is 'tabular' -- meaning that it's in a table.  

The format used by the pandas software package, which is running our table, is called a "dataframe."  A dataframe is a mtwo-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).  "Heterogenous" means that the dataframe can have some columns that hold strings, and other columns that hold numbers or dates.

#### Basic Navigation

We have met pandas data with an index before when we met the pandas Series.  A Series is a one-dimensional labeled array -- meaning that it only had one column, not many.  However, everything that we learned about navigating indices wlil apply to dataframes too.

In [5]:
congress.index[0]

0

In [6]:
congress.index[1000]

1000

We can call the pandas data with the **.loc** function.  The formula for calling data is :

    dataFrame.loc[<ROWS RANGE> , <COLUMNS RANGE>] -- for calling rows or columns by name
    dataFrame.iloc[<ROWS RANGE> , <COLUMNS RANGE>] -- for calling rows or columns by number

Here are rows #1005-1008:

In [7]:
congress.iloc[1005:1008, ]

Unnamed: 0.1,Unnamed: 0,speech,date,speaker,word_count,year,month,month_year
1005,1005,Mr. President. for many years I have advocated...,1967-01-11,Mr. WILLIAMS of Delaware,184,1967,1,1967-01-01
1006,1006,I am delighted to have the Senator from Delawa...,1967-01-11,Mr. DIRKSEN,27,1967,1,1967-01-01
1007,1007,Mr. President. I submit a resolution to amend ...,1967-01-11,Mr. CANNON,449,1967,1,1967-01-01


Here is the speaker column. Notice the use of ':' for 'everything':

In [8]:
congress.loc[:, 'speaker']

0                  The VICE PRESIDENT
1                       Mr. MANSFIELD
2                  The VICE PRESIDENT
3                  The VICE PRESIDENT
4                  Mrs. AGNES BAGGETT
                      ...            
5992063                   Ms. GRANGER
5992064    Ms. KILPATRICK of Michigan
5992065                    Mr. HELLER
5992066                   Mr. PAULSEN
5992067          Mr. HALL of New York
Name: speaker, Length: 5992068, dtype: object

We can also call columns by name using just square brackets.

In [9]:
congress['speaker']

0                  The VICE PRESIDENT
1                       Mr. MANSFIELD
2                  The VICE PRESIDENT
3                  The VICE PRESIDENT
4                  Mrs. AGNES BAGGETT
                      ...            
5992063                   Ms. GRANGER
5992064    Ms. KILPATRICK of Michigan
5992065                    Mr. HELLER
5992066                   Mr. PAULSEN
5992067          Mr. HALL of New York
Name: speaker, Length: 5992068, dtype: object

Notice that I can also call the column with double brackets.

  * The difference between the two methods of calling the column is that above, single brackets call the column as a pandas Series.  
  * Double brackets call the column as a pandas dataframe -- such that the column is labeled with its name.

In [10]:
congress[['speaker']]

Unnamed: 0,speaker
0,The VICE PRESIDENT
1,Mr. MANSFIELD
2,The VICE PRESIDENT
3,The VICE PRESIDENT
4,Mrs. AGNES BAGGETT
...,...
5992063,Ms. GRANGER
5992064,Ms. KILPATRICK of Michigan
5992065,Mr. HELLER
5992066,Mr. PAULSEN


You can also see how many rows there are.

In [11]:
congress['speaker'].count()

5992068

Here is just the speaker and speech for row 3234:

In [12]:
congress.loc[:, ['speaker', 'speech']].iloc[3234, :]

speaker                                            Mr. TOWER
speech     Mr. President. on June 17. a starting gun will...
Name: 3234, dtype: object

Here is just the speech:

In [13]:
myspeech = congress.loc[:, ['speech']].iloc[3234, :]
myspeech

speech    Mr. President. on June 17. a starting gun will...
Name: 3234, dtype: object

We can use some familiar tools to print out the whole speech or any portion thereof:

In [14]:
for word in myspeech[:500]:
    print(word)

Mr. President. on June 17. a starting gun will sound in San Marcos. Tex.. and the worlds toughest river race will be underway. The race is the Texas water safari. marking its fifth year in 1967 with a 538mile race from San Marcos. by way of the San Marcos and Guadalupe Rivers. along coastal bays and rivers. utilizing the Intracoastal Canal. to Freeport. Brave men from all over the countryand several entrants from foreign countries--will test their endurance. skill. equipment. plain physical stamina. and even luck as they brave logjams. rocks. white water. strong winds. and exhausting portages. on a journey through some of the most beautiful country in Texas. I am submitting today a concurrent resolution granting official recognition to the event. The race Is being sponsored by a nonprofit organization expressly set up for this purpose. Prizes approaching $6.500 in value are being donated. along with several fine trophies. I believe this outstanding sports event. emphasizing courage. sk

### Navigating tabular data: column by column, rows within columns

In the current dataset, many words are compiled into a list that is a 'speech' in Congress.  

You can call the column 'speech' with square brackets, e.g.

    congress['speech']

Many speeches form a column called 'speech.'  The column speech can be called and treated as a list.

You can call individual speeches with an additional set of square brackets after ['speech'], e.g. 

    congress['speech'][0]
    
-- which calls the first speech in the speech column.

In [15]:
congress['speech'][0]

'Those who do not enjoy the privilege of the floor will please retire from the Chamber.'

In [16]:
congress['speech'][1]

'Mr. President. on the basis of an agreement reached on both sides. it is suggested that the Chamber be cleared of all attaches. unless they have absolutely important business to attend to in the Chamber.'

In [17]:
congress['speech'][2]

'The Members of the Senate have heard the remarks of the distinguished majority leader. All attaches and staff members who are not vitally needed for the next few minutes of the deliberations of the Senate will tetire from the Chamber.'

We can work on the text -- for instance cleaning or counting -- by calling each row in a text column, one at a time, and executing a transformation, via a for-loop.

Here are the last hundred characters of the last ten speeches in the dataframe, in upper case:

In [18]:
for speech in congress['speech'][-10:]:
    speech = speech.upper()
    print(speech[-100:])

MADAM SPEAKER. I WOULD LIKE TO SUBMIT THE FOLLOWING:
EMENTS AND SERVICE OF AVIS GREEN TUCKER. AND IN EXTENDING OUR CONDOLENCES TO HER FAMILY AND FRIENDS.
CERNING THE CHINESE GOVERNMENTS APPALLING AND MASSIVE HUMAN RIGHTS VIOLATIONS SIMPLY ISNT AN OPTION.
662-"YEA". H.R. 6547. PROTECTING STUDENTS FROM SEXUAL AND VIOLENT PREDATORS ACTROLCALL NO. 663"YEA".
RS THAN DOMINATED THE FINANCIAL MEDIA. THEY MADE THEIR REPUTATIONS AND THEIR FORTUNES THROUGH FRAUD.
ROLLCALL NOS. 662 AND 661. I WAS ABSENT FROM THE HOUSE. HAD I BEEN PRESENT. I WOULD HAVE VOTED "NO."
UL TO PROTECTING THE CONSTITUTION OF THE UNITED STATES AND THE GOALS OF OUR GREAT NATION. GOD BLESS.
AKER. ON ROLICALL NO. 658. I WAS UNAVOIDABLY DETAINED. HAD I BEEN PRESENT. I WOULD HAVE VOTED "YES."
LCALL NO. 658 MY FLIGHT WAS DELAYED DUE TO WEATHER AND HAD I BEEN PRESENT. I WOULD HAVE VOTED "YES."
ME BEFORE THE HOUSE. AND DONATED MY RAISE TO LOCAL NONPROFIT ORGANIZATIONS RATHER THAN ACCEPTING IT.


## Basic Counting with Tabular Data 

We will use two commands that we have seen before to count tabular data.

    .count() -- produces a count of how many items are in a category.  Generally speaking this is the same as counting the number of rows.
    .value_counts() -- produces the subtotals for every subcategory listed in a column. We have used this command previously to get the word counts for every word in a list.  We will use value_counts() to get word counts for every word in a column in pandas.
    
We will also use one new command to count how many unique objects there are in a category.

    .unique() -- finds only the unique members of a list
    


It's easier to understand the difference between these commands in practice.

**.count()** on its own gives you the number of rows in the dataframe as a whole.  For our data, that just means the total number of speeches. 

Even if .count() is applied to the column speaker, it's still measuring the total number of individual speeches -- not how many unique speakers there are.  Most speakers are responsible for more than one speech, so their name appears several times in the dataset.  The count() below counts all rows in the dataframe, regardless of how many speakers there are:

In [19]:
congress['speaker'].count()

5992068

**.value_counts()** organizes the data by unique values and then creates a count of each.  We will use it for word count, as we have in the past.  

Applied to the speaker column, value_counts() givesyou a list of how many speeches each speaker gave.

In [20]:
congress['speaker'].value_counts()

The PRESIDING OFFICER      709041
The SPEAKER pro tempore    239201
The CHAIRMAN               137788
The SPEAKER                 86866
Mr. ROBERT C. BYRD          75733
                            ...  
MAJ. MICHAEL S. STEWART         1
Mr. BROOMPELD                   1
Mr. IISH                        1
Mr. STFNHOTM                    1
Tile I-COORDINATION             1
Name: speaker, Length: 56350, dtype: int64

What if you need to know how many unique speakers are represented in this dataframe?  You will use the **.unique()** function.

In [21]:
congress['speaker'].unique()

array(['The VICE PRESIDENT', 'Mr. MANSFIELD', 'Mrs. AGNES BAGGETT', ...,
       'The DERA', 'Mr. BAYHI', 'Mr. BAY-H'], dtype=object)

In [22]:
len(congress['speaker'].unique())

56350

#### Counting particular words per cell.

Above we noted that .count(), applied to a column, will give you the number of rows in the column.

You can also use .count() to find the counts of individual strings in each column. 

This line counts the string 'pineapple' for each speech in the 'speech' column.  

The resulting series has zero's in most columns.  But if we use .nlargest() we get a short list of the index numbers of the speeches where pineapples are mentioned the most:

In [23]:
congress['speech'].str.count('pineapple').nlargest(5)

1000851    65
1084391    51
1017189    29
2164092    26
1023495    24
Name: speech, dtype: int64

Here's how to print the results, using .loc and .iloc to call the speech by its index number.

In [24]:
for word in list(congress.loc[:, ['speech']].iloc[1000851, ]):
    print(word[:1000])

Mr. President. I am introducing legislation today to enable Hawaiian pineapple products to compete in the U.S. market with lowcost foreign canned pineapple which can easily undersell Hawaiian pineapple. One of the finest products in all America is the sweet. juicy. delectable pineapple grown in Hawaii. Since the turn of the century. pineapple has been a mainstay in Hawaiis economy. Today it is still my States second largest agricultural industay. second only to sugar. The processed value of Hawaiian pineapple last year was $137 million. The industry employs 6.200 yearround workers who earned $42 million in annual wages and another 12.000 seasonal workers who earn a total of $10 million a year. Hawaiis pineapple industry has been very energetic and progressive. investing millions of its own dollars in research to improve pineapple quality and production. The Hawaiian pineapple industry is the most highly mechanized In the world and its fleldworkers are the highest paid in the world. The

Note that we have here searched just for the string 'pineapple.' This method could create confusion in future searches unless we used regex to look for an exact word -- unless we really only care about 'pineapple,' which is unusual enough to produce good results as a free-standing string.

## Subsetting Data

We can use the python grammar of operators to ask Python to only look at certain parts of the data -- or 'subsets' of the complete dataset.

For instance, if we want *only* the data from the 1980s, we can use square brackets **[ ]** to tell python to subset a dataframe.  

We use square brackets **[ ]** to tell python to subset a dataframe according to the constraints inside the brackets.

The command to subset data is expressed with the grammar:

    df[df['columnname'].LIMITINGOPERATOR]


For instance, df[df['speaker']=='bob']] would tell python to find only the rows of the dataframe where 'bob' was listed as the speaker.

Using square brackets to "filter" for particular rows is one of the major ways of navigating tabular data in pandas.


### The operators for filtering

The following 'operators' are the ones most frequently used to tell Python how to narrow down the data.  Each works a slightly different way: 

    .isin() -- tells Python to only look for values that are in another list
    == -- tells Python to only look for values that are equal to another value
    != -- tells Python to only look for values that are NOT equal to another value



In [25]:
congress[congress['speaker'] == 'Mr. DOLE'][:10]

Unnamed: 0.1,Unnamed: 0,speech,date,speaker,word_count,year,month,month_year
7475,7475,"Mr. Speaker. the January 8. 1967. ""Doanes Agri...",1967-02-01,Mr. DOLE,201,1967,2,1967-02-01
8657,8657,Mr. Speaker. I ask unanimous consent to revise...,1967-02-02,Mr. DOLE,12,1967,2,1967-02-01
8659,8659,Mr. Speaker. I join in the statements made by ...,1967-02-02,Mr. DOLE,301,1967,2,1967-02-01
8767,8767,Mr. Speaker. it is my pleasure to join in the ...,1967-02-02,Mr. DOLE,878,1967,2,1967-02-01
12255,12255,Mr. Speaker. today I have introduced a joint r...,1967-02-09,Mr. DOLE,82,1967,2,1967-02-01
19034,19034,Mr. Speaker. it is my pleasure to join Mrs. BO...,1967-02-28,Mr. DOLE,258,1967,2,1967-02-01
20616,20616,Mr. Speaker. I wish to associate myself with t...,1967-03-02,Mr. DOLE,291,1967,3,1967-03-01
24507,24507,Mr. Speaker. during this year of 1967 the Fede...,1967-03-08,Mr. DOLE,122,1967,3,1967-03-01
25378,25378,Mr. Speaker. will the gentleman yield?,1967-03-09,Mr. DOLE,6,1967,3,1967-03-01
25380,25380,Mr. Speaker. permit me to say. first of all. t...,1967-03-09,Mr. DOLE,78,1967,3,1967-03-01


#### Using .isin() to find data from the 1980s

In the following line of code, we'll use **.isin()** to tell Python to look for values in the 1980s.  We tell Python to look at the 'year' column. Then we select only the years that are in a list of years from the 1980s. 

    eighties_data = congress[congress['year'].isin(target_years)].copy()  # filter our dataset to just this decade

**.isin()**  takes as its object a list, for instance the *target_years* variable, which we will create to include every year from 1980 to 1990.


Before we apply .isin(), however, we need to format the data so that we can navigate for time.

First, we need to make a 'year' column.

Then we need to filter for years that are in our target.  Note the use of the .isin() function. 

In [26]:
import pandas as pd
import datetime

We call the datetime package
    
    .dt.year

to create a new column called 'year'

In [27]:
congress['year']=pd.to_datetime(congress['date']).dt.year # make a year column

congress.head()

Unnamed: 0.1,Unnamed: 0,speech,date,speaker,word_count,year,month,month_year
0,0,Those who do not enjoy the privilege of the fl...,1967-01-10,The VICE PRESIDENT,16,1967,1,1967-01-01
1,1,Mr. President. on the basis of an agreement re...,1967-01-10,Mr. MANSFIELD,35,1967,1,1967-01-01
2,2,The Members of the Senate have heard the remar...,1967-01-10,The VICE PRESIDENT,40,1967,1,1967-01-01
3,3,The Chair lays before the Senate the following...,1967-01-10,The VICE PRESIDENT,151,1967,1,1967-01-01
4,4,Secretary of State.,1967-01-10,Mrs. AGNES BAGGETT,3,1967,1,1967-01-01


Using == to subset:

In [28]:
data1980 = congress[congress['year']== 1980].copy()  # filter our dataset to just this decade

data1980.head()

Unnamed: 0.1,Unnamed: 0,speech,date,speaker,word_count,year,month,month_year
2329890,2329890,Mr. Speaker. we in Delaware are proud of the o...,1980-01-03,Mr. EVANS of Delaware,122,1980,1,1980-01-01
2329891,2329891,Mr. Speaker. it is logical for Americans to be...,1980-01-03,Mr. DERWINSKI,82,1980,1,1980-01-01
2329892,2329892,The Chair has examined the Journal of the last...,1980-01-03,The SPEAKER pro tempore,32,1980,1,1980-01-01
2329893,2329893,Mr. Speaker. I ask unanimous consent that the ...,1980-01-03,Mr. WHITE,36,1980,1,1980-01-01
2329894,2329894,Is there objection to the request of the gentl...,1980-01-03,The SPEAKER pro tempore,11,1980,1,1980-01-01


Using .isin() to subset:

In [29]:
target_years = list(range(1980, 1989 + 1))  # List of the years 1980-1989

eighties_data = congress[congress['year'].isin(target_years)].copy()  # filter our dataset to just this decade

eighties_data.head()

Unnamed: 0.1,Unnamed: 0,speech,date,speaker,word_count,year,month,month_year
2329890,2329890,Mr. Speaker. we in Delaware are proud of the o...,1980-01-03,Mr. EVANS of Delaware,122,1980,1,1980-01-01
2329891,2329891,Mr. Speaker. it is logical for Americans to be...,1980-01-03,Mr. DERWINSKI,82,1980,1,1980-01-01
2329892,2329892,The Chair has examined the Journal of the last...,1980-01-03,The SPEAKER pro tempore,32,1980,1,1980-01-01
2329893,2329893,Mr. Speaker. I ask unanimous consent that the ...,1980-01-03,Mr. WHITE,36,1980,1,1980-01-01
2329894,2329894,Is there objection to the request of the gentl...,1980-01-03,The SPEAKER pro tempore,11,1980,1,1980-01-01


Let's save the results in case we want to use them again.

In [31]:
cd ~/digital-history

/users/jguldi/digital-history


In [None]:
data1980.to_csv("data1980.csv")
eighties_data.to_csv("eighties_data.csv")

## Cleaning tabular data

Next, we're going to break speeches into words and remove stopwords.  We get our stopwords list from the package NLTK (natural language toolkit):

Let's load stopwords as we have before

In [204]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

stop = stopwords.words('english')
stop[:10]

[nltk_data] Downloading package stopwords to
[nltk_data]     /users/jguldi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Stopwording with a table is much like stopwording a list: we need a for-loop that cycles through each word in the list and asks if that word is "in" the stopwords list.

In a table, however, the text is in a column.

We work on text data in a table by calling by calling each row of the column individually with a for loop and then changing the column as a whole in the datatable.  

Let's write a for-loop that cycles through each speech in eighties_data['speech']. For each speech, we will perform some familiar tasks:

  * We will .split() the speech into words
  * We will ask Python whether each word is in the stopwords list
  * We will save the results of that comparison.

After doing this for one speech, we will save the results of the stopworded speech in a new, clean list called new_column.

The for-loop will continue on to the next speech in the column eighties_data['speech'] until the whole thing has been cleaned, and the clean results comprise a list called new_column, which has as many entries as eighties_data has rows.

**This may take a minute depending on how much memory you have.**

In [None]:
new_column = []

for speech in data1980['speech']: # cycle through each speech 
    for c in string.punctuation: # cycle through the punctuation marks
        speech.replace(c,'') # remove punctuation
    speech.lower().split() # lowercase and split
    speech2 = [] # make an empty list to be filled in by the clean words in this individual speech
    for word in speech: # cycle through each word in the speech 
        if word not in stop: # test if it's a stopword
            speech2.append(word) # save the good words
    new_column.append(speech2) # save each clean speech as a new item in the list, new_column

new_column[:5] # show us the first five clean speeches

Now that we have the clean results stored in new_column, we can use *new_column* as the basis for a new column in the dataframe *eighties_data*.

In [None]:
data1980['stopworded_speech'] = new_column

data1980.head()

Here is the play-by-play of what just happened.

We remove the stopwords with a nested loop that moves through the dataframe above.  Here's a detailed account of what the following for-loop does to remove stopwords from the column 'speech' in the dataframe 'eighties_data2'.

* Before we do anything else, we create a backup of eighties_data (just in case we mangle things!) and we make an empty list called new_column, which is going to be a dummy version of the real data in eighties_data['speech']


        eighties_data2 = eighties_data.copy()

        new_column = []
        
        
* The first part of the loop takes the 'speech' column, and moves one row at a time:
    
        'for speech in eighties_data2['speech']'
        

* Next, the instructions split up each given speech into a list of separate words, which we call 'speech1'  
    
        'speech1 = speech.split()'


* Next, we create an empty dummy variable called 'speech2':

        speech2 = []


* We move through the words of speech1, asking of each one: Is this word in the stopwords list, 'stop'?

        for word in speech1:
            if word not in stop:
                
                
* If the word is NOT a stopword, we're going to keep it and add it to our dummy variable, speech2.  

                speech2.append(word)
                

* After that loop runs through every words in speech1, we will have a version of the first speech from eighties_data2['speech'] that has no stopwords, and we've called it speech2.  

    It's currently in the form of a list, so we're going to weld it back into a string with .join():
    
                speech3 = ' '.join(speech2)


* Then, at last, we'll tack our stopworded speech (speech3) onto the end of our dummy list, new_column.
                    
                 new_column.append(speech3)


The loop repeats this process for every speech in eighties_data2['speech'], and every word in every speech.  At the end of that process, new_column will be a list of clean speeches in the same order as the original column.

*  All there is to do now is to take our stopworded handiwork -- new_column -- and use it to replace the unstopworded original column, eighties_data2['speech']:
            
        eighties_data2['speech'] = new_column

The new output, eighties_data2, should be just like eighties_data, but with stopwords removed in the 'speech' column.

Please note that the below loop takes a minute to run.  *Also of interest for advanced coders: there are many ways to do this in parallel, using .apply(). We won't be covering them in this class, but you should feel free to look them up and implement them if you feel competent on your own.

## Wordcount with Tabular Data

We can use many of the tools we already know to count words.

     value_counts()


#### You need lists of words in a column to count them.

An important observation: in the stopwording loop above, we just changed the data type in which the words are stored. 

Originally, our 'speech' column was just long strings of words.  In order to stopword those strings, we .split() each speech into a list of individual words -- just like the lists we've been working on so far. Those lists are easy to stopword.

We could have glued the words back together into super-long strings again. But in fact, it's useful to keep the words in list form, because lists are easy to count.  

In [130]:
type(data1980['speech'])

pandas.core.series.Series

In [None]:
type(data1980['stopworded_speech'])

How long is the first speech, in words?

In [None]:
len(data1980['speech'][0])

What are the top words in the first speech?

In [None]:
data1980['speech'].value_counts().Series

#### Count the top words overall in our dataset

Let's count the top words.  

Let's create a list called "all_words" from the content of each speech in the "speech" column.

In [None]:
words1980 = []

for speech in data1980['speech']:
    speech1 = speech.split()
    for word in speech1:
        words1980.append(word)

In [None]:
words1980[:10]

In [None]:
topwords1980 = pd.Series.value_counts(words1980)[:20]

Great -- but some of those words are pretty hollow!  

Let's create a new stopword list and clean it up.

In [None]:
stopworded_count = []
stopwords2 = ['ca', 'mr', 'many', 'without', 'last', 'way', 'programs', 'want', 'like', 'ask', 'could', 'year', 'american', 'country', 'well', 'made', 'say', 'members', 'million', 'must', 'percent', 'congress', 'federal', 'national', 'legislation', 'government', 'program', 'may', 'act', 'make', 'going', 'first', 'senator', 'senate', 'legislation', 'support', 'chairman', 'amendment', 'committee', 'united', 'today', 'state', 'one', 'us', 'gentleman', 'would', 'bill', '100th', 'house', 'states', 'new', 'speaker', 'years', 'also', 'time']

cleanwords1980 = []

for speech in data1980['speech']:
    speech1 = speech.split()
    for word in speech1:
        if word not in stopwords2: # notice the new line here!
            cleanwords1980.append(word)
        
topwords1980 = pd.Series.value_counts(cleanwords1980)[:20]
topwords1980

## Get the top speeches that have mention the word "democracy" 

In [None]:
import numpy as np

Next, let's count the number of times that the word 'democracy' appears in the 1980s.  

Notice the use of .str.count():

In [None]:
data1980['democracy_count'] = data1980['speech'].str.count('democracy')  # Create a new column for the count of the word democracydis
display(data1980)

In [None]:
Get the top 20 speeches that mentioned democracy.

In [None]:
democracyspeeches = data1980.nlargest(20, ['democracy_count'])
democracyspeeches

## Get the top speeches by length

In [None]:
What are the longest speeches in the database?

In [None]:
longest_speeches = congress.nlargest(n=5, columns=['word_count']) # Get the top 5 longest speeches by word_count
longest_speeches 

## Assignment

1) Print out the first 500 words of the speech that mentions your favorite animal the highest number of times.  


2) Find the longest speech by Bob Dole. Show the last 500 words.  


3) Find the longest speech by Bob Dole in 1980. Show the first 500 words.  


4) Find the speech where Bob Dole mentions democracy the greatest number of times. Show the first 500 words.  


5) Find longest speech where Bob Dole mentions democracy at least three times. Show the first 500 words.  


Take a screenshot of the code and the results and upload it. 