# Hist 3368 - Week 5 - Working With Tabular Data in Pandas

***by Jo Guldi***

Until now, in this class we have worked with lists of words. We have cleaned them and counted and compared them.

For the rest of the class, we will be working with data in tables. Tables allow us to keep track of the date when each word is from. If we have time data, we can compare wordcounts over time, compare wordcounts for different speakers, and so on.

We will need a few special commands to navigate tabular data.

In this notebook, we will learn to navigate tables:

   * how to call a column
   * how to move through a column, row by row, using a for loop
   * how to stopword
   * how to subset or 'filter' data by a column, for example, finding all the speeches of one speaker, or all the speeches on a certain date, using square brackets -- **[ ]** -- and the operators **.isin()**, **==**, and **!=**.
   * how to find the largest counts in a dataset using **.nlargest()**

We will clean tabular data, with strategies we've seen before:
   * stripping punctuation
   * stopwording
   * lemmatizing
   * splitting into words (i.e. tokenization)

We will also learn some basics of counting with tables:

   * how to count the words in a subset of data.

#### Learning Research Strategies

We will practice navigating around the tabular data for Congress, asking the kind of questions a researcher might want to know, such as:

   * given a set of years, who were the top speakers in Congress?
   * given a speaker, what was his or her longest speech?
   * given a certain set of words, who were the speakers who used those words the most?
   
The research questions profiled here are fairly simple, but if combined with strategies such as a *controlled vocabulary* they can result in a good deal of important information about which speakers were engaged with a particular topic -- for instance, the environment, crime, or women's health.  

These research strategies can also help the researcher to navigate to the longest speeches where a speaker invokes those topics, or the speeches where the speaker invokes the highest number of words related to a particular topic.  Those research strategies should form the basis for guided reading.


## Load some data

In [1]:
import pandas as pd
import csv

In [2]:
cd ~/digital-history/

/users/jguldi/digital-history


***This might take a minute. Loading takes time -- please be patient.***

In [3]:
congress = pd.read_csv("congress1967-2010.csv")

TROUBLESHOOTING: if the line above doesn't work, you might have missed something earlier this week.

In [4]:
congress.head()

Unnamed: 0.1,Unnamed: 0,speech,date,speaker,word_count,year,month,month_year
0,0,Those who do not enjoy the privilege of the fl...,1967-01-10,The VICE PRESIDENT,16,1967,1,1967-01-01
1,1,Mr. President. on the basis of an agreement re...,1967-01-10,Mr. MANSFIELD,35,1967,1,1967-01-01
2,2,The Members of the Senate have heard the remar...,1967-01-10,The VICE PRESIDENT,40,1967,1,1967-01-01
3,3,The Chair lays before the Senate the following...,1967-01-10,The VICE PRESIDENT,151,1967,1,1967-01-01
4,4,Secretary of State.,1967-01-10,Mrs. AGNES BAGGETT,3,1967,1,1967-01-01


The data you are looking at is 'tabular' -- meaning that it's in a table.  

The format used by the pandas software package, which is running our table, is called a "dataframe."  A dataframe is a mtwo-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).  "Heterogenous" means that the dataframe can have some columns that hold strings, and other columns that hold numbers or dates.

#### Basic Navigation

We have met pandas data with an index before when we met the pandas Series.  A Series is a one-dimensional labeled array -- meaning that it only had one column, not many.  However, everything that we learned about navigating indices wlil apply to dataframes too.

In [5]:
congress.index[0]

0

In [6]:
congress.index[1000]

1000

We can call the pandas data with the **.loc** function.  The formula for calling data is :

    dataFrame.loc[<ROWS RANGE> , <COLUMNS RANGE>] -- for calling rows or columns by name
    dataFrame.iloc[<ROWS RANGE> , <COLUMNS RANGE>] -- for calling rows or columns by number

Here are rows #1005-1008:

In [7]:
congress.iloc[1005:1008, ]

Unnamed: 0.1,Unnamed: 0,speech,date,speaker,word_count,year,month,month_year
1005,1005,Mr. President. for many years I have advocated...,1967-01-11,Mr. WILLIAMS of Delaware,184,1967,1,1967-01-01
1006,1006,I am delighted to have the Senator from Delawa...,1967-01-11,Mr. DIRKSEN,27,1967,1,1967-01-01
1007,1007,Mr. President. I submit a resolution to amend ...,1967-01-11,Mr. CANNON,449,1967,1,1967-01-01


Here is the speaker column. Notice the use of ':' for 'everything':

In [8]:
congress.loc[:, 'speaker']

0                  The VICE PRESIDENT
1                       Mr. MANSFIELD
2                  The VICE PRESIDENT
3                  The VICE PRESIDENT
4                  Mrs. AGNES BAGGETT
                      ...            
5992063                   Ms. GRANGER
5992064    Ms. KILPATRICK of Michigan
5992065                    Mr. HELLER
5992066                   Mr. PAULSEN
5992067          Mr. HALL of New York
Name: speaker, Length: 5992068, dtype: object

We can also call columns by name using just square brackets.

In [9]:
congress['speaker']

0                  The VICE PRESIDENT
1                       Mr. MANSFIELD
2                  The VICE PRESIDENT
3                  The VICE PRESIDENT
4                  Mrs. AGNES BAGGETT
                      ...            
5992063                   Ms. GRANGER
5992064    Ms. KILPATRICK of Michigan
5992065                    Mr. HELLER
5992066                   Mr. PAULSEN
5992067          Mr. HALL of New York
Name: speaker, Length: 5992068, dtype: object

Notice that I can also call the column with double brackets.

  * The difference between the two methods of calling the column is that above, single brackets call the column as a pandas Series.  
  * Double brackets call the column as a pandas dataframe -- such that the column is labeled with its name.

In [10]:
congress[['speaker']]

Unnamed: 0,speaker
0,The VICE PRESIDENT
1,Mr. MANSFIELD
2,The VICE PRESIDENT
3,The VICE PRESIDENT
4,Mrs. AGNES BAGGETT
...,...
5992063,Ms. GRANGER
5992064,Ms. KILPATRICK of Michigan
5992065,Mr. HELLER
5992066,Mr. PAULSEN


You can also see how many rows there are.

In [11]:
congress['speaker'].count()

5992068

Here is just the speaker and speech for row 3234:

In [12]:
congress.loc[:, ['speaker', 'speech']].iloc[3234, :]

speaker                                            Mr. TOWER
speech     Mr. President. on June 17. a starting gun will...
Name: 3234, dtype: object

Here is just the speech:

In [13]:
myspeech = congress.loc[:, ['speech']].iloc[3234, :]
myspeech

speech    Mr. President. on June 17. a starting gun will...
Name: 3234, dtype: object

We can use some familiar tools to print out the whole speech or any portion thereof:

In [14]:
for word in myspeech[:500]:
    print(word)

Mr. President. on June 17. a starting gun will sound in San Marcos. Tex.. and the worlds toughest river race will be underway. The race is the Texas water safari. marking its fifth year in 1967 with a 538mile race from San Marcos. by way of the San Marcos and Guadalupe Rivers. along coastal bays and rivers. utilizing the Intracoastal Canal. to Freeport. Brave men from all over the countryand several entrants from foreign countries--will test their endurance. skill. equipment. plain physical stamina. and even luck as they brave logjams. rocks. white water. strong winds. and exhausting portages. on a journey through some of the most beautiful country in Texas. I am submitting today a concurrent resolution granting official recognition to the event. The race Is being sponsored by a nonprofit organization expressly set up for this purpose. Prizes approaching $6.500 in value are being donated. along with several fine trophies. I believe this outstanding sports event. emphasizing courage. sk

### Navigating tabular data: column by column, rows within columns

In the current dataset, many words are compiled into a list that is a 'speech' in Congress.  

You can call the column 'speech' with square brackets, e.g.

    congress['speech']

Many speeches form a column called 'speech.'  The column speech can be called and treated as a list.

You can call individual speeches with an additional set of square brackets after ['speech'], e.g. 

    congress['speech'][0]
    
-- which calls the first speech in the speech column.

In [15]:
congress['speech'][0]

'Those who do not enjoy the privilege of the floor will please retire from the Chamber.'

In [16]:
congress['speech'][1]

'Mr. President. on the basis of an agreement reached on both sides. it is suggested that the Chamber be cleared of all attaches. unless they have absolutely important business to attend to in the Chamber.'

In [17]:
congress['speech'][2]

'The Members of the Senate have heard the remarks of the distinguished majority leader. All attaches and staff members who are not vitally needed for the next few minutes of the deliberations of the Senate will tetire from the Chamber.'

We can work on the text -- for instance cleaning or counting -- by calling each row in a text column, one at a time, and executing a transformation, via a for-loop.

Here are the last hundred characters of the last ten speeches in the dataframe, in upper case:

In [18]:
for speech in congress['speech'][-10:]:
    speech = speech.upper()
    print(speech[-100:])

MADAM SPEAKER. I WOULD LIKE TO SUBMIT THE FOLLOWING:
EMENTS AND SERVICE OF AVIS GREEN TUCKER. AND IN EXTENDING OUR CONDOLENCES TO HER FAMILY AND FRIENDS.
CERNING THE CHINESE GOVERNMENTS APPALLING AND MASSIVE HUMAN RIGHTS VIOLATIONS SIMPLY ISNT AN OPTION.
662-"YEA". H.R. 6547. PROTECTING STUDENTS FROM SEXUAL AND VIOLENT PREDATORS ACTROLCALL NO. 663"YEA".
RS THAN DOMINATED THE FINANCIAL MEDIA. THEY MADE THEIR REPUTATIONS AND THEIR FORTUNES THROUGH FRAUD.
ROLLCALL NOS. 662 AND 661. I WAS ABSENT FROM THE HOUSE. HAD I BEEN PRESENT. I WOULD HAVE VOTED "NO."
UL TO PROTECTING THE CONSTITUTION OF THE UNITED STATES AND THE GOALS OF OUR GREAT NATION. GOD BLESS.
AKER. ON ROLICALL NO. 658. I WAS UNAVOIDABLY DETAINED. HAD I BEEN PRESENT. I WOULD HAVE VOTED "YES."
LCALL NO. 658 MY FLIGHT WAS DELAYED DUE TO WEATHER AND HAD I BEEN PRESENT. I WOULD HAVE VOTED "YES."
ME BEFORE THE HOUSE. AND DONATED MY RAISE TO LOCAL NONPROFIT ORGANIZATIONS RATHER THAN ACCEPTING IT.


## Basic Counting with Tabular Data 

We will use two commands that we have seen before to count tabular data.

    .count() -- produces a count of how many items are in a category.  Generally speaking this is the same as counting the number of rows.
    .value_counts() -- produces the subtotals for every subcategory listed in a column. We have used this command previously to get the word counts for every word in a list.  We will use value_counts() to get word counts for every word in a column in pandas.
    
We will also use one new command to count how many unique objects there are in a category.

    .unique() -- finds only the unique members of a list
    


It's easier to understand the difference between these commands in practice.

**.count()** on its own gives you the number of rows in the dataframe as a whole.  For our data, that just means the total number of speeches. 

Even if .count() is applied to the column speaker, it's still measuring the total number of individual speeches -- not how many unique speakers there are.  Most speakers are responsible for more than one speech, so their name appears several times in the dataset.  The count() below counts all rows in the dataframe, regardless of how many speakers there are:

In [19]:
congress['speaker'].count()

5992068

**.value_counts()** organizes the data by unique values and then creates a count of each.  We will use it for word count, as we have in the past.  

Applied to the speaker column, value_counts() givesyou a list of how many speeches each speaker gave.

In [20]:
congress['speaker'].value_counts()

The PRESIDING OFFICER                709041
The SPEAKER pro tempore              239201
The CHAIRMAN                         137788
The SPEAKER                           86866
Mr. ROBERT C. BYRD                    75733
                                      ...  
Mr. RTNALDO                               1
Mr. WAMIP                                 1
Ms. LORETTA SANCHEZ of Calitornia         1
MAJ. THELMA D. BURG                       1
ir. STENNIS                               1
Name: speaker, Length: 56350, dtype: int64

What if you need to know how many unique speakers are represented in this dataframe?  You will use the **.unique()** function.

In [21]:
congress['speaker'].unique()

array(['The VICE PRESIDENT', 'Mr. MANSFIELD', 'Mrs. AGNES BAGGETT', ...,
       'The DERA', 'Mr. BAYHI', 'Mr. BAY-H'], dtype=object)

In [22]:
len(congress['speaker'].unique())

56350

#### Counting particular words per cell.

Above we noted that .count(), applied to a column, will give you the number of rows in the column.

You can also use .count() to find the counts of individual strings in each column. 

This line counts the string 'pineapple' for each speech in the 'speech' column.  

The resulting series has zero's in most columns.  But if we use .nlargest() we get a short list of the index numbers of the speeches where pineapples are mentioned the most:

In [23]:
congress['speech'].str.count('pineapple').nlargest(5)

1000851    65
1084391    51
1017189    29
2164092    26
1023495    24
Name: speech, dtype: int64

Here's how to print the results, using .loc and .iloc to call the speech by its index number.

In [24]:
for word in list(congress.loc[:, ['speech']].iloc[1000851, ]):
    print(word[:1000])

Mr. President. I am introducing legislation today to enable Hawaiian pineapple products to compete in the U.S. market with lowcost foreign canned pineapple which can easily undersell Hawaiian pineapple. One of the finest products in all America is the sweet. juicy. delectable pineapple grown in Hawaii. Since the turn of the century. pineapple has been a mainstay in Hawaiis economy. Today it is still my States second largest agricultural industay. second only to sugar. The processed value of Hawaiian pineapple last year was $137 million. The industry employs 6.200 yearround workers who earned $42 million in annual wages and another 12.000 seasonal workers who earn a total of $10 million a year. Hawaiis pineapple industry has been very energetic and progressive. investing millions of its own dollars in research to improve pineapple quality and production. The Hawaiian pineapple industry is the most highly mechanized In the world and its fleldworkers are the highest paid in the world. The

Note that we have here searched just for the string 'pineapple.' This method could create confusion in future searches unless we used regex to look for an exact word -- unless we really only care about 'pineapple,' which is unusual enough to produce good results as a free-standing string.

## Subsetting Data

We can use the python grammar of operators to ask Python to only look at certain parts of the data -- or 'subsets' of the complete dataset.

For instance, if we want *only* the data from the 1980s, we can use square brackets **[ ]** to tell python to subset a dataframe.  

We use square brackets **[ ]** to tell python to subset a dataframe according to the constraints inside the brackets.

The command to subset data is expressed with the grammar:

    df[df['columnname'].LIMITINGOPERATOR]


For instance, df[df['speaker']=='bob']] would tell python to find only the rows of the dataframe where 'bob' was listed as the speaker.

Using square brackets to "filter" for particular rows is one of the major ways of navigating tabular data in pandas.


### The operators for filtering

The following 'operators' are the ones most frequently used to tell Python how to narrow down the data.  Each works a slightly different way: 

    .isin() -- tells Python to only look for values that are in another list
    == -- tells Python to only look for values that are equal to another value
    != -- tells Python to only look for values that are NOT equal to another value



In [25]:
congress[congress['speaker'] == 'Mr. DOLE'][:10]

Unnamed: 0.1,Unnamed: 0,speech,date,speaker,word_count,year,month,month_year
7475,7475,"Mr. Speaker. the January 8. 1967. ""Doanes Agri...",1967-02-01,Mr. DOLE,201,1967,2,1967-02-01
8657,8657,Mr. Speaker. I ask unanimous consent to revise...,1967-02-02,Mr. DOLE,12,1967,2,1967-02-01
8659,8659,Mr. Speaker. I join in the statements made by ...,1967-02-02,Mr. DOLE,301,1967,2,1967-02-01
8767,8767,Mr. Speaker. it is my pleasure to join in the ...,1967-02-02,Mr. DOLE,878,1967,2,1967-02-01
12255,12255,Mr. Speaker. today I have introduced a joint r...,1967-02-09,Mr. DOLE,82,1967,2,1967-02-01
19034,19034,Mr. Speaker. it is my pleasure to join Mrs. BO...,1967-02-28,Mr. DOLE,258,1967,2,1967-02-01
20616,20616,Mr. Speaker. I wish to associate myself with t...,1967-03-02,Mr. DOLE,291,1967,3,1967-03-01
24507,24507,Mr. Speaker. during this year of 1967 the Fede...,1967-03-08,Mr. DOLE,122,1967,3,1967-03-01
25378,25378,Mr. Speaker. will the gentleman yield?,1967-03-09,Mr. DOLE,6,1967,3,1967-03-01
25380,25380,Mr. Speaker. permit me to say. first of all. t...,1967-03-09,Mr. DOLE,78,1967,3,1967-03-01


#### Using .isin() to find data from the 1980s

In the following line of code, we'll use **.isin()** to tell Python to look for values in the 1980s.  We tell Python to look at the 'year' column. Then we select only the years that are in a list of years from the 1980s. 

    eighties_data = congress[congress['year'].isin(target_years)].copy()  # filter our dataset to just this decade

**.isin()**  takes as its object a list, for instance the *target_years* variable, which we will create to include every year from 1980 to 1990.


Before we apply .isin(), however, we need to format the data so that we can navigate for time.

First, we need to make a 'year' column.

Then we need to filter for years that are in our target.  Note the use of the .isin() function. 

In [26]:
import pandas as pd
import datetime

We call the datetime package
    
    .dt.year

to create a new column called 'year'

In [27]:
congress['year']=pd.to_datetime(congress['date']).dt.year # make a year column

congress.head()

Unnamed: 0.1,Unnamed: 0,speech,date,speaker,word_count,year,month,month_year
0,0,Those who do not enjoy the privilege of the fl...,1967-01-10,The VICE PRESIDENT,16,1967,1,1967-01-01
1,1,Mr. President. on the basis of an agreement re...,1967-01-10,Mr. MANSFIELD,35,1967,1,1967-01-01
2,2,The Members of the Senate have heard the remar...,1967-01-10,The VICE PRESIDENT,40,1967,1,1967-01-01
3,3,The Chair lays before the Senate the following...,1967-01-10,The VICE PRESIDENT,151,1967,1,1967-01-01
4,4,Secretary of State.,1967-01-10,Mrs. AGNES BAGGETT,3,1967,1,1967-01-01


Using == to subset:

In [28]:
data1980 = congress[congress['year']== 1980].copy()  # filter our dataset to just this decade

data1980.head()

Unnamed: 0.1,Unnamed: 0,speech,date,speaker,word_count,year,month,month_year
2329890,2329890,Mr. Speaker. we in Delaware are proud of the o...,1980-01-03,Mr. EVANS of Delaware,122,1980,1,1980-01-01
2329891,2329891,Mr. Speaker. it is logical for Americans to be...,1980-01-03,Mr. DERWINSKI,82,1980,1,1980-01-01
2329892,2329892,The Chair has examined the Journal of the last...,1980-01-03,The SPEAKER pro tempore,32,1980,1,1980-01-01
2329893,2329893,Mr. Speaker. I ask unanimous consent that the ...,1980-01-03,Mr. WHITE,36,1980,1,1980-01-01
2329894,2329894,Is there objection to the request of the gentl...,1980-01-03,The SPEAKER pro tempore,11,1980,1,1980-01-01


Using .isin() to subset:

In [29]:
target_years = list(range(1980, 1989 + 1))  # List of the years 1980-1989

eighties_data = congress[congress['year'].isin(target_years)].copy().reset_index()  # filter our dataset to just this decade
eighties_data = eighties_data.drop(['index', 'Unnamed: 0'], 1) #minor reformatting - drop extra columns
eighties_data.head()

Unnamed: 0,speech,date,speaker,word_count,year,month,month_year
0,Mr. Speaker. we in Delaware are proud of the o...,1980-01-03,Mr. EVANS of Delaware,122,1980,1,1980-01-01
1,Mr. Speaker. it is logical for Americans to be...,1980-01-03,Mr. DERWINSKI,82,1980,1,1980-01-01
2,The Chair has examined the Journal of the last...,1980-01-03,The SPEAKER pro tempore,32,1980,1,1980-01-01
3,Mr. Speaker. I ask unanimous consent that the ...,1980-01-03,Mr. WHITE,36,1980,1,1980-01-01
4,Is there objection to the request of the gentl...,1980-01-03,The SPEAKER pro tempore,11,1980,1,1980-01-01


Let's save the results in case we want to use them again.

In [30]:
cd ~/digital-history

/users/jguldi/digital-history


In [31]:
data1980.to_csv("data1980.csv")
eighties_data.to_csv("eighties_data.csv")

## Cleaning tabular data

Next, we're going to break speeches into words and remove stopwords.  We get our stopwords list from the package NLTK (natural language toolkit):

Let's load stopwords as we have before

In [32]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

stop = stopwords.words('english')
stop[:10]

[nltk_data] Downloading package stopwords to
[nltk_data]     /users/jguldi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

We'll take a new and special preparation step here where we add some regex -- including the word boundary symbols you've seen before -- to make a list of stopwords that Python can search for with great ease.  Mainly you'll want to copy and paste the following line, rather than understanding it, but here are the components:

    r'': 'begin regex, using the formula inside these quotation marks'
    \\b: 'look for a word boundary'
    (?:{}): 'search for the query word for each of the words in the query that follows'
    '|'.join(stop): | means 'or', and .join() produces stopword1|stopword2|stopword3|etc... (where each stopword corresponds to 'i', 'me', 'mine,' etc.
    
Basically we're just formatting the stopwords list so that Python can search for the whole series efficently.

In [33]:
stopwords_regex = r'\b(?:{})\b'.format('|'.join(stop))

In [34]:
stopwords_regex

"\\b(?:i|me|my|myself|we|our|ours|ourselves|you|you're|you've|you'll|you'd|your|yours|yourself|yourselves|he|him|his|himself|she|she's|her|hers|herself|it|it's|its|itself|they|them|their|theirs|themselves|what|which|who|whom|this|that|that'll|these|those|am|is|are|was|were|be|been|being|have|has|had|having|do|does|did|doing|a|an|the|and|but|if|or|because|as|until|while|of|at|by|for|with|about|against|between|into|through|during|before|after|above|below|to|from|up|down|in|out|on|off|over|under|again|further|then|once|here|there|when|where|why|how|all|any|both|each|few|more|most|other|some|such|no|nor|not|only|own|same|so|than|too|very|s|t|can|will|just|don|don't|should|should've|now|d|ll|m|o|re|ve|y|ain|aren|aren't|couldn|couldn't|didn|didn't|doesn|doesn't|hadn|hadn't|hasn|hasn't|haven|haven't|isn|isn't|ma|mightn|mightn't|mustn|mustn't|needn|needn't|shan|shan't|shouldn|shouldn't|wasn|wasn't|weren|weren't|won|won't|wouldn|wouldn't)\\b"

To clean our text when our text is in tabular form, we can apply many commands that are familiar.  Technically, they are being applied over each row of the pandas dataframe.  But the pandas software makes it easier for us.

For each speech, we will perform some familiar tasks:

  * We will **.split()** the speech into words
  * we will use **replace** to get rid of punctuation
  * we will use **wn.morphy()** to get the lemma of each word


The only problem with tabular data is that we have to run splitting, clearing punctuation, stopwording, and other actions on entire **columns** of lists of data rather than just lists.

In theory, you might imagine writing a loop like this to deal with each cell at a time.   However, that would take FOREVER.  

A more efficient approach is to work with the built-in commands that Pandas takes which work over all the cells in an entire column.



The pandas-native commands for working on columns in tabular data have familiar names:

    .str.replace()
    .str.lower()
    .str.split()
    
Let's see them in action.

Get rid of punctuation

In [35]:
eighties_data['speech'] = eighties_data['speech'].str.replace('[^\w\s]','')

Lowercase the text

In [36]:
eighties_data['speech'] = eighties_data['speech'].str.lower()

Eliminate stopwords using .replace() 

***This may take a minute*** -- notice the [*] in light gray to the left of the line of code. This means, 'the computer is thinking; please wait.' If your computer repeatedly crashes, you may need to allocate more memory when you next call up a session of JupyterLab.

In [37]:
eighties_data['stopworded'] = eighties_data['speech'].str.replace(stopwords_regex, '')

Split each speech into a list of individual words

***This may take a minute***

In [38]:
eighties_data['words'] = eighties_data['stopworded'].str.split()

In [39]:
eighties_data.head()

Unnamed: 0,speech,date,speaker,word_count,year,month,month_year,stopworded,words
0,mr speaker we in delaware are proud of the out...,1980-01-03,Mr. EVANS of Delaware,122,1980,1,1980-01-01,mr speaker delaware proud outstanding rec...,"[mr, speaker, delaware, proud, outstanding, re..."
1,mr speaker it is logical for americans to be u...,1980-01-03,Mr. DERWINSKI,82,1980,1,1980-01-01,mr speaker logical americans upset hold...,"[mr, speaker, logical, americans, upset, holdi..."
2,the chair has examined the journal of the last...,1980-01-03,The SPEAKER pro tempore,32,1980,1,1980-01-01,chair examined journal last days proceedi...,"[chair, examined, journal, last, days, proceed..."
3,mr speaker i ask unanimous consent that the co...,1980-01-03,Mr. WHITE,36,1980,1,1980-01-01,mr speaker ask unanimous consent committee ...,"[mr, speaker, ask, unanimous, consent, committ..."
4,is there objection to the request of the gentl...,1980-01-03,The SPEAKER pro tempore,11,1980,1,1980-01-01,objection request gentleman texas,"[objection, request, gentleman, texas]"


Note that with str.split() we have now changed the kind of data in the 'speech' column.  Formerly, we had one long string of text, to which we could apply commands such as .replace()) and .lower().  Now, we have a list of words in each row of 'speech.' This is useful for counting -- which we'll do next -- but it makes using .replace() more difficult.  

***Bottom line***: when working with tabular data, use commands like .replace() before you .split() the strings of text into individual words. 

(NB: You can always use ' '.join(list) to weld those lists of words back together if you have to.)

## Wordcount with Tabular Data

We can use many of the tools we already know to count words.

     value_counts()


#### You need lists of words in a column to count them.

An important observation: in the stopwording loop above, we just changed the data type in which the words are stored. 

Originally, our 'speech' column was just long strings of words.  In order to stopword those strings, we .split() each speech into a list of individual words -- just like the lists we've been working on so far. Those lists are easy to stopword.

We could have glued the words back together into super-long strings again. But in fact, it's useful to keep the words in list form, because lists are easy to count.  

#### How many words in any speech?

How long is the first speech, in words (not including stopwords)?

In [48]:
len(eighties_data['words'][0])

73

What are the top words in the first speech (not including stopwords)?

In [51]:
pd.Series.value_counts(list(eighties_data['words'][0]))[:10]

shops          3
outstanding    3
wilmington     3
railroad       3
amtrak         2
mutual         2
achieved       2
service        2
pleasure       2
recognize      2
dtype: int64

Notice what happens when I set the parameter "normalize" for value_counts() as "True": (it tells Python to tell us the percentage)

In [50]:
pd.Series.value_counts(list(eighties_data['words'][0]), normalize=True)[:10]

shops          0.041096
outstanding    0.041096
wilmington     0.041096
railroad       0.041096
amtrak         0.027397
mutual         0.027397
achieved       0.027397
service        0.027397
pleasure       0.027397
recognize      0.027397
dtype: float64

#### Total word count for the dataset

Let's count the words for each speech in the dataset. We'll make a new column called 'wordcount.'

In [55]:
eighties_data['wordcount'] = eighties_data['words'].str.len()

In [57]:
eighties_data.head()

Unnamed: 0,speech,date,speaker,word_count,year,month,month_year,stopworded,words,wordcount
0,mr speaker we in delaware are proud of the out...,1980-01-03,Mr. EVANS of Delaware,122,1980,1,1980-01-01,mr speaker delaware proud outstanding rec...,"[mr, speaker, delaware, proud, outstanding, re...",73
1,mr speaker it is logical for americans to be u...,1980-01-03,Mr. DERWINSKI,82,1980,1,1980-01-01,mr speaker logical americans upset hold...,"[mr, speaker, logical, americans, upset, holdi...",39
2,the chair has examined the journal of the last...,1980-01-03,The SPEAKER pro tempore,32,1980,1,1980-01-01,chair examined journal last days proceedi...,"[chair, examined, journal, last, days, proceed...",19
3,mr speaker i ask unanimous consent that the co...,1980-01-03,Mr. WHITE,36,1980,1,1980-01-01,mr speaker ask unanimous consent committee ...,"[mr, speaker, ask, unanimous, consent, committ...",19
4,is there objection to the request of the gentl...,1980-01-03,The SPEAKER pro tempore,11,1980,1,1980-01-01,objection request gentleman texas,"[objection, request, gentleman, texas]",4


Notice that we now have two wordcount columns -- one we made before stopwording and one we made after stopwording.

How many words are there in the dataframe as a whole? We can answer that question by adding up all the individual speech wordcounts using 

    .sum()

In [56]:
eighties_data['wordcount'].sum()

108044320

#### Get the longest speeches in the datasets

What are the longest speeches in the database?

In [59]:
longest_speeches = congress.nlargest(n=5, columns=['word_count']) # Get the top 5 longest speeches by word_count
longest_speeches 

Unnamed: 0.1,Unnamed: 0,speech,date,speaker,word_count,year,month,month_year
2118543,2118543,Mr. President. the House amendment to the Sena...,1978-10-05,Mr. DECONCINI,39517,1978,10,1978-10-01
2110541,2110541,Mr. Speaker. the amendment that I am offering ...,1978-09-28,Mr. EDWARDS of California,37188,1978,9,1978-09-01
1330561,1330561,00801. D. (6) $1.500. A. Arent. Fox. Kintner. ...,1974-11-25,as. V.I,35103,1974,11,1974-11-01
2571276,2571276,815 16th Street NW.. Washington. D.C. 20006. D...,1981-09-09,es. AFL-CIO,33507,1981,9,1981-09-01
1943615,1943615,Mr. Speaker. I would like compilation of my co...,1978-02-01,Mr. FOLEY,33013,1978,2,1978-02-01


#### The top words for the dataset

Let's count the top words overall.  

First, we need a list with all the words in the 'words' column in it. A simple for-loop can do that in a hurry.  Let's create a list called "all_words" from the content of each speech in the "speech" column.

In [63]:
words1980 = []

for speech in eighties_data['words']:
    for word in speech:
        words1980.append(word)

words1980[:10]

topwords1980 = pd.Series.value_counts(words1980)[:20]

In [64]:
topwords1980

mr           1262448
would         922396
president     755471
bill          603331
amendment     543091
us            482365
senator       468393
time          465160
gentleman     437904
committee     422377
one           397414
speaker       378901
states        351176
new           336035
people        327221
years         324829
chairman      322768
senate        319095
house         312648
year          312510
dtype: int64

This line of code does exactly the same thing, using the function **.explode** to give each word in the list its own row, to which **.value_counts** is then applied.  It splits all the words in the 'words' column into their own rows, counts them, and gives you the most frequent.

In [75]:
eighties_data["words"].explode().dropna().value_counts()

mr                1262448
would              922396
president          755471
bill               603331
amendment          543091
                   ...   
pomeriggio              1
gagnost                 1
19c                     1
acceleratingi           1
countryfalling          1
Name: words, Length: 815067, dtype: int64

Great -- but some of those words are still pretty hollow, despite having already been stopworded!  

Let's use our top words from the decade to create a new stopword list, format the list, and apply it to eighties_data.

In [80]:
maybestopwords = list(pd.Series.value_counts(words1980)[:200].index)
maybestopwords[:50]

['mr',
 'would',
 'president',
 'bill',
 'amendment',
 'us',
 'senator',
 'time',
 'gentleman',
 'committee',
 'one',
 'speaker',
 'states',
 'new',
 'people',
 'years',
 'chairman',
 'senate',
 'house',
 'year',
 'congress',
 'federal',
 'program',
 'think',
 'many',
 'state',
 'united',
 'legislation',
 'support',
 'also',
 'act',
 'government',
 'may',
 'yield',
 'today',
 'budget',
 'national',
 'american',
 'percent',
 'make',
 'first',
 'country',
 'ask',
 'million',
 'could',
 'like',
 'going',
 'colleagues',
 'must',
 'resolution']

Ideally, we would edit this list by hand. But I'm just going to use the top 200 words as stopwords for now.

In [82]:
stopwords_regex2 = r'\b(?:{})\b'.format('|'.join(maybestopwords))
eighties_data['stopworded'] = eighties_data['stopworded'].str.replace(stopwords_regex2, '')

In [84]:
eighties_data.head()

Unnamed: 0,speech,date,speaker,word_count,year,month,month_year,stopworded,words,wordcount
0,mr speaker we in delaware are proud of the out...,1980-01-03,Mr. EVANS of Delaware,122,1980,1,1980-01-01,delaware proud outstanding achieved ...,"[mr, speaker, delaware, proud, outstanding, re...",73
1,mr speaker it is logical for americans to be u...,1980-01-03,Mr. DERWINSKI,82,1980,1,1980-01-01,logical upset holding hostages ir...,"[mr, speaker, logical, americans, upset, holdi...",39
2,the chair has examined the journal of the last...,1980-01-03,The SPEAKER pro tempore,32,1980,1,1980-01-01,chair examined journal days proceedings ...,"[chair, examined, journal, last, days, proceed...",19
3,mr speaker i ask unanimous consent that the co...,1980-01-03,Mr. WHITE,36,1980,1,1980-01-01,banking finance urban affairs disc...,"[mr, speaker, ask, unanimous, consent, committ...",19
4,is there objection to the request of the gentl...,1980-01-03,The SPEAKER pro tempore,11,1980,1,1980-01-01,texas,"[objection, request, gentleman, texas]",4


Now, count (no splitting required).

In [83]:
topwords1980 = eighties_data["stopworded"].str.split().explode().dropna().value_counts()[:20]
topwords1980

within            77036
children          76906
agreement         76869
upon              75960
case              75881
motion            75698
international     75691
ordered           75636
office            73858
human             73439
already           73174
local             73046
passed            72516
special           72003
appropriations    71823
still             71683
benefits          71621
present           71580
political         71314
result            71215
Name: stopworded, dtype: int64

## Get the top speeches that have mention the word "democracy" 

Next, let's count the number of times that the word 'democracy' appears in the 1980s.  

Notice the use of .str.count():

In [86]:
data1980['democracy_count'] = data1980['speech'].str.count('democracy')  # Create a new column for the count of the word democracydis
display(data1980)

Unnamed: 0.1,Unnamed: 0,speech,date,speaker,word_count,year,month,month_year,democracy_count
2329890,2329890,Mr. Speaker. we in Delaware are proud of the o...,1980-01-03,Mr. EVANS of Delaware,122,1980,1,1980-01-01,0
2329891,2329891,Mr. Speaker. it is logical for Americans to be...,1980-01-03,Mr. DERWINSKI,82,1980,1,1980-01-01,0
2329892,2329892,The Chair has examined the Journal of the last...,1980-01-03,The SPEAKER pro tempore,32,1980,1,1980-01-01,0
2329893,2329893,Mr. Speaker. I ask unanimous consent that the ...,1980-01-03,Mr. WHITE,36,1980,1,1980-01-01,0
2329894,2329894,Is there objection to the request of the gentl...,1980-01-03,The SPEAKER pro tempore,11,1980,1,1980-01-01,0
...,...,...,...,...,...,...,...,...,...
2497053,2497053,Mr. Speaker. as the 96th Congress draws to a c...,1980-12-16,Mr. DIXON,207,1980,12,1980-12-01,0
2497054,2497054,Mr. Speaker. I rise with great pride and pleas...,1980-12-16,Mr. FRENZEL,243,1980,12,1980-12-01,0
2497055,2497055,Mr. Speaker. in my friend. Congressman EDWARD ...,1980-12-16,Mr. McDADE,113,1980,12,1980-12-01,0
2497056,2497056,Mr. Speaker. although JOHN RHODES is voluntari...,1980-12-16,Mr. CAMPBELL,99,1980,12,1980-12-01,0


Get the top 20 speeches that mentioned democracy.

In [88]:
democracyspeeches = data1980.nlargest(20, ['democracy_count'])
democracyspeeches

Unnamed: 0.1,Unnamed: 0,speech,date,speaker,word_count,year,month,month_year,democracy_count
2464930,2464930,Mr. President. I am disturbed by the continuin...,1980-09-30,Mr. HELMS,3079,1980,9,1980-09-01,6
2359798,2359798,Mr. Speaker. yesterday it was my honor to be a...,1980-03-28,Mr. ONEILL,2616,1980,3,1980-03-01,5
2430781,2430781,Treaty is in the U.S. interest because it is a...,1980-07-31,The SALT,7327,1980,7,1980-07-01,5
2450384,2450384,"Mr. Speaker. as I stated in my September 3 ""De...",1980-09-10,Mr. HARKIN,1585,1980,9,1980-09-01,5
2455704,2455704,Mr. President. this past Tuesday. the leading ...,1980-09-18,Mr. TSONGAS,1314,1980,9,1980-09-01,5
2483171,2483171,Mr. Speaker. next week Congressman HENRY S. RE...,1980-12-03,Ms. OAKAR,507,1980,12,1980-12-01,5
2330510,2330510,Mr. Speaker. George Meany spent a lifetime fig...,1980-01-22,Mr. OBERSTAR,477,1980,1,1980-01-01,4
2344758,2344758,Mr. Chairman. I oppose this legislation as it ...,1980-02-22,Mr. LIVINGSTON,560,1980,2,1980-02-01,4
2346604,2346604,Mr. President. the amendment I offer at this t...,1980-02-27,Mr. BOREN,3177,1980,2,1980-02-01,4
2358485,2358485,Mr. Speaker. Al Lowensteins involvement in the...,1980-03-26,Mr. McCLOSKEY,2424,1980,3,1980-03-01,4


## Assignment

1) Print out the first 500 words of the speech that mentions your favorite animal the highest number of times.  


2) Choose one of the top 100 speakers from congress['speaker'].value_counts().
   * Find the longest speech by that speaker. 
   * Show the first and last 500 words.  


3) Limit the dataframe to one year of your choosing in the 1980s.  
   * Find the longest speech by your speaker. 
   * Show the first 500 words.  


4) Call up a list of your speaker's top hundred words.  
   * Choose one word that you judge to be meaningful -- and possibly distinctive of that speaker (consult the list of overall top words for comparison).  
   * Find the speech where your speaker mentions that word the greatest number of times. 
   * Show the first 500 words.  


5) Find longest speech where your speaker mentions the word of your choosing.  
   * The word should be mentioned at least three times. 
   *  Show the first 500 words.  

For each part of the assignment, take a screenshot of the code and the results and upload it. 