# Hist 3368 - Week 5 - Working With Tabular Data in Pandas

***by Jo Guldi***

*Please note: This notebook requires at least **30GB** of memory. You may have to start a new HPC session.*

Until now, in this class we have worked with lists of words. We have cleaned them and counted and compared them.

For the rest of the class, we will be working with data in tables. Tables allow us to keep track of the date when each word is from. If we have time data, we can compare wordcounts over time, compare wordcounts for different speakers, and so on.

We will need a few special commands to navigate tabular data.

In this notebook, we will learn to navigate tables:

   * how to call a column
   * how to move through a column, row by row, using a for loop
   * how to subset or 'filter' data by a column, for example, finding all the speeches of one speaker:
       * how to filter using square brackets -- **[ ]** 
       * the use of the operators **.isin()**, **==**, and **!=**.
   * how to find the largest counts in a dataset using **.nlargest()**

We will clean tabular data, with strategies we've seen before:
   * stripping punctuation
   * stopwording
   * lemmatizing
   * splitting into words (i.e. tokenization)

We will also learn some basics of counting with tables:

   * how to count the words in a subset of data.

#### Learning Research Strategies

We will practice navigating around the tabular data for Congress, asking the kind of questions a researcher might want to know, such as:

   * given a set of years, who were the top speakers in Congress?
   * given a speaker, what was his or her longest speech?
   * given a certain set of words, who were the speakers who used those words the most?
   
The research questions profiled here are fairly simple, but if combined with strategies such as a *controlled vocabulary* they can result in a good deal of important information about which speakers were engaged with a particular topic -- for instance, the environment, crime, or women's health.  

These research strategies can also help the researcher to navigate to the longest speeches where a speaker invokes those topics, or the speeches where the speaker invokes the highest number of words related to a particular topic.  Those research strategies should form the basis for guided reading.


## Load some data

In [1]:
import pandas as pd
import csv

In [2]:
cd /scratch/group/history/hist_3368-jguldi

/scratch/group/history/hist_3368-jguldi


***This might take a minute. Loading takes time -- please be patient.***

In [3]:
congress = pd.read_csv("congress1967-2010.csv")

In [4]:
congress.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,speech,date,speaker,word_count,year,month,month_year
0,0,0,Those who do not enjoy the privilege of the fl...,1967-01-10,The VICE PRESIDENT,16,1967,1,1967-01-01
1,1,1,Mr. President. on the basis of an agreement re...,1967-01-10,Mr. MANSFIELD,35,1967,1,1967-01-01
2,2,2,The Members of the Senate have heard the remar...,1967-01-10,The VICE PRESIDENT,40,1967,1,1967-01-01
3,3,3,The Chair lays before the Senate the following...,1967-01-10,The VICE PRESIDENT,151,1967,1,1967-01-01
4,4,4,Secretary of State.,1967-01-10,Mrs. AGNES BAGGETT,3,1967,1,1967-01-01


The data you are looking at is 'tabular' -- meaning that it's in a table.  

The format used by the pandas software package, which is running our table, is called a "dataframe."  A dataframe is a mtwo-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).  "Heterogenous" means that the dataframe can have some columns that hold strings, and other columns that hold numbers or dates.

#### Basic Navigation

We have met pandas data with an index before when we met the pandas Series.  A Series is a one-dimensional labeled array -- meaning that it only had one column, not many.  However, everything that we learned about navigating indices wlil apply to dataframes too.

In [5]:
congress.index[0]

0

In [6]:
congress.index[1000]

1000

We can call the pandas data with the **.loc** function.  The formula for calling data is :

    dataFrame.loc[<ROWS RANGE> , <COLUMNS RANGE>] -- for calling rows or columns by name
    dataFrame.iloc[<ROWS RANGE> , <COLUMNS RANGE>] -- for calling rows or columns by number

Here are rows #1005-1008:

In [7]:
congress.iloc[1005:1008, ]

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,speech,date,speaker,word_count,year,month,month_year
1005,1005,1005,Mr. President. for many years I have advocated...,1967-01-11,Mr. WILLIAMS of Delaware,184,1967,1,1967-01-01
1006,1006,1006,I am delighted to have the Senator from Delawa...,1967-01-11,Mr. DIRKSEN,27,1967,1,1967-01-01
1007,1007,1007,Mr. President. I submit a resolution to amend ...,1967-01-11,Mr. CANNON,449,1967,1,1967-01-01


Here is the speaker column. Notice the use of ':' for 'everything':

In [8]:
congress.loc[:, 'speaker']

0                  The VICE PRESIDENT
1                       Mr. MANSFIELD
2                  The VICE PRESIDENT
3                  The VICE PRESIDENT
4                  Mrs. AGNES BAGGETT
                      ...            
5992063                   Ms. GRANGER
5992064    Ms. KILPATRICK of Michigan
5992065                    Mr. HELLER
5992066                   Mr. PAULSEN
5992067          Mr. HALL of New York
Name: speaker, Length: 5992068, dtype: object

We can also call columns by name using just square brackets.

In [9]:
congress['speaker']

0                  The VICE PRESIDENT
1                       Mr. MANSFIELD
2                  The VICE PRESIDENT
3                  The VICE PRESIDENT
4                  Mrs. AGNES BAGGETT
                      ...            
5992063                   Ms. GRANGER
5992064    Ms. KILPATRICK of Michigan
5992065                    Mr. HELLER
5992066                   Mr. PAULSEN
5992067          Mr. HALL of New York
Name: speaker, Length: 5992068, dtype: object

Notice that I can also call the column with double brackets.

  * The difference between the two methods of calling the column is that above, single brackets call the column as a pandas Series.  
  * Double brackets call the column as a pandas dataframe -- such that the column is labeled with its name.  
     * The chief difference between a dataframe and a Series is that with a dataframe you can add extra columns later if you want to.

In [10]:
congress[['speaker']]

Unnamed: 0,speaker
0,The VICE PRESIDENT
1,Mr. MANSFIELD
2,The VICE PRESIDENT
3,The VICE PRESIDENT
4,Mrs. AGNES BAGGETT
...,...
5992063,Ms. GRANGER
5992064,Ms. KILPATRICK of Michigan
5992065,Mr. HELLER
5992066,Mr. PAULSEN


You can also see how many rows there are.

In [11]:
congress['speaker'].count()

5992068

We can call data from the datatable by row, column name, and by row number.  

Here is just the speaker and speech for row 3234:

In [12]:
congress.loc[:, ['speaker', 'speech']].iloc[3234, :]

speaker                                            Mr. TOWER
speech     Mr. President. on June 17. a starting gun will...
Name: 3234, dtype: object

Here is just the speech:

In [13]:
myspeech = congress.loc[:, ['speech']].iloc[3234, :]
myspeech

speech    Mr. President. on June 17. a starting gun will...
Name: 3234, dtype: object

We can use some familiar tools to print out the whole speech or any portion thereof:

In [14]:
for word in myspeech[:500]:
    print(word)

Mr. President. on June 17. a starting gun will sound in San Marcos. Tex.. and the worlds toughest river race will be underway. The race is the Texas water safari. marking its fifth year in 1967 with a 538mile race from San Marcos. by way of the San Marcos and Guadalupe Rivers. along coastal bays and rivers. utilizing the Intracoastal Canal. to Freeport. Brave men from all over the countryand several entrants from foreign countries--will test their endurance. skill. equipment. plain physical stamina. and even luck as they brave logjams. rocks. white water. strong winds. and exhausting portages. on a journey through some of the most beautiful country in Texas. I am submitting today a concurrent resolution granting official recognition to the event. The race Is being sponsored by a nonprofit organization expressly set up for this purpose. Prizes approaching $6.500 in value are being donated. along with several fine trophies. I believe this outstanding sports event. emphasizing courage. sk

### Navigating tabular data: column by column, rows within columns

In the current dataset, many words are compiled into a list that is a 'speech' in Congress.  

You can call the column 'speech' with square brackets, e.g.

    congress['speech']

Many speeches form a column called 'speech.'  The column speech can be called and treated as a list.

You can call individual speeches with an additional set of square brackets after ['speech'], e.g. 

    congress['speech'][0]
    
-- which calls the first speech in the speech column.

In [15]:
congress['speech'][0]

'Those who do not enjoy the privilege of the floor will please retire from the Chamber.'

In [16]:
congress['speech'][1]

'Mr. President. on the basis of an agreement reached on both sides. it is suggested that the Chamber be cleared of all attaches. unless they have absolutely important business to attend to in the Chamber.'

In [17]:
congress['speech'][2]

'The Members of the Senate have heard the remarks of the distinguished majority leader. All attaches and staff members who are not vitally needed for the next few minutes of the deliberations of the Senate will tetire from the Chamber.'

We can work on the text -- for instance cleaning or counting -- by calling each row in a text column, one at a time, and executing a transformation, via a for-loop.

Here are the last hundred characters of the last five speeches in the dataframe, in upper case:

In [18]:
for speech in congress['speech'][-5:]:
    speech = speech.upper()
    print('***')
    print('here are the last ten words of a speech:')
    print(speech[-100:])

***
here are the last ten words of a speech:
ROLLCALL NOS. 662 AND 661. I WAS ABSENT FROM THE HOUSE. HAD I BEEN PRESENT. I WOULD HAVE VOTED "NO."
***
here are the last ten words of a speech:
UL TO PROTECTING THE CONSTITUTION OF THE UNITED STATES AND THE GOALS OF OUR GREAT NATION. GOD BLESS.
***
here are the last ten words of a speech:
AKER. ON ROLICALL NO. 658. I WAS UNAVOIDABLY DETAINED. HAD I BEEN PRESENT. I WOULD HAVE VOTED "YES."
***
here are the last ten words of a speech:
LCALL NO. 658 MY FLIGHT WAS DELAYED DUE TO WEATHER AND HAD I BEEN PRESENT. I WOULD HAVE VOTED "YES."
***
here are the last ten words of a speech:
ME BEFORE THE HOUSE. AND DONATED MY RAISE TO LOCAL NONPROFIT ORGANIZATIONS RATHER THAN ACCEPTING IT.


## Basic Counting with Tabular Data 

We will use two commands that we have seen before to count tabular data.

    .count() -- produces a count of how many items are in a category.  Generally speaking this is the same as counting the number of rows.
    .value_counts() -- produces the subtotals for every subcategory listed in a column. We have used this command previously to get the word counts for every word in a list.  We will use value_counts() to get word counts for every word in a column in pandas.
    
We will also use one new command to count how many unique objects there are in a category.

    .unique() -- finds only the unique members of a list
    


It's easier to understand the difference between these commands in practice.

**.count()** on its own gives you the number of rows in the dataframe as a whole.  For our data, that just means the total number of speeches. 

Even if .count() is applied to the column speaker, it's still measuring the total number of individual speeches -- not how many unique speakers there are.  Most speakers are responsible for more than one speech, so their name appears several times in the dataset.  The count() below counts all rows in the dataframe, regardless of how many speakers there are:

In [19]:
congress['speaker'].count()

5992068

**.value_counts()** organizes the data by unique values and then creates a count of each.  We will use it for word count, as we have in the past.  

Applied to the speaker column, value_counts() givesyou a list of how many speeches each speaker gave.

In [20]:
congress['speaker'].value_counts()

The PRESIDING OFFICER      709041
The SPEAKER pro tempore    239201
The CHAIRMAN               137788
The SPEAKER                 86866
Mr. ROBERT C. BYRD          75733
                            ...  
Fr. JOHN MCDONNELL              1
Mr. oQTINGER                    1
Mr. MOYIHAN                     1
Mr. HOI.JNGS                    1
Miss BEULA EDMISTON             1
Name: speaker, Length: 56350, dtype: int64

What if you want the names of every speaker?  

In [21]:
congress['speaker'][:10]

0       The VICE PRESIDENT
1            Mr. MANSFIELD
2       The VICE PRESIDENT
3       The VICE PRESIDENT
4       Mrs. AGNES BAGGETT
5       The VICE PRESIDENT
6    The legislative clerk
7            Mr. MANSFIELD
8       The VICE PRESIDENT
9    Mr. LONG of Louisiana
Name: speaker, dtype: object

There's a lot of repetition in the 'speaker' column.  If you want to list the names of every speaker only ONCE, you need the "unique" values.

To get the unique values in the speaker column you will use the **.unique()** function.

In [22]:
congress['speaker'].unique()[:10]

array(['The VICE PRESIDENT', 'Mr. MANSFIELD', 'Mrs. AGNES BAGGETT',
       'The legislative clerk', 'Mr. LONG of Louisiana',
       'Mr. ROBERT A. BRENKWORTH', 'Mr. PASTORE', 'Mr. RUSSELL',
       'Mr. KUCHEL', 'Mr. CLARK'], dtype=object)

What if you need to know how many unique speakers are represented in this dataframe? You can use **len()** to give you the length -- that is the number of items -- in any list.  The length of a unique set is the number of unique answers.

In [23]:
len(congress['speaker'].unique())

56350

#### Counting particular words per cell.

Above we noted that .count(), applied to a column, will give you the number of rows in the column.

You can also use .count() to find the number of times any individual strings occurs in each cell of a column. 

Here's how many times the first ten speeches include the word 'the.'

In [24]:
congress['speech'][:10].str.count('the')

0     3
1     4
2     7
3    18
4     0
5     5
6     9
7     1
8     1
9     1
Name: speech, dtype: int64

Here's how many times the first ten speeches include the word 'pineapple.'

In [95]:
congress['speech'][:10].str.count('pineapple')

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
Name: speech, dtype: int64

The result of our search for 'pineapple' says that most speeches in Congress talk about pineapples zero times.  

***Note that we have here searched just for the string 'pineapple.' This method could create confusion in future searches unless we used regex to look for an exact word -- unless we really only care about 'pineapple,' which is unusual enough to produce good results as a free-standing string. We will not go into using regex to improve searches here, because we have previously covered this material in another notebook***

What if we only want the speeches that talked about pineapples the most?

If we use the function **.nlargest()**, pandas will return the rows with the highest counts of the foregoing count.  

The table below lists the row numbers of the speeches where pineapples are mentioned the most:

In [26]:
pineapplespeeches = congress['speech'].str.count('pineapple').nlargest(5)
pineapplespeeches

1000851    65
1084391    51
1017189    29
2164092    26
1023495    24
Name: speech, dtype: int64

Note that the list of speeches above is stripped down to just a **row number** and a **count**.  When we use **nlargest**, it tends to drop all the information other than the count and the row number.  Fortunately, we can use **.iloc** to navigate from row number to all the relevant information 

Here's how to print the results, using .loc and .iloc to call the speech by its index number.

In [27]:
for word in list(congress.loc[:, ['speech']].iloc[1000851, ]):
        print(word[:1000])

Mr. President. I am introducing legislation today to enable Hawaiian pineapple products to compete in the U.S. market with lowcost foreign canned pineapple which can easily undersell Hawaiian pineapple. One of the finest products in all America is the sweet. juicy. delectable pineapple grown in Hawaii. Since the turn of the century. pineapple has been a mainstay in Hawaiis economy. Today it is still my States second largest agricultural industay. second only to sugar. The processed value of Hawaiian pineapple last year was $137 million. The industry employs 6.200 yearround workers who earned $42 million in annual wages and another 12.000 seasonal workers who earn a total of $10 million a year. Hawaiis pineapple industry has been very energetic and progressive. investing millions of its own dollars in research to improve pineapple quality and production. The Hawaiian pineapple industry is the most highly mechanized In the world and its fleldworkers are the highest paid in the world. The

Here's how to call the speaker:

In [28]:
congress.loc[:, ['speaker']].iloc[1000851, ]

speaker    Mr. FONG
Name: 1000851, dtype: object

Here's how to call the date:

In [29]:
congress.loc[:, ['date']].iloc[1000851, ]

date    1973-02-01
Name: 1000851, dtype: object

Here's how to call up a series of speeches from the list above, *pineapplespeeches*:

In [30]:
for speechnumber in pineapplespeeches.index[-3:]:
    for word in list(congress.loc[:, ['speech']].iloc[speechnumber, ]):
        speaker = congress.loc[:, ['speaker']].iloc[speechnumber, ]
        date = congress.loc[:, ['date']].iloc[speechnumber, ]
        print('***')
        print('here is a speech about pineapples by ' + speaker + ':')
        print(word[:1000])
        print(date)

***
speaker    here is a speech about pineapples by Mrs. MINK:
Name: 1017189, dtype: object
Mr. Speaker. the Hawaii pineapple industry as we know it is on the verge of extinction. Twenty years ago Hawaii had nine pineapple companies. but this number has dwindled at an accelerating pace. Today there are only four left. The demise of another within a year has already been announced. Most recently. one of the remaining three said it will discontinue its operation on the Island of Molokai by 1975 or 1976. Another of the three also announced it will terminate its activities on Molokai. leaving the island virtually without any industry of any kind. The handwriting is on the wall. and we must anticipate that in a very short time there will be no pineapple canning industry left. The only thing remaining will be fresh pineapple. which will be grown on a limited basis. In 1950. Hawaii had 72 percent of world pineapple production. Now we have less than half that figure. Simply stated. Hawaiis pin

## Subsetting Data

We can use the python grammar of operators to ask Python to only look at certain parts of the data -- or 'subsets' of the complete dataset.

For instance, if we want *only* the data from the 1980s, we can use square brackets **[ ]** to tell python to subset a dataframe.  

We use square brackets **[ ]** to tell python to subset a dataframe according to the constraints inside the brackets.

The command to subset data is expressed with the grammar:

    df[df['columnname'].LIMITINGOPERATOR]


For instance, df[df['speaker']=='bob']] would tell python to find only the rows of the dataframe where 'bob' was listed as the speaker.

Using square brackets to "filter" for particular rows is one of the major ways of navigating tabular data in pandas.


### The operators for filtering

The following 'operators' are the ones most frequently used to tell Python how to narrow down the data.  Each works a slightly different way: 

    .isin() -- tells Python to only look for values that are in another list
    == -- tells Python to only look for values that are equal to another value
    != -- tells Python to only look for values that are NOT equal to another value



In [31]:
congress[congress['speaker'] == 'Mr. DOLE'][:10]

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,speech,date,speaker,word_count,year,month,month_year
7475,7475,7475,"Mr. Speaker. the January 8. 1967. ""Doanes Agri...",1967-02-01,Mr. DOLE,201,1967,2,1967-02-01
8657,8657,8657,Mr. Speaker. I ask unanimous consent to revise...,1967-02-02,Mr. DOLE,12,1967,2,1967-02-01
8659,8659,8659,Mr. Speaker. I join in the statements made by ...,1967-02-02,Mr. DOLE,301,1967,2,1967-02-01
8767,8767,8767,Mr. Speaker. it is my pleasure to join in the ...,1967-02-02,Mr. DOLE,878,1967,2,1967-02-01
12255,12255,12255,Mr. Speaker. today I have introduced a joint r...,1967-02-09,Mr. DOLE,82,1967,2,1967-02-01
19034,19034,19034,Mr. Speaker. it is my pleasure to join Mrs. BO...,1967-02-28,Mr. DOLE,258,1967,2,1967-02-01
20616,20616,20616,Mr. Speaker. I wish to associate myself with t...,1967-03-02,Mr. DOLE,291,1967,3,1967-03-01
24507,24507,24507,Mr. Speaker. during this year of 1967 the Fede...,1967-03-08,Mr. DOLE,122,1967,3,1967-03-01
25378,25378,25378,Mr. Speaker. will the gentleman yield?,1967-03-09,Mr. DOLE,6,1967,3,1967-03-01
25380,25380,25380,Mr. Speaker. permit me to say. first of all. t...,1967-03-09,Mr. DOLE,78,1967,3,1967-03-01


Here are the speeches of Mr. Fong.

In [32]:
congress[congress['speaker'] == 'Mr. FONG'][:10]

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,speech,date,speaker,word_count,year,month,month_year
1907,1907,1907,Mr. President. the Senate is now considering t...,1967-01-17,Mr. FONG,873,1967,1,1967-01-01
7621,7621,7621,Mr. President. I introduce. for appropriate re...,1967-02-01,Mr. FONG,313,1967,2,1967-02-01
8303,8303,8303,Mr. President. will the distinguished Senator ...,1967-02-02,Mr. FONG,9,1967,2,1967-02-01
8305,8305,8305,Mr. President. I commend the distinguished sen...,1967-02-02,Mr. FONG,183,1967,2,1967-02-01
8309,8309,8309,I thank the distinguished Senator for his very...,1967-02-02,Mr. FONG,1361,1967,2,1967-02-01
9006,9006,9006,Mr. President. it is gratifying to call attent...,1967-02-03,Mr. FONG,417,1967,2,1967-02-01
9757,9757,9757,Mr. President. I am particularly pleased that ...,1967-02-06,Mr. FONG,173,1967,2,1967-02-01
10922,10922,10922,Mr. President. it was with deep grief and shoc...,1967-02-08,Mr. FONG,429,1967,2,1967-02-01
18460,18460,18460,Mr. President. the current unrest and chaos th...,1967-02-28,Mr. FONG,179,1967,2,1967-02-01
18461,18461,18461,Mr. President. during extensive hearings. cond...,1967-02-28,Mr. FONG,1206,1967,2,1967-02-01


#### Using .isin()

We can use the operator **.isin()** to filter our results via a certain list.

Here are all the speeches that took place in the summer months, June-July-August, i.e. months 6-7-8:

In [33]:
congress[congress['month'].isin(range(6, 8))]

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,speech,date,speaker,word_count,year,month,month_year
63624,63624,63624,Mr. President. on April 19 the United Auto Wor...,1967-06-01,Mr. HARTKE,227,1967,6,1967-06-01
63625,63625,63625,Without objection. it is so ordered. The resol...,1967-06-01,The PRESIDING OFFICER,17,1967,6,1967-06-01
63626,63626,63626,Mr. President. I ask unanimous consent that th...,1967-06-01,Mr. McCLELLAN,47,1967,6,1967-06-01
63627,63627,63627,Without objection. it is so ordered.,1967-06-01,The PRESIDING OFFICER,6,1967,6,1967-06-01
63628,63628,63628,Mr. President. I ask unanimous consent that th...,1967-06-01,Mr. DIRKSEN,23,1967,6,1967-06-01
...,...,...,...,...,...,...,...,...,...
5970495,5970495,5970495,Madam Speaker. I rise today to commend and con...,2010-07-30,Mr. RADANOVICH,239,2010,7,2010-07-01
5970496,5970496,5970496,"Madam Speaker. I rise today to introduce the ""...",2010-07-30,Ms. LINDA T. SANCHEZ of California,307,2010,7,2010-07-01
5970497,5970497,5970497,Madam Speaker. I wish to speak today about an ...,2010-07-30,Mr. SHERMAN,277,2010,7,2010-07-01
5970498,5970498,5970498,Madam Speaker. I rise today to pay tribute and...,2010-07-30,Mr. WALDEN,677,2010,7,2010-07-01


Say we make a list of all the speakers to spoke the most about pineapples.

In [34]:
pineapplespeakers = []

for speechnumber in list(pineapplespeeches.index):
        speaker = list(congress.loc[:, ['speaker']].iloc[speechnumber, ])[0]
        if speaker not in pineapplespeakers:
            pineapplespeakers.append(speaker)
        
pineapplespeakers

['Mr. FONG', 'Mrs. MINK', 'Mr. MATSUNAGA']

Here are all the speeches by the speakers who spoke the most about pineapples.

In [35]:
congress[congress['speaker'].isin(pineapplespeakers)]

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,speech,date,speaker,word_count,year,month,month_year
1541,1541,1541,Mr. Speaker. it was my great privilege to list...,1967-01-16,Mr. MATSUNAGA,281,1967,1,1967-01-01
1907,1907,1907,Mr. President. the Senate is now considering t...,1967-01-17,Mr. FONG,873,1967,1,1967-01-01
2136,2136,2136,"Mr. Speaker. the editorial entitled ""Speaker o...",1967-01-17,Mr. MATSUNAGA,347,1967,1,1967-01-01
2276,2276,2276,Mr. Speaker. I rise to pay tribute to the memo...,1967-01-18,Mr. MATSUNAGA,18,1967,1,1967-01-01
2296,2296,2296,Mr. Speaker. I. along with all my colleagues. ...,1967-01-18,Mrs. MINK,200,1967,1,1967-01-01
...,...,...,...,...,...,...,...,...,...
4742933,4742933,4742933,o1 Hawai. Mr. Chairman. I rise In support of t...,1998-03-25,Mrs. MINK,620,1998,3,1998-03-01
4931395,4931395,4931395,Madam Chairman. I believe strongly that all ch...,1999-10-26,Mrs. MINK,484,1999,10,1999-10-01
5003189,5003189,5003189,"I would have voted ""yea."" On the amendment to ...",2000-07-13,Mrs. MINK,51,2000,7,2000-07-01
5145728,5145728,5145728,Mr. Speaker. today I am introducing a bill dir...,2002-02-14,Mrs. MINK,411,2002,2,2002-02-01


#### Using .isin() to find data from the 1980s

In the following line of code, we'll use **.isin()** to tell Python to look for values in the 1980s.  We tell Python to look at the 'year' column. Then we select only the years that are in a list of years from the 1980s. 

    eighties_data = congress[congress['year'].isin(target_years)].copy()  # filter our dataset to just this decade

**.isin()**  takes as its object a list, for instance the *target_years* variable, which we will create to include every year from 1980 to 1990.


Before we apply .isin(), however, we need to format the data so that we can navigate for time.

First, we need to make a 'year' column.

Then we need to filter for years that are in our target.  Note the use of the .isin() function. 

In [36]:
import pandas as pd
import datetime

We call the datetime package
    
    .dt.year

to create a new column called 'year'

In [37]:
congress['year']=pd.to_datetime(congress['date']).dt.year # make a year column

congress.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,speech,date,speaker,word_count,year,month,month_year
0,0,0,Those who do not enjoy the privilege of the fl...,1967-01-10,The VICE PRESIDENT,16,1967,1,1967-01-01
1,1,1,Mr. President. on the basis of an agreement re...,1967-01-10,Mr. MANSFIELD,35,1967,1,1967-01-01
2,2,2,The Members of the Senate have heard the remar...,1967-01-10,The VICE PRESIDENT,40,1967,1,1967-01-01
3,3,3,The Chair lays before the Senate the following...,1967-01-10,The VICE PRESIDENT,151,1967,1,1967-01-01
4,4,4,Secretary of State.,1967-01-10,Mrs. AGNES BAGGETT,3,1967,1,1967-01-01


Using == to subset:

In [91]:
data1980 = congress[congress['year']== 1980].copy()  # filter our dataset to just this decade

data1980.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,speech,date,speaker,word_count,year,month,month_year
2329890,2329890,2329890,Mr. Speaker. we in Delaware are proud of the o...,1980-01-03,Mr. EVANS of Delaware,122,1980,1,1980-01-01
2329891,2329891,2329891,Mr. Speaker. it is logical for Americans to be...,1980-01-03,Mr. DERWINSKI,82,1980,1,1980-01-01
2329892,2329892,2329892,The Chair has examined the Journal of the last...,1980-01-03,The SPEAKER pro tempore,32,1980,1,1980-01-01
2329893,2329893,2329893,Mr. Speaker. I ask unanimous consent that the ...,1980-01-03,Mr. WHITE,36,1980,1,1980-01-01
2329894,2329894,2329894,Is there objection to the request of the gentl...,1980-01-03,The SPEAKER pro tempore,11,1980,1,1980-01-01


Using .isin() to subset:

In [1]:
target_years = list(range(1980, 1989 + 1))  # List of the years 1980-1989

eighties_data = congress[congress['year'].isin(target_years)].copy().reset_index()  # filter our dataset to just this decade
eighties_data = eighties_data.drop(['index', 'Unnamed: 0'], 1) #minor reformatting - drop extra columns
eighties_data.head()

NameError: name 'congress' is not defined

Let's save the results in case we want to use them again.

In [40]:
cd ~/digital-history

/users/jguldi/digital-history


In [41]:
data1980.to_csv("data1980.csv")

In [42]:
eighties_data.to_csv("eighties_data.csv")

Subsetting your data and saving copies of clean data is a good way to start work on your dataset.

## Cleaning tabular data

Next, we're going to break speeches into words and remove stopwords.  We get our stopwords list from the package NLTK (natural language toolkit):

Let's load stopwords as we have before

In [43]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

stop = stopwords.words('english')
stop[:10]

[nltk_data] Downloading package stopwords to
[nltk_data]     /users/jguldi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

We'll take a new and special preparation step here where we add some regex -- including the word boundary symbols you've seen before -- to make a list of stopwords that Python can search for with great ease.  Mainly you'll want to copy and paste the following line, rather than understanding it, but here are the components:

    r'': 'begin regex, using the formula inside these quotation marks'
    \\b: 'look for a word boundary'
    (?:{}): 'search for the query word for each of the words in the query that follows'
    '|'.join(stop): | means 'or', and .join() produces stopword1|stopword2|stopword3|etc... (where each stopword corresponds to 'i', 'me', 'mine,' etc.
    
Basically we're just formatting the stopwords list so that Python can search for the whole series efficently.

In [44]:
stopwords_regex = r'\b(?:{})\b'.format('|'.join(stop))

In [45]:
stopwords_regex

"\\b(?:i|me|my|myself|we|our|ours|ourselves|you|you're|you've|you'll|you'd|your|yours|yourself|yourselves|he|him|his|himself|she|she's|her|hers|herself|it|it's|its|itself|they|them|their|theirs|themselves|what|which|who|whom|this|that|that'll|these|those|am|is|are|was|were|be|been|being|have|has|had|having|do|does|did|doing|a|an|the|and|but|if|or|because|as|until|while|of|at|by|for|with|about|against|between|into|through|during|before|after|above|below|to|from|up|down|in|out|on|off|over|under|again|further|then|once|here|there|when|where|why|how|all|any|both|each|few|more|most|other|some|such|no|nor|not|only|own|same|so|than|too|very|s|t|can|will|just|don|don't|should|should've|now|d|ll|m|o|re|ve|y|ain|aren|aren't|couldn|couldn't|didn|didn't|doesn|doesn't|hadn|hadn't|hasn|hasn't|haven|haven't|isn|isn't|ma|mightn|mightn't|mustn|mustn't|needn|needn't|shan|shan't|shouldn|shouldn't|wasn|wasn't|weren|weren't|won|won't|wouldn|wouldn't)\\b"

To clean our text when our text is in tabular form, we can apply many commands that are familiar.  Technically, they are being applied over each row of the pandas dataframe.  But the pandas software makes it easier for us.

For each speech, we will perform some familiar tasks:

  * We will **.split()** the speech into words
  * we will use **replace** to get rid of punctuation
  * we will use **wn.morphy()** to get the lemma of each word


The only problem with tabular data is that we have to run splitting, clearing punctuation, stopwording, and other actions on entire **columns** of lists of data rather than just lists.

In theory, you might imagine writing a loop like this to deal with each cell at a time.   However, that would take FOREVER.  

A more efficient approach is to work with the built-in commands that Pandas takes which work over all the cells in an entire column.



The pandas-native commands for working on columns in tabular data have familiar names:

    .str.replace()
    .str.lower()
    .str.split()
    
Let's see them in action.

Get rid of punctuation

In [46]:
eighties_data['speech'] = eighties_data['speech'].str.replace('[^\w\s]','')

Lowercase the text

In [47]:
eighties_data['speech'] = eighties_data['speech'].str.lower()

Eliminate stopwords using .replace() 

***This may take a minute*** -- notice the [*] in light gray to the left of the line of code. This means, 'the computer is thinking; please wait.' If your computer repeatedly crashes, you may need to allocate more memory when you next call up a session of JupyterLab.

In [48]:
eighties_data['stopworded'] = eighties_data['speech'].str.replace(stopwords_regex, '')

Split each speech into a list of individual words

***This may take a minute***

In [49]:
eighties_data['words'] = eighties_data['stopworded'].str.split()

In [50]:
eighties_data.head()

Unnamed: 0,Unnamed: 0.1,speech,date,speaker,word_count,year,month,month_year,stopworded,words
0,2329890,mr speaker we in delaware are proud of the out...,1980-01-03,Mr. EVANS of Delaware,122,1980,1,1980-01-01,mr speaker delaware proud outstanding rec...,"[mr, speaker, delaware, proud, outstanding, re..."
1,2329891,mr speaker it is logical for americans to be u...,1980-01-03,Mr. DERWINSKI,82,1980,1,1980-01-01,mr speaker logical americans upset hold...,"[mr, speaker, logical, americans, upset, holdi..."
2,2329892,the chair has examined the journal of the last...,1980-01-03,The SPEAKER pro tempore,32,1980,1,1980-01-01,chair examined journal last days proceedi...,"[chair, examined, journal, last, days, proceed..."
3,2329893,mr speaker i ask unanimous consent that the co...,1980-01-03,Mr. WHITE,36,1980,1,1980-01-01,mr speaker ask unanimous consent committee ...,"[mr, speaker, ask, unanimous, consent, committ..."
4,2329894,is there objection to the request of the gentl...,1980-01-03,The SPEAKER pro tempore,11,1980,1,1980-01-01,objection request gentleman texas,"[objection, request, gentleman, texas]"


Note that with str.split() we have now changed the kind of data in the 'speech' column.  Formerly, we had one long string of text, to which we could apply commands such as .replace()) and .lower().  Now, we have a list of words in each row of 'speech.' This is useful for counting -- which we'll do next -- but it makes using .replace() more difficult.  

***Bottom line***: when working with tabular data, use commands like .replace() before you .split() the strings of text into individual words. 

(NB: You can always use ' '.join(list) to weld those lists of words back together if you have to.)

## Wordcount with Tabular Data

We can use many of the tools we already know to count words.

     value_counts()


#### You need lists of words in a column to count them.

An important observation: in the stopwording loop above, we just changed the data type in which the words are stored. 

Originally, our 'speech' column was just long strings of words.  In order to stopword those strings, we .split() each speech into a list of individual words -- just like the lists we've been working on so far. Those lists are easy to stopword.

We could have glued the words back together into super-long strings again. But in fact, it's useful to keep the words in list form, because lists are easy to count.  

#### How many words in any speech?

How long is the first speech, in words (not including stopwords)?

In [51]:
len(eighties_data['words'][0])

73

What are the top words in the first speech (not including stopwords)?

In [52]:
pd.Series.value_counts(list(eighties_data['words'][0]))[:10]

railroad       3
wilmington     3
outstanding    3
shops          3
pleasure       2
recognize      2
mutual         2
amtrak         2
achieved       2
mr             2
dtype: int64

Notice what happens when I set the parameter "normalize" for value_counts() as "True": (it tells Python to tell us the percentage)

In [53]:
pd.Series.value_counts(list(eighties_data['words'][0]), normalize=True)[:10]

railroad       0.041096
wilmington     0.041096
outstanding    0.041096
shops          0.041096
pleasure       0.027397
recognize      0.027397
mutual         0.027397
amtrak         0.027397
achieved       0.027397
mr             0.027397
dtype: float64

#### Total word count for the dataset

Let's count the words for each speech in the dataset. We'll make a new column called 'wordcount.'

In [54]:
eighties_data['wordcount'] = eighties_data['words'].str.len()

In [55]:
eighties_data.head()

Unnamed: 0,Unnamed: 0.1,speech,date,speaker,word_count,year,month,month_year,stopworded,words,wordcount
0,2329890,mr speaker we in delaware are proud of the out...,1980-01-03,Mr. EVANS of Delaware,122,1980,1,1980-01-01,mr speaker delaware proud outstanding rec...,"[mr, speaker, delaware, proud, outstanding, re...",73
1,2329891,mr speaker it is logical for americans to be u...,1980-01-03,Mr. DERWINSKI,82,1980,1,1980-01-01,mr speaker logical americans upset hold...,"[mr, speaker, logical, americans, upset, holdi...",39
2,2329892,the chair has examined the journal of the last...,1980-01-03,The SPEAKER pro tempore,32,1980,1,1980-01-01,chair examined journal last days proceedi...,"[chair, examined, journal, last, days, proceed...",19
3,2329893,mr speaker i ask unanimous consent that the co...,1980-01-03,Mr. WHITE,36,1980,1,1980-01-01,mr speaker ask unanimous consent committee ...,"[mr, speaker, ask, unanimous, consent, committ...",19
4,2329894,is there objection to the request of the gentl...,1980-01-03,The SPEAKER pro tempore,11,1980,1,1980-01-01,objection request gentleman texas,"[objection, request, gentleman, texas]",4


Notice that we now have two wordcount columns -- one we made before stopwording and one we made after stopwording.

How many words are there in the dataframe as a whole? We can answer that question by adding up all the individual speech wordcounts using 

    .sum()

In [56]:
eighties_data['wordcount'].sum()

108044320

#### Get the longest speeches in the datasets

What are the longest speeches in the database?

In [70]:
longest_speeches = eighties_data.nlargest(n=5, columns=['word_count']) # Get the top 5 longest speeches by word_count
longest_speeches 

Unnamed: 0,Unnamed: 0.1,speech,date,speaker,word_count,year,month,month_year,stopworded,words,wordcount
241386,2571276,815 16th street nw washington dc 20006 d 6 390...,1981-09-09,es. AFL-CIO,33507,1981,9,1981-09-01,815 16th street nw dc 20006 6 390649 c jame...,"[815, 16th, street, nw, washington, dc, 20006,...",29740
1217582,3547472,limits would be in effect until such time as t...,1988-06-01,The MAAC,30042,1988,6,1988-06-01,limits effect implements fee schedul...,"[limits, would, effect, time, secretary, imple...",17962
197843,2527733,box 269 elizabethtown pa d 6 63064 a american ...,1981-05-18,sors. P.O,21558,1981,5,1981-05-01,box 269 elizabethtown pa 6 63064 associatio...,"[box, 269, elizabethtown, pa, 6, 63064, americ...",19252
286393,2616283,box 269 elizabethtown pa d 6 8461 a american a...,1981-11-24,sors. P.O,21076,1981,11,1981-11-01,box 269 elizabethtown pa 6 8461 association...,"[box, 269, elizabethtown, pa, 6, 8461, america...",18817
13423,2343313,de cv balderas 36 mexico df mexico d 6 2400 e ...,1980-02-19,car. S.A,19922,1980,2,1980-02-01,de cv balderas 36 mexico df mexico 6 2400 e 2...,"[de, cv, balderas, 36, mexico, df, mexico, 6, ...",17793


#### The top words for the dataset

Let's count the top words for the longest speeches.

First, we need a list with all the words in the 'words' column in it. A simple for-loop can do that in a hurry.  Let's create a list called "all_words" from the content of each speech in the "speech" column.

In [76]:
longestspeecheswords = []

for speech in longest_speeches['words']:
    for word in speech:
        longestspeecheswords.append(word)

longestspeecheswords[:30]

['815',
 '16th',
 'street',
 'nw',
 'washington',
 'dc',
 '20006',
 '6',
 '390649',
 'c',
 'james',
 'hacket',
 'american',
 'plywood',
 'association',
 'po',
 'box',
 '11700',
 'tacoma',
 'wash',
 'b',
 'american',
 'plywood',
 'assciation',
 'po',
 'box',
 '11700',
 'tacoma',
 'wash',
 '98411']

In [75]:
pd.Series.value_counts(longestspeecheswords)[:10]

washington    3969
dc            3915
nw            3548
b             3363
street        3228
6             2203
e             1778
avenue        1609
9             1519
suite         1297
dtype: int64

Now here's a line of code that does exactly the same thing in another way.

It uses the function 

    .explode()

to give each word in the list its own row. We can use **.dropna()** at the end to tell pandas to drop any rows that are empty. Check it out.

In [77]:
longestspeecheswords = longest_speeches["words"].explode().dropna()
longestspeecheswords

241386           815
241386          16th
241386        street
241386            nw
241386    washington
             ...    
13423           19th
13423         street
13423             nw
13423     washington
13423             dc
Name: words, Length: 103564, dtype: object

In [78]:
longestspeecheswords.value_counts()[:10]

washington    3969
dc            3915
nw            3548
b             3363
street        3228
6             2203
e             1778
avenue        1609
9             1519
suite         1297
Name: words, dtype: int64

Let's count the top words overall.  

In [79]:
topwords1980

within            77036
children          76906
agreement         76869
upon              75960
case              75881
motion            75698
international     75691
ordered           75636
office            73858
human             73439
already           73174
local             73046
passed            72516
special           72003
appropriations    71823
still             71683
benefits          71621
present           71580
political         71314
result            71215
Name: stopworded, dtype: int64

In [80]:
topwords1980 = eighties_data["words"].explode().dropna().value_counts()
topwords1980

mr             1262448
would           922396
president       755471
bill            603331
amendment       543091
                ...   
rzcosn               1
ergens               1
tradegattwe          1
debitatng            1
1985thats            1
Name: words, Length: 815067, dtype: int64

Great -- but some of those words are still pretty hollow, despite having already been stopworded!  

Let's use our top words from the decade to create a new stopword list, format the list, and apply it to eighties_data.

In [61]:
maybestopwords = list(pd.Series.value_counts(words1980)[:200].index)
maybestopwords[:50]

['mr',
 'would',
 'president',
 'bill',
 'amendment',
 'us',
 'senator',
 'time',
 'gentleman',
 'committee',
 'one',
 'speaker',
 'states',
 'new',
 'people',
 'years',
 'chairman',
 'senate',
 'house',
 'year',
 'congress',
 'federal',
 'program',
 'think',
 'many',
 'state',
 'united',
 'legislation',
 'support',
 'also',
 'act',
 'government',
 'may',
 'yield',
 'today',
 'budget',
 'national',
 'american',
 'percent',
 'make',
 'first',
 'country',
 'ask',
 'million',
 'could',
 'like',
 'going',
 'colleagues',
 'must',
 'resolution']

Ideally, we would edit this list by hand. But I'm just going to use the top 200 words as stopwords for now.

In [62]:
stopwords_regex2 = r'\b(?:{})\b'.format('|'.join(maybestopwords))
eighties_data['stopworded'] = eighties_data['stopworded'].str.replace(stopwords_regex2, '')

In [63]:
eighties_data.head()

Unnamed: 0,Unnamed: 0.1,speech,date,speaker,word_count,year,month,month_year,stopworded,words,wordcount
0,2329890,mr speaker we in delaware are proud of the out...,1980-01-03,Mr. EVANS of Delaware,122,1980,1,1980-01-01,delaware proud outstanding achieved ...,"[mr, speaker, delaware, proud, outstanding, re...",73
1,2329891,mr speaker it is logical for americans to be u...,1980-01-03,Mr. DERWINSKI,82,1980,1,1980-01-01,logical upset holding hostages ir...,"[mr, speaker, logical, americans, upset, holdi...",39
2,2329892,the chair has examined the journal of the last...,1980-01-03,The SPEAKER pro tempore,32,1980,1,1980-01-01,chair examined journal days proceedings ...,"[chair, examined, journal, last, days, proceed...",19
3,2329893,mr speaker i ask unanimous consent that the co...,1980-01-03,Mr. WHITE,36,1980,1,1980-01-01,banking finance urban affairs disc...,"[mr, speaker, ask, unanimous, consent, committ...",19
4,2329894,is there objection to the request of the gentl...,1980-01-03,The SPEAKER pro tempore,11,1980,1,1980-01-01,texas,"[objection, request, gentleman, texas]",4


Now, count (no splitting required).

In [64]:
topwords1980 = eighties_data["stopworded"].str.split().explode().dropna().value_counts()[:20]
topwords1980

within            77036
children          76906
agreement         76869
upon              75960
case              75881
motion            75698
international     75691
ordered           75636
office            73858
human             73439
already           73174
local             73046
passed            72516
special           72003
appropriations    71823
still             71683
benefits          71621
present           71580
political         71314
result            71215
Name: stopworded, dtype: int64

## Get the top speeches that have mention the word "democracy" 

Next, let's count the number of times that the word 'democracy' appears in the 1980s.  

Notice the use of .str.count():

In [99]:
democracy_speeches = eighties_data.copy()
democracy_speeches['democracy_count'] = democracy_speeches['speech'].str.count('democracy')  # Create a new column for the count of the word democracydis
democracy_speeches.head()

Unnamed: 0,Unnamed: 0.1,speech,date,speaker,word_count,year,month,month_year,stopworded,words,wordcount,democracy_count
0,2329890,mr speaker we in delaware are proud of the out...,1980-01-03,Mr. EVANS of Delaware,122,1980,1,1980-01-01,delaware proud outstanding achieved ...,"[mr, speaker, delaware, proud, outstanding, re...",73,0
1,2329891,mr speaker it is logical for americans to be u...,1980-01-03,Mr. DERWINSKI,82,1980,1,1980-01-01,logical upset holding hostages ir...,"[mr, speaker, logical, americans, upset, holdi...",39,0
2,2329892,the chair has examined the journal of the last...,1980-01-03,The SPEAKER pro tempore,32,1980,1,1980-01-01,chair examined journal days proceedings ...,"[chair, examined, journal, last, days, proceed...",19,0
3,2329893,mr speaker i ask unanimous consent that the co...,1980-01-03,Mr. WHITE,36,1980,1,1980-01-01,banking finance urban affairs disc...,"[mr, speaker, ask, unanimous, consent, committ...",19,0
4,2329894,is there objection to the request of the gentl...,1980-01-03,The SPEAKER pro tempore,11,1980,1,1980-01-01,texas,"[objection, request, gentleman, texas]",4,0


Get the speeches that mentioned democracy most frequently.

In [115]:
top_democracy_speeches = democracy_speeches.nlargest(500, ['democracy_count'])
top_democracy_speeches.head()

Unnamed: 0,Unnamed: 0.1,speech,date,speaker,word_count,year,month,month_year,stopworded,words,wordcount,democracy_count
1252707,3582597,mr president in listening to the speeches toda...,1988-08-10,Mr. PACKWOOD,5661,1988,8,1988-08-01,listening speeches find repeated ...,"[mr, president, listening, speeches, today, fi...",2654,27
1044031,3373921,mr speaker with its four core grantees the nat...,1987-04-09,Mr. CONIMS,1603,1987,4,1987-04-01,four core grantees endowment democracy ...,"[mr, speaker, four, core, grantees, national, ...",840,26
1383325,3713215,mr president benazir bhutto prime minister of ...,1989-10-13,Mr. KERRY,2162,1989,10,1989-10-01,benazir bhutto prime minister islamic repu...,"[mr, president, benazir, bhutto, prime, minist...",1226,24
867164,3197054,mr president on december 16 1983 president rea...,1985-12-06,Mr. HATCH,1862,1985,12,1985-12-01,december 16 1983 reagan speaking ceremon...,"[mr, president, december, 16, 1983, president,...",981,21
1309407,3639297,mr president i wanted to say to my friend from...,1989-04-13,Mr. WALLOP,1213,1989,4,1989-04-01,wanted friend south carolina feel ...,"[mr, president, wanted, say, friend, south, ca...",543,19


What words do the top speeches mentioning democracy use when they talk about democracy?

In [116]:
democracywordcount = top_democracy_speeches["stopworded"].str.split().explode().dropna().value_counts()
democracywordcount[:30]

democracy        3933
democratic       1767
nicaragua        1674
political        1338
freedom          1235
central          1168
aid              1096
contras           926
sandinistas       918
peace             878
human             720
salvador          685
el                680
free              656
countries         603
elections         592
chile             590
opposition        525
south             519
communist         517
power             510
forces            510
nicaraguan        507
endowment         487
philippines       457
election          451
international     440
sandinista        438
leaders           410
toward            399
Name: stopworded, dtype: int64

Does that give you something you could write about?

### Summing Up

Here is a reprise of the code you just learned.  

In the code that follows, we will use **str.count()** and **.nlargest()** to find the speeches that mention "pineapples" the most.
* We will use **.loc** and **.iloc()** to find the speakers who gave those pineapple speeches, calling that list, *pineapplespeakers.*
* We will use square brackets -- **[]** -- and **.isin()** to find all the speeches given by speakers who are in the *pineapplespeakers* list, calling the result *pineapplespeakersspeeches.*
* We will narrow the *pineapplespeakersspeeches* to just the 1960s using **.isin(),** creating the dataframe *pineapple_speakers_sixties_data.*
* We will clean that data using **.str.replace()** and **.lower()**.
* We will add a column, *democracy_count*, to *pineapple_speakers_sixties_data*.  The new coumn uses **str.count** to count how many times the speakers who defended the American pineapple mentioned democracy. We will use **.nlargest()** to find the speeches where they mentioned democracy the most, calling the results, *top_speeches.*
* We will then divide 'stopworded' column of  *top_speeches* into words. We will count the stopworded words using **str.split().explode().dropna().value_counts()**, producing the final dataset, *pineapple_democracywordcount.*

Effectively, this seemingly silly research project will give us an alternative portrait of how American representatives talked about democracy -- one centered not on anti-communism in the Reagan congress of the 1980s, with its fears about spreading communism in El Salvador, but coming from the diverse, multi-ethnic, Pacific island of Hawaii in the 1960s.  

Such a project will allow us to answer a question like this: What did Hawaii's representatives (who sometimes defended its pineapple industry) talk about when they spoke about democracy in the 1960s? How did their view of American democracy differ from that of Reagan-era America in the fight against communism?


In [10]:
import nltk
import pandas as pd
import csv
from nltk.corpus import stopwords
nltk.download('stopwords')
stop = stopwords.words('english')
stopwords_regex = r'\b(?:{})\b'.format('|'.join(stop))

[nltk_data] Downloading package stopwords to
[nltk_data]     /users/jguldi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [11]:
cd /scratch/group/history/hist_3368-jguldi

/scratch/group/history/hist_3368-jguldi


In [32]:
# load congress data
congress = pd.read_csv("congress1967-2010.csv")

# get the speeches that mentioned pineapples the most
pineapplespeeches = congress['speech'].str.count('pineapple').nlargest(5) 

# get the names of speakers who gave the most pineapple-y speeches 
pineapplespeakers = []
for speechnumber in list(pineapplespeeches.index):
        speaker = list(congress.loc[:, ['speaker']].iloc[speechnumber, ])[0]
        if speaker not in pineapplespeakers:
            pineapplespeakers.append(speaker)
        
# get all speeches of speakers who mention pineapples the most
pineapplespeakersspeeches = congress[congress['speaker'].isin(pineapplespeakers)]

# narrow to just the 1960s
target_years = list(range(1960, 1969 + 1))  # List of the years 1980-1989
pineapple_speakers_sixties_data = pineapplespeakersspeeches[pineapplespeakersspeeches['year'].isin(target_years)].copy().reset_index()  # filter our dataset to just this decade
pineapple_speakers_sixties_data = pineapple_speakers_sixties_data.drop(['index', 'Unnamed: 0'], 1) #minor reformatting - drop extra columns

# clean up the data
pineapple_speakers_sixties_data['speech'] = pineapple_speakers_sixties_data['speech'].str.replace('[^\w\s]','').str.lower() # remove punctuation, lowercase
pineapple_speakers_sixties_data['stopworded'] = pineapple_speakers_sixties_data['speech'].str.replace(stopwords_regex, '') #stopwording

# count how many times the pineapple speaker mentioned democracy in each speech
pineapple_speakers_sixties_data['democracy_count'] = pineapple_speakers_sixties_data['speech'].str.count('democracy')  # Create a new column for the count of the word democracydis
top_speeches = pineapple_speakers_sixties_data.nlargest(10, ['democracy_count'])

# count other words that the pineapple speakers used when they mentioned democracy
pineapple_democracywordcount = top_speeches["stopworded"].str.split().explode().dropna().value_counts()
pineapple_democracywordcount[:30]

hawaii         64
congress       40
territory      37
state          36
statehood      34
united         31
states         30
pacific        29
islands        27
people         27
education      23
world          23
american       21
us             21
island         20
one            20
nation         19
years          19
status         18
house          18
would          17
trust          17
government     17
democracy      16
new            15
first          15
territories    15
vote           14
hawaiis        13
citizens       13
Name: stopworded, dtype: int64

***One Interpretation:*** while 80's legislators were most concerned about contrasting American democracy with what they perceived as the threat of Marxist violence in Latin American, Hawaiian legislators in the 1960s had mostly been concerned with democracy as a reflection of pacific people and islanders, depicting democracy as a system closely bound up with the advantages of public education and citizenship.

***Reflect on the Room for Other Interpretations.  What is fact here, and what is open to debate?*** *How might your interpretation differ from the interpretation I have given? Would you pay attention to the same words?  Would you interpret them the same way?  Would you measure the 'top words about democracy' in the same way? Would that produce different results?*

## Assignment

1) Print out the first 500 words of the speech that mentions your favorite animal the highest number of times.  


2) Choose one of the top 100 speakers from congress['speaker'].value_counts().
   * Find the longest speech by that speaker. 
   * Show the first and last 500 words.  


3) Limit the dataframe to one year of your choosing in the 1980s.  
   * Find the longest speech by your speaker. 
   * Show the first 500 words.  


4) Call up a list of your speaker's top hundred words.  
   * Choose one word that you judge to be meaningful -- and possibly distinctive of that speaker (consult the list of overall top words for comparison).  
   * Find the speech where your speaker mentions that word the greatest number of times. 
   * Show the first 500 words.  


5) Find longest speech where your speaker mentions the word of your choosing.  
   * The word should be mentioned at least three times. 
   *  Show the first 500 words.  

For each part of the assignment, take a screenshot of the code and the results and upload it. 