# Congressional Activity
<font size=4 color='blue'>Understand and Prep Data - General Activity</font>
***

**Project Summary:**  
The Resume of Congressional Activity has been published since 1947. It includes statistics on the number of measures introduced, bills passed, the outcome of confirmations, etc.  
This project analyzes activity trends and factors that affect the productivity of Congress.  

**Notebook Scope:**  
This notebook includes code to load and preview raw General Activity data from an Excel spreadsheet. This input file was compiled by copying and pasting content from the annual Resume in PDF format. Minimal cleanup and formatting were completed manually to support the data prep covered in this notebook.  

**Output:**  
An Excel file containing scrubbed General Activity data is generated.  
***

***
# Notebook Setup
***

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import re

In [2]:
# Set display options
pd.options.display.multi_sparse = False

***  
# Read Data
***
Given the complexity of scraping data from PDF, each resume's contents were manually copied and pasted into Microsoft Excel. Minimal formatting was completed for consistency. Individual resumes can be found on <a href="https://www.senate.gov/legislative/ResumesofCongressionalActivity1947present.htm">Senate.gov</a> 

<font color='red'>Note:</font> Not all of the resumes on the Senate.gov page referenced above are final. Where the resume was not final, additional .gov sites were searched to find the latest version. A copy of the PDFs used is available on <a href="https://github.com/tamimcm416/congressional_activity">GitHub</a>.


In [3]:
# Load all worksheets
file_name = '../Data/Resume Data - Raw.xlsx'
raw_data_dict = pd.read_excel(file_name, sheet_name=None, header=None, skiprows=3, usecols='A:D')

In [4]:
# View the naemes of each tab loaded from the Excel document
print(raw_data_dict.keys())

dict_keys(['1983 - 98.1', '1984 - 98.2', '1985 - 99.1', '1986 - 99.2', '1987 - 100.1', '1988 - 100.2', '1989 - 101.1', '1990 - 101.2', '1991 - 102.1', '1992 - 102.2', '1993 - 103.1', '1994 - 103.2', '1995 - 104.1', '1996 - 104.2', '1997 - 105.1', '1998 - 105.2', '1999 - 106.1', '2000 - 106.2', '2001 - 107.1', '2002 - 107.2', '2003 - 108.1', '2004 - 108.2', '2005 - 109.1', '2006 - 109.2', '2007 - 110.1', '2008 - 110.2', '2009 - 111.1', '2010 - 111.2', '2011 - 112.1', '2012 - 112.2', '2013 - 113.1', '2014 - 113.2', '2015 - 114.1', '2016 - 114.2', '2017 - 115.1', '2018 - 115.2', '2019 - 116.1', '2020 - 116.2', '2021 - 117.1', '2022 - 117.2'])


In [5]:
# View the contents of the first worksheet
pd.DataFrame(raw_data_dict['1983 - 98.1'])

Unnamed: 0,0,1,2,3
0,,Senate,House,Total
1,Days in session,150,146,
2,Time in session,"1,010 hrs., 47'","851 hrs., 45'",
3,Congressional Record:,,,
4,...Pages of proceedings,17224,10665,27889
5,...Extension of Remarks,,,5985
6,Public bills enacted into law,101,114,215
7,Private bills enacted into law,,6,6
8,Bills in conference,3,2,5
9,Bills through conference,4,29,33


***
# Create General Activity Dataframe
***
Consolidate general activity data into a single dataframe.

In [6]:
# Create an empty dataframe to hold the final data
gen_activity_df = pd.DataFrame(columns = ['Session', 'Label', 'Senate', 'House', 'Both'])

In [7]:
# Loop through the first column of each worksheet to tidy the label and pull the values for each label.
for key in raw_data_dict.keys():
    for i, row in raw_data_dict[key].iterrows():
        if pd.isna(row[0]) == False:
            new_label = row[0].capitalize()
            if '...' not in new_label:
                section = new_label.replace(':', '')
            else:
                new_label = section + '...' + new_label[3:].capitalize()
            gen_activity_df.loc[len(gen_activity_df)] = [key, new_label, row[1], row[2], row[3]]

In [8]:
# Preview the general activity dataframe
gen_activity_df.head()

Unnamed: 0,Session,Label,Senate,House,Both
0,1983 - 98.1,Days in session,150,146,
1,1983 - 98.1,Time in session,"1,010 hrs., 47'","851 hrs., 45'",
2,1983 - 98.1,Congressional record:,,,
3,1983 - 98.1,Congressional record...Pages of proceedings,17224,10665,27889.0
4,1983 - 98.1,Congressional record...Extension of remarks,,,5985.0


***
# Tidy General Activity Dataframe
***
The current dataframe is not tidy. This can be addressed by moving variables to columns and observations to rows. For this dataset, an observation will be defined as a combination of Year, Congress, Session, and Chamber.

In [9]:
# Pivot the dataframe so that each column is a variable. To simplify, this will be done by chamber and then concatenated 
gen_activity_tidy_df = pd.DataFrame()
for chamber in ['Senate', 'House', 'Both']:
    temp_df = gen_activity_df[['Session', 'Label', chamber]].copy()
    temp_df = temp_df.pivot(index=['Session'], columns=['Label'], values=chamber)
    temp_df['Chamber'] = chamber
    gen_activity_tidy_df = pd.concat([gen_activity_tidy_df, temp_df], axis=0)

In [10]:
# Split the Session value into Year, Congress and Session
gen_activity_tidy_df.reset_index(inplace=True)
gen_activity_tidy_df[['Year', 'Congress', 'Session']] = gen_activity_tidy_df['Session'].str.split(' - |[.]', expand=True)
cols = ['Year', 'Congress', 'Session', 'Chamber']
cols.extend(gen_activity_tidy_df.columns.to_list()[2:-3])
gen_activity_tidy_df = gen_activity_tidy_df[cols].copy()

In [11]:
# Clear the column heading label
gen_activity_tidy_df.columns.name = None

# Preview General Activity data
***

In [12]:
gen_activity_tidy_df.head()

Unnamed: 0,Year,Congress,Session,Chamber,Bills not signed,Bills through conference,Bills vetoed,Conference reports,Congressional record...Extension of remarks,Congressional record...Pages of proceedings,...,"Measures reported, total...Senate joint resolutions","Measures reported, total...Simple resolutions",Private bills enacted into law,Public bills enacted into law,Quorum calls,Recorded votes,Special reports,Time in session,Vetoes overridden,Yea-and-nay votes
0,1983,98,1,Senate,,4.0,3.0,4.0,,17224,...,87,139,,101,18,,25,"1,010 hrs., 47'",1,381
1,1984,98,2,Senate,,22.0,8.0,,,14650,...,99,122,17.0,166,19,,11,"940 hrs., 28'",1,292
2,1985,99,1,Senate,,8.0,,2.0,,18418,...,118,100,,110,20,,18,"1,252 hrs., 31'",1,381
3,1986,99,2,Senate,,,4.0,,,17426,...,111,63,7.0,187,16,,15,"1,278 hrs., 15'",1,359
4,1987,100,1,Senate,,,1.0,1.0,,18660,...,72,62,2.0,96,36,,28,"1,214 hrs., 52'",2,420


In [13]:
# Drop rows and columns that consist only of NaN data
gen_activity_tidy_df.dropna(axis = 0, how = 'all', inplace=True)
gen_activity_tidy_df.dropna(axis = 1, how = 'all', inplace=True)
gen_activity_tidy_df.reset_index(drop=True, inplace=True)

In [14]:
# View the number of rows and columns in the dataframe
gen_activity_tidy_df.shape

(120, 41)

***
# Variables
***

In [15]:
# Display variables (column headings) for the dataframe
gen_activity_tidy_df.columns.values

array(['Year', 'Congress', 'Session', 'Chamber', 'Bills not signed',
       'Bills through conference', 'Bills vetoed', 'Conference reports',
       'Congressional record...Extension of remarks',
       'Congressional record...Pages of proceedings', 'Days in session',
       'Measures introduced, total', 'Measures introduced, total...Bills',
       'Measures introduced, total...Concurrent resolutions',
       'Measures introduced, total...Joint resolutions',
       'Measures introduced, total...Simple resolutions',
       'Measures passed, total', 'Measures passed, total...House bills',
       'Measures passed, total...House concurrent resolutions',
       'Measures passed, total...House joint resolutions',
       'Measures passed, total...Senate bills',
       'Measures passed, total...Senate concurrent resolutions',
       'Measures passed, total...Senate joint resolutions',
       'Measures passed, total...Simple resolutions',
       'Measures pending on calendar', 'Measures reporte

***
**Variable Descriptions**  
-- Year, Congress, Session, Chamber: describes the timeframe and details for each row (observation)  
-- Bills not signed: the number of bills passed by Congress but not signed by the President (pocket vetoes)  
-- Bills through conference: the number of bills reconciled in conference and returned to both chambers for final approval  
-- Bills vetoed: the number of bills passed by Congress but vetoed by the President  
-- Conference reports: the number of reports issued by conference(s) regarding bill reconciliation  
-- Congressional record...Extension of remarks: the number of Congressional Record pages containing remarks from members of Congress  
-- Congressional record...Pages of proceedings: the number of Congressional Record pages containing proceedings of Congress  
-- Days in session: the number of calendar days spent in session    
-- Measures introduced, total: the number of measures introduced, categorized by measure type  
-- Measures passed, total: the number of measures passed, categorized by measure type  
-- Measures pending on calendar: the number of measures eligible for consideration by the full chamber  
-- Measures reported, total: the number of measures reported out of committee, categorized by measure type    
-- Private bills enacted into law: the number laws enacted that benefit specific individuals   
-- Public bills enacted into law: the number of laws that apply generally  
-- Quorum calls: the number of votes taken to confirm enough members are present to conduct business  
-- Recorded votes: the number of roll call votes taken    
-- Special reports: the number of special reports produced      
-- Time in session: the number of hours and minutes spent in session    
-- Vetoes overridden: the number of Presidential vetoes overridden by Congress  
-- Yea-and-nay votes: the number of voice votes taken   


***
## Rename Variables for Clarity
***

In [16]:
# Remove the 'Congressional Record" prefix from the Pages of proceedings and Extension of remarks labels
gen_activity_tidy_df.columns = gen_activity_tidy_df.columns.str.removeprefix('Congressional record...')

In [17]:
# Clean up the labels where we prepended the headings
gen_activity_tidy_df.columns = gen_activity_tidy_df.columns.str.replace(' total...', ' ', regex=True)

***
## Update Datatypes and Formats
***

In [25]:
# Review current datatypes
gen_activity_tidy_df.dtypes

Year                                                         int32
Congress                                                     int32
Session                                                      int32
Chamber                                             string[python]
Bills not signed                                             Int64
Bills through conference                                     Int64
Bills vetoed                                                 Int64
Conference reports                                           Int64
Extension of remarks                                         Int64
Pages of proceedings                                         Int64
Days in session                                              Int64
Measures introduced, total                                   Int64
Measures introduced, Bills                                   Int64
Measures introduced, Concurrent resolutions                  Int64
Measures introduced, Joint resolutions                       I

In [19]:
# Let pandas infer the best datatypes
gen_activity_tidy_df = gen_activity_tidy_df.convert_dtypes()
gen_activity_tidy_df.dtypes

Year                                                string[python]
Congress                                            string[python]
Session                                             string[python]
Chamber                                             string[python]
Bills not signed                                             Int64
Bills through conference                                     Int64
Bills vetoed                                                 Int64
Conference reports                                           Int64
Extension of remarks                                         Int64
Pages of proceedings                                         Int64
Days in session                                              Int64
Measures introduced, total                                   Int64
Measures introduced, Bills                                   Int64
Measures introduced, Concurrent resolutions                  Int64
Measures introduced, Joint resolutions                       I

In [20]:
# Convert Year, Congress, and Session to int
int_cols = ['Year', 'Congress', 'Session']
gen_activity_tidy_df[int_cols] = gen_activity_tidy_df[int_cols].astype('int')

In [21]:
# Review Time in Session data
gen_activity_tidy_df['Time in session'].unique()

<StringArray>
["1,010 hrs., 47'",   "940 hrs., 28'", "1,252 hrs., 31'", "1,278 hrs., 15'",
 "1,214 hrs., 52'", "1,126 hrs., 52'", "1,003 hrs., 19'", "1,250 hrs., 14'",
 "1,200 hrs., 44'",  "1091 hrs., 09'", "1,269 hrs., 42'", "1,246 hrs., 33'",
 "1,839 hrs., 10'", "1,036 hrs., 45'", "1,093 hrs., 07'",  "1,095 hrs., 5'",
  "1,183 hrs., 0'", "1,017 hrs., 51'", "1,236 hrs., 15'", "1,043 hrs., 23'",
 "1,454 hrs., 05'", "1,031 hrs., 31'", "1,222 hrs., 26'", "1,027 hrs., 48'",
 "1,375 hrs., 54'",   "988 hrs., 30'", "1,420 hrs., 39'", "1,074 hrs., 40'",
 "1,101 hrs., 44'",   "930 hrs., 12'", "1,095 hrs., 12'",   "908 hrs., 15'",
 "1,073 hrs., 39'",   "780 hrs., 58'", "1,166 hrs., 34'", "1,015 hrs., 29'",
   "947 hrs., 46'",   "963 hrs., 52'", "1,038 hrs., 11'",   "958 hrs., 32'",
   "851 hrs., 45'",   "852 hrs., 59'",   "965 hrs., 16'",   "829 hrs., 11'",
   "909 hrs., 57'",   "749 hrs., 01'",   "748 hrs., 54'",   "939 hrs., 17'",
   "938 hrs., 34'",   "856 hrs., 58'",   "981 hrs., 55'",   "9

In [22]:
# Simplify Time in session by dropping the minutes and convert to float
def conv_time(sess_time):
    if pd.isna(sess_time):
        return np.nan
    else:
        hrs = sess_time.split(' ')[0]
        return float(hrs.replace(',', ''))

gen_activity_tidy_df['Time in session'] = gen_activity_tidy_df['Time in session'].apply(conv_time).astype('Int64')

***
# Preview Final Dataset
***

In [23]:
gen_activity_tidy_df.head()

Unnamed: 0,Year,Congress,Session,Chamber,Bills not signed,Bills through conference,Bills vetoed,Conference reports,Extension of remarks,Pages of proceedings,...,"Measures reported, Senate joint resolutions","Measures reported, Simple resolutions",Private bills enacted into law,Public bills enacted into law,Quorum calls,Recorded votes,Special reports,Time in session,Vetoes overridden,Yea-and-nay votes
0,1983,98,1,Senate,,4.0,3.0,4.0,,17224,...,87,139,,101,18,,25,1010,1,381
1,1984,98,2,Senate,,22.0,8.0,,,14650,...,99,122,17.0,166,19,,11,940,1,292
2,1985,99,1,Senate,,8.0,,2.0,,18418,...,118,100,,110,20,,18,1252,1,381
3,1986,99,2,Senate,,,4.0,,,17426,...,111,63,7.0,187,16,,15,1278,1,359
4,1987,100,1,Senate,,,1.0,1.0,,18660,...,72,62,2.0,96,36,,28,1214,2,420


***
# Write to Excel
***

In [24]:
gen_activity_tidy_df.to_excel('../Data/General Activity Data - Scrubbed.xlsx', index=False)

***
**End**
***