# Congressional Activity
<font size=4 color='blue'>Understand and Prep Data - Confirmations</font>
***

**Project Summary:**  
The Resume of Congressional Activity has been published since 1947. It includes statistics on the number of measures introduced, bills passed, the outcome of confirmations, etc.  
This project analyzes activity trends and factors that affect the productivity of Congress.  

**Notebook Scope:**  
This notebook includes code to load and preview raw Confirmation data from an Excel spreadsheet. This input file was compiled by copying and pasting content from the annual Resume in PDF format. Minimal cleanup and formatting were completed manually to support the data prep covered in this notebook.  

**Output:**  
An Excel file containing scrubbed Confirmation data is generated.  
***

***
# Notebook Setup
***

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import re

In [2]:
# Set display options
pd.options.display.multi_sparse = False

***  
# Read Data
***
Given the complexity of scraping data from PDF, each resume's contents were manually copied and pasted into Microsoft Excel. Minimal formatting was completed for consistency. Individual resumes can be found on <a href="https://www.senate.gov/legislative/ResumesofCongressionalActivity1947present.htm">Senate.gov</a> 

<font color='red'>Note:</font> Not all of the resumes on the Senate.gov page referenced above are final. Where the resume was not final, additional .gov sites were searched to find the latest version. A copy of the PDFs used is available on <a href="https://github.com/tamimcm416/congressional_activity">GitHub</a>.


In [7]:
# Read in all worksheets
file_name = '../Data/Resume Data - Raw.xlsx'
raw_data_dict = pd.read_excel(file_name, sheet_name=None, header=None, skiprows=3, usecols='F:G')

In [8]:
# View the naemes of each tab loaded from the Excel document
print(raw_data_dict.keys())

dict_keys(['1983 - 98.1', '1984 - 98.2', '1985 - 99.1', '1986 - 99.2', '1987 - 100.1', '1988 - 100.2', '1989 - 101.1', '1990 - 101.2', '1991 - 102.1', '1992 - 102.2', '1993 - 103.1', '1994 - 103.2', '1995 - 104.1', '1996 - 104.2', '1997 - 105.1', '1998 - 105.2', '1999 - 106.1', '2000 - 106.2', '2001 - 107.1', '2002 - 107.2', '2003 - 108.1', '2004 - 108.2', '2005 - 109.1', '2006 - 109.2', '2007 - 110.1', '2008 - 110.2', '2009 - 111.1', '2010 - 111.2', '2011 - 112.1', '2012 - 112.2', '2013 - 113.1', '2014 - 113.2', '2015 - 114.1', '2016 - 114.2', '2017 - 115.1', '2018 - 115.2', '2019 - 116.1', '2020 - 116.2', '2021 - 117.1', '2022 - 117.2'])


In [9]:
# View the contents of the first worksheet
pd.DataFrame(raw_data_dict['1983 - 98.1'])

Unnamed: 0,5,6
0,,
1,"Army nominations, totaling 14,784, disposed of...",
2,...Confirmed,14782.0
3,...Failed at August-September adjournment,1.0
4,...Failed at November 18 sine die adjournment,1.0
5,,
6,"Navy nominations, totaling 21,994, disposed of...",
7,...Confirmed,21994.0
8,,
9,"Air Force nominations, totaling 12,819, dispos...",


***
# Create Confirmation Dataframe
***
Consolidate confirmation data into a single dataframe.

In [6]:
# Create an empty dataframe to hold the final data
confirm_df = pd.DataFrame(columns = ['Session', 'Label', 'Value'])

In [15]:
# Loop through the data to pull labels and values for confirmations
for key in raw_data_dict.keys():
    for i, row in raw_data_dict[key].iterrows():
        if pd.isna(row[5]) == False:
            new_label = row[5].capitalize()
            if '...' not in new_label:
                section = new_label.replace(':', '')
            else:
                new_label = section.split(' nominations')[0] + '...' + new_label[3:].capitalize()
            confirm_df.loc[len(confirm_df)] = [key, new_label, row[6]]

In [16]:
# Let's preview the confirmation dataframe
with pd.option_context('display.max_colwidth', 400):
    display(confirm_df.head())

Unnamed: 0,Session,Label,Value
0,1983 - 98.1,"Army nominations, totaling 14,784, disposed of as follows:",
1,1983 - 98.1,Army...Confirmed,14782.0
2,1983 - 98.1,Army...Failed at august-september adjournment,1.0
3,1983 - 98.1,Army...Failed at november 18 sine die adjournment,1.0
4,1983 - 98.1,"Navy nominations, totaling 21,994, disposed of as follows:",


***
# Tidy Confirmation Dataframe
***
The current dataframe is not tidy. This can be addressed by moving variables to columns and observations to rows. For this dataset, an observation will be defined as a combination of Year, Congress and Session.

In [17]:
# Start by identifying variable names that contain values
labels_with_digits = []
for row in confirm_df['Label']:
    if any(char.isdigit() for char in row):
        labels_with_digits.append(row)
        continue
print(f'There are {len(labels_with_digits)} labels that contain values.')

There are 488 labels that contain values.


In [18]:
# Preview the list of labels that contain values:
labels_with_digits[0:10]

['Army nominations, totaling 14,784, disposed of as follows:',
 'Army...Failed at november 18 sine die adjournment',
 'Navy nominations, totaling 21,994, disposed of as follows:',
 'Air force nominations, totaling 12,819, disposed of as follows:',
 'Marine corps nominations, totaling 2,990, disposed of as follows:',
 'Civilian nominations, totaling 3,454, disposed of as follows:',
 'Civilian...Failed at november 18 sine die adjournment',
 'Summary...Failed at november 18 sine die adjournment',
 'Army nominations, totaling 14,031, disposed of as follows:',
 'Navy nominations, totaling 8,855, disposed of as follows:']

***  
<font color='red'>Note:</font> There are a number of labeling formats used in the confirmations section of the resumes, which creates complexity for splitting values from labels. For brevity, the iterative process of finding all formats is excluded, and just the code to split the labels and values follows.
***  

In [19]:
# Add a new row for each label with carryover values
carryovers_df = pd.DataFrame(columns=['Session', 'Label', 'Value'])

for i, row in confirm_df.iterrows():
    if 'carried' in row['Label'] and '...' not in row['Label']:
        pattern = re.compile(r'(?P<branch>.*) nominations.* \(and (?P<carryover>\d*,?\d*) nominations.*')
        result = pattern.search(row['Label'])
        if result == None:
            pattern = re.compile(r'(?P<branch>.*) nominations.* \(including (?P<carryover>\d*,?\d*)')
            result = pattern.search(row['Label'])
            if result == None:
                pass
            else:
                carryovers_df.loc[len(carryovers_df)] = row['Session'], result.group(1) + ', nominations, carryover', result.group(2)
        else:
            carryovers_df.loc[len(carryovers_df)] = row['Session'], result.group(1) + ', nominations, carryover', result.group(2)

confirm_df = pd.concat([confirm_df, carryovers_df])

In [20]:
# Separate the total nominations value from labels that contain them. While doing this, we need to standardize how carryovers are recorded.
# In some years, the carryover nominations are included in the total nominations, and in some years, the carryover value is in addition
# to the total nominations. For our analysis, we will always include the carryover nominations in the total.
def process_total_noms(row):
    if 'totaling' in row['Label'] and '...' not in row['Label']:
        pattern = re.compile(r'(?P<branch>.*) nominations.* totaling (?P<value>\d*,?\d*).*')
        result = pattern.search(row['Label'])
        if result != None:
            label = result.group(1) + ', nominations'
            value = int(result.group(2).replace(',', ''))
            if 'including' in row['Label']:
                pattern = re.compile(r'.* \(including (?P<carryover>\d*,?\d*).*')
                result = pattern.search(row['Label'])
                if result != None:
                    value -= int(result.group(1).replace(',', ''))
        return label, value
    else:
        return row['Label'], row['Value']

confirm_df[['Label', 'Value']] = confirm_df.apply(process_total_noms, axis=1, result_type='expand')
confirm_df.head()

Unnamed: 0,Session,Label,Value
0,1983 - 98.1,"Army, nominations",14784.0
1,1983 - 98.1,Army...Confirmed,14782.0
2,1983 - 98.1,Army...Failed at august-september adjournment,1.0
3,1983 - 98.1,Army...Failed at november 18 sine die adjournment,1.0
4,1983 - 98.1,"Navy, nominations",21994.0


In [22]:
confirm_df.dropna(inplace=True)
confirm_df['Value'] = confirm_df['Value'].replace(',', '', regex=True)
confirm_df['Value'] = confirm_df['Value'].astype(int)
confirm_df.head()

Unnamed: 0,Session,Label,Value
0,1983 - 98.1,"Army, nominations",14784
1,1983 - 98.1,Army...Confirmed,14782
2,1983 - 98.1,Army...Failed at august-september adjournment,1
3,1983 - 98.1,Army...Failed at november 18 sine die adjournment,1
4,1983 - 98.1,"Navy, nominations",21994


In [23]:
#Check for duplicate labels by session before transposing
counts = confirm_df[['Session', 'Label']].value_counts()
counts_df = pd.DataFrame(counts, columns=['Count']).reset_index()
counts_df[counts_df['Count'] > 1].head()

Unnamed: 0,index,Count


In [24]:
# Group duplicate labels to prevent errors when pivoting dataframe
confirm_df = confirm_df.groupby(['Session', 'Label']).sum()
confirm_df.reset_index(inplace=True)

In [25]:
confirm_df.head()

Unnamed: 0,Session,Label,Value
0,1983 - 98.1,"Air force, nominations",25638
1,1983 - 98.1,Air force...Confirmed,25584
2,1983 - 98.1,Air force...Failed at august-september adjourn...,2
3,1983 - 98.1,Air force...Unconfirmed,52
4,1983 - 98.1,"Army, nominations",29568


In [26]:
# Pivot dataframe so that each column represents a variable and each row represents an observation
confirm_tidy_df = confirm_df.pivot(index='Session', columns='Label', values='Value')
confirm_tidy_df.fillna(0, inplace=True)
confirm_tidy_df = confirm_tidy_df.astype(int)

In [27]:
# Split the Session value into Year, Congress and Session
confirm_tidy_df.reset_index(inplace=True)
confirm_tidy_df[['Year', 'Congress', 'Session']] = confirm_tidy_df['Session'].str.split(' - |[.]', expand=True).astype(int)

In [28]:
# Move the Session, Year, and Congress columns to the left
left_cols = ['Session', 'Congress', 'Year']
for col in left_cols:
    col_data = confirm_tidy_df.pop(col)
    confirm_tidy_df.insert(0, col, col_data)

***
# Preview Confirmation Data
***

In [29]:
confirm_tidy_df.head()

Label,Year,Congress,Session,"Air force, nominations","Air force, nominations, carryover",Air force...Confirmed,Air force...Failed at august-september adjournment,Air force...Returned,Air force...Returned to white house,Air force...Unconfirmed,...,Summary...Total nominations received this session,Summary...Total recess reappointments,Summary...Total rejected,Summary...Total returned,Summary...Total returned at sine die adjournment,Summary...Total returned to the white house,Summary...Total returned to white house,Summary...Total superseded by recess reappointments,Summary...Total unconfirmed,Summary...Total withdrawn
0,1983,98,1,25638,0,25584,2,0,0,52,...,0,0,0,0,0,0,0,0,52,4
1,1984,98,2,23636,52,23688,0,0,0,0,...,0,34,0,0,0,0,0,0,214,4
2,1985,99,1,42734,0,38026,0,0,0,4708,...,0,0,0,0,0,0,0,12,7354,16
3,1986,99,2,24492,4708,29200,0,0,0,0,...,0,0,0,0,0,0,0,0,140,16
4,1987,100,1,37334,0,31422,0,2,0,5910,...,0,0,2,40,0,0,0,0,10988,20


***
# Variables
***

In [31]:
# Display variables (column headings) for the dataframe
confirm_tidy_df.columns.values

array(['Year', 'Congress', 'Session', 'Air force, nominations',
       'Air force, nominations, carryover', 'Air force...Confirmed',
       'Air force...Failed at august-september adjournment',
       'Air force...Returned', 'Air force...Returned to white house',
       'Air force...Unconfirmed', 'Air force...Withdrawn',
       'Army, nominations', 'Army, nominations, carryover',
       'Army...Confirmed',
       'Army...Failed at august-september adjournment',
       'Army...Failed at november 18 sine die adjournment',
       'Army...Failed at sine die adjournment', 'Army...Returned',
       'Army...Returned to white house', 'Army...Unconfirmed',
       'Army...Withdrawn', 'Civilian, nominations',
       'Civilian, nominations, carryover', 'Civilian...Confirmed',
       'Civilian...Failed at adjournment',
       'Civilian...Failed at aug.-sept. adjournment',
       'Civilian...Failed at august-september adjournment',
       'Civilian...Failed at november 18 sine die adjournment',
    

***
**Variable Descriptions**  
Each observation (row) is described by Year, Congress, and Session.  
The remaining rows consist of the type of nominations followed by the disposition of the nomination.

***
## Rename Variables for Clarity
***

In [33]:
# Standardize outcome column names and merge duplicates
pattern = re.compile(r'(?i).*?(?P<outcome>(test|unconfirmed|confirmed|withdrawn|failed|recess reappointment|rejected|returned)).*')
org_col_names = confirm_tidy_df.columns.to_list()
for col in org_col_names:
    if '...' in col:
        result = pattern.search(col)
        if result != None:
            new_name = col.split('...')[0] + ', ' + result.group(1).lower()
            if 'Other' in new_name:
                new_name = new_name.lower().replace('other c', 'C')
            if new_name in confirm_tidy_df.columns.to_list():
                confirm_tidy_df[new_name] += confirm_tidy_df[col]
                confirm_tidy_df.drop(col, axis=1, inplace=True)
            else:
                confirm_tidy_df.rename(columns={col: new_name}, inplace=True)

In [34]:
confirm_tidy_df.columns

Index(['Year', 'Congress', 'Session', 'Air force, nominations',
       'Air force, nominations, carryover', 'Air force, confirmed',
       'Air force, failed', 'Air force, returned', 'Air force, unconfirmed',
       'Air force, withdrawn', 'Army, nominations',
       'Army, nominations, carryover', 'Army, confirmed', 'Army, failed',
       'Army, returned', 'Army, unconfirmed', 'Army, withdrawn',
       'Civilian, nominations', 'Civilian, nominations, carryover',
       'Civilian, confirmed', 'Civilian, failed',
       'Civilian, recess reappointment', 'Civilian, rejected',
       'Civilian, returned', 'Civilian, unconfirmed', 'Civilian, withdrawn',
       'Marine corps, nominations', 'Marine corps, nominations, carryover',
       'Marine corps, confirmed', 'Marine corps, returned',
       'Marine corps, unconfirmed', 'Marine corps, withdrawn',
       'Navy, nominations', 'Navy, nominations, carryover', 'Navy, confirmed',
       'Navy, returned', 'Navy, unconfirmed', 'Navy, withdrawn',

In [35]:
# Combine the Civilian and Other Civilian nominations columns
confirm_tidy_df['Civilian, nominations'] += confirm_tidy_df['Other civilian, nominations']
confirm_tidy_df['Civilian, nominations, carryover'] += confirm_tidy_df['Other civilian, nominations, carryover']
confirm_tidy_df.drop(['Other civilian, nominations', 'Other civilian, nominations, carryover'], axis=1, inplace=True)

In [36]:
# Cleanup Summary column labels and merge duplicates
nom_cols = ['Summary...Total nominations received', 'Summary...Total nominations received this session']
confirm_tidy_df['Total, nominations'] = confirm_tidy_df[nom_cols].sum(axis=1)
confirm_tidy_df.drop(nom_cols, axis=1, inplace=True)
carryover_cols = ['Summary...Nominations carried over from first session', 'Summary...Total nominations carried over from first session',
                  'Summary...Total nominations carried over from the first session', 'Summary...Total carried over from first session']
confirm_tidy_df['Total, nominations, carryover'] = confirm_tidy_df[carryover_cols].sum(axis=1)
confirm_tidy_df.drop(carryover_cols, axis=1, inplace=True)
confirm_tidy_df.columns = confirm_tidy_df.columns.str.replace('Summary', 'Total')

In [37]:
# Set Air Force, Space Force and Marine Corps labels to proper case
confirm_tidy_df.columns = confirm_tidy_df.columns.str.replace('Air force', 'Air Force')
confirm_tidy_df.columns = confirm_tidy_df.columns.str.replace('Marine corps', 'Marine Corps')
confirm_tidy_df.columns = confirm_tidy_df.columns.str.replace('Space force', 'Space Force')

In [38]:
confirm_tidy_df.columns

Index(['Year', 'Congress', 'Session', 'Air Force, nominations',
       'Air Force, nominations, carryover', 'Air Force, confirmed',
       'Air Force, failed', 'Air Force, returned', 'Air Force, unconfirmed',
       'Air Force, withdrawn', 'Army, nominations',
       'Army, nominations, carryover', 'Army, confirmed', 'Army, failed',
       'Army, returned', 'Army, unconfirmed', 'Army, withdrawn',
       'Civilian, nominations', 'Civilian, nominations, carryover',
       'Civilian, confirmed', 'Civilian, failed',
       'Civilian, recess reappointment', 'Civilian, rejected',
       'Civilian, returned', 'Civilian, unconfirmed', 'Civilian, withdrawn',
       'Marine Corps, nominations', 'Marine Corps, nominations, carryover',
       'Marine Corps, confirmed', 'Marine Corps, returned',
       'Marine Corps, unconfirmed', 'Marine Corps, withdrawn',
       'Navy, nominations', 'Navy, nominations, carryover', 'Navy, confirmed',
       'Navy, returned', 'Navy, unconfirmed', 'Navy, withdrawn',

***
## Update Datatypes and Formats
***

In [40]:
# Review datatypes
confirm_tidy_df.dtypes

Label
Year                                    int32
Congress                                int32
Session                                 int32
Air Force, nominations                  int32
Air Force, nominations, carryover       int32
Air Force, confirmed                    int32
Air Force, failed                       int32
Air Force, returned                     int32
Air Force, unconfirmed                  int32
Air Force, withdrawn                    int32
Army, nominations                       int32
Army, nominations, carryover            int32
Army, confirmed                         int32
Army, failed                            int32
Army, returned                          int32
Army, unconfirmed                       int32
Army, withdrawn                         int32
Civilian, nominations                   int32
Civilian, nominations, carryover        int32
Civilian, confirmed                     int32
Civilian, failed                        int32
Civilian, recess reappointme

***
# Preview Final Dataset
***

In [42]:
confirm_tidy_df.head()

Label,Year,Congress,Session,"Air Force, nominations","Air Force, nominations, carryover","Air Force, confirmed","Air Force, failed","Air Force, returned","Air Force, unconfirmed","Air Force, withdrawn",...,"Space Force, withdrawn","Total, failed","Total, returned","Total, confirmed","Total, recess reappointment","Total, rejected","Total, unconfirmed","Total, withdrawn","Total, nominations","Total, nominations, carryover"
0,1983,98,1,25638,0,25584,2,0,52,0,...,0,954,0,111072,0,0,52,4,112082,0
1,1984,98,2,23636,52,23688,0,0,0,0,...,0,0,0,83452,34,0,214,4,83652,52
2,1985,99,1,42734,0,38026,0,0,4708,0,...,0,68,0,111836,12,0,7354,16,119286,0
3,1986,99,2,24492,4708,29200,0,0,0,0,...,0,0,0,79786,0,0,140,16,72588,7354
4,1987,100,1,37334,0,31422,0,2,5910,0,...,0,0,40,92808,0,2,10988,20,103858,0


***
# Write to Excel
***

In [44]:
confirm_tidy_df.to_excel('../Data/Confirmation Data - Scrubbed.xlsx', index=False)

***
**End**
***