# Data Prep
<font size=4 color='blue'>Project: Congressional Data Scrape and Validation</font>
***

**Project Summary:**  
The Resume of Congressional Activity has been published annually since 1947. PDF versions of this document are available for download from several US government websites, including <a href="https://senate.gov">senate.gov</a>. The primary goal of this project is to scrape the data from these documents and create a dataset that can be used for analysis.


**Notebook Scope:**  
This notebook reads formatted data from Excel and prepares the data for validation. 

**Output:**  
Scrubbed data is saved to an Excel file for further validation and analysis.

***
# Notebook Setup
***

In [1]:
# Import libraries
import os
import pandas as pd
import re

In [2]:
%%html
<!-- Prevent text wrappping in dataframe displays for a cleaner print -->
<style> .dataframe td {white-space: nowrap;}</style>

***
# Read General Data
***

In [3]:
# Create a list of files to read
raw_text_a = []
path = '../Data/ResumesScrubbed/LegislativeActivity/'
files = os.listdir(path)
print(files)

['98_1.xlsm', '98_2.xlsm', '99_1.xlsm', '99_2.xlsm']


In [4]:
# Create a list to contain the contents of each file
leg_act_list = []

In [5]:
# Read each file and concat to the legislative activity dataframe
for file_name in files:
    file_cont_df = pd.read_excel(path + file_name, index_col=0)
    file_cont_df.index = file_cont_df.index.str.strip()
    leg_act_list.append(file_cont_df)

In [6]:
# Preview file contents
leg_act_list[0].head()

Unnamed: 0,Senate,House,Total
Congress,98,98,98
Session,1,1,1
Start Date,1983-01-03 00:00:00,1983-01-03 00:00:00,1983-01-03 00:00:00
End Date,1983-11-18 00:00:00,1983-11-18 00:00:00,1983-11-18 00:00:00
Days in session,150,146,


***
## Validate Variable Names
***

In [7]:
# Define standard variables (column headings). This list will be used to validate the dataframes before merging
std_headings = ['Congress', 'Session', 'Start Date', 'End Date', 'Days in session', 'Time in session', 'Pages of proceedings',
                'Extensions of remarks', 'Public bills enacted into law', 'Private bills enacted into law', 'Bills in conference',
                'Bills through conference', 'Measures passed, total', 'Measures passed, Senate bills', 'Measures passed, House bills',
                'Measures passed, Senate joint resolutions', 'Measures passed, House joint resolutions', 
                'Measures passed, Senate concurrent resolutions', 'Measures passed, House concurrent resolutions', 
                'Measures passed, Simple resolutions', 'Measures reported, total', 'Measures reported, Senate bills',
                'Measures reported, House bills', 'Measures reported, Senate joint resolutions', 'Measures reported, House joint resolutions',
                'Measures reported, Senate concurrent resolutions', 'Measures reported, House concurrent resolutions',
                'Measures reported, Simple resolutions', 'Special reports', 'Conference reports',
                'Measures pending on calendar', 'Measures introduced, total', 'Measures introduced, Bills', 
                'Measures introduced, Joint resolutions', 'Measures introduced, Concurrent resolutions', 
                'Measures introduced, Simple resolutions', 'Quorum calls', 'Yea-and-nay votes', 'Recorded votes', 'Bills vetoed', 
                'Vetoes overridden']

In [8]:
# Review labels from each file and flag any that do not exist in the std_headings list
for df in leg_act_list:
     for label in df.index:
         if label not in std_headings:
             print(f'{df.at['Congress', 'Senate']}, {df.at['Session', 'Senate']}: {label}')

***
<font color='blue'>**Note:**</font>  
Any typos found were updated directly in the Excel files.

***
## Merge Data
***

In [9]:
# Create dataframe for merged data
gen_activity_df = pd.DataFrame()

In [10]:
# Transpose the dataframe read from each file and add to the general activity dataframe
for df in leg_act_list:
    df = df.transpose().reset_index()
    gen_activity_df = pd.concat([gen_activity_df, df], ignore_index=True)

In [11]:
# Preview dataframe
gen_activity_df.head()

Unnamed: 0,index,Congress,Session,Start Date,End Date,Days in session,Time in session,Pages of proceedings,Extensions of remarks,Public bills enacted into law,...,"Measures introduced, total","Measures introduced, Bills","Measures introduced, Joint resolutions","Measures introduced, Concurrent resolutions","Measures introduced, Simple resolutions",Quorum calls,Yea-and-nay votes,Recorded votes,Bills vetoed,Vetoes overridden
0,Senate,98,1,1983-01-03 00:00:00,1983-11-18 00:00:00,150.0,"1,010 hrs, 47'",17224,,101,...,2795,2198.0,209.0,86.0,302.0,18.0,381.0,,3,
1,House,98,1,1983-01-03 00:00:00,1983-11-18 00:00:00,146.0,"851 hrs., 45'",10665,,114,...,5642,4580.0,440.0,237.0,385.0,35.0,297.0,201.0,4,1.0
2,Total,98,1,1983-01-03 00:00:00,1983-11-18 00:00:00,,,27889,5985.0,215,...,8437,,,,,,,,7,
3,Senate,98,2,1984-01-23 00:00:00,1984-10-12 00:00:00,131.0,"940 hrs., 28'",14650,,166,...,1302,897.0,150.0,69.0,186.0,19.0,292.0,,8,
4,House,98,2,1984-01-23 00:00:00,1984-10-12 00:00:00,120.0,"852 hrs., 59'",1229,,242,...,2462,1862.0,223.0,142.0,235.0,55.0,227.0,181.0,9,1.0


***
## Validate Datatypes
***

In [12]:
# Review current datatypes
gen_activity_df.dtypes

index                                               object
Congress                                            object
Session                                             object
Start Date                                          object
End Date                                            object
Days in session                                     object
Time in session                                     object
Pages of proceedings                                object
Extensions of remarks                               object
Public bills enacted into law                       object
Private bills enacted into law                      object
Bills in conference                                 object
Bills through conference                            object
Measures passed, total                              object
Measures passed, Senate bills                       object
Measures passed, House bills                        object
Measures passed, Senate joint resolutions           obje

In [13]:
# Infer datatypes
gen_activity_df = gen_activity_df.infer_objects()

In [15]:
# Review updated datatypes
gen_activity_df.dtypes

index                                                       object
Congress                                                     int64
Session                                                      int64
Start Date                                          datetime64[ns]
End Date                                            datetime64[ns]
Days in session                                            float64
Time in session                                             object
Pages of proceedings                                        object
Extensions of remarks                                      float64
Public bills enacted into law                                int64
Private bills enacted into law                              object
Bills in conference                                        float64
Bills through conference                                   float64
Measures passed, total                                      object
Measures passed, Senate bills                              flo

***
## Write to Excel
***

In [18]:
gen_activity_df.to_excel('../Data/GeneralLegislativeData.xlsx', index=False)

***
# Read Confirmation Data
***

In [3]:
# Create a list of files to read
raw_text_a = []
path = '../Data/ResumesScrubbed/LegislativeActivity/'
files = os.listdir(path)
print(files)

['98_1.xlsm', '98_2.xlsm', '99_1.xlsm', '99_2.xlsm']


In [4]:
# Create a list to contain the contents of each file
leg_act_list = []

In [5]:
# Read each file and concat to the legislative activity dataframe
for file_name in files:
    file_cont_df = pd.read_excel(path + file_name, index_col=0)
    file_cont_df.index = file_cont_df.index.str.strip()
    leg_act_list.append(file_cont_df)

In [6]:
# Preview file contents
leg_act_list[0].head()

Unnamed: 0,Senate,House,Total
Congress,98,98,98
Session,1,1,1
Start Date,1983-01-03 00:00:00,1983-01-03 00:00:00,1983-01-03 00:00:00
End Date,1983-11-18 00:00:00,1983-11-18 00:00:00,1983-11-18 00:00:00
Days in session,150,146,


***
**End**
***