#### This Notebook examines Pennsylvania'a campaign data specifically from 2018-2023, although previous years can be loaded onto the analysis considerations. The dataset is relational, with the five documents per annum (contributions, debt, expense, expenditures, and filer) linked through a unique filer ID.

In [7]:
import pandas as pd
import numpy as np
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
import sys
sys.path.append('/home/alankagiri/2023-fall-clinic-climate-cabinet')
from utils import PA_EDA_Functions as eda
from utils import PA_Data_Web_Scraper as scraper

In [2]:
# download the data
scraper.download_PA_data(2018,2023)

In [8]:
#initialize the datasets:
contrib_paths = [["../data/contrib_2018_03042019_2018.txt", 2018],
                 ["../data/contrib_2019.txt", 2019],
                 ["../data/contrib_2020_2020.txt",2020],
                 ["../data/contrib_2021_2021.txt",2021],
                 ["../data/contrib_2022_2022.txt",2022],
                 ["../data/2023/contrib_2023_2023.txt",2023]]

filer_paths = [["../data/filer_2018_03042019_2018.txt", 2018],
               ["../data/filer_2019.txt",2019],
               ["../data/filer_2020_2020.txt",2020],
               ["../data/filer_2021_2021.txt",2021],
               ["../data/filer_2022_2022.txt",2022],
               ["../data/2023/filer_2023_2023.txt",2023]]

expense_paths = [["../data/expense_2018_03042019_2018.txt",2018],
                 ["../data/expense_2019.txt", 2019],
                 ["../data/expense_2020_2020.txt", 2020],
                 ["../data/expense_2021_2021.txt",2021],
                 ["../data/expense_2022_2022.txt",2022],
                 ["../data/2023/expense_2023_2023.txt",2023]]

In [9]:
merged_datasets_per_year = []
merged_expense_dataset = []
for i in range(len(contrib_paths)):
    contrib_df = eda.initialize_PA_dataset(contrib_paths[i][0],contrib_paths[i][1])
    filer_df = eda.initialize_PA_dataset(filer_paths[i][0],filer_paths[i][1])
    expense_df = eda.initialize_PA_dataset(expense_paths[i][0],expense_paths[i][1])
    merged = eda.merge_same_year_datasets(contrib_df,filer_df)
    merged_datasets_per_year.append(merged)
    merged_expense_dataset.append(expense_df)

Skipping line 209819: expected 24 fields, saw 29

Skipping line 465906: expected 24 fields, saw 25

Skipping line 1334: expected 12 fields, saw 15
Skipping line 62726: expected 12 fields, saw 13

Skipping line 108099: expected 12 fields, saw 13

Skipping line 251552: expected 24 fields, saw 27

Skipping line 60173: expected 12 fields, saw 13

Skipping line 523329: expected 24 fields, saw 47
Skipping line 523330: expected 24 fields, saw 47

Skipping line 14404: expected 12 fields, saw 17

Skipping line 66486: expected 12 fields, saw 13
Skipping line 66487: expected 12 fields, saw 13
Skipping line 66488: expected 12 fields, saw 13
Skipping line 66489: expected 12 fields, saw 13
Skipping line 66490: expected 12 fields, saw 13
Skipping line 66491: expected 12 fields, saw 13
Skipping line 66492: expected 12 fields, saw 13

Skipping line 109048: expected 26 fields, saw 27

Skipping line 12499: expected 14 fields, saw 15
Skipping line 29777: expected 14 fields, saw 15
Skipping line 29778: exp

In [10]:
contrib_filer_info_2018_2023 = eda.merge_all_datasets(merged_datasets_per_year)
contrib_filer_info_2018_2023

Unnamed: 0,RECIPIENT_ID,YEAR,DONOR,TOTAL_CONT_AMT,DONOR_TYPE,RECIPIENT_TYPE,RECIPIENT,RECIPIENT_OFFICE,RECIPIENT_PARTY
0,2000081,2018,JOSEPH A RIBAS,25.00,INDIVIDUAL,Committee,FIRSTENERGY CORP. POLITICAL ACTION COMMITTEE,,
1,2000081,2018,PAUL J KASHELLA,40.00,INDIVIDUAL,Committee,FIRSTENERGY CORP. POLITICAL ACTION COMMITTEE,,
2,2000081,2018,VICKY C THIEL,25.00,INDIVIDUAL,Committee,FIRSTENERGY CORP. POLITICAL ACTION COMMITTEE,,
3,2000081,2018,JOSEPH B HILDEBRANDT,20.00,INDIVIDUAL,Committee,FIRSTENERGY CORP. POLITICAL ACTION COMMITTEE,,
4,2000081,2018,JACQUELINE A ESPINOZA,50.00,INDIVIDUAL,Committee,FIRSTENERGY CORP. POLITICAL ACTION COMMITTEE,,
...,...,...,...,...,...,...,...,...,...
387459,393671,2023,ERIC J YARNELL,38.47,INDIVIDUAL,Committee,HIGHMARK PAC OF HIGHMARK INC.,,
387460,393671,2023,PATRICIA LAUGHLIN,116.00,INDIVIDUAL,Committee,HIGHMARK PAC OF HIGHMARK INC.,,
387461,393671,2023,MATTHEW J RHENISH,130.00,INDIVIDUAL,Committee,HIGHMARK PAC OF HIGHMARK INC.,,
387462,393671,2023,JAMES J BENEDICT,192.30,INDIVIDUAL,Committee,HIGHMARK PAC OF HIGHMARK INC.,,


##### 1.1 For each column, what are the contents of it? How many blanks or nulls are there? What is the format? If there it is one of several types, what are those types?

In [6]:
cols, type, nulls, null_percent = [],[],[],[]
for column in contrib_filer_info_2018_2023.columns:
   cols.append(column)
   type.append(contrib_filer_info_2018_2023.dtypes[column]) 
   nulls.append(contrib_filer_info_2018_2023[column].isna().sum(),)
   null_percent.append(round(contrib_filer_info_2018_2023[column].isna().sum()/len(contrib_filer_info_2018_2023)*100,2))

summary_df = {'columnName':cols, 'colType':type,'numNulls':nulls,'null_percent':null_percent}
summary_df = pd.DataFrame(summary_df)#, columns==['columnName','colType','numNulls','nullPercent'])
summary_df


Unnamed: 0,columnName,colType,numNulls,null_percent
0,RECIPIENT_ID,object,0,0.0
1,YEAR,int64,0,0.0
2,DONOR,object,0,0.0
3,TOTAL_CONT_AMT,float64,0,0.0
4,DONOR_TYPE,object,0,0.0
5,RECIPIENT_TYPE,object,5573,0.09
6,RECIPIENT,object,0,0.0
7,RECIPIENT_OFFICE,object,6236824,95.83
8,RECIPIENT_PARTY,object,4289375,65.91


*Having reduced the contributor and filer datasets to the relevant datasets, it is evident that with the exception of {CONT_DESCRIP, OFFICE, PARTY} columns, most of the values are reported and available. With regards to the type of data stored in the datasets, most are considered objects (which are mainly strings), in part due to the presence of dirty/inconsistent data inputs.*

#####  2.1 Who are the top 10 contributors in your data? The top 10 recipients?

In [7]:
eda.top_n_contributors(contrib_filer_info_2018_2023,10)

Unnamed: 0_level_0,TOTAL_CONT_AMT
DONOR,Unnamed: 1_level_1
CHARLOTTE SWENSON,114202845.38
JEFFREY YASS,57205000.0
TOTAL OTHER CONTRIBUTIONS,39332908.48
COMMONWEALTH CHILDREN'S CHOICE FUND,34954611.22
STUDENTS FIRST PAC,31699924.71
COMMONWEALTH LEADERS FUND,27270100.22
STUDENT'S FIRST PAC,18500000.0
CONTRIBUTIONS FROM FEC REPORT,17850908.02
HOUSE DEMOCRATIC CAMPAIGN COMMITTEE,15784775.41
HOUSE REPUBLICAN CAMPAIGN COMMITTEE,13752286.61


In [7]:
eda.top_n_recipients(contrib_filer_info_2018_2023,10)

Unnamed: 0_level_0,TOTAL_CONT_AMT
FILER_NAME,Unnamed: 1_level_1
FRIENDS OF JENNIFER O'MARA,115522369.09
Shapiro for Pennsylvania,76829505.52
Students First PAC,58525000.0
COMMONWEALTH LEADERS FUND,39822241.33
COMMONWEALTH CHILDREN'S CHOICE FUND,36168500.0
PA Democratic Party,34933417.16
Pennsylvania House Democratic Campaign Committee,28272822.94
International Brotherhood of Electrical Workers Local 98 Committee on Political Education,26470647.64
HOUSE REPUBLICAN CAMPAIGN COMMITTEE,24083705.1
"AMERICAN FEDERATION OF TEACHERS, AFL-CIO COPE (AFT/COPE)",23304568.79


##### 3.1 Make a bar chart with plotly comparing contributions by donor type (PAC, individual, etc) and one comparing recipients by the office type they are running for

In [11]:
eda.compare_cont_by_donorType(contrib_filer_info_2018_2023)

Unnamed: 0,YEAR,RECIPIENT_TYPE,TOTAL_CONT_AMT
0,2018,Candidate,1568627.22
1,2018,Committee,315345576.08
2,2018,Lobbyist,21065.0
3,2019,Candidate,517345.52
4,2019,Committee,328678633.98
5,2019,Lobbyist,45923.0
6,2020,Candidate,590490.29
7,2020,Committee,354754859.47
8,2020,Lobbyist,18888.44
9,2021,Candidate,900134.08


*The dataset is organized from the perspective of the entity filing the finance reports, which in this case is either a political committee, a lobbyist, or a candidate. As such, it is somewhat difficult to ascertain the classification of the contributors (were they a PAC, an individual, a corporation...) as there is no linearity in their names. However, the overwhelming majority of contribution recipients were committees, indicating that most entities donated to PACs or SuperPACS.*

In [12]:
eda.plot_recipients_by_office(contrib_filer_info_2018_2023)

KeyError: 'OFFICE'

<span style="color:pink"> Not suprisingly, legislative races received the most contributions from 2018-2023, with a significant portion going to House races. This makes sense since House election cycles are more frequent that Senate. It is worth noting that although the PA campaign website offers an Office Code Table that indicates what the abbreviated races represent (link attached at end for reference), there are some abbreviations which do not match up with any in the Table on the PA website. These included {CPJA,CPJP,DSC,RSC,USC,USP,USS}. Reaching out to the PA Election official led to some answers for {CPJA, CPJP, USC, USS, DSC, RSC}, and the peculiar feature was that some of these codes apply to out-of-state races, namely races to the U.S Senate and House Chambers, as well as the nation presidency. This was explained as filing errors committed by the filing entities.
###### https://www.dos.pa.gov/VotingElections/CandidatesCommittees/CampaignFinance/Resources/Pages/Technical-Specifications.aspx </span>

##### 4.1:  If you have multiple years, are they all similar? If not, is the difference explicable (maybe by election schedules)

*Thankfully the years are largely similar. However in 2022 additional columns were appended to the filer and contributor datasets, but these columns are irrelevant for the sake of our analysis*

#### This next portion repeats the EDA done on contribution and filer data but on the expenditure datasets spanning 2018-2023. The expense dataset stores information from Schedule III of the campaign finance report, which details information about the services rendered to the filer by the recipient, as well as the nature of the expenditure (contribution, service, phone-banking, etc)

In [8]:
expense_info_2018_2023 = eda.merge_all_datasets(merged_expense_dataset)
expense_info_2018_2023

Unnamed: 0,DONOR_ID,YEAR,RECIPIENT,AMOUNT,PURPOSE
0,2001144,2018,MICHAEL TURZAI,931.92,REIMBURSEMENT
1,2001144,2018,ARMSTRONG,25.04,INTERNET
2,2001144,2018,COMCAST,421.50,INTERNET
3,2001144,2018,NAYLAX,250.00,AD
4,2002299,2018,FRIENDS OF TOM TOSTI,500.00,POLITICAL CONTRIBUTION
...,...,...,...,...,...
64297,387871,2023,ZELDA YODER,46.07,PLAQUE / SIGN
64298,389369,2023,ZEM ZEM SHRINE CLUB,250.00,DEPOSIT FOR CAMPAIGN FUNDRAISING EVENT
64299,394204,2023,ZERO DAY BREWERY,873.50,FOOD FOR FUNDRAISER
64300,392983,2023,ZIO BRICK OVEN PIZZA,84.07,MEETING EXPENSE


##### 1.2 For each column, what are the contents of it? How many blanks or nulls are there? What is the format? If there it is one of several types, what are those types?

In [7]:
cols, type, nulls, null_percent = [],[],[],[]
for column in expense_info_2018_2023.columns:
   cols.append(column)
   type.append(expense_info_2018_2023.dtypes[column]) 
   nulls.append(expense_info_2018_2023[column].isna().sum(),)
   null_percent.append(round((expense_info_2018_2023[column].isna().sum()/len(expense_info_2018_2023))*100,2))

summary_df = {'columnName':cols, 'colType':type,'numNulls':nulls,'null_percent':null_percent}
summary_df = pd.DataFrame(summary_df)
summary_df

Unnamed: 0,columnName,colType,numNulls,null_percent
0,FILER_ID,object,0,0.0
1,YEAR,int64,0,0.0
2,EXPENSE_NAME,object,0,0.0
3,EXPENSE_AMT,float64,0,0.0
4,EXPENSE_DESC,object,0,0.0


#####  2.2 What are the top 10 expenditure reasons in your data? The top 10 recipients?

In [8]:
pd.set_option("display.float_format", "{:.2f}".format)
expenditure_reasons = (expense_info_2018_2023.groupby(["EXPENSE_DESC"])
        .agg({"EXPENSE_AMT": sum})
        .sort_values(by="EXPENSE_AMT", ascending=False)
    )
expenditure_reasons.head(10)

Unnamed: 0_level_0,EXPENSE_AMT
EXPENSE_DESC,Unnamed: 1_level_1
NAN,702230230.2
POSTAGE,532741231.64
CONTRIBUTION,441939525.25
UNITEMIZED EXPENDITURES,212332926.31
DONATION,138815526.73
NON-PENNSYLVANIA EXPENDITURES,118542071.03
SEE FEC REPORT AT HTTPS://WWW.FEC.GOV/DATA/COMMITTEE/C00042366/? TAB=FILINGS,114236098.71
NON PA DISBURSEMENTS,103186134.6
SEE FEC REPORT AT HTTPS://WWW.FEC.GOV/DATA/COMMITTEE/C00042366/?TAB=FILINGS,57998961.58
ADVERTISING,51071086.44


*It's a bit difficult to ascertain the description column, mainly because there is no standardized reporting format. Filers are free to describe the expenditure as they see fit, which makes grouping them into categories uncertain. Some seem to link the Federal Election Committee's website url. The combined cost of expenditures lacking descriptions is the highest*

In [10]:
pd.set_option("display.float_format", "{:.2f}".format)
expenditure_recipients = (expense_info_2018_2023.groupby(["EXPENSE_NAME"])
        .agg({"EXPENSE_AMT": sum})
        .sort_values(by="EXPENSE_AMT", ascending=False)
    )
expenditure_recipients.head(10)

Unnamed: 0_level_0,EXPENSE_AMT
EXPENSE_NAME,Unnamed: 1_level_1
ACME MARKETS,517209160.05
NON-PENNSYLVANIA EXPENDITURES,352505553.6
PNC,304225694.63
DNC SERVICES/DEMOCRATIC NATIONAL COMMITTEE,210584210.93
DSCC,172269060.29
NON PA TRANSACTIONS,114481897.04
CONTRIBUTIONS TO FEDERAL AND NON-PA STATE AND LOCAL CANDIDATES AND COMMITTEES,106419119.55
THE BUSINESS CENTER FOR ENTREPRENEURSHIP &AMP; SOCIAL ENTERPRISE,103202985.45
COMMONWEALTH CHILDREN'S CHOICE FUND,44325110.15
GRASSROOTS MEDIA LLC,40759192.03


<span style="color:pink">It's very interesting that highest recipient of expenditures is ACME Markets, a supermarket chain. More interesting is that a PAC seems to be the recipient, which reveals an interesting reality. How legally clear is it when a PAC receives money in the form of contributions, vs when it does and this amount is considered an expenditure by the filer? If an organization seeks the "services" of a PAC and lists them as an expenditure, it wouldn't seem obvious if that PAC would then list its payment as a contribution. In the case it doesn't, this raises an interesting potential outcome of PACs ostensibly receiving funds to "help" campaigns they are already ideologically aligned with without counting such "collaborations" as donations</span>

##### 3.2:  If you have multiple years, are they all similar? If not, is the difference explicable (maybe by election schedules)

*The years are all similar*