## Harvard Dataset EDA
The Harvard dataset, which contains information on individual candidates, aims to enrich the exiting individual dataset (`individuals_table.csv`) from various states. (However, as Sarah suggested, it's still uncertain whether a new table for candidates should be created. If so, it could include more variables.) For now, I've selected variables that could 1) help match individuals within the two datasets and 2) provide additional information in addition to the individual's table, such as election outcomes and candidate party affiliations.

This Markdown document is designed to explain the rationale behind the selection of variables within the Harvard dataset and to describe the basic nature of these variables.

In [2]:
import numpy as np
import pandas as pd
from utils.transform.constants import HV_FILEPATH



In [3]:
hv_df = pd.read_stata(HV_FILEPATH)


One or more strings in the dta file could not be decoded using utf-8, and
so the fallback encoding of latin-1 is being used.  This can happen when a file
has been incorrectly encoded by Stata or some other software. You should verify
the string values returned are correct.
  hv_df = pd.read_stata(HV_FILEPATH)
One or more strings in the dta file could not be decoded using utf-8, and
so the fallback encoding of latin-1 is being used.  This can happen when a file
has been incorrectly encoded by Stata or some other software. You should verify
the string values returned are correct.
  hv_df = pd.read_stata(HV_FILEPATH)


In [9]:
HV_INDIVIDUAL_COLS = [
    "caseid",
    "year",
    "month",
    "day",
    "sab",
    "cname",
    "candid",
    "cand",
    "sen",
    "partyz",
    "partyt",
    "outcome",
    "vote",
    "termz",
    "last",
    "first",
    "v19_20171211",
    "v19_20160217"

]
raw_df = hv_df[HV_INDIVIDUAL_COLS]

1. Name columns (`v19_20171211`,`v19_20160217`,`cand`,
`last` and `first`)  
`cand`	Standardized Candidate Name  
`v19_20171211`	Standardized Candidate Name from December 11, 2017  
`v19_20160217`	Standardized Candidate Name from February 17, 2016  


In [10]:
# if two standardized candidate columns match
# What is the pattern for update, any correction of just update from NaN.
raw_df[
    (raw_df["v19_20171211"].str.strip() != raw_df["v19_20160217"].str.strip()) &
    ~raw_df["v19_20160217"].isna()
][["v19_20171211", "v19_20160217"]]


Unnamed: 0,v19_20171211,v19_20160217
8074,"KAWASAKI, SCOTT",
8075,SCATTERING,
8076,"HOLDAWAY, TRUNO N. L.",
8077,"THOMPSON, STEVE M.",
8078,SCATTERING,
...,...,...
378340,"ANDERSON, JAMES LEE",
378341,SCATTERING,
378342,"FORD, ROBERT",
378343,"SCOTT, CHARLES K.",


We thus know that v19_20171211 is a update of v19_20160217, there is no additional information within the v19_20160217 volumn. 

In [11]:
raw_df["v19_20171211"] = raw_df["v19_20171211"].str.lower()
raw_df[
    (raw_df["cand"].str.strip() != raw_df["v19_20171211"].str.strip()) & 
    raw_df["v19_20171211"].isna()
][["v19_20171211", "cand","first","last"]]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  raw_df["v19_20171211"] = raw_df["v19_20171211"].str.lower()


Unnamed: 0,v19_20171211,cand,first,last


So, there are two things to do 1) update the `cand` column that use the non-empty v19 values to replace the cand (full_name) column; 2) delete all rows with missingnames. 

In [12]:
raw_df[raw_df["cand"].isna()]

Unnamed: 0,caseid,year,month,day,sab,cname,candid,cand,sen,partyz,partyt,outcome,vote,termz,last,first,v19_20171211,v19_20160217


In [14]:
raw_df["cand"] = np.where(raw_df["v19_20171211"].notna(),raw_df["v19_20171211"],raw_df["cand"])
raw_df = raw_df[~raw_df["cand"].str.startswith("namemissing")]


In [15]:
raw_df[(raw_df["last"].isna() & raw_df["first"].isna())]

Unnamed: 0,caseid,year,month,day,sab,cname,candid,cand,sen,partyz,partyt,outcome,vote,termz,last,first,v19_20171211,v19_20160217


2. `sab` column (state)  
We only need to uppercase every value here.

In [16]:
raw_df["sab"] = raw_df["sab"].str.upper()
print(raw_df["sab"].unique())

['AL' 'AK' 'AZ' 'AR' 'CA' 'CO' 'CT' 'DE' 'FL' 'GA' 'HI' 'ID' 'IL' 'IN'
 'IA' 'KS' 'KY' 'LA' 'ME' 'MD' 'MA' 'MI' 'MN' 'MS' 'MO' 'MT' 'NE' 'NV'
 'NH' 'NJ' 'NM' 'NY' 'NC' 'ND' 'OH' 'OK' 'OR' 'PA' 'RI' 'SC' 'SD' 'TN'
 'TX' 'UT' 'VT' 'VA' 'WA' 'WV' 'WI' 'WY']


3. Candidate uniqueness?
In order to merge the datasets, we need to ensure the uniqueness of each individual. That is, name/id of each observation is unique. 

In [17]:
# not unique
raw_df["candid"].value_counts()

candid
167521    100
168980     77
167044     73
165078     72
166643     66
         ... 
4421        1
4423        1
4603        1
4979        1
4978        1
Name: count, Length: 160405, dtype: int64

In [18]:
unique_party_counts = raw_df.groupby("candid")["partyt"].nunique().reset_index(name="unique_partyt_count")

unique_party_counts[unique_party_counts["unique_partyt_count"] > 1]

Unnamed: 0,candid,unique_partyt_count
7,8,2
20,21,2
29,30,2
52,53,2
58,60,2
...,...,...
153480,354159,2
153637,354938,2
155466,363848,2
155473,363876,2


In [19]:
raw_df[raw_df["candid"] == 8]

Unnamed: 0,caseid,year,month,day,sab,cname,candid,cand,sen,partyz,partyt,outcome,vote,termz,last,first,v19_20171211,v19_20160217
6730,779,1986,11.0,4.0,AK,,8,"donley, dave",0,d,d,w,2985.0,2.0,donley,dave,"donley, dave","DONLEY, DAVE"
6841,783,1988,11.0,8.0,AK,,8,"donley, dave",0,d,d,w,4234.0,2.0,donley,dave,"donley, dave","DONLEY, DAVE"
6946,732,1990,11.0,6.0,AK,,8,"donley, dave",0,d,d,w,4081.0,2.0,donley,dave,"donley, dave","DONLEY, DAVE"
8544,90,1992,11.0,3.0,AK,,8,"donley, dave",1,d,d,w,5731.0,2.0,donley,dave,"donley, dave","DONLEY, DAVE"
8580,160,1994,11.0,8.0,AK,,8,"donley, dave",1,d,d,w,5209.0,4.0,donley,dave,"donley, dave","DONLEY, DAVE"
8620,161,1998,11.0,3.0,AK,,8,"donley, dave",1,r,r,w,8003.0,4.0,donley,dave,"donley, dave","DONLEY, DAVE"
8685,8,2002,11.0,5.0,AK,,8,"donley, dave",1,r,r,l,4666.0,2.0,donley,dave,"donley, dave","DONLEY, DAVE"


For one candidate, they may have several election records. They may have several election results at different counties. Throughout their career, they may have different parties.

4. Other variables

`partyt` Assigns just one party to a candidate in one election season (i.e., the primary and general election in one year), using the same seven codes used in partyz.  
	For example, a candidate running in NY with fusion as a Democrat and a Republican is assigned the party they are expected to caucus with in the state legislature, measured by how the end up caucusing in the state legislature.  
	For example, a candidate who files in a state primary as a Democrat, and then is written in by voters in the Republican primary, has “d” designated as their “true” party.  


In [20]:
raw_df["partyt"].value_counts()

partyt
d            183902
r            151981
nonmaj        22189
writein       17574
nonpart        2431
partymiss       268
Name: count, dtype: int64

In [25]:
raw_df[raw_df["last"].isna()]

Unnamed: 0,caseid,year,month,day,sab,cname,candid,cand,sen,partyz,partyt,outcome,vote,termz,last,first,v19_20171211,v19_20160217


In [28]:
raw_df = raw_df[(raw_df["year"] <= 2017) & (raw_df["year"] >= 2014)]

In [29]:
mini_df = pd.read_csv("/project/data/transformed/inds_mini.csv")

In [None]:
mini_df["party"].value_counts()

party
republican    2
democratic    1
DELETE        1
Name: count, dtype: int64

We rename the party categories to align with the individual file.

In [41]:
test_df = raw_df.merge(mini_df, left_on="cand", right_on="full_name",how = "right")
test_df[test_df["cand"].isna() == False]

Unnamed: 0,caseid,year,month,day,sab,cname,candid,cand,sen,partyz,...,last_name,full_name,entity_type,state,party,company,occupation,address,zip,city
0,321972.0,2014.0,8.0,26.0,AZ,,10358.0,"alston, lela",0.0,d,...,,"alston, lela",candidate,AZ,democratic,none (is a candidate),,,,
1,336363.0,2014.0,11.0,4.0,AZ,maricopa,10358.0,"alston, lela",0.0,d,...,,"alston, lela",candidate,AZ,democratic,none (is a candidate),,,,
2,361592.0,2016.0,11.0,8.0,AZ,,10358.0,"alston, lela",0.0,d,...,,"alston, lela",candidate,AZ,democratic,none (is a candidate),,,,
2537,361500.0,2016.0,11.0,8.0,AZ,,361500.0,"schmuck, frank",1.0,r,...,,"schmuck, frank",candidate,AZ,republican,none (is a candidate),,,,
3239,321937.0,2014.0,8.0,26.0,AZ,,295389.0,"carter, heather",0.0,r,...,,"carter, heather",candidate,AZ,republican,none (is a candidate),,,,
3240,336333.0,2014.0,11.0,4.0,AZ,maricopa,295389.0,"carter, heather",0.0,r,...,,"carter, heather",candidate,AZ,republican,none (is a candidate),,,,
3241,361565.0,2016.0,11.0,8.0,AZ,,295389.0,"carter, heather",0.0,r,...,,"carter, heather",candidate,AZ,republican,none (is a candidate),,,,


2015 -2017 data
create a new table for elections who was runnning, foreign key to ind's table 
what race they are 
district, vote, results + # of votes -> how many more votes etc.


In [62]:
ind_df = pd.read_csv("/project/output/transformed/individuals_table.csv")

  ind_df = pd.read_csv("/project/output/transformed/individuals_table.csv")


In [66]:
ind_df[ind_df["full_name"].isna()]

Unnamed: 0.1,Unnamed: 0,id,first_name,last_name,full_name,entity_type,state,party,company
2487042,0,3aefdcef-5322-456b-ba06-f798a7f02435,James,Schultz,,Individual,MN,,
2487043,1,8eb8c02d-ec89-4b13-8165-3f9997e76a88,James,Schultz,,Individual,MN,,
2487044,2,306ac8c6-131e-416e-b6b3-17276e83564e,James,Schultz,,Individual,MN,,
2487045,3,9e9b0667-9820-45b7-adf0-00545da2f1b5,Keith,Ellison,,Individual,MN,,
2487046,4,a1f08d62-9a7a-4e39-bbcc-4ee49ab0bbba,Keith,Ellison,,Individual,MN,,
...,...,...,...,...,...,...,...,...,...
2496058,9016,fac77dae-2df0-4f65-a8bf-7430a59ae6c7,Wallace,Swan,,Individual,MN,,
2496059,9017,5974ec19-e5cc-4a70-b720-09853e72d7c7,Wallace,Swan,,Individual,MN,,
2496060,9018,c883db2e-07fe-47dd-8eaf-3ee111a42f71,Wallace,Swan,,Individual,MN,,
2496061,9019,aecb8c05-7e67-4368-b7d6-252a5d592012,Wallace,Swan,,Individual,MN,,


In [None]:
ind_df_cleaned = 

linkage -> file, to find match individual, for the election results -> look for potential match -> include in the election results. 

statefinancetransformer -> new class -> election result transformer 

1. cleaning the data 
2. doing matching -> reuse the funciton in linkage 
