### erd staging queries ###

We begin with the House2016 dataset, first enumerating the number of records in this table.

In [None]:
dataset_id = "hdv_staging"

In [19]:
%%bigquery 
SELECT count(*)
FROM hdv_staging.House2016 hs1 

Unnamed: 0,f0_
0,16977


Query 1. Fipscode (numerical equivalent of jurisdiction) and candidate are the most intuitive choices for a primary key. 

In [32]:
%%bigquery 
SELECT count(DISTINCT  concat(hs1.fipscode, hs1.candidate))
FROM hdv_staging.House2016 hs1;

Unnamed: 0,f0_
0,16408


Since 16408 < 16977, this combination cannot be the primary key. This suggests that the same fipscode (equivalent to jurisdiction) may have the same candidate multiple times. For such an example, consider Blount County, Alabama, where Write-In appears more than once:

In [37]:
%%bigquery 
SELECT DISTINCT  hs1.fipscode, hs1.candidate, hs1.jurisdiction, hs1.total_votes, hs1.state
FROM hdv_staging.House2016 hs1
WHERE hs1.state = 'AL' and hs1.jurisdiction = 'Blount'
limit 11

Unnamed: 0,fipscode,candidate,jurisdiction,total_votes,state
0,100900000,Aderholt,Blount,1673,AL
1,100900000,Write-In,Blount,1673,AL
2,100900000,Putman,Blount,23131,AL
3,100900000,Palmer,Blount,23131,AL
4,100900000,Write-In,Blount,23131,AL


Query 2. "Write-in" candidates occasionally appear across the same fipscode. Implementing "total_votes" as a primary key could differentiate these records.

In [38]:
%%bigquery 
SELECT count(DISTINCT  concat(hs1.fipscode, hs1.candidate, hs1.total_votes))
FROM hdv_staging.House2016 hs1;

Unnamed: 0,f0_
0,16971


Clearly, fipscode, candidate, and total_votes are not the primary key. 16971 is less than 16977. There may be a record with a Write-In candidate in a given fipscode that can't be differentiated by vote totals. Indeed, there is such a case in Rochester, New Hampshire, for Ward 3 and 5:

In [39]:
%%bigquery 
SELECT *
FROM hdv_staging.House2016 hs1
WHERE hs1.fipscode = 3301765140 and hs1.candidate = 'Scatter'

Unnamed: 0,state,jurisdiction,fipscode,office,year,party,candidate,votes,total_votes
0,NH,Rochester - Ward 1,3301765140,US House,2016,,Scatter,0,2598
1,NH,Rochester - Ward 2,3301765140,US House,2016,,Scatter,0,2674
2,NH,Rochester - Ward 3,3301765140,US House,2016,,Scatter,0,2514
3,NH,Rochester - Ward 4,3301765140,US House,2016,,Scatter,0,2360
4,NH,Rochester - Ward 5,3301765140,US House,2016,,Scatter,0,2514
5,NH,Rochester - Ward 6,3301765140,US House,2016,,Scatter,0,2066


Query 3. To rectify the problem discovered in Query 2, we will include the jurisdiction name in the primary key.

In [40]:
%%bigquery 
SELECT count(DISTINCT  concat(hs1.fipscode, hs1.candidate, hs1.total_votes, hs1.jurisdiction))
FROM hdv_staging.House2016 hs1;

Unnamed: 0,f0_
0,16977


Now the count for this key matches the number of records in the table, so we have found the primary key

Since each table has the same schema, we claim that each other table will have this same primary key. Next, we look at Senate2016:

In [47]:
%%bigquery 
SELECT count(*)
FROM hdv_staging.Senate2016 s1 

Unnamed: 0,f0_
0,15923


In [46]:
%%bigquery 
SELECT count(DISTINCT  concat(s1.fipscode, s1.candidate, s1.total_votes, s1.jurisdiction))
FROM hdv_staging.Senate2016 s1;

Unnamed: 0,f0_
0,15923


Now, we turn to the 2018 House elections:

In [62]:
%%bigquery 
SELECT count(*)
FROM hdv_staging.House2018 

Unnamed: 0,f0_
0,17256


In [66]:
%%bigquery 
SELECT count(DISTINCT  concat(p.fipscode, p.candidate, p.jurisdiction, p.total_votes))
FROM hdv_staging.House2018 p;

Unnamed: 0,f0_
0,16665


Since this combination does not work, we include votes as a possible distinction.

In [68]:
%%bigquery 
SELECT count(DISTINCT  concat(p.fipscode, p.candidate, p.jurisdiction, p.total_votes, p.votes))
FROM hdv_staging.House2018 p;

Unnamed: 0,f0_
0,17256


Now, we can turn to the 2018 Senate election:

In [69]:
%%bigquery 
SELECT count(*)
FROM hdv_staging.Senate2018 p 

Unnamed: 0,f0_
0,17305


In [72]:
%%bigquery 
SELECT count(DISTINCT  concat(p.fipscode, p.candidate, p.jurisdiction, p.total_votes, p.votes))
FROM hdv_staging.Senate2018 p;

Unnamed: 0,f0_
0,17305


Lastly, we have the 2016 presidential election:

In [73]:
%%bigquery 
SELECT count(*)
FROM hdv_staging.President2016 p 

Unnamed: 0,f0_
0,53768


In [74]:
%%bigquery 
SELECT count(DISTINCT  concat(p.fipscode, p.candidate, p.jurisdiction, p.total_votes, p.votes))
FROM hdv_staging.President2016 p;

Unnamed: 0,f0_
0,53766


There are no more unique fields we can add to the key, which leads us to suspect that there are duplicates in this table. We investigate:

In [75]:
%%bigquery
select p.state, p. jurisdiction, p.fipscode, p.candidate, p.votes, p.total_votes, count(*)
from hdv_staging.President2016 p
GROUP BY p.state, p.jurisdiction, p.fipscode, p.candidate, p.votes, p.total_votes
HAVING count(*) > 1

Unnamed: 0,state,jurisdiction,fipscode,candidate,votes,total_votes,f0_
0,IL,ALEXANDER,1700300000,Evan McMullin,0,2820,2
1,IL,BROWN,1700900000,Evan McMullin,0,2359,2


To be certain, we look at these specific rows:

In [80]:
%%bigquery
select p.state, p. jurisdiction, p.fipscode, p.candidate, p.votes, p.total_votes
from hdv_staging.President2016 p
where p.state = 'IL' and p.candidate = 'Evan McMullin' and (p.fipscode = 1700300000 or p.fipscode = 1700900000)


Unnamed: 0,state,jurisdiction,fipscode,candidate,votes,total_votes
0,IL,ALEXANDER,1700300000,Evan McMullin,0,2820
1,IL,ALEXANDER,1700300000,Evan McMullin,0,2820
2,IL,BROWN,1700900000,Evan McMullin,0,2359
3,IL,BROWN,1700900000,Evan McMullin,0,2359


This confirms our suspicion, so we cannot select any further keys to be primary. As we continue to clean our data and normalize on jurisdiction, we anticipate that the number of primary keys per table will decrease, and that we will have foreign key relationships. As of yet, since all of our tables follow the same schema, reporting election results for different years (2016, 2018) and types of elections (House, Senate, Presidential), we have not identified any reasonable foreign key relationships. We do not think there are any self-relationships, either. The inclusion of a secondary dataset, containing information from the US Census, should help us build foreign key relationships based on jurisdiction.