# Analysis of US Election Results. Election Possible outcome simulation.

This notebook is aimed to provide exploratory data analysis (EDA), understanding of used datasets, simulation of possible outcomes of the elections by finding all variations of states where non of the presidential candidates received 50% of votes.
As usual I begin with importing libraries that are useful for this project.

In [None]:
import pandas as pd
pd.options.display.float_format = "{:,.4f}".format

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sys


# 1. Uploading Data

I will be doing EDA and simulation only for presidential elections. Hence the first step is getting data for presidential elections.
I will add a dataset with electoral votes for each state.

In [None]:
president_county_data = '../input/us-election-2020/president_county.csv'

president_county_candidate_data = '../input/us-election-2020/president_county_candidate.csv'

president_state_data = '../input/us-election-2020/president_state.csv'

electortal_votes = '../input/electoralvotes/ElectoralVotes.csv'

population = '../input/population/nst-est2019-alldata.csv'


It is much easier to use all imported libraries and all manipulations where datasets are presented in dataframe format. All files are small enough in order to painlessly convert them using pandas library. Let's do the conversion.

In [None]:
df_president_county = pd.read_csv(president_county_data)
df_president_county_candidate = pd.read_csv(president_county_candidate_data)
df_president_state = pd.read_csv(president_state_data)
df_electortal_votes = pd.read_csv(electortal_votes)
df_population = pd.read_csv(population)

I have loaded electoral votes by state. Let's perform EDA on this dataset.
For the future analysis it would be intersting to compare electoral votes and population by state.

# 2.Exploratory Data Analysis (EDA)

In [None]:
df_electortal_votes.info()

In [None]:
df_electortal_votes.describe()

Good news! No missing data or any suprising values. 
Minimum electoral votes = 3.  
Median number of electoral votes  = 8 (need to take into consideration that maximum value is 55). 
Most of the states have electoral votes less than 12. 

Let's see what state has the most electoral votes and how many states have same electoral vote. 

# 2.1. Electoral Votes Dataset Analysis

In [None]:
df_electortal_votes_orderd=df_electortal_votes.sort_values(by=['Electoral Votes'])

x = list(df_electortal_votes_orderd['US State'])
y = list(df_electortal_votes_orderd['Electoral Votes'])
#ax = df_electortal_votes.plot.bar(x,y,  color ='maroon', width = 0.4, figsize = (20,20), legend = False)
#ax.set_xlabel('States')
#ax.set_ylabel('Count')
plt.figure(figsize = (20,20))
ax = plt.bar(x,y, color ='maroon')
plt.xticks(x, df_electortal_votes_orderd['US State'], rotation='vertical')
plt.xlabel('State', fontweight ='bold') 
plt.ylabel('Electoral Votes', fontweight ='bold') 
#plt.savefig("ElectoralVotes.jpg")


It is very clear that California is a leader in electoral votes. 

In [None]:
df_electortal_votes_orderd.groupby(['Electoral Votes']).agg({'US State':'count'}).rename(columns={'US State':'US_State_Count'}).reset_index()

In [None]:
Votes_bins = pd.IntervalIndex.from_tuples([(0,6), (6, 12), (12, 18), (18, 55)])
df_electortal_votes_bin = pd.cut(df_electortal_votes['Electoral Votes'],Votes_bins)
df_electortal_votes_bin.value_counts()

The highest number of states (8 states) have electoral votes = 3, than 6 states have electoral votes = 6. 
After grouping by bins, we can see that 22 out of 51 states have electoral votes less or equal to 6.
I can say that 3,4,5,6 are the most popular values of the electoral votes. And over 50% states have electoral votes lass than 13.

The highest number of electoral votes = 55 belong to California. Second higher number of electoral votes = 38 to Texas.

The next step in the analysis of electoral votes dataset would be comparison to the population by state. I would expect correlation between electoral votes and population. 

Let's invetsigte population dataset and join it with the electoral votes in order to see the correlations between electoral votes and population.

Available data set of 2019 population.

# 2.2. Electoral Votes and Population

In [None]:
df_population.info()

In [None]:
df_population.describe()

In [None]:
print(df_population.head())

In [None]:
df_population_2019 = df_population[["STATE", "NAME","POPESTIMATE2019"]]
df_population_2019 = df_population_2019[df_population_2019["STATE"] > 0].rename(columns = {'NAME':'state'})

df_population_2019.sort_values(['state'], ascending=True)

Merging electoral votes dataset with population. Data I found for 2019 year.

In [None]:
df_electortal_votes_population_2019 = pd.merge(df_electortal_votes,df_population_2019, how = 'left', left_on ='US State', right_on = 'state' )
df_electortal_votes_population_2019 = df_electortal_votes_population_2019.drop(['STATE', 'US State'], axis=1)
df_electortal_votes_population_2019

In [None]:
corr_matrix = df_electortal_votes_population_2019.corr()
corr_matrix


The table above called correlation matrix and shows correlation between population and electoral votes.
From this table you can see that population and electoral votes have very high correlation.
Another way to visualize the relationship between two measured data variables is with a scatterplot.


In [None]:
ax = df_electortal_votes_population_2019.plot.scatter(x = 'POPESTIMATE2019', y = 'Electoral Votes', figsize = (4,4))
ax.set_ylabel('Electoral Votes')
ax.set_xlabel('Population 2019')
ax.axhline(0, color = 'grey', lw = 1)
ax.axvline(0, color = 'grey', lw = 1)
ax

# 2.3. Presidential Race Datasets Analysis

In [None]:
print("About dataset: General information about reporting votes to presidential race by county.\n")
df_president_county.info()

In [None]:
print(df_president_county.head())

In [None]:
df_president_county.describe()

Intersting that maximum values of current votes are higher than total votes. 
Let's check where the maximum value are located and what percent column shows us.

In [None]:
df_president_county_test = df_president_county.copy()
df_president_county_test['Dif'] = df_president_county_test['total_votes'] - df_president_county_test['current_votes']
df_president_county_test.sort_values(['Dif'], ascending=True)

Los Angeles and Sacramento  County 	current_votes count exceeds total_votes count, however the column 'percent' does not show that, showing 95%. I would not trust the column 'percent' since it is not clear on how  percent column is calulated.  
Let's calulate  % of current votes conmpare to total votes by state. 

In [None]:
df_president_county_group= df_president_county.groupby(['state']).agg({'current_votes':'sum','total_votes':'sum'})
df_president_county_group['percent_calc'] = 100 * (df_president_county_group['current_votes']/df_president_county_group['total_votes'])
df_president_county_group['difference'] = df_president_county_group['total_votes'] - df_president_county_group['current_votes']
df_president_county_group.sort_values('percent_calc', ascending=False).reset_index()

In [None]:
df_president_county_group[df_president_county_group['percent_calc'] > 100].reset_index()

16 states have current number of votes higher than total votes. Different reasons could be for that including the representation of the total votes column. If that column represents only registered to vote residents.
Anyway it seams very unusual to have higher number of current_votes compare to total_votes. 
The current votes higher than total votes in California for over 1,6 Mln

# 3.Election Results Analysis

The further analysis will be related to the elections results. I will work with 2 datasets:  president_state and president_county_candidate

In [None]:
print("About dataset: Described information about candidate votes to presidential race by county.\n")
print(df_president_county_candidate.head())

In [None]:
df_president_county_candidate.info()

In [None]:
df_president_county_candidate.describe()

I would like to explore further minimum and maximum values for total votes. 
Where do they position and what information I can get from them. I will start with minimum value for total votes = 0

In [None]:
df_president_county_candidate[df_president_county_candidate['total_votes'] == 0].count()

In [None]:
df_president_county_candidate[df_president_county_candidate['total_votes'] == 0].groupby('candidate').count()

Total number of records with 0 votes = 4,724, some of them belong to Trump and Biden, other condidates and the higher number to 'write-ins'. 
Let's check the statistic for candidate = 'Write-ins'.

In [None]:
df_president_county_candidate.info()

In [None]:
df_president_county_candidate['candidate'].unique().tolist()

In [None]:
df_president_county_candidate_write_ins= df_president_county_candidate[df_president_county_candidate['candidate'] ==' Write-ins']
df_president_county_candidate_write_ins

In [None]:
df_president_county_candidate_write_ins_state = df_president_county_candidate_write_ins.groupby('state').agg({'total_votes':'sum'}).sort_values(by=['total_votes'], ascending=False).reset_index()
df_president_county_candidate_write_ins_state

Let's check the states that do not have write-ins. 

In [None]:
df_president_county_candidate_state=df_president_county_candidate.groupby('state').agg({'total_votes':'sum'})
df_president_county_candidate_state

In [None]:
df_states_without_write_ins = pd.merge(df_president_county_candidate_state,df_president_county_candidate_write_ins_state,on= 'state', how='outer',indicator=True)
df_states_without_write_ins[df_states_without_write_ins['_merge']!='both']

Hence there are 10 states that do not have 'write_ins'.

Let's analysis the data for our 2 main candidates - Joe Biden and Donald Trump

# 3.1. Joe Biden vs Donald Trump

In [None]:
df_president_county_candidate_main = df_president_county_candidate[(df_president_county_candidate.candidate=='Joe Biden')| (df_president_county_candidate.candidate=='Donald Trump')]
df_president_county_candidate_main

In [None]:
group_col = ['state','candidate']
df_president_county_candidate_main_group= df_president_county_candidate_main.groupby(group_col).agg({'total_votes':'sum'}).reset_index()
df_president_county_candidate_main_group=df_president_county_candidate_main_group.set_index('state')
df_president_county_candidate_main_group


In [None]:
df_president_county_candidate_total = df_president_county_candidate.groupby(['candidate']).agg({'total_votes':'sum'})

In [None]:
df_president_county_candidate_total.plot(kind='bar', color ='lightblue',figsize= (20,10))

It is no surprise that Joe Biden and Donald Trump are the main candidates. Let's see who are other candidates and who got the most votes.

In [None]:
df_president_county_candidate_not_main_total = df_president_county_candidate_total[(df_president_county_candidate_total.index != "Joe Biden") & (df_president_county_candidate_total.index != "Donald Trump")]
df_president_county_candidate_not_main_total.sort_values(by='total_votes', ascending=False)
#df_president_county_candidate_total.plot(kind='bar', color ='lightblue',figsize= (20,10))

In [None]:
df_president_county_candidate_not_main_total.plot(kind='bar', color ='lightgray',figsize= (20,10))

Clearly Jo Jorgensen is a leader with over 1Mln votes. And for comparison Kanye West got 66K. 

In [None]:
df_president_county_candidate_main_total = df_president_county_candidate_total[(df_president_county_candidate_total.index == "Joe Biden") | (df_president_county_candidate_total.index == "Donald Trump")]
df_president_county_candidate_main_total

In [None]:
df_president_county_candidate_main_total.plot(kind='bar', color ='orange',figsize= (20,10))

In [None]:
print("About dataset: General information about reporting votes to presidential race by state.\n")
print(df_president_state.head())

In [None]:
df_president_state.info()

In [None]:
df_president_state.describe()

In [None]:
df_president_county_total_votes = df_president_county.groupby(['state']).agg({'total_votes':'sum'}).reset_index()
df_president_county_total_votes.head()

I want to compare total votes in two datasets : total votes by county and total votes by state.

In [None]:
df_total_votes_check = pd.merge(df_president_county_total_votes,df_president_state,how = 'inner', on = 'state')
df_total_votes_check['vote_diff'] = df_total_votes_check['total_votes_y'] - df_total_votes_check['total_votes_x']
df_total_votes_check

The highest difference in total votes in Illinois. I am not sure whether it is issue in provided datsets or not.

In [None]:
df_total_votes_check.sort_values(by='vote_diff', ascending=True)

In [None]:
df_president_county_candidate_main_group.head()

In [None]:
df_president_state.head()

In [None]:
df_president_candidate_main = pd.merge(df_president_county_candidate_main_group, df_president_state, how='left', on ='state').rename(columns = {'total_votes_x':'candidate_votes','total_votes_y':'reg_votes'})
df_president_candidate_main.head()

I want to calculate percentage that our 2 main candidates received for each state.

In [None]:
df_president_candidate_main['percent'] = (df_president_candidate_main['candidate_votes'] / df_president_candidate_main['reg_votes']).astype('float').round(4)
df_president_candidate_main.head()

In [None]:
df_president_candidate_main.sort_values(by=['state','percent'], ascending = False)

In [None]:
df_president_candidate_main['winner'] = 'False'
df_president_candidate_main.loc[df_president_candidate_main['percent'] > 0.50, 'winner'] = 'True'

df_president_candidate_main

In [None]:
df_president_candidate_main_ev = pd.merge(df_president_candidate_main, df_electortal_votes, how='left',left_on='state', right_on ='US State')
df_president_candidate_main_ev

In [None]:
df_president_candidate_main_pivot = df_president_candidate_main.pivot(index='state', columns ='candidate', values ='candidate_votes')

df_president_candidate_main_pivot.head()

I want to explore two types of graphs to see which representation is visually better.

In [None]:
df_president_candidate_main_pivot.plot(kind='bar',stacked = True,figsize= (20,10))

In [None]:
df_president_candidate_main_pivot2 = df_president_candidate_main.pivot(index='state', columns ='candidate', values ='percent')


In [None]:
df_president_candidate_main_pivot2.plot(kind='bar',yticks =[0.0,0.2,0.3,0.4,0.5,0.6,0.8,1.0,1.2] , stacked = True,figsize= (30,20)).yaxis.grid(linestyle='--') # horizontal  lines

#df_president_candidate_main_pivot2.plot(kind='bar',stacked = True,figsize= (20,10))
#plt.grid(True, which = 'major',linestyle='--')


And Illinois again. Looks like they have over 100 %. I would definitely consider some issue in data with this results.

In [None]:
df_president_candidate_main_Illinois = df_president_candidate_main[df_president_candidate_main['state']=='Illinois']
df_president_candidate_main_Illinois

In [None]:
df_president_candidate_main_California = df_president_candidate_main[df_president_candidate_main['state']=='California']
df_president_candidate_main_California

Below is the list of states where neither Trump or Biden received at least 50%. It is 5 states. 

In [None]:
df_president_candidate_main_pivot2_states=df_president_candidate_main_pivot2[(df_president_candidate_main_pivot2['Donald Trump']< 0.50) & (df_president_candidate_main_pivot2['Joe Biden']< 0.50)]
df_president_candidate_main_pivot2_states


In [None]:
df_president_candidate_main_pivot2_states=df_president_candidate_main_pivot2_states.reset_index()
df_president_candidate_main_pivot2_states

# 3.2. Total electoral votes calculation.

In [None]:
merge_final =  pd.merge(df_president_candidate_main_pivot2,df_electortal_votes,how='inner',left_on='state', right_on='US State')

merge_final

In [None]:
merge_final = merge_final.set_index('US State')


In [None]:
merge_final_trump = merge_final[merge_final['Donald Trump']>0.50].agg({'Electoral Votes':'sum'})
merge_final_trump

In [None]:
merge_final_biden = merge_final[merge_final['Joe Biden']>0.50].agg({'Electoral Votes':'sum'})
merge_final_biden

In [None]:
merge_final_dispute_states=merge_final[(merge_final['Donald Trump']< 0.50) & (merge_final['Joe Biden']< 0.50)]
merge_final_dispute_states

In [None]:
merge_final_dispute_states[["Donald Trump","Joe Biden"]].plot(kind='bar',yticks =[0.0,0.3,0.5,0.7,1.0], figsize= (10,5)).yaxis.grid(linestyle='--') 

# 3.3.Simulation Possible Outcomes for 5 States

Let's try to play with the possibility of winning for Biden or Trump by only 5 states where non of them got 50%

In [None]:
merge_final_dispute_states_total = merge_final_dispute_states['Electoral Votes'].sum()
print("Total electoral votes for states where no one got 50%: ")
print(merge_final_dispute_states_total)

In [None]:
print("Number of electoral votes Trump needs to win: ")
print(270 - int(merge_final_trump))

In [None]:
print("Number of electoral votes Biden needs to win: ")
print(270 - int(merge_final_biden))

To win the election Biden needs any of two states (out of 5 with less than 50%), where Trumps needs any 3 states. It is  obvious even at this step that Biden has higher chances of winning.
The outcome will be dending who wins each state. 
To Win Trum needs - 3 States and Biden needs 2 states.
Hence probabilty of winning to Biden is 3/5 , which is 60%
Probality of winning to Trump is 2/5, which is 40%
Let's play with final outcome. 

**I want to find all possible combinations. ** 




In [None]:
#from itertools import combinations
import itertools as it

In [None]:
#comb = combinations([1,2,3],2)
comb = it.product(['Trump','Biden'], repeat=5)
#comb =  it.permutations([1,2,3,4,5],5)
possible = []
for i in list(comb):
    possible.append(i)
    print(i)
    

In [None]:
print(possible)


In [None]:
columns=list(merge_final_dispute_states.index.values)
columns

The dataframe below indicates who won the state. We have total 32 combinations

In [None]:
df_combinations = pd.DataFrame(possible,columns=columns)
print(df_combinations) 

In [None]:
df_combinations['Total_Trump'] = 0
df_combinations['Total_Biden'] = 0
df_combinations['Elections Won'] = 'Default'
df_combinations

In [None]:
df_combinations_transposed = df_combinations.transpose()
df_combinations_transposed

In [None]:
final_states = merge_final_dispute_states.drop(['Donald Trump', 'Joe Biden'], axis=1)
final_states

In [None]:
df_combinations_votes = df_combinations_transposed.join(final_states, how='left').fillna(0)
df_combinations_votes

In [None]:
df_combinations_votes_copy = df_combinations_votes.copy()
for i in df_combinations_votes_copy.columns:
        trump_votes = 0
        biden_votes = 0
        for j in range(5):
            if df_combinations_votes_copy[i][j] =='Trump':
                trump_votes =  trump_votes + df_combinations_votes['Electoral Votes'][j] 
            if df_combinations_votes_copy[i][j] =='Biden':
                biden_votes =  biden_votes + df_combinations_votes['Electoral Votes'][j] 
        df_combinations_votes_copy[i][5] = trump_votes + int(merge_final_trump.values)
        df_combinations_votes_copy[i][6] = biden_votes + int(merge_final_biden.values) 
        if df_combinations_votes_copy[i][5] >= 270:
            df_combinations_votes_copy[i][7] = 'Trump'
        if df_combinations_votes_copy[i][6] >= 270:
            df_combinations_votes_copy[i][7] = 'Biden'
        if df_combinations_votes_copy[i][5] == df_combinations_votes_copy[i][6]:
            df_combinations_votes_copy[i][7] =  'Draw'


In [None]:
df_combinations_votes_copy

In [None]:
df_combinations_votes_copy_drop = df_combinations_votes_copy.drop(['Electoral Votes'], axis=1)
df_combinations_votes_copy_drop

In [None]:
elections_combinations = df_combinations_votes_copy_drop.transpose()
elections_combinations

Interesting that Draw would happen in one case based on table above.

In [None]:
elections_combinations_graph = elections_combinations.drop(['Arizona', 'Georgia', 'North Carolina', 'Pennsylvania', 'Wisconsin'], axis = 1) 
elections_combinations_graph

In [None]:
import plotly.express as px

fig = px.scatter(elections_combinations_graph, y="Elections Won")

fig.update_traces(marker=dict(color='LightSkyBlue', size=20,
                              line=dict(width=1,
                                        color='DarkSlateGrey')),
                  selector=dict(mode='markers'))

fig.show()

It is clear that Biden has higher chances to win. Biden has 26 combinations to win elections where Trump only - 5

In [None]:
Biden_won = elections_combinations_graph[elections_combinations_graph['Elections Won']=='Biden'].count()['Elections Won']
Biden_won


In [None]:
Trump_won = elections_combinations_graph[elections_combinations_graph['Elections Won']=='Trump'].count()['Elections Won']
Trump_won

*We all now know the elections results. It was fun playing with provided data and get some new information.  *