***Disclaimer: this is personal practice notebook and a learning project made public, not a major achievement in data analysis or a political pamphlet.***

In the 2020 U.S. presidential election, the actual voting procedure was very different compared to recent elections. Because of COVID-19, the Democratic candidate Joe Biden encouraged his supporters to rely on mail-in ballots, whereas the incumbent president Donald Trump openly campaigned for in-person voting on election day. As proven by vote count, people did respond to both. In many states, the early lead of Donald Trump vanished after the mail-in ballots were counted.

The research question this notebook asks is whether all this might have had any effect on COVID-19 surge in different parts of United States. 
As the county data on COVID-19 is now available on CDC website, it was possible to compare it with the existing data on vote count.

The data used in this notebook is available online on Kaggle.com with additional descriptive information. Since I did not require the most recent data on COVID-19 cases, I uploaded an older version of this data as an own dataset. This was because there is a government transition going on in the U.S., which often leads to changes also in data use and collection. I wanted to make sure the format of the dataset will continue to be compatible with this notebook without me checking in on this notebook every day.

*The Kaggle datasets (thank you for Kaggle user **Raphael Fontes** for US Election data and user **Heads or Tails** for COVID-19 county data) can be found on the following links (here as external):*

__[U.S. county COVID-19 data](https://www.kaggle.com/headsortails/covid19-us-county-jhu-data-demographics)__

__[U.S. election 2020 data](https://www.kaggle.com/unanimad/us-election-2020)__

In this notebook, the timeframe is the first two weeks after the presidential election. This is the suspected incubation period of COVID-19, so had someone been infected on election day and tested positive since, this would've shown up some time during the respected two weeks of COVID-19 data.

On election day, many Biden supporters who voted early allegedly stayed home. Conversely, the Trump supporters allegedly were out and about. <br>

*Did the counties that Donald Trump won have a larger increase of new COVID-19 cases in the two weeks following the election day?*

*Let's find out.*

***December 1st, 2020***<br>
***Jari Peltola***

In [None]:
#import modules
import pandas as pd
import numpy as np

In [None]:
# enable showing all columns and rows
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# load dataframe
df = pd.read_csv("../input/us-election-2020/president_county_candidate.csv")

# change boolean value to integer for easier use
mask = df.applymap(type) != bool
d = {True: 1, False: 0}
df = df.where(mask, df.replace(d))

# remove the word "County" from county names to sync dataframes
df['county'] = df['county'].str.replace('County', '')

First we divide the county results to Trump and Biden counties based on who won there.

In [None]:
# create two subsets based on who won an individual county
df_biden_won = df[(df['won'] == 1) & (df['candidate'] == 'Joe Biden')] 
df_trump_won = df[(df['won'] == 1) & (df['candidate'] == 'Donald Trump')]

In [None]:
df_biden_won.dtypes

In [None]:
df_biden_won = df_biden_won.sort_values(by = 'county', ascending = True) 

# reset index 
df_biden_won.reset_index(inplace = True) 

# select and drop original columns relevant to task at hand
col = ['index']
df_biden_won = df_biden_won.drop(col, axis=1)

df_biden_won.head(20)

In [None]:
df_trump_won = df_trump_won.sort_values(by = 'county', ascending = True) 

# reset index 
df_trump_won.reset_index(inplace = True) 

# select and drop original columns relevant to task at hand
col = ['index']
df_trump_won = df_trump_won.drop(col, axis=1)

df_trump_won.head(20)

In [None]:
# remove whitespace in column names to enable further merge
df_biden_won.state = df_biden_won.state.str.replace(' ', '')
df_biden_won.county = df_biden_won.county.str.replace(' ', '')

# create 'CountyState' column with both county and state name
# this is for preventing mixing different counties with the same name
df_biden_won['CountyState'] = df_biden_won[['county','state']].apply(lambda x : '{}_{}'.format(x[0],x[1]), axis=1)

In [None]:
df_trump_won.state = df_trump_won.state.str.replace(' ', '')
df_trump_won.county = df_trump_won.county.str.replace(' ', '')
df_trump_won['CountyState'] = df_trump_won[['county','state']].apply(lambda x : '{}_{}'.format(x[0],x[1]), axis=1)

In [None]:
df_biden_won = df_biden_won.sort_values(by = 'CountyState', ascending = True) 

# reset index 
df_biden_won.reset_index(inplace = True) 

# select and drop original columns relevant to task at hand
col = ['index']
df_biden_won = df_biden_won.drop(col, axis=1)

In [None]:
df_trump_won = df_trump_won.sort_values(by = 'CountyState', ascending = True) 

# reset index 
df_trump_won.reset_index(inplace = True) 

# select and drop original columns relevant to task at hand
col = ['index']
df_trump_won = df_trump_won.drop(col, axis=1)

In [None]:
df_biden_won.head(20)

In [None]:
df_trump_won.head(20)

From dataframe shape, we can see that by quantity Trump won twice as many individual counties as Joe Biden, whose supporters mostly live near large metro areas.

In [None]:
#get dataframe shape
shape = df_biden_won.shape
print('\nDataFrame Shape :', shape)
print('\nNumber of rows :', shape[0])
print('\nNumber of columns :', shape[1])

In [None]:
#get dataframe shape
shape = df_trump_won.shape
print('\nDataFrame Shape :', shape)
print('\nNumber of rows :', shape[0])
print('\nNumber of columns :', shape[1])

In [None]:
# select preferred columns
df_biden_counties = df_biden_won.loc[:,['CountyState', 'candidate', 'won']]

In [None]:
df_trump_counties = df_trump_won.loc[:,['CountyState', 'candidate', 'won']]

Next we bring in the county COVID-19 data to later match it with the election results.

In [None]:
# load dataframe
df_county_covid = pd.read_csv("../input/covid-19-us-counties-22-2020/covid_us_county.csv")

df_county_covid.head(10)

In [None]:
df_county_covid.dtypes

The 'date' column in the dataframe needs to be change into DateTime format.

In [None]:
# change column to DateTime format
df_county_covid['date'] =  pd.to_datetime(df_county_covid['date'], infer_datetime_format=True)

In [None]:
# mask timeframe covering two weeks after election
start_date = '2020-11-03'
end_date = '2020-11-18'

# wear a mask
mask = (df_county_covid['date'] >= start_date) & (df_county_covid['date'] < end_date)
df_county_covid = df_county_covid.loc[mask]

# drop NaN rows with no date data
df_county_covid.dropna(subset=['date'], inplace=True)

# reset index 
df_county_covid.reset_index(inplace = True) 

In [None]:
#get dataframe shape
shape = df_county_covid.shape
print('\nDataFrame Shape :', shape)
print('\nNumber of rows :', shape[0])
print('\nNumber of columns :', shape[1])

In [None]:
# remove whitespace from column names
df_county_covid.county = df_county_covid.county.str.replace(' ', '')
df_county_covid.state = df_county_covid.state.str.replace(' ', '')
df_county_covid.state_code = df_county_covid.state_code.str.replace(' ', '')

#rename column
df_county_covid.rename(columns = {'state_code':'StateCode'}, inplace = True) 

# create similar 'CountyState' column as in the election result dataframe 
df_county_covid['CountyState'] = df_county_covid[['county','state']].apply(lambda x : '{}_{}'.format(x[0],x[1]), axis=1)

In [None]:
df_county_covid = df_county_covid.sort_values(by = 'CountyState', ascending = True) 

# reset index 
df_county_covid.reset_index(inplace = True) 

# select and drop original columns relevant to task at hand
col = ['index', 'level_0', 'fips']
df_county_covid = df_county_covid.drop(col, axis=1)

In [None]:
df_county_covid.head(20)

We need only two days of data from the COVID-19 dataset: the first and the last day of the two-week period under further analysis.

In [None]:
# selecting rows based on condition 
df_county_covid_0411 = df_county_covid.loc[df_county_covid['date'] == '2020-11-04'] 
df_county_covid_1711 = df_county_covid.loc[df_county_covid['date'] == '2020-11-17'] 

In [None]:
#get dataframe shape
shape = df_county_covid_0411.shape
print('\nDataFrame Shape :', shape)
print('\nNumber of rows :', shape[0])
print('\nNumber of columns :', shape[1])

In [None]:
#get dataframe shape
shape = df_county_covid_1711.shape
print('\nDataFrame Shape :', shape)
print('\nNumber of rows :', shape[0])
print('\nNumber of columns :', shape[1])

In [None]:
df_county_covid_0411 = df_county_covid_0411.sort_values(by = 'CountyState', ascending = True) 


# reset index 
df_county_covid_0411.reset_index(inplace = True) 


# select and drop original columns relevant to task at hand
cols = ['index']
df_county_covid_0411 = df_county_covid_0411.drop(cols, axis=1)
#df_county_covid_1711 = df_county_covid_0411.drop(cols, axis=1)

In [None]:
df_county_covid_0411.head(20)

In [None]:
df_county_covid_1711 = df_county_covid_1711.sort_values(by = 'CountyState', ascending = True) 
df_county_covid_1711.reset_index(inplace = True)

# select and drop original columns relevant to task at hand
cols = ['index']

df_county_covid_1711 = df_county_covid_1711.drop(cols, axis=1)

In [None]:
df_county_covid_1711.head(20)

In [None]:
# select relevant columns for further use
df_county_covid_0411 = df_county_covid_0411.loc[:,['date', 'CountyState', 'cases', 'deaths', 'StateCode']]
df_county_covid_1711 = df_county_covid_1711.loc[:,['date', 'CountyState', 'cases', 'deaths', 'StateCode']]

In [None]:
df_county_covid_0411.head()

Now we can merge datasets by using left join. The result is four new dataframes - two on Biden and two on Trump - consisting both county election data and COVID-19 case data.

In [None]:
df_biden_0411 = pd.merge(df_biden_counties, df_county_covid_0411, how='left')
df_biden_1711 = pd.merge(df_biden_counties, df_county_covid_1711, how='left')

In [None]:
df_trump_0411 = pd.merge(df_trump_counties, df_county_covid_0411, how='left')
df_trump_1711 = pd.merge(df_trump_counties, df_county_covid_1711, how='left')

In [None]:
df_biden_0411 = df_biden_0411.dropna()
df_biden_0411.head(20)

In [None]:
df_trump_0411 = df_trump_0411.dropna()
df_trump_0411.head(20)

In [None]:
df_biden_1711 = df_biden_1711.dropna()
df_biden_1711.head(20)

In [None]:
df_trump_1711 = df_trump_1711.dropna()
df_trump_1711.head(20)

In [None]:
#get dataframe shape
shape = df_biden_0411.shape
print('\nshape :', shape)
print('\nNumber of rows :', shape[0])
print('\nNumber of columns :', shape[1])

In [None]:
#get dataframe shape
shape = df_trump_0411.shape
print('\nshape :', shape)
print('\nNumber of rows :', shape[0])
print('\nNumber of columns :', shape[1])

In [None]:
#get dataframe shape
shape = df_biden_1711.shape
print('\nshape :', shape)
print('\nNumber of rows :', shape[0])
print('\nNumber of columns :', shape[1])

In [None]:
#get dataframe shape
shape = df_trump_1711.shape
print('\nshape :', shape)
print('\nNumber of rows :', shape[0])
print('\nNumber of columns :', shape[1])

In [None]:
# rename columns for further use
df_biden_0411.rename(columns = {'cases':'cases_0411'}, inplace = True) 
df_biden_0411.reset_index(inplace = True)
col = ['index']
df_biden_0411 = df_biden_0411.drop(col, axis=1)

df_biden_1711.rename(columns = {'cases':'cases_1711'}, inplace = True) 
df_biden_1711.reset_index(inplace = True)
col = ['index']
df_biden_1711 = df_biden_1711.drop(col, axis=1)

df_trump_0411.rename(columns = {'cases':'cases_0411'}, inplace = True) 
df_trump_0411.reset_index(inplace = True)
col = ['index']
df_trump_0411 = df_trump_0411.drop(col, axis=1)

df_trump_1711.rename(columns = {'cases':'cases_1711'}, inplace = True)
df_trump_1711.reset_index(inplace = True)
col = ['index']
df_trump_1711 = df_trump_1711.drop(col, axis=1)

In [None]:
df_biden_0411.head(10)

In [None]:
df_biden_1711.head(10)

In [None]:
df_trump_0411.head(10)

In [None]:
df_trump_1711.head(10)

Next the COVID-19 case data from the two dates we are interested in is collected in the same dataframe. As result, we get one dataframe for Biden and one for Trump with all the data we need.

In [None]:
# values to list
list = df_biden_1711['cases_1711'].values.tolist()

# flatten list
list = np.array(list).flatten()

#create new column "Spread" for relative infections
df_biden_0411['cases_1711'] = np.array(list)

df_biden_0411.head(10)

In [None]:
# values to list
list = df_trump_1711['cases_1711'].values.tolist()

# flatten list
list = np.array(list).flatten()

#create new column "Spread" for relative infections
df_trump_0411['cases_1711'] = np.array(list)

df_trump_0411.head(10)

In [None]:
#create new column for case increase between the two dates
df_biden_0411['CaseIncrease'] = df_biden_0411['cases_1711'] -  df_biden_0411['cases_0411']

df_biden_0411.head(10)

In [None]:
df_trump_0411['CaseIncrease'] = df_trump_0411['cases_1711'] -  df_trump_0411['cases_0411']

df_trump_0411.head(10)

In [None]:
# make case number integers
df_biden_0411.cases_0411= df_biden_0411.cases_0411.astype(int)
df_biden_0411.cases_1711= df_biden_0411.cases_1711.astype(int)
df_biden_0411.deaths = df_biden_0411.deaths.astype(int)
df_biden_0411.CaseIncrease= df_biden_0411.CaseIncrease.astype(int)

df_biden_0411.head(10)

In [None]:
df_trump_0411.cases_0411= df_trump_0411.cases_0411.astype(int)
df_trump_0411.cases_1711= df_trump_0411.cases_1711.astype(int)
df_trump_0411.deaths = df_trump_0411.deaths.astype(int)
df_trump_0411.CaseIncrease= df_trump_0411.CaseIncrease.astype(int)

df_trump_0411.head(10)

We must check if there are null values in our case data, since the data will be used in division calculation later. In this case, possible insufficient data will be dropped.

In [None]:
print(df_biden_0411.loc[df_biden_0411['cases_0411'] == 0])

In [None]:
df_biden_0411 = df_biden_0411[df_biden_0411.cases_0411 != 0]

In [None]:
#get dataframe shape
shape = df_biden_0411.shape
print('\nshape :', shape)
print('\nNumber of rows :', shape[0])
print('\nNumber of columns :', shape[1])

Next, case increase percentage will be calculated. The two columns used for this are the case numbers from November 4th and the number of increased COVID-19 cases (from November 4th to November 17th) calculated earlier.

In [None]:
# values to two lists
list_1 = df_biden_0411['CaseIncrease'].values.tolist()
list_2 = df_biden_0411['cases_0411'].values.tolist()

# empty list
CasePerc = []

# function to calculate relative infection rate using two lists of values
def relative_spread(x1, x2): 
                    result =  [(x1 / x2 * 100) for (x1, x2) in zip(list_1,list_2)] 
                    CasePerc.append(result)   
            
# execute function on list values            
relative_spread (list_1,list_2)

# flatten results list
CasePerc = np.array(CasePerc).flatten()

# round to one digit
CasePerc = np.round(CasePerc, 1)

#create new column
df_biden_0411['IncreasePerc'] = np.array(CasePerc)

In [None]:
df_biden_0411.head(10)

We can now see the average and median increase percentage of Biden counties and their respected COVID-19 cases in the two weeks following the election.

In [None]:
biden_mean = df_biden_0411['IncreasePerc'].mean()
biden_median = df_biden_0411['IncreasePerc'].median()

print('\nBiden county mean :', biden_mean)
print('\nBiden county median :', biden_median)

In the Trump data, there are null values especially in Utah data. As we are not concentrating on state differences, those rows will be dropped.

In [None]:
print(df_trump_0411.loc[df_trump_0411['cases_0411'] == 0])

In [None]:
df_trump_0411 = df_trump_0411[df_trump_0411.cases_0411 != 0]

In [None]:
#get dataframe shape
shape = df_trump_0411.shape
print('\nshape :', shape)
print('\nNumber of rows :', shape[0])
print('\nNumber of columns :', shape[1])

In [None]:
# values to two lists
list_1 = df_trump_0411['CaseIncrease'].values.tolist()
list_2 = df_trump_0411['cases_0411'].values.tolist()

# empty list
CasePerc = []

# function to calculate relative infection rate using two lists of values
def relative_spread(x1, x2): 
                    result =  [(x1 / x2 * 100) for (x1, x2) in zip(list_1,list_2)] 
                    CasePerc.append(result)   
            
# execute function on list values            
relative_spread (list_1,list_2)

# flatten results list
CasePerc = np.array(CasePerc).flatten()

# round to one digit
CasePerc = np.round(CasePerc, 1)

#create new column
df_trump_0411['IncreasePerc'] = np.array(CasePerc)

In [None]:
df_trump_0411.head(10)

Now we can get the average and median percentages also on Trump counties.

In [None]:
trump_mean = df_trump_0411['IncreasePerc'].mean()
trump_median = df_trump_0411['IncreasePerc'].median()

print('\nTrump county mean :', trump_mean)
print('\nTrump county median :', trump_median)

By average, counties Donald Trump won had a 35-percent increase of new COVID-19 cases in two weeks following the election. Conversely, counties where Joe Biden won had only about a 23-percent increase in new virus cases. 

***This does not automatically mean that the increased COVID-19 cases are a direct consequence of Trump supporter voting day activity. Not all Biden voters stayed at home on November 3rd, nor did all Trump supporters actively mingle with other people with no masks and social distancing. What we see is a larger trendline, one fact among others, with numerous possible explanations.*** 

In [None]:
print('\naverage increase percent of new COVID-19 cases (04.11.2020-17.11.2020) in counties Joe Biden won :', biden_mean)
print('\naverage increase percent of new COVID-19 cases (04.11.2020-17.11.2020) in counties Donald Trump won :', trump_mean)