## 2020 US General Election Turnout Visualisation
Hello everyone! Today we will be taking a look at the 2020 US General Election Turnout dataset and visualise its features. We will also graph the distribution of the features as they are transformed using different techniques.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from scipy import stats
from seaborn import heatmap
from collections import Counter
from sklearn.preprocessing import StandardScaler as ss, MinMaxScaler as mms

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('../input/2020-us-general-election-turnout-rates/2020 November General Election - Turnout Rates.csv')

In [None]:
df.head()

## Preparing the dataset
Firstly, the null values are replaced with zeros.

In [None]:
for col in df:
    df[col] = df[col].fillna(0)

Then, we initialize a variable 'state_abv' to the 'State Abv' column of our dataframe. Then, we drop the 'State', 'Source', 'Official/Unofficial', 'Overseas Eligible', 'State Abv' columns from df.

In [None]:
state_abv = df['State Abv']
df = df.drop(['State', 'Source', 'Official/Unofficial', 'Overseas Eligible', 'State Abv'], axis=1)

In [None]:
cols = ['Total Ballots Counted (Estimate)', 'Vote for Highest Office (President)', 
        'Voting-Eligible Population (VEP)', 'Voting-Age Population (VAP)', 'Prison', 'Probation', 
        'Parole', 'Total Ineligible Felon']

Now, we convert the variables in our data to numerical format by removing the commas and converting it from string to integer.

In [None]:
for col in cols:
    df[col] = [int(''.join(str(i).split(','))) for i in df[col]]

We only wish to examine the values of the states in America, therefore the 'United States' row is removed.

In [None]:
for col in ['VEP Turnout Rate', '% Non-citizen']:
    df[col] = [float(i[:-1]) for i in df[col]]

df = df.drop(0)

Here, the 'bar_charts' function is defined, which plots out the non-zero states and displays above them the percentage that each one contributes to the feature.

In [None]:
def bar_charts(column, x, y, title):
    colors = np.array([['#EEE888', '#776699', '#DD111D', '#FFFF22', '#f8f8ff', '#F0EE82', '#BB69E4', 
                       '#BBBECC', '#CDD98C', 'b']*5]).reshape(1, -1)[0]

    fig, ax = plt.subplots(1, 1, figsize=(17, 9))
    values = pd.Series(dict(zip(state_abv[1:], 
                                df[column]))).sort_values(ascending=False)
    values = values[values!=0]
    
    rects = ax.bar(values.keys(), values, color=colors, edgecolor='black', linewidth=1.5)
    ax.set_title(title)
    ax.set_xlabel(x)
    ax.set_ylabel(y)

    for rect in rects:
        percent = rect.get_height()/df[column].sum()
        text = percent*100
        if percent > 0:
            dec = str(percent)[4]
        else:
            dec = '0'
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1*height,'%d' % text+'.'+dec, ha='center', va='bottom',
                fontsize=10)

    plt.show()

This 'pie_charts' function plots out a pie chart which shows the twenty most frequent states states in the column and their percentages.

In [None]:
def pie_charts(column):
    num = 20
    values = pd.Series(dict(zip(state_abv[1:], df[column]))).sort_values(ascending=False)
    fig, ax = plt.subplots(1, 1, figsize=(9, 9))
    ax.pie(values[:num], labels=values.keys()[:num], autopct=lambda p:f'{p:.2f}%', labeldistance=0.75, 
           explode=[0.1]*num, shadow=True, textprops={'fontsize':10})
    plt.title(column)
    plt.show()

## Total Ballots Counted (Estimate)
The following pie chart shows that California has the most counted ballots, with a little under one seventh of the entire number of ballots. Then follows Texas, with almost a tenth, and Florida, with a small amount less than Texas.

In [None]:
pie_charts('Total Ballots Counted (Estimate)')

## Vote for Highest Office (President)
The bar chart below visualises the number of votes for highest office, however, due to the significant amount of missing values in this feature, this is not a fair judgement of the real votes for highest office. Though in this instance, Texas leads with a value of over a fifth. Then follows Michigan with over a tenth and Virginia which has 8%.

In [None]:
bar_charts('Vote for Highest Office (President)', 'State', 'Number of votes for highest office', 
           'Number of votes for highest office per state (%)')

## VEP Turnout Rate
The subsequent pie plot tells us that the VEP (Voting-Eligible Population) who turned out is roughly the same within the states, from 4-5%.

In [None]:
pie_charts('VEP Turnout Rate')

## Voting-Eligible Population
Now we will look at how much of the population is eligible to vote for the election. It is shown that California leads in these numbers, as they have a value of 11%. Following that is Texas with 8% and Florida with 6%. Afterwards, the values take a gradual decrease.

In [None]:
bar_charts('Voting-Eligible Population (VEP)', 'State', 'Population', 'Voting-Eligible Population per state (%)')

## Voting-Age Population
The actual voting-age population of the states has California in the lead with 15%, then Texas at 11% and Florida with 9%

In [None]:
pie_charts('Voting-Age Population (VAP)')

## % Non-citizens
The subsequent bar graph shows that the city which has the most non-citizens that vote is California, which has 5%, followed by Texas with 4.5% and Nevada, which has 4.1%

In [None]:
bar_charts('% Non-citizen', 'State', 'Non-citizen (%)', 'Non-citizens per state (%)')

## Prison
The next pie chart visualises how many people voted in prison per state. Texas (16.5%) has the most, then California (11.2%) and Florida (9.8%).

In [None]:
pie_charts('Prison')

## Probations
The bar chart below shows how many people on probation voted per state. Georgia had the most, with over a fifth, then Texas with a little under one fith, and afterwards is Florida with roughly a tenth.

In [None]:
bar_charts('Probation', 'State', 'Number of probations', 'Number of probations per state (%)')

## Parole
The final pie chart shows how many people on parole voted per state. The state with the most on parole is Texas, as they have over a fifth, and then is California, with a little less than Texas. The two main states make up for almost half of the entire United States!

In [None]:
pie_charts('Parole')

## Ineligible Felons
The final bar graph shows how many felons cannot vote in the US. The state with the highest count of this number is Texas (16.8%), followed by Georgia (11.2%) and thirdly Florida (7.6%).

In [None]:
bar_charts('Total Ineligible Felon', 'State', 'Ineligible Felons', 'Number of Ineligible Felons per state (%)')

## Visualising data transformations
Now we will look at the distributions of our different features and compare them to how they would look like while transformed using log, box cox, standard and min max scaler.

In [None]:
for col in df.columns.drop('VEP Turnout Rate'):
    fig, axes = plt.subplots(1, 5, figsize=(15, 3))
    
    f1 = df[col]
    f2 = (df[col]+1).transform(np.log)
    f3 = pd.DataFrame(stats.boxcox(df[col]+1)[0])
    f4 = pd.DataFrame(ss().fit_transform(np.array(df[col]).reshape(-1, 1)))
    f5 = pd.DataFrame(mms().fit_transform(np.array(df[col]).reshape(-1, 1)))
    
    for plot in [[axes[0], f1, 'lightgreen', 'Normal'], [axes[1], f2, 'pink', 'Log'], 
                 [axes[2], f3, 'yellow', 'Box Cox'], [axes[3], f4, 'orange', 'Standard Scaler'], 
                 [axes[4], f5, 'skyblue', 'Min Max Scaler']]:
        ax = plot[0]
        feature = plot[1]
        colour = plot[2]
        transform = plot[3]
        
        feature.hist(ax=ax, color=colour)
        ax.set_title(transform)
        ax.set_xlabel(col)
        
        deciles = feature.quantile([.1, .2, .3, .4, .5, .6, .7, .8, .9])
        for pos in np.array(deciles).reshape(1, -1)[0]:
            handle = ax.axvline(pos, color='darkblue', linewidth=1)
        ax.legend([handle], ['decile'])

plt.show()

The final piece of visualisation that we will perform is binning seven of our features.

In [None]:
j = 0
fig, axes1 = plt.subplots(1, 3, figsize=(15, 3))
fig, axes2 = plt.subplots(1, 4, figsize=(15, 3))
axes = pd.concat([pd.Series(axes1), pd.Series(axes2)]).reset_index(drop=True)

for i in [['orange', 'Total Ballots Counted (Estimate)'], ['pink', 'Voting-Eligible Population (VEP)'],
        ['skyblue', 'Voting-Age Population (VAP)'], ['lightblue', 'Prison'], ['lightgreen', 'Probation'],
        ['yellow', 'Parole'], ['orange', 'Total Ineligible Felon']]:
    name = i[1]
    colour = i[0]
    col = df[name]
    diff = (col.max() - col.min()) / 100
    bins = np.digitize(col, np.arange(col.min(), col.max(), diff))
    df[name+'_bin'] = bins
    pd.DataFrame(bins).hist(ax=axes[j], color=colour)
    axes[j].set_title(name)
        
    deciles = pd.DataFrame(bins).quantile([.1, .2, .3, .4, .5, .6, .7, .8, .9])
    for pos in np.array(deciles).reshape(1, -1)[0]:
        handle = axes[j].axvline(pos, color='darkblue', linewidth=1)
    axes[j].legend([handle], ['decile'])
    j += 1
    
plt.show()

Lastly, the transforms for the features are added to our dataset.

In [None]:
orig = list(df.columns.drop('VEP Turnout Rate'))
cols = [[pd.DataFrame(stats.boxcox(df[orig[0]]+1)[0]), 'Total Ballots Counted (Estimate)', 'boxcox'], 
        [pd.DataFrame(ss().fit_transform(np.array(df[orig[1]]).reshape(-1, 1))), 'Vote for Highest Office (President)', 'standard'],
        [(df[orig[2]]+1).transform(np.log), 'Voting-Eligible Population (VEP)', 'log'], 
        [pd.DataFrame(stats.boxcox(df[orig[3]]+1)[0]), 'Voting-Age Population (VAP)', 'boxcox'], 
        [(df[orig[4]]+1).transform(np.log), '% Non-citizen', 'log'], 
        [pd.DataFrame(stats.boxcox(df[orig[5]]+1)[0]), 'Prison', 'boxcox'], 
        [pd.DataFrame(stats.boxcox(df[orig[6]]+1)[0]), 'Probation', 'boxcox'], 
        [(df[orig[7]]+1).transform(np.log), 'Parole', 'log'], 
        [pd.DataFrame(stats.boxcox(df[orig[8]]+1)[0]), 'Total Ineligible Felon', 'boxcox']]

for col in cols:
    transform = col[0]
    name = col[1]
    transform_name = col[2]
    df[name+'_'+transform_name] = transform
    
df = df.fillna(0)

### Thank you for reading my notebook.
### If you enjoyed this notebook and found it helpful, please give it an upvote and provide feedback as it would help me make more of these.