# Exploratory Data Analysis - 2020 Democratic Party Endorsements
## By: Joseph Waugh
 
 #### The purpose of this document is to provide an exploratory analysis of the 2020 Democratic Party Endorsements based off of a Kaggle Dataset <a href="https://www.kaggle.com/fmejia21/2020-democratic-primary-endorsements/" target="_blank"> (link) </a> that filters the data on a variety of different variables. The data will be filtered in a dataframe format, and then visualizations will allow for further understanding.

## Step I: Importing the Libraries
#### The first step in this data analysis is to import the below libraries to the Python console. These packages allow for the visualizations and dataframe to be created, and ultimately filtered depending on the need.

In [78]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
%matplotlib inline

import plotly.express as px

import plotly
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode
import plotly.figure_factory as ff
from plotly import subplots
from plotly.subplots import make_subplots
init_notebook_mode(connected=True)

from datetime import date, datetime, timedelta
import time, re, os
from IPython.display import Markdown, display

## Step II: Importing the Data
#### Now that the libraries have been imported, the functionality exists to transfer the .csv file obtained from Kaggle into a readable dataframe that can be further explored. The pandas library is used to extract the data from the .csv file into the dataframe. 

In [39]:
#reading in the dataset
df = pd.read_csv(r'C:/Users/HP/Desktop/endorsements-2020.csv')

#determining which columns of data are available
print("The columns in the dataset are as follows:")
for column in df.columns:
    print (column)

The columns in the dataset are as follows:
date
position
city
state
endorser
endorsee
endorser party
source
order
category
body
district
points


In [40]:
#determining which columns of data have the greatest percentage of missing or null values
percent_missing = np.round(df.isnull().sum() * 100 / len(df),2)
percent_missing_df = pd.DataFrame({'column name': df.columns, 
                                   'percent_missing': percent_missing}).sort_values('percent_missing', ascending=False)


#### The visualization here highlights the percentage of datapoints for each column that are null. From the visualization, it appears that 7 columns have a high null value percentage, where at least 75% of the data in each of these is missing. This is highlighted in the visualization below: 

In [41]:
#plotly dynamic visualization highlighting the percentage of datapoints for each column that are null
fig = go.Figure()
fig.add_trace(
        go.Bar(x=percent_missing_df['column name'],
               y=percent_missing_df['percent_missing'],
               opacity=0.9,
               text=percent_missing_df['percent_missing'],
               textposition='inside',
               marker={'color':'blue'}
                   ))
fig.update_layout(
      title={'text': 'Data Missing %',
             'y':0.95, 'x':0.5,
            'xanchor': 'center', 'yanchor': 'top'},
      showlegend=False,
      xaxis_title_text='Columns',
      yaxis_title_text='Percentage',
      bargap=0.1
    )

fig.show()

### Step III: Data Handling:

#### To improve the clarity of the data, a few of the columns with relatively high "data missing %" values will be removed. I'm using a threshold of >75% missing to remove the following columns: 

- city
- body
- order
- district
- date

#### As for the endorsee and source, these are vital for our understanding, so these will not be removed. We will be able to extract some conclusions from the data that is currently available. 

In [42]:
df.drop(['city','body','order','district'], axis=1, inplace=True)
print(df)

            date             position state               endorser  \
0     2017-07-28       representative    MD            David Trone   
1     2019-01-02             governor    NY           Andrew Cuomo   
2     2019-01-03              senator    CA       Dianne Feinstein   
3     2019-01-08              senator    DE       Thomas R. Carper   
4     2019-01-12                mayor    TX          Ron Nirenberg   
5     2019-01-21           DNC member    CA        Laphonza Butler   
6     2019-01-25           DNC member    DC         James J. Zogby   
7     2019-01-27  lieutenant governor    CA       Eleni Kounalakis   
8     2019-01-27                mayor    CA           Libby Schaaf   
9     2019-01-28       representative    CA               Ted Lieu   
10    2019-01-29       representative    CA  Nanette Diaz Barragán   
11    2019-02-01              senator    NJ        Robert Menendez   
12    2019-02-01             governor    NJ            Phil Murphy   
13    2019-02-04    

#### An important phase of understanding the data is to understand the source. In order to do this, I will extract the primary source from the associated url that's provided within the dataset. This will allow us to find the big news players, and what they have published, versus other smaller sources and what they've decided to publish.

#### In order to accomplish this, the following sources are used to refind the location of the individual endorsements. There are 9 primary sources that will be used, with the other option being an "other" column for the items not fitting in the correct category. Specifically, the 9 sources are as follows: 


- 4President <a href="http://www.4president.org/" target="_blank"> (Link) </a>
- AP News <a href="https://apnews.com/" target="_blank"> (Link) </a>
- FiveThirtyEight <a href="https://fivethirtyeight.com/" target="_blank"> (Link) </a>
- Politico <a href="https://www.politico.com/" target="_blank"> (Link) </a>
- Twitter <a href="https://twitter.com/" target="_blank"> (Link) </a>
- USA Today <a href="https://www.usatoday.com/" target="_blank"> (Link) </a>
- Washington Post <a href="https://www.washingtonpost.com/" target="_blank"> (Link) </a>
- Youtube <a href="https://www.youtube.com/" target="_blank"> (Link) </a>

In [50]:
#Change Column Name for source extraction
df.rename(columns={'source': 'raw_source'}, inplace=True)
df['raw_source'] = df.loc[:,'raw_source'].fillna('other')
df['source'] = 'other'

#Define list of 9 primary sources for extraction 
sources=['4president','apnews','cnn','fivethirtyeight','politico','twitter','usatoday','washingtonpost','youtube']

#Check list for each row of data in dataset and change as necessary
for k in sources:
    df['source'] =  np.where(df['raw_source'].str.contains(k), k,  df['source'])

df.drop('raw_source', axis=1, inplace=True)
df['endorsee'] = df.loc[:,'endorsee'].fillna('no_endorsee')
df['party'] = df.loc[:, 'endorser party'].fillna('None')

In [54]:
#Table that excludes rows without a clear endorsee (data that isn't refined enough)
endorsee_df = df[df['endorsee']!='no_endorsee']
endorsee_df['endorsee'] = endorsee_df['endorsee'].str.split(' ').apply(lambda r: r[-1])

In [56]:
end_df = endorsee_df.groupby('endorsee').agg({'endorser': 'count', 'points': 'sum'})

end_df.rename(columns={'endorser': 'n_endorsements',
                       'points': 'tot_points'},
              inplace=True)

end_df['points_endorser_ratio'] = np.round(np.divide(np.array(end_df['tot_points']), np.array(end_df['n_endorsements'])), 2)
end_df.reset_index(inplace=True)


### Endorsee Analysis:
#### To further understand the trends between endorsee and the number of points that are associated, the following scoring rubric was used for the data:

- 10 Points
 - Former presidents and vice presidents
 - Current national party leaders
- 8 points
 - Governors
- 6 Points
 - U.S. Senators
- 5 Points
 - Former presidential and vice-presidential nominees
 - Former national party leaders
 - 2020 presidential candidates who have dropped out
- 3 Points
 - U.S. Representatives
 - Mayors of Large Cities
- 2 Points
 - Officials in statewide elected offices
 - State legislative leaders
- 1 Point
 - Other Democratic National Committee members
 
#### The visualization below can help us determine where each candidate stands in terms of overall points (based on the scale above) and number of endorsements overall. 


In [57]:
fig = go.Figure()

fig.add_trace( 
        go.Scatter(
            x=end_df['n_endorsements'], 
            y=end_df['tot_points'],
            mode='markers+text',
            marker=dict(
                size=(end_df['points_endorser_ratio']+3)**2,
                color=end_df["points_endorser_ratio"],
                colorscale='geyser',
                opacity = 0.7),
            text=end_df['endorsee'],
            textposition='bottom right'
    ))

fig.update_layout(
        xaxis_type="log",
        yaxis_type="log",
        title={'text': 'Total Points per Number of Endorsers',
               'y':0.95, 'x':0.5,
               'xanchor': 'center', 'yanchor': 'top'},
        showlegend=False,
        xaxis_title_text='Number of Endorsers',
        yaxis_title_text='Total Points',
        updatemenus = list([
            dict(active=0,
                 buttons=list([
                    dict(label='Log Scale',
                         method='update',
                         args=[{'visible': True},
                               {'title': 'Log scale',
                                'xaxis': {'type': 'log'},
                                'yaxis': {'type': 'log'}}]),
                    dict(label='Log X',
                         method='update',
                         args=[{'visible': True},
                               {'title': 'Linear scale',
                                'xaxis': {'type': 'log'},
                                'yaxis': {'type': 'linear'}}]),
                    dict(label='Log Y',
                        method='update',
                       args=[{'visible': True},
                              {'title': 'Linear scale',
                               'xaxis': {'type': 'linear'},
                               'yaxis': {'type': 'log'}}]),
                    dict(label='Linear Scale',
                        method='update',
                       args=[{'visible': True},
                              {'title': 'Linear scale',
                               'xaxis': {'type': 'linear'},
                               'yaxis': {'type': 'linear'}}]),
                            ]),
                direction="down",
                pad={"r": 10, "t": 10},
                showactive=True,
                x=-0.2,
                xanchor="left",
                y=1.1,
                yanchor="top"
                )]),
        annotations=[
            go.layout.Annotation(text="Select Axis Scale", 
                                 x=-0.2, xref="paper", 
                                 y=1.13, yref="paper",
                                 align="left", showarrow=False),
        ])

fig.show()

### Conclusions:
#### Based on the above chart, we can see a few different conclusions: 
- Joe Bien appears to be in the best position, in terms of number of endorsers and overall points
- Elizabeth Warren appears to be in a similar position, where she has the scond highest number of endorsers and overall points
- Amy Klobuchar appears to have a variety of valuable endorsements, given her endorsement count to overall points ratio (~3.78 points per endorsement)

### Presidential Candidate Summary
#### The below visualization will summarize the following data points
- Points received by category (i.e, Senators, U.S. Representatives, etc.)
- Points received by party 
- Points received by position (i.e., Attorney General, Secretary of State, etc.)
- Points received by state

In [87]:
#Map state (or territory) abbreviation to full name
state_to_s = {
 'Alabama': 'AL',
 'Alaska':'AK',
 'Arizona':'AZ',
 'Arkansas':'AR',
 'California':'CA',
 'Colorado':'CO',
 'Connecticut':'CT',
 'Delaware':'DE',
 'Florida':'FL',
 'Georgia':'GA',
 'Hawaii':'HI',
 'Idaho':'ID',
 'Illinois':'IL',
 'Indiana':'IN',
 'Iowa':'IA',
 'Kansas':'KS',
 'Kentucky':'KY',
 'Louisiana':'LA',
 'Maine':'ME',
 'Maryland':'MD',
 'Massachusetts':'MA',
 'Michigan':'MI',
 'Minnesota':'MN',
 'Mississippi':'MS',
 'Missouri':'MO',
 'Montana':'MT',
 'Nebraska':'NE',
 'Nevada':'NV',
 'New Hampshire':'NH',
 'New Jersey':'NJ',
 'New Mexico':'NM',
 'New York':'NY',
 'North Carolina' :'NC',
 'North Dakota':'ND',
 'Ohio':'OH',
 'Oklahoma':'OK',
 'Oregon':'OR',
 'Pennsylvania':'PA',
 'Rhode Island':'RI',
 'South Carolina':'SC',
 'South Dakota':'SD',
 'Tennessee':'TN',
 'Texas':'TX',
 'Utah':'UT',
 'Vermont':'VT',
 'Virginia':'VA',
 'Washington':'WA',
 'West Virginia':'WV',
 'Wisconsin':'WI',
 'Wyoming':'WY',
 'District of Columbia':'DC',
 'Marshall Islands':'MH'}

s_to_state = {}

for k,v in state_to_s.items():
    s_to_state[v]=k
    
df['full_state'] = df['state'].map(s_to_state)


#calculate counts based on the following columns 
cols = ['category', 'source', 'position', 'endorser party', 'state']
lc = len(cols)

d={}

for c in cols:
    tmp = endorsee_df.groupby(['endorsee', c]).agg({'points':'sum', 'endorser':'count'}).reset_index()
    tmp.rename(columns={'points': 'pt_by_'.format(c), 'endorser': 'votes_by_'.format(c)}, inplace=True)
    d[c] = tmp

cat_df = d['category']
source_df = d['source']
position_df = d['position']
party_df = d['endorser party']
state_df = d['state']
state_df['full_state'] = state_df['state'].map(s_to_state)

buttons=[]
l=endorsee_df['endorsee'].nunique()
n_plots=5
colors = ['cadetblue', 'indianred',  'goldenrod']
pie_colors = ['mediumpurple','beige']

In [94]:
#Creation of visualization plot
fig = make_subplots(
    rows=3, cols=2,
    specs=[[{'colspan':2}, None],
           [{}, {"type": "pie"}],
           [{}, {"type": 'pie'}]],
    subplot_titles=('Points by Endorser Category', 
                    'Points by Endorser Position', '% of Points by Endorser Party', 
                    'Number of Votes by Endorser Source', '% of Votes by Endorser State')
)


for i,e in enumerate(endorsee_df['endorsee'].unique()):
        
    visible = [False]*l*n_plots
    
    visible[i*lc:(i+1)*lc] = [True]*lc
        
    fig.add_trace(
            go.Bar(
                x=cat_df.loc[cat_df['endorsee']==e, 'category'],
                y=cat_df.loc[cat_df['endorsee']==e, 'pt_by_'],
                text=cat_df.loc[cat_df['endorsee']==e, 'pt_by_'],
                textposition='outside',
                opacity=0.9,
                marker={'color':colors[0],
                       'opacity':0.9},
                visible=False if i!=1 else True,
                showlegend=False),
        row=1, col=1)


    
    fig.add_trace(
            go.Bar(
                x=position_df.loc[position_df['endorsee']==e, 'position'],
                y=position_df.loc[position_df['endorsee']==e,'pt_by_'],
                text=position_df.loc[position_df['endorsee']==e,'pt_by_'],
                textposition='outside',
                opacity=0.9,
                marker={'color':colors[1],
                       'opacity':0.9},
                visible=False if i!=1 else True,
                showlegend=False),
        row=2, col=1)
    
    fig.add_trace(
            go.Pie(
                values=np.array(party_df.loc[party_df['endorsee']==e, 'pt_by_']),
                labels=np.array(party_df.loc[party_df['endorsee']==e, 'endorser party']),
                hole=0.4,
                visible=False if i!=1 else True,
                text=party_df.loc[party_df['endorsee']==e, 'endorser party'],
                hoverinfo='label+percent+name',
                textinfo= 'percent+label',
                textposition = 'inside',
                showlegend=False,
                marker = dict(colors = plotly.colors.diverging.Geyser)),
        row=2, col=2)
    
    fig.add_trace(
            go.Bar(
                x=source_df.loc[source_df['endorsee']==e, 'source'],
                y=source_df.loc[source_df['endorsee']==e,'votes_by_'],
                text=source_df.loc[source_df['endorsee']==e,'votes_by_'],
                textposition='outside',
                opacity=0.9,
                marker={'color':colors[2],
                       'opacity':0.9},
                visible=False if i!=1 else True,
                showlegend=False
                       ),
        row=3, col=1)
    
    fig.add_trace(
            go.Pie(
                values=np.array(state_df.loc[state_df['endorsee']==e, 'votes_by_']),
                labels=np.array(state_df.loc[state_df['endorsee']==e, 'state']),
                hole=0.4,
                visible=False if i!=1 else True,
                text=state_df.loc[state_df['endorsee']==e, 'full_state'],
                hoverinfo='label+percent+name',
                textinfo= 'percent+label',
                textposition = 'inside',
                showlegend=False,
                marker = dict(colors = plotly.colors.diverging.Geyser)),
        row=3, col=2)
    

    buttons.append(
        dict(label=e,
             method='update',
             args=[{'visible': visible},
                   #{'title': e}
                  ]))
    

fig.update_layout(
    title={'text': '<b> Endorsee Summary <b>', 'font':{'size':22},
            'y':0.95, 'x':0.5, 'xanchor': 'center', 'yanchor': 'top'},
    margin=dict(t=150),
    height=1350,
    xaxis1=dict(tickangle=45, tickvals=cat_df['category'].unique(), ticktext=cat_df['category'].unique()),
    yaxis1=dict(range=[0, np.max(cat_df['pt_by_']+15)]),
    
    xaxis2=dict(tickangle=45, tickvals=position_df['position'].unique(), ticktext=position_df['position'].unique()),
    yaxis2=dict(range=[0, np.max(position_df['pt_by_']+15)]),
    
    xaxis3=dict(tickangle=45, tickvals=source_df['source'].unique(), ticktext=source_df['source'].unique()), 
    yaxis3=dict(range=[0, np.max(source_df['votes_by_']+15)]), 
    
    bargap=0.1,
    showlegend=True,
    updatemenus = list([
        dict(active=1,
             buttons=buttons,
             direction="down",
             pad={"r": 10, "t": 10},
             showactive=True,
             x=-0.15,
             xanchor="left",
             y=1.04,
             yanchor="top"
         )
     ]))

fig['layout']['annotations'] += go.layout.Annotation(text="Select Endorsee", 
                                                     x=-0.15, xref="paper", 
                                                     y=1.05, yref="paper",
                                                     align="left", showarrow=False),
    
    

fig.show()

### Conclusions: 
#### Here, based on the visualizations, we can begin to paint a picture of what the data is telling us. 
#### For example, we're able to see that Joe Biden's endorsers come from a wide variety of states, positions, parties, and categories. We are also able to pinpoint other factors that impact the flow of endorsements, like Elizabeth Warrent's endorsers primarily coming from Massachusetts, her home state. This tool is something that can be used to track the flow as we get closer to the general election. 