# Historical Characteristics of Electoral Violence
*** 
#### Purpose:
This code aims to utilize machine learning to model characteristics of elections that correlate to election-related violence. 

#### Methdology:
Characteristics of global elections in history are used as independent input variables to train a Random Forest model to predict the level of election-related fatalities recorded in the one year preceeding the date of a given election.

This analysis measures electoral violence by number of fatalities and classifies the level of electoral violence during an election cycle as one of three categories:
1. Non-fatal: no election-related fatalities recorded
2. Low-fatality: 1-3 election-related fatalities recorded
3. Mass-fatality: 4 or more election-related fatalities recorded

The value of the machine learning model is extract feature importances as a method to identify which characteristics of elections are most informative in assessing the level of violence during an election period. Permutation importance is used to provide of list of historical characteristics of elections that are most influential on the ML model's ability to predict the level of election violence.

#### Data Sources:
1. **Dataset of National Elections Across Democracy and Autocracy (NELDA)**
    - A historical dataset of the national elections for all independent countries from 1945-2020
    - Features Types: 
        - Election history of the country
        - Structure and quality of management of the election in question (e.g., whether opposition is allowed, delayed vote counting) 
        - Public perceptions of election fairness
        - The occurrence of protests
        - Economic and political state of the country (e.g., whether the country receives economic aid, impact of the election on US/international relations)
        - The presence of international monitors
    - Source: https://www.jstor.org/stable/23260172 or https://nelda.co/
    
    
2. **The Deadly Electoral Conflict Dataset (DECO)**
    - A georeferenced events dataset by the Uppsala Conflict Data Program (UCDP) that records incidents of electoral violence between 1989-2017 in which at least one election-related fatality occurred
    - Source: https://journals.sagepub.com/doi/full/10.1177/00220027211021620 or https://ucdp.uu.se/downloads/index.html#deco

In [None]:
# import libraries

import os
import pandas as pd
pd.set_option('display.max_columns',100)
from datetime import datetime, timedelta
import re
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.model_selection import GridSearchCV

# 1. Data Loading

In [None]:
# load data

# data directory
data_dir = 'https://raw.githubusercontent.com/wsimpso1/election-violence-risks/main/data'

deco = pd.read_csv(data_dir+'/DECO_v.1.0.csv', parse_dates=['date_start','date_end'])
nelda = pd.read_csv(data_dir+'/NELDA.csv', encoding='latin-1')
nelda_look_up = pd.read_csv(data_dir+'/nelda_look_up.csv', encoding='latin-1')


In [None]:
# inspect original data shapes prior to processing

print('DECO:', deco.shape)
print('NELDA:', nelda.shape)

In [None]:
# datetime conversion

def custom_nelda_date_format(year, mmdd):
    '''
    parse year and month-day columns for NELDA data
    
    Parameters:
    ———————————
    year: int
    mmdd: int
        month (1-12) and day (01-31)
    
    Returns:
    ————————
    combined year-month-day date format
    '''
    try:
        date_string = f'{year}{mmdd}'
        return datetime.strptime(date_string, '%Y%m%d').date()
    # impute date for missing data 
    except:
        return datetime.strptime(f'{year}0101', '%Y%m%d').date()
        
nelda['date'] = pd.to_datetime([custom_nelda_date_format(year,date) for year, date in zip(nelda.year,nelda.mmdd)])

# 2. Exploratory Data Analysis (EDA)

### 2.1 EDA of the Dataset of National Elections Across Democracy and Autocracy (NELDA)

In [None]:
# types of elections recorded in dataset

election_types = nelda.types.value_counts()
sns.barplot(x=election_types.index, 
            y=election_types.values)
plt.title('TYPES OF NATIONAL ELECTIONS MEASURED BY NELDA')
plt.show()

In [None]:
# count of national elections measured in NELDA

# group elections by country
nelda_count = nelda.groupby(['country']).count()['electionid'].copy().reset_index()
nelda_count = nelda_count.rename(columns={'electionid':'Count of Elections'})

# plot
fig = px.choropleth(nelda_count,
                    locations='country', 
                    locationmode="country names",
                    projection='natural earth',
                    scope="world",
                    color='Count of Elections',
                    color_continuous_scale=px.colors.sequential.PuBu, 
                    title='National Elections Measured in NELDA (1945-2020)'
                    )

# display plot
fig.show('png')

In [None]:
# count of elections over time

# group elections by year and type 
nelda_year = nelda.groupby(['year','types']).count()['electionid'].reset_index()
nelda_year = nelda_year.rename(columns={'electionid':'Count of Elections'})


# plot count of election types over time
sns.lineplot(x=nelda_year.year,
             y=nelda_year['Count of Elections'],
            hue=nelda_year.types)
plt.title('Elections Have Increased Over Time')
plt.show()

### 2.2 EDA of the Dataset of Electoral Conflict (DECO)

In [None]:
# count of election related fatalities in DECO

# group by country and sum number of election-related fatalities
deco_count = deco.groupby(['country']).sum()['best'].copy().reset_index()
deco_count = deco_count.rename(columns={'best':'Count of Fatalities'})

# plot map
fig = px.choropleth(deco_count,
                    locations='country', 
                    locationmode="country names",
                    projection='natural earth',
                    scope="world",
                    color='Count of Fatalities',
                    color_continuous_scale=px.colors.sequential.OrRd, 
                    title='Election-related Fatalities based on Events Coded in DECO (1989-2017)'
                    )
fig.show('png')

In [None]:
# examine density of election violence incidents in example country: KENYA

# select electoral violence data related to Kenya
deco_kenya = deco.loc[deco.country == 'Kenya'].copy()

# plot heatmap
fig = px.density_mapbox(deco_kenya, lat='latitude', lon='longitude', radius=10,
                        center=dict(lat=0.1, lon=38), zoom=5,
                        mapbox_style="stamen-terrain",
                        width=600, height=600)
fig.show('png')

In [None]:
# election violence over time

# group data of election fatalities by year
deco_year = deco.groupby(['year']).count()['id'].reset_index()

# create scatterplot and draw linear regression trendline
fig = px.scatter(deco_year,
          x='year',
          y='id',
          trendline='ols',
          title='GLOBAL ELECTION VIOLENCE IS INCREASING',
          labels={'id':'Number of Fatal Events', 'year':'Year'})

# display plot
fig.show('svg')

# 3. Data Cleaning and Preprocessing

### 3.1 Clean and Wrangle NELDA Data of Global Historical Elections

In [None]:
# clean NELDA data of historical elections for modeling

def process_nelda(nelda_data):
    '''
    data wrangling for NELDA data
    - filters relevant data to allow merging with DECO data
    - fill NA and reformat string features to numeric
    
    Parameters:
    ———————————
    nelda_data: pandas dataframe
        original NELDA data
        
    Returns:
    ————————
    nelda_data: pandas dataframe
        processed dataframe
    '''    
    # select only years covered by both datasets (1989-2017)
    nelda_data = nelda_data.loc[(nelda_data.year > 1988) & (nelda_data.year < 2018)]
    # select only countries that appear in both datasets
    deco_country_ids = list(deco.country_id.unique())
    nelda_data = nelda_data.loc[nelda_data.ccode.isin(deco_country_ids)] 
    
    # remove notes column
    nelda_data = nelda_data[[col for col in nelda_data.columns if not re.match(".+notes$", col)]]
    
    # exclude free text features of names and location
    free_txt_cols = ['nelda43', 'nelda44', 'nelda54']
    nelda_data = nelda_data[[col for col in nelda_data.columns if col not in free_txt_cols]]
    
    # fill NaN as 'N/A' for Nelda columns
    nelda_cols = [col[0] for col in [re.findall(r'nelda\d+$', col) for col in nelda_data.columns] if len(col)>=1]
    for col in nelda_cols:
        nelda_data[col] = nelda_data[col].fillna('n/a')
    
    # convert string features to numeric
    def str_to_num(string):
        if string == 'yes':
            return 2
        elif string == 'no':
            return 1
        elif string == 'n/a': 
            return 0
        else:  # string == 'unclear'
            return -1
        
    # apply string-to-numeric function    
    for col in nelda_cols:
        nelda_data[col] = [str_to_num(val) for val in nelda_data[col]]
    
    return nelda_data.reset_index(drop=True)

In [None]:
# apply cleaning function

nelda_clean = process_nelda(nelda)

In [None]:
# view cleaned data

nelda_clean.head()

### 3.2 Clean and Wrangle DECO Data of Electoral Violence 

In [None]:
# clean data of election violence (DECO) for modeling

def process_deco(deco_data):
    '''
    data wrangling for DECO data
    - filter columns
    - aggregate election violence data by country and date
    
    Parameters:
    ———————————
    deco_data: pandas dataframe
        original DECO data
        
    Returns:
    ————————
    deco_data: pandas dataframe
        processed and aggregated data
    '''
    # select relevant columns
    deco_data = deco_data[['country_id', 'best', 'date_end']]
    # rename columns
    deco_data = deco_data.rename(columns={'best':'num_fatalities', 'date_end':'date'})
    # sum number of fatalities by country and date
    deco_data = deco_data.groupby(by=['country_id', 'date']).sum()
    
    return deco_data.reset_index()

In [None]:
# apply cleaning function

deco_agg = process_deco(deco)

In [None]:
# view cleaned data

deco_agg.head()

### 3.2 Merge Preprocessed and Aggregated NELDA and DECO datasets

In [None]:
def fatalities_per_election(election_date, country_id):
    '''
    compute the total number of election related fatalities in 1 year leading up to election date
    
    Parameters:
    ———————————
    election_date: datetime object
        date of election
    country_id: int
        ISO country code
    
    Returns:
    –———————
    sum_election_fatalities: int
        aggregated number of election-related fatalities in x country 1 year leading up to election
    '''
    deco_agg_country = deco_agg.loc[deco_agg.country_id == country_id].copy()
    start_date = election_date - timedelta(days=365)
    deco_agg_country_1_year = deco_agg_country.loc[(deco_agg_country.date >= start_date) & 
                                                   (deco_agg_country.date <= election_date)]
    sum_election_fatalities = deco_agg_country_1_year.num_fatalities.sum()
    return sum_election_fatalities

deco_election_fatalities = [fatalities_per_election(date, country) for date, country in zip(nelda_clean.date, nelda_clean.ccode)]


In [None]:
# create final combined dataframe

nelda_deco = nelda_clean.copy()
nelda_deco['election_fatalities'] = deco_election_fatalities

In [None]:
# Drop unecessary columns 

drop_cols = ['stateid','ccode', 'country', 'electionid', 'year', 'mmdd', 'types', 'notes', 'date']
nelda_deco = nelda_deco.drop(drop_cols, axis=1)

In [None]:
# view combined data that will be used in modeling

# all NELDA variables will be used as independent variables to predict the dependent variable election_fatalities
# which is the best estimate of election related fatalities for that country-year.

nelda_deco.reset_index(inplace=True, drop=True)
nelda_deco.head()

In [None]:
# visualization of fatalities distribution

sns.boxplot(x = nelda_deco.election_fatalities)
plt.show()

### 3.3 Examine Correlations for Possible Multicolinearities

In [None]:
# visualize correlation matrix of NELDA risk factors

# calculate correlations
corr_matrix = nelda_deco.corr().abs()
mask = np.triu(np.ones_like(corr_matrix))

# examine correlations 
plt.figure(figsize=(18,14))
sns.heatmap(corr_matrix[corr_matrix>=.85], cmap='YlGnBu', mask=mask)
plt.show()

### 3.4. Remove Multicolinearity

In [None]:
# of the highly correlated features keep those that are most potentially informative

nelda_deco = nelda_deco.drop(columns=['nelda8','nelda21', 'nelda29',
                                                'nelda36', 'nelda37',
                                                'nelda40', 'nelda41'])

# 4. Model Building

### 4.1 Define features and target variable 

In [None]:
# convert target variable of fatalities to a categorical variable

def to_categorial(val):
    '''
    Discretizes numerical value of fatalities in a country-year
    
    Categories are defined according to a US Dept of Justice definition that an event
    with 4 or more fatalities constitutes mass murder
    https://www.ojp.gov/ncjrs/virtual-library/abstracts/serial-murder-multi-disciplinary-perspectives-investigators 
    
    Parameters:
    ———————————
    val: int
        number of fatalities
    
    Returns:
    ————————
    str: category of level of fatality
    '''
    if val == 0:
        return 'non-fatal'
    if 1 < val <= 3:
        return 'low fatality'
    else:
        return 'mass fatality'

In [None]:
# apply categorical conversion function

nelda_deco['election_fatalities'] = [to_categorial(row) for row in nelda_deco.election_fatalities]

In [None]:
# view the distribution of three classes of election violence intensity

vl_ct = nelda_deco.election_fatalities.value_counts()
sns.barplot(x=list(vl_ct.keys()),
           y=vl_ct.values)
plt.show()

In [None]:
# define target variable y (election fatalities) and features X (risk factors) 

y = nelda_deco.election_fatalities
X = nelda_deco.drop(['election_fatalities'], axis=1)

Drop columns that contain information about election fatalities the model may cheat on
- nelda 33 explicitly codes for the presence of fatalities
- nelda 31 codes for the use of violence by the government against citizens

In [None]:
# drop columns as described above

X = X.drop(columns=['nelda33', 'nelda31'])

In [None]:
# transform all categorical NELDA features to one-hot encoding

# Creates list of all column headers
all_columns = list(X)
# change datatype
X[all_columns] = X[all_columns].astype(str)
# one hot encoding
X_one = pd.get_dummies(X)

In [None]:
# view final transformed dataframe for modeling

X_one.head()

In [None]:
# Adjust class imblance to mitigate overfitting via under/oversampling

# undersample the non-fatal class
undersample = RandomUnderSampler(sampling_strategy='majority', random_state=42)
X_under, y_under = undersample.fit_resample(X_one, y)

# oversample the other classes to eliminate class imbalance 
oversample = RandomOverSampler(sampling_strategy='all', random_state=42)
X_over, y_over = oversample.fit_resample(X_under, y_under)

In [None]:
# view rebalanced classes

vl_ct = y_over.value_counts()
sns.barplot(x=list(vl_ct.keys()),
           y=vl_ct.values)
plt.show()

### 4.2 Train Test Split

In [None]:
# split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X_over, 
                                                    y_over, 
                                                    test_size=.2, 
                                                    random_state=42, 
                                                    stratify=y_over)

### 4.3 Train Model and Search for Optimal Parameters

In [None]:
# define grid search and cross validation

parameter_grid = {'n_estimators':[200, 500, 700, 1000],
              'max_depth':[3, 4, 5], 
              'criterion': ['gini', 'entropy']}

# instantiate model 
rf_model = RandomForestClassifier(random_state=42)
rf_grid_cv = GridSearchCV(rf_model, parameter_grid, verbose=1, cv=10)

In [None]:
# train

rf_grid_cv.fit(X_train, y_train)

In [None]:
# best parameters 

rf_grid_cv.best_params_

In [None]:
# get predictions with best model

y_pred = rf_grid_cv.predict(X_test)

In [None]:
# training accuracy of best model

print('Train Accuracy:', round(rf_grid_cv.score(X_train, y_train)*100, 2), '%')

In [None]:
# testing accuracy of best model

print('Test Accuracy:', round(rf_grid_cv.score(X_test, y_test)*100, 2), '%')

In [None]:
# granular view of model performance via confusion matrix

cm = confusion_matrix(y_test, y_pred, labels=rf_grid_cv.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=rf_grid_cv.classes_)
disp.plot()
plt.show()

In [None]:
# examine precision and recall from classification report

clf_rpt = classification_report(y_test, y_pred, target_names=rf_grid_cv.classes_)
print(clf_rpt)

### 4.4 Feature Importance

In [None]:
# determine feature importances 

'''
Permutation Importance: 
A strategy to measure the decrease in model performance as the result of 
randomly shuffling one feature at a time. More important features in the model’s final decision 
cause a larger drop in performance when shuffled.
'''

r = permutation_importance(rf_grid_cv, X_test, y_test,
                           n_repeats=5,
                           random_state=0)

perm_optimized = pd.DataFrame(columns=['AVG_Importance'], index=X_test.columns)
perm_optimized['AVG_Importance'] = r.importances_mean

perm_optimized = perm_optimized.sort_values('AVG_Importance', ascending=False)
perm_optimized[:10]

In [None]:
# visualize feature importances

plt.figure(figsize=(10,6))
sns.barplot(x=perm_optimized.AVG_Importance[:15],
            y=perm_optimized.index[:15])

plt.title('Feature Importances')
plt.show()

In [None]:
# select features with importance above threshold

importance_threshold = 0.010
perm_optimized = perm_optimized.loc[perm_optimized.AVG_Importance > importance_threshold]

# 5. Obtain Top Risk Factors of Historical Election Violence 

In [None]:
# get top most important risk factors to predicting historical election violence 

top_nelda_codes = [re.findall(r'nelda\d{1,2}',row) for row in perm_optimized.index]
top_nelda_codes = pd.DataFrame(np.concatenate(top_nelda_codes), columns=['nelda_feature'])

top_unique_nelda_codes = list(top_nelda_codes.nelda_feature.unique())

In [None]:
# look up text descriptions of top NELDA risk factors

top_nelda_code_descriptions = nelda_look_up.loc[nelda_look_up.nelda_code.isin(top_unique_nelda_codes)][['nelda_code', 'description_clean']]
top_nelda_code_descriptions = top_nelda_code_descriptions.rename(columns={'description_clean':'election_characteristic'})
top_nelda_code_descriptions.reset_index(inplace=True, drop=True)
# adjust column width to view data
pd.options.display.max_colwidth = 200
top_nelda_code_descriptions

In [None]:
# save list of characteristics of elections that correlate to election violence according to RF model

top_nelda_code_descriptions.to_csv('FINAL_OUTPUT_characteristics_of_electoral_violence.csv'))
