In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

A few weeks ago I was talking to a family member in the U.S. (I'm a U.S. citizen currently live in Germany) and we were discussing the recent spate of weather and other natural disasters that were hammering the states.  When we were done he said, “Well as crazy as it is here I’d take this any day over what you’re dealing with.”  

I was a bit confused, and asked what disaster he was referring to.  He clarified, “No, I mean all of the terrorists driving trucks into crowds and setting off bombs on trains and stuff.”

Ah, right.  I’ve heard similar statements several times since I moved to Europe and never quite understood them - after all, while horrific, I was pretty confident that the probability of being a victim of a terrorist is far lower than many other forms of violent crime or preventable death.  I replied, “You know, there are more gun deaths each day in the US than terrorism deaths in Europe every year.  What you should be afraid of is walking out your door.”

Not surprisingly, we agreed to disagree and the conversation ended cordially.  However, it got me thinking: was I right?  Is someone in Europe more at risk from an Islamic (or other radical) terrorist than an American is from another American with a gun?  If not, why the the fear?

The first question sounded like a straightforward data analytics exercise, so I busted out a Jupyter notebook to explore, grabbed some data and challenged the hypothesis.

To analyze terrorism I chose the Global Terrorism Dataset (GTD), a very comprehensive collection of worldwide terrorism over the last half century. The gun violence datasets were harder to come by, in part due to the successful lobbying efforts by the National Rifle Association (NRA) which blocks government research on gun violence, so I chose to work with the Centers for Disease Control (CDC) Multiple Causes of Mortality dataset which classifies all deaths in the US, including deaths by firearms. The latest year that the GTD and CDC set fully overlap is 2015, so I chose that as the year to focus on.


In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
import json as json

In [3]:
us_population = 323100000.0 # Wikipedia 2016
eu_population = 743000000.0 # Wikipedia 2016
us_eu_pop_ratio = us_population/eu_population
one_million = 1000000

# Terrorism

Let’s start by looking at terrorism.  

In [4]:
df_t = pd.read_csv('../input/global-terrorism-db/globalterrorismdb_0617dist.csv', encoding='iso-8859-1')

In [5]:
df_t.info()

In [6]:
df_t['year'] = df_t['iyear']
df_t['month'] = df_t['imonth']

In [7]:
# Wow, that's a lot of columns.  Let's slim down to the columns most relevant for this analysis.
df_ts = df_t[['eventid', 'year', 'month', 'iday', 'country', 'country_txt', 'city', 'region', 'region_txt','attacktype1', 'summary', 'nkill', 'nkillus','nwound', 'nwoundus']]

In [8]:
df_ts.info()

In [9]:
df_ts.describe()

In [10]:
df_ts['nkill'].sum()

In [11]:
pd.pivot_table(data=df_ts, index='year', columns='region_txt', values='nkill', aggfunc='sum')\
    .plot.line(figsize=(15,5), colormap='tab20c').legend(title=None)

Worldwide, there was a significant spike in terrorism over the most recent decade, with the vast majority of the increase coming from the middle east, Africa, and south Asia.

If we zoom into this decade and look only at the US and Western Europe, this is what we see:

In [12]:
df_ts_eu = df_ts[df_ts['region_txt']=='Western Europe']
df_ts_us = df_ts[df_ts['country_txt']=='United States']
df_ts_comb = df_ts[(df_ts['country_txt']=='United States')|(df_ts['region_txt']=='Western Europe')]

df_ts_eu_2010 = df_ts_eu[df_ts_eu['year'] >= 2010]
df_ts_us_2010 = df_ts_us[df_ts_us['year'] >= 2010]
df_ts_comb_2010 = df_ts_comb[df_ts_comb['year'] >= 2010]

In [13]:
pd.pivot_table(data=df_ts_comb_2010, index=['year', 'month'], columns='region_txt', values='nkill', aggfunc='sum').fillna(0)\
    .plot.line(figsize=(12,5),colormap='tab20c').legend(title=None)

Look at the Y axis on both of the above graphs - it's clear that it's much safer to be in Europe or the US that many other parts of the world (two orders of magnitude safer). While Europe has seen a relative spike in terrorism related deaths since the end of 2015, it also has roughly double the population of the US so to get a better picture of how this compares to US deaths we need to look at deaths per million residents. Here’s what we get:


In [14]:
eu_tot_terror_deaths = df_ts_eu_2010[df_ts_eu_2010['year'] == 2015]['nkill'].sum()
eu_terror_deaths_pm = df_ts_eu_2010[df_ts_eu_2010['year'] == 2015]['nkill'].sum() * one_million/eu_population

print("2015 terror deaths EU: {0} total, or {1:1.2f} per million residents"\
      .format(
          eu_tot_terror_deaths, 
          eu_terror_deaths_pm)
     )

So in 2015 a European had roughly a 1 in 4,000,000 chance of dying in a terrorist attack. That sounds pretty small.  Just out of curiosity, I wonder how that compares to terrorist attacks on American soil:

In [15]:
us_tot_terror_deaths = df_ts_us_2010[df_ts_us_2010['year'] == 2015]['nkill'].sum()
us_terror_deaths_pm = df_ts_us_2010[df_ts_us_2010['year'] == 2015]['nkill'].sum() * one_million/us_population


print("2015 terror deaths US: {0} total, or {1:1.2f} per million residents"\
      .format(
          us_tot_terror_deaths, 
          us_terror_deaths_pm)
     )

I hate to write this because some knucklehead will quote it out of context, but on the surface Europeans have roughly twice the probability of being terror victims than Americans when adjusted for population (in 2015 at least). But that's like saying a person is twice as likely to be killed by a bear than by a shark - both numbers are so low that doubling either is still a low number.  (In fact, the odds of dying in a shark or bear attack aren't too far off than death by a terrorist, but that's for another article.)

Let's look at the other side of the problem.

# Gun Deaths in the U.S. (via CDC)

In [16]:
# some convenience reference data regarding the CDC ICD codes
icd_gun_deaths = ['X72','X73','X74','X93','X94','X95','W32','W33','W34','Y22','Y23','Y24','Y35.0','Y36.4','U01.4']
icd_gun_homicides = ['X93','X94','X95']
icd_gun_suicides = ['X72','X73','X74']
icd_gun_accident = ['W32','W33','W34']
icd_gun_other = ['Y22','Y23','Y24','Y35.0','Y36.4','U01.4']

In [17]:
codes = json.load(open("../input/mortality/2015_codes.json", mode="r"))

In [18]:
# import all of the CDC files needed and normalize them.  just set the start and end year below:
start_year = 2015
end_year = 2015
#########################
cdc_root = '../input/mortality/'
year_range = range(start_year, end_year+1)
print("concatenating files for years {0} to {1}".format(year_range[0], year_range[-1]))
df_yr = pd.read_csv('{0}/{1}_data.csv'.format(cdc_root,year_range[0]))
df_yr.rename(index=str, columns={'icd_code_10th_revision': 'icd_code_10'}, inplace=True)
df_g = df_yr[df_yr['icd_code_10'].isin(icd_gun_deaths)]
for yr in year_range[1::]:
    print('processing year={0}'.format(yr))
    df_yr = pd.read_csv('{0}/{1}_data.csv'.format(cdc_root, yr))    
    if 'icd_code_10' in df_yr.columns:
        df_g = df_g.append(df_yr[df_yr['icd_code_10'].isin(icd_gun_deaths)])
    else:
        df_yr.rename(index=str, columns={'icd_code_10th_revision': 'icd_code_10'}, inplace=True)
        df_g = df_g.append(df_yr[df_yr['icd_code_10'].isin(icd_gun_deaths)])
        
df_g.shape


In [19]:
# drop columns we won't need
df_g = df_g.drop(df_g.columns[28:70], axis=1)

In [20]:
df_g.info()

In [21]:
# to reduce the icd_10 gun deaths to a few useful groups
def classify(icd_10):
    if icd_10 in icd_gun_homicides:
        return 'homicide'
    elif icd_10 in icd_gun_suicides:
        return 'suicide'
    elif icd_10 in icd_gun_accident:
        return 'accident'
    else:
        return 'other'
    

In [22]:
# uses the hispanic origin code as hispanics are not identified in the "race" column
def race_reclassify(race, hisp_orig):
    if hisp_orig == '6':
        return 'White'
    elif hisp_orig == '7':
        return 'Black'
    elif hisp_orig == '8':
        if race == '03':
            return "Native American"
        else:
            return "Asian/Pacific Islander"
    else:
        return 'Hispanic'

In [23]:
df_g['education_2003_revision'] = df_g['education_2003_revision'].apply(lambda x: None if pd.isnull(x) else codes['education_2003_revision'][str(int(x))])

In [24]:
df_g['education_2003_revision'].value_counts()

In [25]:
# add some helpful aggregations - first whether the death was homicide, suicide or other
df_g['intent'] = df_g['icd_code_10'].apply(lambda x: classify(x)) 

In [26]:
df_g['intent'].value_counts()

In [27]:
# now clean up the race classifications
df_g['race'] = df_g[['race','hispanic_originrace_recode']].apply(lambda x: race_reclassify('{0:02d}'.format(x[0]), str(x[1])), axis=1)

In [28]:
df_g['race'].value_counts()

In [29]:
df_g['age'] = df_g['detail_age']

How do, and how many, people die from guns in the US?

In [30]:
pie = df_g.groupby('intent').size()
pie = pd.DataFrame(index=pie.index, data=pie, columns=['fatalities'])
pie.plot.pie(
    y='fatalities',
    figsize=(5,5), 
    colormap='tab20c', 
    title='2015 Gun Deaths', 
    legend=None
)

In [31]:
df_g['intent'].value_counts()

The rough numbers/ratios above have been quoted quite a bit over recent years - roughly 35K gun deaths per year with ~1/3 homicides and ~2/3 suicides - so no big surprises there. 
 
Since terror attacks are essentially homicides, let's look at gun homicides per million so we can compare with the terrorist threat:

In [32]:
us_tot_gun_homicides = df_g[df_g['intent'] == 'homicide'].shape[0]
us_gun_homicides_pm = df_g[df_g['intent'] == 'homicide'].shape[0] * one_million/us_population

print("2015 gun homicides US: {0} total, or {1:1.2f} per million residents"\
      .format(
          us_tot_gun_homicides, 
          us_gun_homicides_pm)
     )

So, at ~40 gun homicides per million residents, **an American is ~175x more likely to die from a gun homicide in the US than a European is from a terrorist in Europe**.  Hmm. 

But... it could be argued that this isn’t a fair comparison.  I’ve heard several arguments that have gone something like this: "Terrorists tend to strike random, killing innocent, unsuspecting victims.  U.S. gun violence mostly happens in Chicago, St Louis and Detroit and involves gangs and criminals.  In other words, U.S. gun violence is about “them”, and we’re not “them”.  

So how can we whittle the dataset down to “not them”?

Let's see what we can find as we drill into the CDC data…


To get a more nuanced view, let’s break it down by sex, race and age.  First, sex:

In [33]:
df_piv = pd.pivot_table(data=df_g[['sex','intent']], index=['sex'], columns=['intent'], aggfunc=len)
df_piv.plot.bar(stacked=True, figsize=(5,5), colormap='tab20', title='Total Gun Deaths by Sex (2015)').legend(loc='center left', bbox_to_anchor=(1, 0.5))

On an absolute basis, American men are ~6X more likely to be victims of gun violence, while on a percentage basis, men and women have similar levels of root cause, with suicide being the major contributor.

In [34]:
df_piv.div(df_piv.sum(1), axis=0)\
.plot.bar(stacked=True, figsize=(5,5),colormap='tab20',title='Relative Causes of Gun Deaths by Sex (2015)')\
.legend(loc='center left', bbox_to_anchor=(1, 0.5))


How about race?


In [35]:
df_piv = pd.pivot_table(data=df_g[['race','intent']], index=['race'], columns=['intent'], aggfunc=len)
df_piv.plot.bar(stacked=True, figsize=(5,5),colormap='tab20',title="Total Gun Deaths by Race (2015)")\
    .legend(loc='center left', bbox_to_anchor=(1, 0.5))

The differences here are striking - blacks and hispanics are far more likely to die from homicide while whites are overwhelmingly likely to take their own life.  To get a different perspective, let’s look at this on a percentage basis:  


In [36]:
df_piv.div(df_piv.sum(1), axis=0)\
    .plot.bar(stacked=True, figsize=(5,5),colormap='tab20', title="Relative Causes of Gun Deaths by Race (2015)")\
    .legend(loc='center left', bbox_to_anchor=(1, 0.5))

Again, some striking differences in intent between different racial groups.  (My gut tells me the homicide rate roughly correlates with average income level, but that’s an analysis for another day.)


Ok, maybe education plays a role, either directly or as a proxy for socio-economic status:

In [37]:
df_piv = pd.pivot_table(data=df_g[['education_2003_revision','intent']], index=['education_2003_revision'], columns=['intent'], aggfunc=len)
df_piv.sort_values('homicide', ascending=False).plot.bar(stacked=True, figsize=(5,5),colormap='tab20', title="Total Gun Deaths by Education Level (2015)").legend(loc='center left', bbox_to_anchor=(1, 0.5))

Ok, pretty clear correlation there.

And now let’s look at age.  Here are two views, one broken down by intent and the other by race:

In [38]:
pd.pivot_table(data=df_g[['age','intent']], index=['age'], columns=['intent'], aggfunc=len)\
    .plot.bar(stacked=True, figsize=(17,5),colormap='tab20', title="Total Gun Deaths by Age of Victim (2015)")

In [39]:
pd.pivot_table(data=df_g[['age','race']], index=['age'], columns=['race'], aggfunc=len)\
    .plot.bar(stacked=True, figsize=(17,5),colormap='tab20b',title="Total Gun Deaths by Age of Victim (2015)")

*(Note: the bump around 50 is due to a spike in white male suicide...  Remember, remember the month of Movember...)*

While tragic, the suicides, accidents and undetermined cause events aren't relevant to this analysis so we'll exclude those to focus exclusively on homicides and revisit the age vs race graph in this light:

In [40]:
df_gunhom = df_g[df_g['intent'] == 'homicide']

In [41]:
# df_gunhom.groupby('age').size().plot.bar(figsize=(15,5))
pd.pivot_table(data=df_gunhom[['age','race']], index=['age'], columns=['race'], aggfunc=len)\
    .plot.bar(stacked=True, figsize=(17,5),colormap='tab20b', title="Total Gun Homicides by Age of Victim (2015)")

So, it appears that gun deaths skew heavily towards young black and hispanic males without college degrees.  It feels wrong removing men from the equation since most of the comments I’ve heard relating to this hypothesis have come from men, so let’s just filter on the other dimensions and look at whites over 30 with college degrees:

In [42]:
# select only white people with college degrees and older than 30 as a proxy for the people that I talk to who are most concerned
df_o = df_g[
    (df_g['intent'] =='homicide') &
    (df_g['race'] == 'White') &
    (df_g['current_data_year'] == 2015) & 
    (df_g['age'] > 30) & 
    (df_g['education_2003_revision'].isin(['Associate degree','Bachelor’s degree','Master’s degree','Doctorate or professional degree']))]

In [43]:
us_total_nt_gun_homicides = df_o.shape[0]
us_nt_gun_homicides_pm = df_o.shape[0] * one_million/us_population

print("2015 gun homicides US (white, over 30, college degree): {0} total, or {1:1.2f} per million residents"\
      .format(
          us_total_nt_gun_homicides, 
          us_nt_gun_homicides_pm)
     )

So even this limited demographic is still ~5X more likely to die from gun in the US than a European is from a terrorist attack.

Let's review:  
* In 2015 a person in Europe had a 0.23 chance in a million of being killed by a terrorist.  
* That same year, a person in the U.S. had a probability somewhere between 1.2 - 40 per million of being killed by another American with a gun.

At this point I think I can be pretty confident that my original hypothesis is true: 
**an American is at much higher risk (between 5 and 175x depending on the year and assumptions) of being killed by another American with a gun than a European is of being killed by a terrorist. ** 


So now we have the facts, but this still doesn’t shed any light into why the fear factor seems upside-down.  That'll be for a future analysis.
