# Global Terrorism EDA
_Hello, fellow Kagglers! This is my attempt to create an EDA. I'm trying to become stronger and I would bery appreciate each comment, advice and correction. This notebook is a some sort of replica of a fantastic EDA analysis [Terrorist Activities Around The World](https://www.kaggle.com/ash316/terrorism-around-the-world), however I tried to reproduce it on my own and made some changes. However, kudos must go to [Ashwini Swain](https://www.kaggle.com/ash316)_

The primary goal of this notebook is to get a good understanding of the tendencies underlying the data. Some secondary goals are:
* To practice plotting and data wrangling
* To learn how to plot maps via different Python libraries

## Table of Contents <a id='TOC'></a>
1. [Questions for the Data](#Questions)
2. [Basic Descriptive Exploration](#Basics)
3. [Distribution by Region and Country](#Region)
4. [Distribution by Attack Type](#AttackType)
5. [Distribution by Target Type](#TargetType)
6. [Maps](#Maps)
7. [Notorious Terrorist Groups](#Groups)


In [None]:
# importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns

# map libraries
import cartopy.crs as ccrs
import cartopy.feature as cf
import folium


from scipy.stats import zscore
from ipywidgets import interactive, IntSlider, Play
from IPython.display import display
from urllib.request import urlopen
from plotly.offline import init_notebook_mode, iplot

import warnings
import json

warnings.filterwarnings('ignore')
init_notebook_mode(connected=True)

In [None]:
# preparing the data
data = pd.read_csv('../input/gtd/globalterrorismdb_0718dist.csv', encoding="ISO-8859-1")

# we also take some demographics data to see some population density charachteristics
population = pd.read_csv('../input/world-population-19602018/population_total_long.csv')
pop_density = pd.read_csv('../input/world-population-19602018/population_density_long.csv')

# make the data a little bit more tidy
# we will use 'Country' columns to merge dataframes, so it's useful for them to have same names
pop_density.rename(columns={'Count': 'Density', 'Country Name':'Country'}, inplace=True)
population.rename(columns={'Count':'Population', 'Country Name':'Country'}, inplace=True)
data.rename(columns={'iyear':'Year','imonth':'Month','iday':'Day','country_txt':'Country',
                     'region_txt':'Region','attacktype1_txt':'AttackType','target1':'Target',
                     'nkill':'Killed','nwound':'Wounded','summary':'Summary','gname':'Group',
                     'targtype1_txt':'TargetType','weaptype1_txt':'WeaponType','motive':'Motive',
                     'city':'City', 'latitude':'Latitude','longitude':'Longitude'},
            inplace=True)

# pop_density and population have other country names, so we wrangle this a little bit
countries_old = ['Bahamas, The', 'Bosnia and Herzegovina', 'Brunei Darussalam', 'Congo, Dem. Rep',
                 'Egypt, Arab Rep.', 'Gambia, The', 'Hong Kong SAR, China', 'Iran, Islamic Rep.', 
                 'Kyrgyz Republic', 'Lao PDR', 'Macao SAR, China', 'Congo, Rep.', 'Russian Federation',
                 'Syrian Arab Republic', 'West Bank and Gaza', 'Yemen, Rep.', 'Korea, Dem. People’s Rep.',
                 'Korea, Rep.']
countries_new = ['Bahamas', 'Bosnia-Herzegovina', 'Brunei', 'Democratic Republic of the Congo',
                'Egypt', 'Gambia', 'Hong Kong', 'Iran', 'Kyrgyzstan', 'Laos', 'Macau',
                'Republic of the Congo', 'Russia','Syria', 'West Bank and Gaza Strip', 'Yemen',
                'North Korea', 'South Korea']

for i, country in enumerate(countries_old):
    population.loc[population['Country'] == country, 'Country'] = countries_new[i]
    pop_density.loc[pop_density['Country'] == country, 'Country'] = countries_new[i]    
    
# our final dataset
terror = data[['Year','Month','Day','Country','Region','City','Latitude','Longitude','AttackType',
               'Killed','Wounded','Target','Summary','Group','TargetType','WeaponType','Motive']]
terror['Casualties'] = terror['Killed'] + terror['Wounded']

## Questions for the Data <a id='Questions'></a> [[⇧](#TOC)]

First we need to decide what we would like to know. Having a lot of data is no good if you don't know what you want to get out of it.
My first inquiry is __whether overall terrorism situation in the world is getting better or worse__, i.e:

* Do terrorist attacks become more or less frequent?
* Do average casualties per one terrorist attack get bigger or smaller?
* Do terrorist attacks per 100,000 population of the world get bigger or smaller?

The last subquestion seems important to me, because I think it is logical that as the population of the world increases the absolute number of terrorist attacks should also increase. And so if the number of attacks per 100 000 people would decrease but the absolute number would increase, it would still mean that the situation is getting better overall.

My second question is: __What is the distribution of terrorist attacks by casualties. It seems that this distribution should have a fat tail. My prior is that the distribution is lognormal__.

_<span style='color:darkgreen'>Here I have doubts about the distribution, because lognormal distribution PDF is defined for $x>0$ and it seems that most of terrorist attacks have casualties $x = 0$. Thus it seems for similar to exponential distribution, but exponential does not have a fat tail as far as I understand.</span>_

My third question is: What is the distribution of terrorist attacks by country and region.

* Are there any correlations with population density?

My fourth question is: __What are the primary targets of terrorists and are they changing over time?__

## Basic descriptive exploration <a id='Basics'></a> [[⇧](#TOC)]

### Missing Data
First of all, let't take a look, whether all of our data is in place:

In [None]:
n1 = terror.isnull().sum()  # count missing cells
n2 = n1 / terror.shape[0]   # find relative missing count

miss_df = pd.concat((n1,n2),axis=1)
miss_df.columns = ['Missing', 'Percentage']
miss_df = miss_df.sort_values(by='Missing',ascending=False).round(2)

display(miss_df[miss_df['Missing']>0].T)

### Motives
It seems like most of the time we don't know the motive of the terrorists attacks. And a lot of the summary column is missing. But what do we see, when we know them?

In [None]:
motives_df = data['Motive'].value_counts(normalize=True).to_frame().rename(columns={'Motive':'Ratio'})
motives_df['Motive'] = motives_df.index
motives_df.reset_index(inplace=True, drop=True)

display(motives_df[['Motive','Ratio']].head())

This is somewhat interesting. Our data has 72% percent of `Motive` missing and when it is not missing about 60% of the time the "specific motive for the attack is unknown or was not reported". That is very suggestive of the fact that we don't know our enemy at all.

Now let's dig out some basic facts from out dataframe:

In [None]:
print('Country with maximum number of attacks: ', terror['Country'].value_counts().index[0])
print('Region with most number of attacks: ', terror['Region'].value_counts().index[0])
print('Maximum people killed in an attack are:', terror['Killed'].max(), 
      'that took place in ',terror.loc[terror['Killed'].idxmax()].Country,
      'in ', terror.loc[terror['Killed'].idxmax()].Year)

### Do terrorist attacks become more or less frequent?

In [None]:
# count the world population for each year
world_pop = population.groupby(by='Year', as_index=False).sum().rename(columns={'Count':'Population'})

# create merged df with number of attacks and world population
by_year = terror.groupby(by='Year', as_index=False)['Country'].count().rename(columns={'Country':'Count'})
by_year = by_year.merge(world_pop, on='Year')

# create columns for variables of interest
by_year['Casualties'] = terror.groupby(by='Year').sum().reset_index()['Casualties']
by_year['RelCount'] = by_year['Count'] / by_year['Population'] * 100000
by_year['RelCasualties'] = by_year['Casualties'] / by_year['Population'] * 100000
by_year['CasPerAttack'] = by_year['Casualties'] / by_year['Count']

In [None]:
# make edges visible
plt.rcParams['patch.force_edgecolor'] = True

# create four plots
f, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(18,20), gridspec_kw={'hspace': 0.3})

sns.countplot('Year', data=terror, color='firebrick', ax=ax1)
ax1.tick_params(axis='x', labelrotation=90)
ax1.set_title('Number Of Terrorist Activities Each Year')

sns.barplot(x='Year', y='Casualties', data=by_year, color='darkorange', ax=ax2)
ax2.tick_params(axis='x', labelrotation=90)
ax2.set_title('Total Casualties from Terrorist Attacks by Year')

sns.barplot(x='Year', y='RelCount',data=by_year, color='salmon', ax=ax3)
ax3.tick_params(axis='x', labelrotation=90)
ax3.set_title('Terrorist Attacks per 100 000 people')

sns.barplot(x='Year', y='RelCasualties',data=by_year, color='wheat', ax=ax4)
ax4.tick_params(axis='x', labelrotation=90)
ax4.set_title('Casualties per 100 000 people')

plt.show()

Now this is disturbing. Not only the number of attacks is increasing by year but also the frequency and deadliness of these attacks. And what about casualties per one attack during the year?

In [None]:
bins = [0, 10, 100, 200, 400, 800, 1600, 3200, 6400, 12800]
labels = ['0-10', '10-100', '100-200', '200-400', '400-800', 
          '800-1600', '1600-3200', '3200-6400', '6400-12800']

# create casualties distribution
cas_distrib = pd.cut(terror['Casualties'], bins = bins,labels = labels).value_counts()
cas_distrib = cas_distrib.sort_index().to_frame()

# convert to logarithmic scale
cas_distrib['LogCasualties'] = np.log1p(cas_distrib['Casualties'])

# create two plots
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,5), gridspec_kw={'width_ratios': [4,1]})

sns.barplot(x='Year', y='CasPerAttack',data=by_year, ec = 'firebrick', color='wheat', ax=ax1)
ax1.set_title('Casualties per one Attack')
ax1.tick_params(axis='x',labelrotation=90)

cas_distrib['LogCasualties'].plot(kind='barh', width=0.8, ec='firebrick', color='wheat', ax=ax2)
ax2.invert_yaxis()
ax2.set_title('Logarithmic Distribution of Attacks by Casualties')

f.tight_layout()
plt.show()

We can see that there was an increased deadliness of attackd during the 2000s, but then the average casualties during one attack went down, which may not be good news since the frequency of attacks more than quintupled since 2001. Also we can see that on a logarithmic scale number of casualties per one terrorist attack is almost linear. This really seems like a distribution with a fat tail. We can make a guess that we have not yet seen the attack from the `12800-25600` bin category and it is yet to come and we should be prepared.

## Distribution by Region and Country <a id='Region'></a> [[⇧](#TOC)]
My next question is how these activities are distributed across the globe and whether the distribution is changing over time. Let's start with the basic overview of different regions and see what how active were terrorists in each region

In [None]:
plt.subplots(figsize=(18,6))

plt.subplot(121)
sns.countplot('Region', data=terror, palette='RdYlGn', edgecolor=sns.color_palette('dark',7),
              order = terror['Region'].value_counts().index)
plt.xticks(rotation=90)
plt.title('Number Of Terrorist Activities by Region')

plt.subplot(122)
sns.barplot(terror['Country'].value_counts().index[:15],
            terror['Country'].value_counts().values[:15],
            palette='inferno')
plt.xticks(rotation=90)
plt.title('Top Affected Countries')

plt.show()

This plot gives us little but still some info. We can already see the regions that have suffered the most from the terrorists' attacks: Middle East & North Africa and South Asia. Let's look at some dynamics - how terrorist activity changed over time in different regions.

In [None]:
terror_region = pd.crosstab(terror['Year'], terror['Region'])
terror_region.plot(color=sns.color_palette('Set2', 12))
fig=plt.gcf()
fig.set_size_inches(18,6)
plt.show()

Now this plot is more vivid, however several lines are squeezed to the bottom of the plot. But here we can see, how the Middle East & North Africa over time became the most dangerous region in the world considering terrorism.

In [None]:
# create cross table
by_region = pd.crosstab(terror['Year'], terror['Region'])
by_region.plot(kind='bar', stacked=True, width=0.8, colormap='tab20c');
fig = plt.gcf()
fig.set_size_inches(15,7)
plt.title("Terrorist Activities Distribution by Region")
plt.show()

In [None]:
plt.rcParams['patch.force_edgecolor'] = False
by_region = pd.crosstab(terror['Year'], terror['Region'], normalize='index')
by_region.plot(kind='bar', stacked=True, width=1,colormap='tab20c');
fig = plt.gcf()
fig.set_size_inches(15,7)
plt.title("Terrorist Activities Distribution by Region")
plt.legend(bbox_to_anchor=(-0.01, -0.2), loc='upper left', ncol=6)
plt.show()

We can see massive increase in terrorist attacks in South Asia since 1980 and although terrorism in the Middle East & North Africa region was always present since 1970 both absolute and relative numbers greatly increased there since 1990s. Another salient region is Sub-Saharan Africa.

In [None]:
# number of attacks by country and by year
by_country = terror.groupby(by=['Year','Country'], as_index=False)['Month'].count()
by_country.rename(columns={'Month':'Count'},inplace=True)
# casualties by country and by year
cas_coun_year_df = terror.groupby(by=['Year','Country'])['Casualties'].sum()
# to one df
by_country = by_country.merge(cas_coun_year_df, on=['Country','Year'])

# merge with pop_density and with population dataframes
by_country = pop_density.merge(by_country, on=['Country','Year'])
by_country = population.merge(by_country, on=['Country','Year'])

# create columns with relative variables:
by_country['RelCount'] = by_country['Count'] / by_country['Population'] * 100000
by_country['RelCasualties'] = by_country['Casualties'] / by_country['Population'] * 100000

# take only last ten years
ten_years_df = by_country[by_country['Year'] >= 2010]
ten_years_df = ten_years_df.groupby(by='Country', as_index=False).agg('mean')
# get rid of outliers (for plotting)
# ten_years_df = ten_years_df[(np.abs(zscore(ten_years_df[['Density','Count']])) < 3).all(axis=1)]

## Distribution of Terrorist Activities by Attack Type <a id='AttackType'></a> [[⇧](#TOC)]
First, let's see what does the distribution look like overall, through all years since 1970. What are the most frequent attack methods:

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(18,8), gridspec_kw={'wspace':0.5})

sns.countplot('AttackType', data=terror, palette='inferno',
              order=terror['AttackType'].value_counts().index,
             ax=ax1)
ax1.set_title('Terrorists` Methods of Attack')
ax1.tick_params(axis='x', labelrotation=90)

pd.crosstab(terror.Region, terror.AttackType).plot.barh(stacked=True, width=1, color=sns.color_palette('RdYlGn', 9),
                                                       ax=ax2)
ax2.set_title('Terrorist Attack Types in Different Regions')
ax2.tick_params(axis='x', labelrotation=90)

plt.show()

And now, let's see, how this distribution changes over time.

In [None]:
by_attack = pd.crosstab(terror['Year'], terror['AttackType'], normalize='index')
by_attack.plot(kind='bar', stacked=True, width=1, colormap='tab10')
fig = plt.gcf()
fig.set_size_inches(18,7)
plt.title("Terrorist Activities Distribution by Attack Type")
plt.legend(bbox_to_anchor=(-0.01, -0.2), loc='upper left', ncol=5)
plt.show()

We can see that assasination became much less frequent method of terrorism over the years as well as facility/infractructure attack and armed assault became more frequent.

## Distribution of Terrorist Attacks by Target Type <a id='TargetType'></a> [[⇧](#TOC)]
Let's find out what are the main targets for terrorists

In [None]:
sns.countplot('TargetType', data=terror, palette='tab20c',
              order=terror['TargetType'].value_counts().index)
fig = plt.gcf()
fig.set_size_inches(18,5)
plt.xticks(rotation=90)
plt.title('Terrorists` Primary Targets of Attack')
plt.show()

In [None]:
by_target = pd.crosstab(terror['Year'], terror['TargetType'], normalize='index')
by_target.plot(kind='bar', stacked=True, width=1, colormap='tab20c')
fig = plt.gcf()
fig.set_size_inches(18,7)
plt.title("Terrorist Activities Distribution by Target Type")
plt.legend(bbox_to_anchor=(0.01, -0.2), loc='upper left', ncol=5)
plt.show()

Now here what I find interesting is the increase in the __"Unknown"__ category. Also it seems interesting that transport is no more a target for terrorists as well as diplomatic targets.

## Maps <a id='Maps'></a> [[⇧](#TOC)]

This section is primarily for exercises with Folium, Cartopy, ipywidgets and plotly libraries. However, we can still get some valuable insights looking at this data.

### Interactive Cartopy Map
This map shows us the distribution of terrorist attack in the world by year. We can see where and when which attack happened.

In [None]:
def map_world(year):
    year_df = terror[terror['Year']==year]
    long_100 = list(year_df[year_df['Casualties']>=75]['Longitude'])
    lat_100 = list(year_df[year_df['Casualties']>=75]['Latitude'])
    sizes = list(year_df[year_df['Casualties']>=75]['Casualties']/10)

    long_ = list(year_df[year_df['Casualties']<75]['Longitude'])
    lat_ = list(year_df[year_df['Casualties']<75]['Latitude'])
    
    ax = plt.axes(projection=ccrs.Miller())
    # without whis line the map would be scaled to the data
    # ax.set_global()
    ax.set_ylim([-90,90])
    ax.set_xlim([-180,180])
    ax.coastlines()
    ax.add_feature(cf.LAND, color='green')
    ax.add_feature(cf.OCEAN, color='lightskyblue')
    ax.set_title("Terrorists` Attacks around the world in {}".format(year))

    plt.scatter(long_, lat_, color='white', marker='o',
                s=2, transform=ccrs.Miller())

    plt.scatter(long_100, lat_100, color='salmon', marker='o',
                s=sizes, ec='firebrick', 
                transform=ccrs.Miller())

    plt.legend(loc='lower left', handles=[mpatches.Patch(color='white', label="<75 casualties"),
                                          mpatches.Patch(color='firebrick', label=">75 casualties")])
    fig = plt.gcf()
    fig.set_size_inches((15,15))
    plt.show()
    
interactive_plot = interactive(map_world, year=IntSlider(value=1970, min=1970, max=2017,continuous_update=False))
display(interactive_plot)

### Interactive Folium Map
Two thousand latest attacks are plotted on this map. Circle sizes correspond to number of deaths during each attack.

In [None]:
terror_fol = terror.copy()
terror_fol.dropna(subset=['Latitude','Longitude'], inplace=True)
terror_fol.sort_index(ascending=False, inplace=True)
location_fol = terror_fol[['Latitude','Longitude']][:2000]
country_fol = terror_fol['Country'][:2000]
city_fol = terror_fol['City'][:2000]
year_fol = terror_fol['Year'][:2000]
month_fol = terror_fol['Month'][:2000]
day_fol = terror_fol['Day'][:2000]
killed_fol = terror_fol['Killed'][:2000]
wound_fol = terror_fol['Wounded'][:2000]

def color_point(x):
    if x>=75: color = 'firebrick'
    elif ((x>0 and x<75)): color = 'navy'
    else: color = 'green'
    
    return color

def point_size(x):
    if x>75: size = x/10
    else: size = 2
        
    return size

map_fol = folium.Map(location=[30,0], tiles='OpenStreetMap', zoom_start=3,
                     min_zoom=3)
for point in location_fol.index:
    info = '<b>Country: </b>' + str(country_fol[point]) + \
           '<br><b>City: </b>' + str(city_fol[point]) + \
           '<br><b>Date: </b>' + str(year_fol[point]) + \
                           '-' + str(month_fol[point]) + \
                           '-' + str(day_fol[point]) + \
           '<br><b>Killed: </b>' + str(killed_fol[point]) + \
           '<br><b>Wounded: </b>' + str(wound_fol[point])
    iframe = folium.IFrame(html=info, width=200, height=200)
    folium.CircleMarker(list(location_fol.loc[point].values),
                       popup = folium.Popup(iframe),
                       radius = point_size(killed_fol[point]),
                       color = color_point(killed_fol[point]),
                       fill = True,
                       fillColor = 'salmon').add_to(map_fol)

map_fol

### Interactive Plotly Maps

Next three maps show different distributions of terrorist attacks. The first one shows overall casualties over the whole timeframe since 1970 to 2017 by each country. The second shows number of attacks per 100 000 people in each country. And the third one shows casualties per 100 000 population in each country. All data is shown for the last ten years.

Surprisingly enough, for example, Ukraine seems to suffer more from terrorism than India, since the relative attack number and relative casualties are higher in Ukraine, although absolute numbers are greater for India.

Also, Norway surpised me as well with relative numbers similar to Turkey and Egypt and more than ten times bigger than India.

In [None]:
map_country_df = terror.groupby(by='Country', as_index=False)['Casualties'].sum()

data = dict(type='choropleth',
            locations = map_country_df['Country'],
            locationmode = 'country names', z = map_country_df['Casualties'],
            text = map_country_df['Country'], colorbar = {'title':'Casualties'},
            colorscale=[[0, 'aliceblue'],
                        [0.03, 'dodgerblue'],
                        [0.06, 'cadetblue'], [0.125, 'khaki'],
                        [0.25, 'gold'], [0.5, 'darkorange'],
                        [1, 'firebrick']],    
            reversescale = False)

layout = dict(title='WorldCasualties',
              geo = dict(showframe = True, projection={'type':'miller'}),
              autosize=False,
              width=750, height=600)

choromap = go.Figure(data = [data], layout = layout)
iplot(choromap, validate=False)

In [None]:
data = dict(type='choropleth',
            locations = ten_years_df['Country'],
            locationmode = 'country names', z = ten_years_df['RelCount'],
            text = ten_years_df['Country'], colorbar = {'title':'RelCount'},
            colorscale=[[0, 'aliceblue'],
                        [0.003, 'dodgerblue'],
                        [0.007, 'cadetblue'], [0.02, 'khaki'],
                        [0.1, 'gold'], [0.5, 'darkorange'],
                        [1, 'firebrick']],    
            reversescale = False)

layout = dict(title='Terrorist Attacks per 100 000 people over the last 10 years',
              geo = dict(showframe = True, projection={'type':'miller'}),
              autosize=False,
              width=750, height=600)

choromap = go.Figure(data = [data], layout = layout)
iplot(choromap, validate=False)

In [None]:
data = dict(type='choropleth',
            locations = ten_years_df['Country'],
            locationmode = 'country names', z = ten_years_df['RelCasualties'],
            text = ten_years_df['Country'], colorbar = {'title':'Rel. Casualties'},
            colorscale=[[0, 'aliceblue'], [0.0016, 'cadetblue'], 
                        [0.008, 'khaki'], [0.04, 'gold'], 
                        [0.2, 'darkorange'], [1, 'firebrick']],    
            reversescale = False)

layout = dict(title='Casualties per 100 000 people over the last 10 years',
              geo = dict(showframe = True, projection={'type':'miller'}),
              autosize=False,
              width=750, height=600)

choromap = go.Figure(data = [data], layout = layout)
iplot(choromap, validate=False)

## Notorious Terrorist Groups <a id='Groups'></a> [[⇧](#TOC)]

In [None]:
sns.barplot(terror['Group'].value_counts()[1:15].values, 
            terror['Group'].value_counts()[1:15].index,
            palette=('inferno'))
plt.xticks(rotation=90)
fig = plt.gcf()
fig.set_size_inches(10, 8)
plt.title('Terrorist Groups with Highest Terror Attacks')
plt.show()

In [None]:
top_groups10 = terror[terror['Group'].isin(terror['Group'].value_counts()[1:11].index)]
pd.crosstab(top_groups10['Year'],top_groups10['Group']).plot(color=sns.color_palette('Paired', 10))
fig = plt.gcf()
fig.set_size_inches(18,6)
plt.show()