## The descriptives from this kernel are from:

### Kaggle Kernels:
* [https://www.kaggle.com/unstablematter/fork-of-global-terrorism-1970-2016](https://www.kaggle.com/unstablematter/fork-of-global-terrorism-1970-2016)
* [https://www.kaggle.com/jmanders/jads-happy-pears-eda](https://www.kaggle.com/jmanders/jads-happy-pears-eda)
* [https://www.kaggle.com/jmanders/the-happy-pears-predicting-terrorism-casualties](https://www.kaggle.com/jmanders/the-happy-pears-predicting-terrorism-casualties)

### Local Kernel of Daan:
* ![Local Kernel of Daan](https://i.imgur.com/iyea2PX.png)

In [None]:
import numpy as np
import pandas as pd


# Plotting
import matplotlib.pyplot as plt
%matplotlib inline
from mpl_toolkits.basemap import Basemap
import seaborn as sns

# Prediction
from sklearn.cross_validation import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
import itertools

# Classifiers
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.dummy import DummyClassifier # Validation

# Data prepping
from sklearn.preprocessing import LabelEncoder

print("Libraries imported.")

In [None]:
data = pd.read_csv('../input/gtd/globalterrorismdb_0617dist.csv', encoding='ISO-8859-1')
all_data = data
print("Data loaded.")

## 2.2. Data selection and description
The main characteristics of the dataset with respect to the research questions are researched in the exploratory data analysis. These characteristics, for instance trends, potential causality or correlation found, will be used to predict future events.
For a more detailed description of the variables, consult [the codebook](http://start.umd.edu/gtd/downloads/Codebook.pdf).

In [None]:
data_columns = [
    
    'eventid', # Unique ID for a row. No analytic or predictive power, but used in some plotting functions.
    
    # ===== Spatio-Temporal Variables =====
    # The names of these variables speak for themselves;
    # where in time and space was the act of terrorism committed?
                'iyear', 'imonth', 'iday', 'latitude', 'longitude',
    
    # ===== Binary Variables (1 -> yes or 0 -> no) ===== 
                'extended', # Did the duration of the incident extend 24 hours?
                'vicinity', # Did the incident occur in the immediate vicinity of the city? Is 0 for IN city.
                'crit1', 'crit2', 'crit3', # The incident meets the criterion (1, 2, 3), described in the introduction.
                'doubtterr', # Is there doubt to wether the attack is an act of terrorism?
                'multiple', # Is this incident connected to other incident(s)? !! Consistently available since 1997 !!
                'success', # Has the attack reached its goal? Depends on type of attack.
                'suicide', # Did the perpetrator intend to escape alive?
                'claimed', # Was the attack claimed by an organised group?
                'property', # Is there evidence of property damage from the incident?
                'ishostkid', # Were there victims taken hostage or kidnapped?
    
    # ===== Continuous Variables =====
                'nkill', # Amount of confirmed kills.
                'nwound', # Amount of confirmed wounded.
    
    # ===== Categorical variables =====
                'country_txt', # Name of country.
                'region', # Region id.
                'region_txt', # Name of region.
                'attacktype1_txt', # Of what type was the attack? I.e. assassination, bombing or kidnapping.
                'targtype1_txt', # What target did the attack have? I.e. business, government or police.
                'natlty1_txt', # Nationality of the target.
                'weaptype1_txt', # What weapon was used?
    
    # ===== Descriptive Variables =====
                'target1', # Description of specific target, if applicable.
                'gname', # Name of the organized group, if applicable.
                'summary', # Summary of the attack.
    
]

In [None]:
data = data.loc[:, data_columns] # Only keep described columns.

# Random acts of violence and other outliers should not be part of the data.
# Thus, restrict the set the only attacks where the terrorism motive is certain.
data = data[(data.crit1 == 1) & (data.crit2 == 1) & (data.crit3 == 1) & (data.doubtterr == 0)]

# Weapontype column contains very long name for vehicle property -> shorten.
data.weaptype1_txt.replace(
    'Vehicle (not to include vehicle-borne explosives, i.e., car or truck bombs)',
    'Vehicle', inplace = True)

# Replace -9 (unknown) values with 0 (no). -9 values are much more likely to be false than true.
data.iloc[:,[6, 15, 16, 17]] = data.iloc[:,[6, 15, 16, 17]].replace(-9,0)

# Some values in the claimed category are 2 (should be 0 or 1).
# Assume these were input mistakes and set 2 to 1.
data.claimed.replace(2,1, inplace = True)

# Ensure consistent values and make everything lowercase.
data.target1 = data.target1.str.lower()
data.gname = data.gname.str.lower()
data.summary = data.summary.str.lower()    
data.target1 = data.target1.fillna('unknown').replace('unk','unknown')

# Some nwound and nkill are NaN. Replace them with median.
data.nkill = np.round(data.nkill.fillna(data.nkill.median())).astype(int) 
data.nwound = np.round(data.nwound.fillna(data.nwound.median())).astype(int) 

# Database only reports victims as nkill and nwound. Combine these into ncasualties column.
# Also add has_casualties column.
data['ncasualties'] = data['nkill'] + data['nwound']
data['has_casualties'] = data['ncasualties'].apply(lambda x: 0 if x == 0 else 1)

print("Data cleaned and prepared.")

# 3. Data exploration

## 3.1. Has the amount of attacks increased during recent years?

In [None]:
barplot = pd.value_counts(data['iyear'])\
.sort_index()\
.plot\
.bar(width=0.8, figsize=(10, 10), title="Amount of terrorist attacks per year")

By looking at the graph, one might come to the shocking conclusion that the amount of terrorist attacks has been drastically increasing during the last five years. However, it is important to take into account the effictiveness of data collection since 2012.

To quote Michael Jensen, START, November 25, 2013: "*While there is no simple answer to this question, what is certain is that by the start of the 2012 collection effort, the staff working on the GTD had become better than ever at identifying terrorist attacks, regardless of where they happened to occur.*"

This implies that the uncertainty in data collection may or may not be responsible for the increase in attacks. The same [article](http://www.start.umd.edu/news/discussion-point-benefits-and-drawbacks-methodological-advancements-data-collection-and-coding)  puts this statement into perspective:  “With that said, the GTD team believes that some portion of the observable increase in terrorist activity since 2011 is the result of new advancements in collection methodology.” Concluding, results from data analysis should be considered with care.

## 3.2. Are attacks becoming more successful?

In [None]:
region_dictionary = {1: 'North America', 2: 'Central America & Carribean', 3: 'South America',
                     4: 'East Asia', 5: 'Southeast Asia', 6: 'South Asia', 7: 'Central Asia',
                     8: 'Western Europe', 9: 'Eastern Europe', 10: 'Middle East and North Africa',
                     11: 'Sub-Saharan Africa', 12: 'Australasia and Oceania'}

def multi_graph(result,result_list, xmin, xmax, ymin, ymax):
    fig2, ax2 = plt.subplots(figsize = (15,8))
    number = 1 #the for-loop in append_list processes the regions in order from 1 to 12
    for j in result_list:
        ax2.plot(j.index, j.eventid, label = '%s ' % region_dictionary[number] )
        number += 1

    plt.xlim([xmin,xmax])
    plt.ylim([ymin,ymax])
    plt.xlabel('year')
    plt.ylabel('number of attacks')
    plt.title(result)
    ax2.legend(loc = 'center', frameon = True, edgecolor = 'black',bbox_to_anchor =(1.2,0.4))


success_list = []
failure_list = []

for i in region_dictionary:
    region_data = data[(data.region == i)]
    region_data_success = region_data[(region_data.success == 1)]
    region_data_failure = region_data[(region_data.success == 0)]
    region_grouped_success = region_data_success.groupby('iyear').count()
    region_grouped_failure = region_data_failure.groupby('iyear').count()

    
    success_list.append(region_grouped_success)
    failure_list.append(region_grouped_failure)

multi_graph('Successes',success_list, 1970, 2011, 0, 2100)
multi_graph('Successes',success_list, 2012, 2016, 0, 6500)
multi_graph('Failures',failure_list, 1970, 2011, 0, 200)
multi_graph('Failures',failure_list, 2012, 2016, 0, 1300)

Immediately noticable is the drop in both successfull and failed attacks in 1998. This is a phenomon shared by all regions and should be investigated more closely. During the last 5 years (2011-2016) there is no clear increase in attacks except for North America and in some extend also South-Asia. The strong increase of failed attacks in North America could be due to  sharpened measures taken after 9/11 2001. At the same time the number of successful attacks increased as well and started declining since 2014 for both North America and South Asia. There are some regions, e.g. regions in Asia and Africa, that display a strong increase of terrorist attacks starting around 2005. The question is whether this is a consequence of better documentation and communication or that the number of attacks have actually increased.

Concluding; there is no worldwide increase in successful nor failed attacks in recent years (from 2012), however 2005 - 2011 there was a strong increase.

## 3.3. How compare successful attacks to failed attacks?

In [None]:
def generate_graph(by_region_list):
    fig = plt.figure(figsize=(20,70))
    i = 1
    
    for element in by_region_list:
        ax1 = fig.add_subplot(11,2,i)
        ax1.set(title = '#Attacks region %s ' % region_dictionary[element[2]],
                ylabel = 'Attack count', xlabel = 'year')

        #entering data
        ax1.plot(element[0].index, element[0].eventid, label = 'Successfull attacks' )
        ax1.plot(element[1].index, element[1].eventid, label = 'Failed attacks' )
        
        i+=1
    
    #add legend
    ax1.legend(loc = 'upper center', frameon = True, edgecolor = 'black', bbox_to_anchor =(-0.1,-0.4))
    plt.show()  


def by_region():
        for region_number in region_dictionary:
            region_data = data[(data.region == region_number)] #for each region group data by year
            region_grouped_success = region_data[(region_data.success == 1)].groupby('iyear').count() #filter on success and group by year
            region_grouped_failure = region_data[(region_data.success == 0)].groupby('iyear').count() #filter on failure and group by year
            
            by_region_list.append([region_grouped_success, region_grouped_failure, region_number])
        
        #create line plot for region grouped by year
        generate_graph(by_region_list)

by_region_list = []
by_region()

There are significantly more successful attacks without a change in failed attacks in the regions:
* During a period of time:
 - Central America & Caribbean (1978-1999)
 - South America (1980-1993)
* Increasing over the total time:
 - Southeast Asia
 - South Asia
 - Eastern Europe
* Middle East and North Africa
 - Sub-Saharan Africa
* Interchangeably:
 - The rest

Probably are the two regions with significant higher amount of successful attacks due to a certain feud between groups in that region. The regions that experience increasingly more successful attacks will probably be due to increased communication with Asia and the Russia's sphere of influence.


## 3.4. Where do the terrorist attacks take place?

In [None]:
orange_palette = ((3, 0, '#FBBC00', '1 - 20'), (4, 20, '#FDA600', '21 - 50'), (5, 50, '#EE8904', '51 - 100'), \
                  (7, 100, '#ED9001', '101 - 250'), (9, 250, '#ED6210', '251 - 600'), \
                  (11, 600, '#DE6D0A', '601 - 1000'), (13, 1000, '#D8510F', '1001 - 2000'), \
                  (15, 2000, '#D23711', '2001 - 4000'), (18, 4000, '#F61119', '4001 - 7500'), \
                  (30, 7500, '#9C200A', '7501 - ∞')) #marker size, count size, color

plt.figure(figsize=(15,15))
# Rounds the long- and latitude to a number withouth decimals, groups them on long- and latitude and counts the amount of attacks.
df_coords = data.round({'longitude':0, 'latitude':0}).groupby(["longitude", "latitude"]).size().to_frame(name = 'count').reset_index()
m = Basemap(projection='mill',llcrnrlat=-80,urcrnrlat=80, llcrnrlon=-180,urcrnrlon=180,lat_ts=20,resolution='c')
m.drawcoastlines()
m.shadedrelief()
    
def plot_points(marker_size, count_size, colour, label_count):
    x, y = m(list(df_coords.longitude[df_coords['count'] >= count_size].astype("float")),\
                (list(df_coords.latitude[df_coords['count'] >= count_size].astype("float"))))
    points = m.plot(x, y, "o", markersize = marker_size, color = colour, label = label_count, alpha = .5)

for p in orange_palette:
    plot_points(p[0], p[1], p[2], p[3]) 
    
plt.title("Amount of terrorist attacks per rounded coordinates", fontsize=24)
plt.legend(title= 'Colour per counted attack', loc ='lower left', prop= {'size':11})
plt.show()

In [None]:
sns.jointplot(x='longitude', y='latitude', data=df_coords, kind="hex", color="#4CB391", size=15, stat_func=None, edgecolor="#EAEAF2", linewidth=.2)
plt.title('Amount of terrorist attacks per rounded coordinates')

The images above show where the documented attacks are concentrated. The worldmap gives a more clear view on where in the world there are more attacks,while the hexagram-representation substantiates that and adds barcharts for both longitude and latitude to infer attack-intensity.

The worldmap shows which countries are most troubled by terrorism. From the map the conclusion can be drawn that the Middle East is most troubled by terrorism. The countries in the Middle East that are often affected by terrorism are Iraq, Iran and Syria. Some other countries that heavily suffered over the years from terrorism are India, Pakistan and Ireland.

The hexplot shows for which range longitude and latitude the attacks are most common. For longitude it is mainly between 5 and 40 and for latitude it is around 35 and 12.

## 3.5. Which attack types are popular?

In [None]:
plt.figure(figsize=(9,7))
ax = sns.countplot(y="attacktype1_txt", data=data)
ax.set_xlabel("Amount of attacks")
ax.set_ylabel("Attack type")

The graph shows that bombing/explosion are the most popular, by far.  This is followed by armed assault and surprisingly assassination. The advantage of using explosives is that causes a lot of fear but also much damage, to both property and people.

## 3.6. Do attacks display seasonality?

In [None]:
df_day_coords = data[['imonth', 'iday', 'longitude', 'latitude', 'success']].copy()
df_day_coords = df_day_coords[df_day_coords['iday'] != 0]

fig, axs = plt.subplots(nrows=12)
fig.set_size_inches(15, 100, forward=True)

for i in range(1,13):
    monthly_data = df_day_coords[df_day_coords['imonth'] == i]
    sns.countplot(x="iday", data=monthly_data, hue="success", ax=axs[i-1])
    axs[i-1].set_xlabel('Day of the month')
    axs[i-1].set_ylabel('Amount of terrorist attacks')

In [None]:
fig, ax = plt.subplots(figsize=(15,15))
sns.countplot(x="iday", data=df_day_coords, ax=ax, palette=sns.cubehelix_palette(15, start=.3, rot=.3))
ax.set_xlabel('Day of the month')
ax.set_ylabel('Amount of terrorist attacks')

In [None]:
df_day_coords.groupby("iday")["imonth"].count().mean()

The total amount of attacks per day of the month do not display clear seasonality. The peak at the 15th of the month stands out and the first of the month is also above average.
On the contrary, the end of the month is significantly lower than the rest, there is a perfect logical explanation: some months have 30 and others 31 days, in addition does february have (mostly) 28 days.

Going into the separate months (January - December), it matches the overall graph that the 15th of a month is higher. The last graph, the month December, shows a significant overall lower amount of attacks, as does the month September in some extend.


## 3.7. What are popular targets?

In [None]:
dc_rt = all_data.set_index("targtype1")["region"].to_dict()

# Targets per country
df_rt = all_data.groupby(["region", "targtype1"]).size().to_frame(name = 'counted_target_types').reset_index()

fig, axes = plt.subplots(nrows=6, ncols=2)
fig.set_size_inches(30, 100)
k = 1
for i in range(0, 6):
    for j in range(0, 2):
        axes[i, j].pie(df_rt[(df_rt.region == k)]["counted_target_types"], autopct='%1.1f%%', startangle=90)
        k += 1  
plt.ylabel('Targets:')
plt.legend(all_data.groupby(["region", "targtype1"])["targtype1_txt"].unique())

The following target types make up for about 50% of the targets of attacks for each region:
* Business
* Government (general)
* Police
* Military

It shows that governmental services and businesses are more profoundly a target of attacks than public facilities, private citizens and other targets. This is a general statement for all regions, the distribution between the most common types differs per region though.


## 3.8. Are there many specific attacks? That are, aimed at specifics persons and not groups?

In [None]:
assassinations = all_data[all_data.attacktype1 == 1]
vc = assassinations['nkill'].value_counts()

vc[vc > 40].plot.pie(
    title="Grouped number of kills for assassinations",
    labels=None,
    figsize=(10, 10),
    autopct="%.2f pct"
)
labels = ['0', '1', '2', '3', '4', '5']
plt.ylabel('Number of killed')
plt.legend(labels)

Assassinations are generally targeted at one specific person. Therefore, the notion arose that assassinations should not be included in the predictive analysis. The graph shows that the majority of assassinations result in zero kills. About one-fifth yields one (expected) kill and in about twenty percent of the cases there are two or more. Although the amount of kills that exceed one might be negligable, the attacks could still satisfy the three criteria that define terrorism. For this reason, assassinations will not be excluded from the predictive dataset.

## 3.9. How does the amount of attacks compare to the population density?

In [None]:
#load population density database
popdens = pd.read_csv('../input/world-population/API_EN.POP.DNST_DS2_en_csv_v2.csv', skiprows=[0,1,2], index_col='Country Name')
popdens.drop
popdens.shape

In [None]:
#Transpose table to make plotting density over the years easier
popdensT = popdens.drop(popdens.columns[[0,1,2,3,59,60]], axis=1)
popdensT = popdensT.T

In [None]:
popdensT.Pakistan.plot(legend=True)

In [None]:
#plot terrorist attack data per country to compare with population density:
#plot specific attacktype for a specific country
#example: bombings in Mexico
country_specific_attacks = all_data[(all_data.country_txt == 'Pakistan')]
country_specific_attacks = country_specific_attacks[(country_specific_attacks.attacktype1_txt == ('Bombing/Explosion'))]
country_specific_attacks = country_specific_attacks.set_index('country_txt')
plot_spec_att = country_specific_attacks.groupby(['iyear','country_txt']).size().reset_index(name="Count")
plot_spec_att.groupby('country_txt').plot(x='iyear', y='Count', legend=False)

Unfortunately the Population Density can at this moment only be compared for all countries specifically. In order to determine whether there is a correlation between population density over the years and the amount of terrorist attack over the years the data sets have to be merged. The two separate plots can provide some insights in specific cases.

Looking into many countries, most of them display both an increase in population density and number of attacks. But, we cannot conclude that this is a causal relation, further data is required to answer the question.

