# Predicting Terrorist Attacks
## Weapon Classification

**Author:** Thomas Skowronek

**Date:** April 07, 2018

### Notebook Configuration

In [262]:
import time
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import scale
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [263]:
# Display up to 150 rows and columns
pd.set_option('display.max_rows', 150)
pd.set_option('display.max_columns', 150)

# Set the figure size for plots
mpl.rcParams['figure.figsize'] = (14.6, 9.0)

# Set the Seaborn default style for plots
sns.set()

# Set the color palette
sns.set_palette(sns.color_palette("muted"))

### Load the Datasets
Load the dataset created by the EDA notebook.

In [264]:
# Load the preprocessed GTD dataset
gtd_df = pd.read_csv('../data/gtd_eda_95t016.csv', low_memory=False, index_col = 0,
                      na_values=[''])

### Inspect the Structure
The cleansed data frame contains 48 attributes, one of which is used for the data frame index, and 110,844 observations.

In [265]:
# Display a summary of the data frame
gtd_df.info(verbose = True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 110844 entries, 199501000001 to 201701270001
Data columns (total 48 columns):
iyear               110844 non-null int64
imonth              110844 non-null int64
iday                110844 non-null int64
country_txt         110844 non-null object
region_txt          110844 non-null object
provstate           110844 non-null object
city                110844 non-null object
latitude            110844 non-null float64
longitude           110844 non-null float64
specificity         110844 non-null float64
summary             110844 non-null object
attacktype1_txt     110844 non-null object
targtype1_txt       110844 non-null object
targsubtype1_txt    110844 non-null object
corp1               110844 non-null object
target1             110844 non-null object
natlty1_txt         110844 non-null object
gname               110844 non-null object
nperpcap            110844 non-null float64
weaptype1_txt       110844 non-null object
weapsubtype

### Convert Attributes to Correct Data Type
Convert a subset of the data frame attributes to categorical, datatime and string to align with the GTD code book as executed previously in the EDA notebook.

In [266]:
# List of attributes that are categorical
cat_attrs = ['extended_txt', 'country_txt', 'region_txt', 'specificity', 'vicinity_txt',
             'crit1_txt', 'crit2_txt', 'crit3_txt', 'doubtterr_txt', 'multiple_txt',
             'success_txt', 'suicide_txt', 'attacktype1_txt', 'targtype1_txt', 
             'targsubtype1_txt', 'natlty1_txt', 'guncertain1_txt', 'individual_txt', 
             'claimed_txt', 'weaptype1_txt', 'weapsubtype1_txt', 'property_txt', 
             'ishostkid_txt', 'INT_LOG_txt', 'INT_IDEO_txt','INT_MISC_txt', 'INT_ANY_txt']

for cat in cat_attrs:
    gtd_df[cat] = gtd_df[cat].astype('category')

# Data time feature added during EDA
gtd_df['incident_date'] = pd.to_datetime(gtd_df['incident_date'])

# Necessary for label encoding below
gtd_df['gname'] = gtd_df['gname'].astype('str')
    
gtd_df.info(verbose = True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 110844 entries, 199501000001 to 201701270001
Data columns (total 48 columns):
iyear               110844 non-null int64
imonth              110844 non-null int64
iday                110844 non-null int64
country_txt         110844 non-null category
region_txt          110844 non-null category
provstate           110844 non-null object
city                110844 non-null object
latitude            110844 non-null float64
longitude           110844 non-null float64
specificity         110844 non-null category
summary             110844 non-null object
attacktype1_txt     110844 non-null category
targtype1_txt       110844 non-null category
targsubtype1_txt    110844 non-null category
corp1               110844 non-null object
target1             110844 non-null object
natlty1_txt         110844 non-null category
gname               110844 non-null object
nperpcap            110844 non-null float64
weaptype1_txt       110844 non-null categ

### Find the Major Groups
Get the list of terrorist groups that have 100 or more attacks.

In [267]:
# Calculate the number of attacks by group
groups = gtd_df['gname'].value_counts()

# Include groups with at least 100 attacks
groups = groups[groups > 99]

# Exclude unknown groups
group_list = groups.index[groups.index != 'Unknown']

# Subset the data to major groups
major_groups = gtd_df[gtd_df['gname'].isin(group_list)]

# Display the number of attacks by group
major_groups['gname'].value_counts()

Taliban                                                        6558
Islamic State of Iraq and the Levant (ISIL)                    4261
Al-Shabaab                                                     2669
Boko Haram                                                     2067
Communist Party of India - Maoist (CPI-Maoist)                 1766
Revolutionary Armed Forces of Colombia (FARC)                  1529
New People's Army (NPA)                                        1444
Maoists                                                        1411
Kurdistan Workers' Party (PKK)                                 1255
Tehrik-i-Taliban Pakistan (TTP)                                1250
Al-Qaida in the Arabian Peninsula (AQAP)                        966
Liberation Tigers of Tamil Eelam (LTTE)                         950
Houthi extremists (Ansar Allah)                                 862
Al-Qaida in Iraq                                                633
Donetsk People's Republic                       

### Drop Text and Datetime Attributes
Remove the text and datetime attributes, which will not be used in the models.

In [268]:
major_groups = major_groups.drop(['provstate', 'city', 'corp1', 'target1', 'scite1', 
                                  'scite1', 'dbsource', 'incident_date'], axis=1)

major_groups.info(verbose = True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40323 entries, 199501020002 to 201612310043
Data columns (total 41 columns):
iyear               40323 non-null int64
imonth              40323 non-null int64
iday                40323 non-null int64
country_txt         40323 non-null category
region_txt          40323 non-null category
latitude            40323 non-null float64
longitude           40323 non-null float64
specificity         40323 non-null category
summary             40323 non-null object
attacktype1_txt     40323 non-null category
targtype1_txt       40323 non-null category
targsubtype1_txt    40323 non-null category
natlty1_txt         40323 non-null category
gname               40323 non-null object
nperpcap            40323 non-null float64
weaptype1_txt       40323 non-null category
weapsubtype1_txt    40323 non-null category
nkill               40323 non-null float64
nkillus             40323 non-null float64
nkillter            40323 non-null float64
nwound      

### Standardize the Numeric Attributes
Adjust for differences in the range of the numeric attributes.

In [269]:
scaler = preprocessing.StandardScaler()

# List of numeric attributes
scale_attrs = ['nperpcap', 'nkill', 'nkillus', 'nkillter', 'nwound', 'nwoundus', 'nwoundte']

# Normalize the attributes in place
major_groups[scale_attrs] = scaler.fit_transform(major_groups[scale_attrs])

# View the transformation
major_groups[scale_attrs]

Unnamed: 0_level_0,nperpcap,nkill,nkillus,nkillter,nwound,nwoundus,nwoundte
eventid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
199501020002,-0.079613,-0.190562,-0.028705,0.016505,-0.157298,-0.033921,-0.098630
199501020008,-0.079613,0.223322,-0.028705,-0.158771,-0.212179,-0.033921,-0.098630
199501020009,-0.079613,0.016380,-0.028705,-0.158771,0.007347,-0.033921,-0.098630
199501030003,-0.079613,-0.121581,-0.028705,0.191781,-0.212179,-0.033921,-0.098630
199501060002,-0.079613,-0.190562,-0.028705,-0.158771,-0.212179,-0.033921,-0.098630
199501060003,-0.079613,-0.190562,-0.028705,-0.158771,-0.212179,-0.033921,-0.098630
199501070006,-0.079613,-0.259543,-0.028705,-0.158771,-0.157298,-0.033921,-0.098630
199501080002,-0.079613,-0.259543,-0.028705,-0.158771,0.007347,-0.033921,-0.098630
199501080003,-0.079613,-0.259543,-0.028705,-0.158771,-0.212179,-0.033921,-0.098630
199501090004,-0.079613,-0.190562,-0.028705,-0.158771,-0.212179,-0.033921,-0.098630


### Encode the Target Attribute
Convert the text values of the terrorist groups to an encoded numeric value for the random forest models.

In [270]:
# Create the encoder
le = preprocessing.LabelEncoder()

# Fit the encoder to the target
le.fit(major_groups['gname'])

LabelEncoder()

In [271]:
# View the labels
list(le.classes_)

['Abu Sayyaf Group (ASG)',
 'Al-Aqsa Martyrs Brigade',
 "Al-Gama'at al-Islamiyya (IG)",
 'Al-Nusrah Front',
 'Al-Qaida in Iraq',
 'Al-Qaida in the Arabian Peninsula (AQAP)',
 'Al-Qaida in the Islamic Maghreb (AQIM)',
 'Al-Shabaab',
 'Algerian Islamic Extremists',
 'Allied Democratic Forces (ADF)',
 'Armed Islamic Group (GIA)',
 'Baloch Liberation Army (BLA)',
 'Baloch Liberation Front (BLF)',
 'Baloch Republican Army (BRA)',
 'Bangsamoro Islamic Freedom Movement (BIFM)',
 'Barqa Province of the Islamic State',
 'Basque Fatherland and Freedom (ETA)',
 'Boko Haram',
 'Chechen Rebels',
 'Communist Party of India - Maoist (CPI-Maoist)',
 'Corsican National Liberation Front (FLNC)',
 "Donetsk People's Republic",
 'Free Aceh Movement (GAM)',
 'Free Syrian Army',
 'Fulani extremists',
 'Garo National Liberation Army',
 'Gunmen',
 'Hamas (Islamic Resistance Movement)',
 'Hezbollah',
 'Hizbul Mujahideen (HM)',
 'Houthi extremists (Ansar Allah)',
 'Hutu extremists',
 'Islamic State of Iraq (ISI)

In [272]:
# View the encoded values for th terrorist group names
label_codes = le.transform(major_groups['gname'])
label_codes

array([27,  2,  2, ..., 62, 17, 42])

In [273]:
# Convert some integers into their category names
list(le.inverse_transform([0, 1, 2, 27]))

['Abu Sayyaf Group (ASG)',
 'Al-Aqsa Martyrs Brigade',
 "Al-Gama'at al-Islamiyya (IG)",
 'Hamas (Islamic Resistance Movement)']

### Create Training and Testing Datasets
The original dataset is split into 80% training and 20% testing.

In [274]:
# Seed for reproducible results
seed = 1009

# Predictor variables
X = pd.get_dummies(major_groups.drop(['gname'], axis=1), drop_first=True)

# Labels
y = label_codes

# Create an 80/20 split for training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = seed, stratify = y)

### Initial Random Forest Model
Create the initial model using 100 estimators.

In [275]:
# Create the model
rf1 = RandomForestClassifier(n_estimators = 100, n_jobs = -1, random_state = seed)

# Fit it to the training data
rf1.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=1009, verbose=0,
            warm_start=False)

In [276]:
# Predict labels on the test dataset
pred_lables1 = rf1.predict(X_test)

# Calculate the accuracy of the model
score1 = accuracy_score(y_test, pred_lables1)
print("\nAccuracy: {}".format(score1))


Accuracy: 0.9168009919404836


### References

Albon, C. (2017). Convert Pandas categorical data for scikit-learn Retrieved from https://chrisalbon.com/machine_learning/preprocessing_structured_data/convert_pandas_categorical_column_into_integers_for_scikit-learn/

Keen, B. (2017). Feature scaling with scikit-learn Retrieved from http://benalexkeen.com/feature-scaling-with-scikit-learn/