# Predicting Terrorist Attacks
## Data Preprocessing

**Author:** Thomas Skowronek

**Date:** March 20, 2018

### Notebook Configuration

In [60]:
import pandas as pd

In [61]:
# Configure notebook output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Number of rows and columns
pd.set_option('display.max_rows', 150)
pd.set_option('display.max_columns', 150)

### Load the Datasets
For this project, the two most recents dataset are imported.  The first covers the years 1995 to 2012, and the second spans 2013 to 2016.

In [62]:
# Load 1995-2012 GTD
gtd_df1 = pd.read_csv('../data/gtd_95to12_0617dist.csv', low_memory=False, index_col = 0,
                      na_values=[''])

# Load 2013-2016 GTD
gtd_df2 = pd.read_csv('../data/gtd_13to16_0617dist.csv', low_memory=False, index_col = 0, 
                      na_values=[''])

# Append the 2nd data frame to the first
gtd_df = gtd_df1.append(gtd_df2)

### Inspect the Structure
The data frame contains 135 attributes, one of which is used for the data frame index, and 112,251 observations.

In [63]:
# Display a summary of the data frame
gtd_df.info(verbose = True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 112251 entries, 199501000001 to 201701270001
Data columns (total 134 columns):
iyear                 int64
imonth                int64
iday                  int64
approxdate            object
extended              int64
resolution            object
country               int64
country_txt           object
region                int64
region_txt            object
provstate             object
city                  object
latitude              float64
longitude             float64
specificity           float64
vicinity              int64
location              object
summary               object
crit1                 int64
crit2                 int64
crit3                 int64
doubtterr             int64
alternative           float64
alternative_txt       object
multiple              int64
success               int64
suicide               int64
attacktype1           int64
attacktype1_txt       object
attacktype2           float64
attacktype2

### View Missing Data
Calculate the total number of null values and percent for each attribute.  As the results show, many attributes are comprised of missing values of more than 50%.

In [64]:
# Check the number of missing values in each attribute
count = gtd_df.isnull().sum()
percent = round(count / 112251 * 100, 2)
series = [count, percent]
result = pd.concat(series, axis=1, keys=['Count','Percent'])
result.sort_values(by='Count', ascending=False)

Unnamed: 0,Count,Percent
weaptype4_txt,112245,99.99
weaptype4,112245,99.99
weapsubtype4_txt,112244,99.99
weapsubtype4,112244,99.99
gsubname3,112238,99.99
claimmode3,112139,99.9
claimmode3_txt,112139,99.9
gsubname2,112103,99.87
divert,112092,99.86
guncertain3,111997,99.77


###  Identify the First Pass of Target Attributes
Select the list of attributes that contain missing values of less than 20%.

In [65]:
target_attrs = result[result['Percent'] < 20.0]
target_attrs.index.values

array(['iyear', 'imonth', 'iday', 'extended', 'country', 'country_txt',
       'region', 'region_txt', 'provstate', 'city', 'latitude',
       'longitude', 'specificity', 'vicinity', 'summary', 'crit1', 'crit2',
       'crit3', 'doubtterr', 'multiple', 'success', 'suicide',
       'attacktype1', 'attacktype1_txt', 'targtype1', 'targtype1_txt',
       'targsubtype1', 'targsubtype1_txt', 'corp1', 'target1', 'natlty1',
       'natlty1_txt', 'gname', 'guncertain1', 'individual', 'nperps',
       'nperpcap', 'claimed', 'weaptype1', 'weaptype1_txt', 'weapsubtype1',
       'weapsubtype1_txt', 'nkill', 'nkillus', 'nkillter', 'nwound',
       'nwoundus', 'nwoundte', 'property', 'ishostkid', 'scite1',
       'dbsource', 'INT_LOG', 'INT_IDEO', 'INT_MISC', 'INT_ANY'], dtype=object)

### Subset the Original Dataset
Only include the attributes in the target set of attributes.

In [66]:
subset_df = gtd_df.loc[:, target_attrs.index.values]
subset_df.info(verbose = True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 112251 entries, 199501000001 to 201701270001
Data columns (total 56 columns):
iyear               112251 non-null int64
imonth              112251 non-null int64
iday                112251 non-null int64
extended            112251 non-null int64
country             112251 non-null int64
country_txt         112251 non-null object
region              112251 non-null int64
region_txt          112251 non-null object
provstate           109653 non-null object
city                111805 non-null object
latitude            110844 non-null float64
longitude           110844 non-null float64
specificity         112247 non-null float64
vicinity            112251 non-null int64
summary             102988 non-null object
crit1               112251 non-null int64
crit2               112251 non-null int64
crit3               112251 non-null int64
doubtterr           112251 non-null int64
multiple            112251 non-null int64
success             1

### Save the Preprocessed Data
Output the new data frame to a CSV file.

In [67]:
subset_df.to_csv("../data/gtd_preprocessed_95t016.csv", sep = ",")