# GLOBAL TERRORISM DATABASE ANALYSIS
## Course: Data Preparation and Visualization - Final Project

---

## Project Overview

This comprehensive analysis explores nearly five decades of global terrorism data (1970-2017) from the **Global Terrorism Database (GTD)**, maintained by the START Consortium at the University of Maryland. The GTD is the most comprehensive unclassified database on terrorist events worldwide, containing detailed information on over 181,000 attacks.

### Objectives

This project aims to:

1. **Temporal Analysis**: Identify patterns, trends, and turning points in terrorist activity over 48 years
2. **Geographic Mapping**: Visualize regional concentrations and country-level hotspots
3. **Tactical Evolution**: Analyze attack methods, weaponry, and target selection patterns
4. **Actor Profiling**: Use machine learning to cluster unknown perpetrators and analyze group longevity
5. **Human Impact**: Quantify casualties, hostage situations, and victim demographics

### Dataset Characteristics

- **Timespan**: 1970 - 2017 (48 years)
- **Observations**: 181,691 terrorist attacks
- **Variables**: 135 attributes per incident
- **Coverage**: 205 countries and territories across 12 world regions
- **Data Gap**: Complete absence of 1993 data (lost before compilation)

### Methodological Approach

- **Data Cleaning**: Handle missing values, remove incomplete years, engineer features
- **Visualization**: White background style with consistent color themes for professional presentation
- **Machine Learning**: PCA + K-Means clustering for perpetrator profiling
- **Statistical Analysis**: Temporal aggregations, geographic distributions, and correlation studies

---

# Data Dictionary: Global Terrorism Database (GTD)

## I. Temporal Information

| Variable | Description | Data Type / Values |
| :--- | :--- | :--- |
| `iyear` | Year of incident | Numeric |
| `imonth` | Month of incident (0 = Unknown, for pre-2011 data) | Numeric |
| `iday` | Day of incident (0 = Unknown, for pre-2011 data) | Numeric |
| `approxdate` | Text description if exact date is unknown (e.g., "Mid-June 1978") | Text |
| `extended` | Did the incident extend more than 24 hours? | Categorical (1 = Yes, 0 = No) |
| `resolution` | End date of incident (if `extended` = 1) | Date |

## II. Incident Information

| Variable | Description | Data Type / Values |
| :--- | :--- | :--- |
| `summary` | Brief narrative summary of the incident. **(Note: Post-1997 only)** | Text |
| `crit1` | Criterion 1: Does the incident have political, economic, religious, or social goals? | Categorical (1 = Yes, 0 = No) |
| `crit2` | Criterion 2: Intent to coerce, intimidate, or convey message beyond immediate victims? | Categorical (1 = Yes, 0 = No) |
| `crit3` | Criterion 3: Outside the context of legitimate warfare/international humanitarian law? | Categorical (1 = Yes, 0 = No) |
| `doubtterr` | Doubt whether the incident is terrorism? **(Note: Post-1997 only)** | Categorical (1 = Yes, 0 = No, -9 = Unknown) |
| `alternative` / `alternative_txt` | If `doubtterr` = 1, most likely alternative classification | Categorical (1 = Insurgency/Guerilla, 2 = Other Crime, 3 = Inter-group Conflict, 4 = Lack of Intent, 5 = State Actor) |
| `multiple` | Part of a multiple/coordinated incident? **(Note: Post-1997 only)** | Categorical (1 = Yes, 0 = No) |
| `related` | Lists other related `eventid` if `multiple` = 1. **(Note: Post-1997 only)** | Text |

## III. Location Information

| Variable | Description | Data Type / Values |
| :--- | :--- | :--- |
| `country` / `country_txt` | Country where incident occurred | Categorical (Country code) |
| `region` / `region_txt` | Geographic region (e.g., North America, Western Europe, Southeast Asia) | Categorical (1-12) |
| `provstate` | Province, state, or first-level administrative division | Text |
| `city` | City, village, or town where incident occurred | Text |
| `vicinity` | Did incident occur in the vicinity of the city (not within city limits)? | Categorical (1 = Yes, 0 = No) |
| `location` | Additional details about location (e.g., "near embassy", "on Highway 5") | Text |
| `latitude` | Latitude of the city | Numeric |
| `longitude` | Longitude of the city | Numeric |
| `specificity` | Precision level of geo-coding | Categorical (1 = City center, 2 = Regional center (city not found), 3 = Regional center (outside city), 4 = Province/state center, 5 = Unknown) |

## IV. Attack Information

| Variable | Description | Data Type / Values |
| :--- | :--- | :--- |
| `attacktype1` / `attacktype1_txt` | Primary attack type | Categorical (1 = Assassination, 2 = Armed Assault, 3 = Bombing/Explosion, 4 = Hijacking, 5 = Hostage Taking (Barricade), 6 = Hostage Taking (Kidnapping), 7 = Facility/Infrastructure Attack, 8 = Unarmed Assault, 9 = Unknown) |
| `attacktype2` / `attacktype2_txt` | Secondary attack type (if applicable, in hierarchical order) | Categorical (Same as above) |
| `attacktype3` / `attacktype3_txt` | Tertiary attack type (if applicable) | Categorical (Same as above) |
| `success` | Was the attack successful (by attack type definition, e.g., did bomb detonate)? | Categorical (1 = Yes, 0 = No) |
| `suicide` | Was this a suicide attack? | Categorical (1 = Yes, 0 = No) |

## V. Weapon Information

| Variable | Description | Data Type / Values |
| :--- | :--- | :--- |
| `weaptype1` / `weaptype1_txt` | General weapon type used | Categorical (1 = Biological, 2 = Chemical, 3 = Radiological, 4 = Nuclear, 5 = Firearms, 6 = Explosives, 7 = Fake Weapons, 8 = Incendiary, 9 = Melee, 10 = Vehicle, 11 = Sabotage Equipment, 12 = Other, 13 = Unknown) |
| `weapsubtype1` / `weapsubtype1_txt` | More specific weapon subtype (e.g., Handgun, Letter Bomb, Timed Bomb) | Categorical |
| `weaptype2` / `weapsubtype2` | Secondary weapon type and subtype | Categorical |
| `weaptype3` / `weapsubtype3` | Tertiary weapon type and subtype | Categorical |
| `weaptype4` / `weapsubtype4` | Fourth weapon type and subtype | Categorical |
| `weapdetail` | Additional notes about weapons (e.g., gun model, concealment method) | Text |

## VI. Target/Victim Information

| Variable | Description | Data Type / Values |
| :--- | :--- | :--- |
| `targtype1` / `targtype1_txt` | General target type | Categorical (1 = Business, 2 = Government (General), 3 = Police, 4 = Military, 7 = Government (Diplomatic), 8 = Educational Institution, 10 = Journalists & Media, 14 = Private Citizens & Property, 15 = Religious Figures/Institutions, 19 = Transportation, etc.) |
| `targsubtype1` / `targsubtype1_txt` | More specific target subtype (e.g., Restaurant/Bar, Embassy, Patrol) | Categorical |
| `corp1` | Name of targeted organization/agency/company (if applicable) | Text |
| `target1` | Specific description of target (e.g., "US Embassy", "5 patrol soldiers", "President X") | Text |
| `natlty1` / `natlty1_txt` | Nationality of the target | Categorical (Country code) |
| `targtype2` / `targsubtype2` / `corp2` / `target2` / `natlty2` | Information for second target | Same as above |
| `targtype3` / `targsubtype3` / `corp3` / `target3` / `natlty3` | Information for third target | Same as above |

## VII. Perpetrator Information

| Variable | Description | Data Type / Values |
| :--- | :--- | :--- |
| `gname` | Name of perpetrator group ("Unknown" if unclear) | Text |
| `gsubname` | Name of sub-group or specific faction (if applicable) | Text |
| `gname2` / `gsubname2` | Second perpetrator group name and sub-group | Text |
| `gname3` / `gsubname3` | Third perpetrator group name and sub-group | Text |
| `guncertain1` | Is attribution to perpetrator 1 doubtful/uncertain? | Categorical (1 = Yes, 0 = No) |
| `guncertain2` | Is attribution to perpetrator 2 doubtful/uncertain? | Categorical (1 = Yes, 0 = No) |
| `guncertain3` | Is attribution to perpetrator 3 doubtful/uncertain? | Categorical (1 = Yes, 0 = No) |
| `individual` | Attack carried out by unaffiliated individual(s)? **(Note: Post-1997 only)** | Categorical (1 = Yes, 0 = No) |
| `nperps` | Total number of perpetrators involved | Numeric (-99 = Unknown) |
| `nperpcap` | Number of perpetrators captured **(Note: Post-1997 only)** | Numeric (-99 = Unknown) |
| `claimed` | Did (group 1) claim responsibility? **(Note: Post-1997 only)** | Categorical (1 = Yes, 0 = No) |
| `claimmode` / `claimmode_txt` | Mode of claim (e.g., call, letter, video) **(Note: Post-1997 only)** | Categorical (1-10) |
| `claim2` / `claimmode2` | Group 2 claim and mode **(Note: Post-1997 only)** | Categorical |
| `claim3` / `claimmode3` | Group 3 claim and mode **(Note: Post-1997 only)** | Categorical |
| `compclaim` | Competing claims by multiple groups? **(Note: Post-1997 only)** | Categorical (1 = Yes, 0 = No, -9 = Unknown) |
| `motive` | Specific motive of the attack (if stated) **(Note: Post-1997 only)** | Text |

## VIII. Casualties & Consequences

| Variable | Description | Data Type / Values |
| :--- | :--- | :--- |
| `nkill` | Total confirmed fatalities (including victims and perpetrators) | Numeric |
| `nkillus` | Number of US citizens killed | Numeric |
| `nkillter` | Number of perpetrators killed | Numeric |
| `nwound` | Total confirmed injured (including victims and perpetrators) | Numeric |
| `nwoundus` | Number of US citizens injured | Numeric |
| `nwoundte` | Number of perpetrators injured | Numeric |
| `property` | Was there property damage? | Categorical (1 = Yes, 0 = No, -9 = Unknown) |
| `propextent` / `propextent_txt` | Extent of property damage | Categorical (1 = Catastrophic (>= $1B), 2 = Major (>= $1M), 3 = Minor (< $1M), 4 = Unknown) |
| `propvalue` | Estimated property damage value (USD at time of incident) | Numeric |
| `propcomment` | Notes describing property damage | Text |
| `ishostkid` | Were there hostages taken or kidnapping? | Categorical (1 = Yes, 0 = No, -9 = Unknown) |
| `nhostkid` | Total number of hostages/kidnap victims | Numeric (-99 = Unknown) |
| `nhostkidus` | Number of hostages/victims who are US citizens | Numeric |
| `nhours` | Duration of kidnapping (in hours, if < 24 hours) | Numeric (-99 = Unknown) |
| `ndays` | Duration of kidnapping (in days, if > 24 hours) | Numeric |
| `divert` | Country where vehicle was diverted (hijacking) or victims taken (kidnapping) | Text |
| `kidhijcountry` | Country where incident was resolved | Text |
| `ransom` | Was ransom demanded? | Categorical (1 = Yes, 0 = No, -9 = Unknown) |
| `ransomamt` | Ransom amount demanded (USD) | Numeric (-99 = Unknown) |
| `ransomamtus` | Ransom amount demanded from US sources (USD) | Numeric (-99 = Unknown) |
| `ransompaid` | Ransom amount paid (USD) | Numeric (-99 = Unknown) |
| `ransompaidus` | Ransom amount paid by US sources (USD) | Numeric (-99 = Unknown) |
| `ransomnote` | Notes on ransom demands or non-monetary demands **(Note: Post-1997 only)** | Text |
| `hostkidoutcome` / `hostkidoutcome_txt` | Outcome of hostage/kidnapping situation | Categorical (1 = Rescue Attempt, 2 = Hostage(s) Released, 3 = Hostage(s) Escaped, 4 = Hostage(s) Killed, 5 = Successful Rescue, 6 = Combined, 7 = Unknown) |
| `nreleased` | Number of hostages/victims who survived (released, escaped, or rescued) | Numeric (-99 = Unknown) |

## IX. Additional Information & Sources

| Variable | Description | Data Type / Values |
| :--- | :--- | :--- |
| `addnotes` | Additional notes on relevant details not captured elsewhere **(Note: Post-1997 only)** | Text |
| `scite1` | First citation source **(Note: Post-1997 only)** | Text |
| `scite2` | Second citation source **(Note: Post-1997 only)** | Text |
| `scite3` | Third citation source **(Note: Post-1997 only)** | Text |
| `dbsource` | Data collection source origin (e.g., PGIS, CETIS, ISVG, START) | Text |
| `INT_LOG` | International (Logistics): Did perpetrator group cross borders to execute attack? | Categorical (1 = Yes, 0 = No, -9 = Unknown) |
| `INT_IDEO` | International (Ideological): Does perpetrator nationality differ from target nationality? | Categorical (1 = Yes, 0 = No, -9 = Unknown) |
| `INT_MISC` | International (Miscellaneous): Does attack location differ from target nationality? | Categorical (1 = Yes, 0 = No, -9 = Unknown) |
| `INT_ANY` | International (Any): Is attack international by any criterion (LOG, IDEO, MISC)? | Categorical (1 = Yes, 0 = No, -9 = Unknown) |

---

**Dataset Scope:** 181,691 attacks | 205 countries | 12 regions | 135 variables | 1970-2017 (48 years)

---
# PART I: DATA PREPARATION
## Loading Libraries and Initial Setup

In [None]:
# Core data manipulation libraries
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
import matplotlib.gridspec as gridspec
from matplotlib.cm import ScalarMappable
from matplotlib.colors import Normalize, LinearSegmentedColormap
import seaborn as sns
import squarify  
import re

# Geographic visualization
import geopandas as gpd

# Machine Learning & Preprocessing
from sklearn.preprocessing import OneHotEncoder, StandardScaler, QuantileTransformer
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.cluster import MiniBatchKMeans
from sklearn.pipeline import Pipeline
from matplotlib.gridspec import GridSpec
import matplotlib.colors as mcolors

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style("white")
plt.rcParams['figure.facecolor'] = 'white'
plt.rcParams['axes.facecolor'] = 'white'
plt.rcParams['axes.grid'] = False

print("All libraries loaded successfully!")

All libraries loaded successfully!


## Visualization Formatting

In [None]:
# Main theme colors for multi-category visualizations
theme_colors = [
    "#8C1B13",  # deep wine red
    "#D73030",  # strong red
    "#E7642E",  # burnt orange
    "#E6B01B",  # warm golden yellow
    "#C65C8A",  # muted pink-purple
    "#7E3FA4"   # controlled royal purple
]

# Single insight color (for highlighting key findings)
INSIGHT_COLOR = "#d62728"  # red

# Supporting colors
COLOR_CONTEXT = "#BDC3C7"  # light gray for context/background
COLOR_TEXT = "#1f2f40"     # dark text

color_freq = "#8C1B13"      
color_deadly = "#D73030"    
color_context = "#BDC3C7" 

# Setting Style
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['axes.spines.top'] = False
plt.rcParams['axes.spines.right'] = False
plt.rcParams['axes.spines.left'] = False 
plt.rcParams['axes.spines.bottom'] = True
plt.rcParams['axes.grid'] = False 

print("Color theme established")

Color theme established


## Loading the Dataset

In [None]:
# Load the Global Terrorism Database
df = pd.read_csv('data/globalterrorismdb.csv', encoding='latin-1', low_memory=False)

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"Total attacks: {len(df):,}")
print(f"Features: {df.shape[1]}")
print(f"Time range: {df['iyear'].min()} - {df['iyear'].max()}")

Dataset loaded successfully!
Shape: (181691, 135)
Total attacks: 181,691
Features: 135
Time range: 1970 - 2017


## Data Cleaning and Preprocessing

### Step 1: Handle Missing Value Codes

The GTD uses specific codes for missing data:
- `-9` = Unknown/Missing
- `-99` = Unknown/Missing (for numeric ranges)

We convert these to `NaN` for proper handling.

In [None]:
df_clean = df.copy()

# Fix inconsistency: nwound < nwoundus, nkill < nkillus
df['nwound'] = np.where(df['nwoundus'] > df['nwound'], df['nwoundus'], df['nwound'])
df['nkill'] = np.where(df['nkillus'] > df['nkill'], df['nkillus'], df['nkill'])

# Replace GTD missing value codes with NaN
columns_to_replace = ['nperps', 'nperpcap', 'nhostkid', 'nhours', 'ndays', 
                      'ransomamt', 'ransomamtus', 'ransompaid', 'ransompaidus', 'nreleased']

for col in columns_to_replace:
    if col in df_clean.columns:
        df_clean[col] = df_clean[col].replace([-9, -99], pd.NA)

print("Missing value codes handled")
print(f"Affected columns: {len([c for c in columns_to_replace if c in df_clean.columns])}")

Missing value codes handled
Affected columns: 10


In [None]:
# Check missing value ratio for all columns
missing_ratio = df_clean.isna().mean().sort_values(ascending=False)
high_missing = missing_ratio[missing_ratio > 0.9]
print(f"Features >90% missing: {len(high_missing)}")
print(high_missing)

Features >90% missing: 62
gsubname3           0.999890
weapsubtype4_txt    0.999615
weapsubtype4        0.999615
weaptype4           0.999598
weaptype4_txt       0.999598
                      ...   
weapsubtype2_txt    0.936475
nhostkid            0.932110
weaptype2           0.927751
weaptype2_txt       0.927751
nhostkidus          0.925604
Length: 62, dtype: float64


### Step 2: Remove 1993 Data

All data for 1993 was lost before compilation, creating a gap in the time series.

In [None]:
# Remove 1993 (missing year)
records_1993 = len(df_clean[df_clean['iyear'] == 1993])
df_clean = df_clean[df_clean['iyear'] != 1993]

print(f"Year 1993 removed: {records_1993} records excluded")
print(f"New dataset size: {len(df_clean):,} attacks")

Year 1993 removed: 0 records excluded
New dataset size: 181,691 attacks


### Step 3: Check for impossible dates

In [None]:
# Check for impossible dates
impossible_dates = df_clean[
    (df_clean['imonth'] < 1) | (df_clean['imonth'] > 12) |
    (df_clean['iday'] < 1) | (df_clean['iday'] > 31)
]

print(f"Impossible date rows: {len(impossible_dates)}")
print(impossible_dates[['iyear', 'imonth', 'iday']].head())

Impossible date rows: 891
    iyear  imonth  iday
1    1970       0     0
2    1970       1     0
3    1970       1     0
4    1970       1     0
96   1970       3     0


In [None]:
# Remove impossible dates from df_clean
df_clean = df_clean[
    (df_clean['imonth'] >= 1) & (df_clean['imonth'] <= 12) &
    (df_clean['iday'] >= 1) & (df_clean['iday'] <= 31)
].copy()

print(f"Impossible date rows removed. New dataset size: {len(df_clean):,}")

Impossible date rows removed. New dataset size: 180,800


### Step 4: Feature Engineering

Create derived variables for enhanced analysis:

In [None]:
# 1. Create full date column
df_clean['date'] = pd.to_datetime(
    df_clean[['iyear', 'imonth', 'iday']].rename(
        columns={'iyear': 'year', 'imonth': 'month', 'iday': 'day'}
    ),
    errors='coerce'
)

# 2. Total casualties (killed + wounded)
df_clean['Casualties'] = df_clean['nkill'].fillna(0) + df_clean['nwound'].fillna(0)

# 3. Decade grouping
df_clean['Decade'] = (df_clean['iyear'] // 10) * 10

# 4. Severity classification
def classify_severity(casualties):
    if casualties == 0:
        return 'No Casualties'
    elif casualties <= 5:
        return 'Low'
    elif casualties <= 25:
        return 'Medium'
    elif casualties <= 100:
        return 'High'
    else:
        return 'Extreme'

df_clean['severity'] = df_clean['Casualties'].apply(classify_severity)

# 5. Unknown group indicator
df_clean['is_unknown'] = (df_clean['gname'] == 'Unknown').astype(int)

# 6. Normalized region/country columns for consistent grouping
df_clean['region_norm'] = df_clean['region_txt'].str.strip().str.lower()
df_clean['country_norm'] = df_clean['country_txt'].str.strip().str.lower()

# 7. Normalized success column (for dashboards using success/failure)
df_clean['success_norm'] = df_clean['success'].apply(lambda x: 1 if x == 1 else (0 if x == 0 else np.nan))

# 8. Hostage/kidnapping filter for dashboards
df_h = df_clean[df_clean['ishostkid'] == 1].copy()

# 9 Standardize Group Names
isis_variants = [
    'Islamic State of Iraq and the Levant (ISIL)',
    'Islamic State of Iraq (ISI)',
    'Islamic State (IS)',
    'Islamic State of Iraq and Syria (ISIS)'
]
df_clean['gname_clean'] = df_clean['gname'].replace(isis_variants, 'ISIS/ISIL')

# 10. Civilian target indicator
df_clean['is_civilian_target'] = df_clean['targtype1_txt'].str.contains('Private Citizens', case=False, na=False).astype(int)

### Step 5: Save Cleaned Dataset

In [None]:
# Save cleaned dataset
df_clean.to_csv('data/terrorism_cleaned.csv', index=False)
print("Cleaned dataset saved as 'terrorism_cleaned.csv'")

Cleaned dataset saved as 'terrorism_cleaned.csv'
