# GLOBAL TERRORISM DATABASE: FROM CHAOS TO CLARITY
## The Power of Data Preprocessing & Unmasking the Unknown
## Course: Data Preparation and Visualization - Final Project

---

### PROJECT OVERVIEW
This project moves beyond standard descriptive statistics to investigate the **strategic behavior** of global terrorism over nearly five decades (1970-2017). Using the **Global Terrorism Database (GTD)**, we aim to shift the analytical focus from "What happened?" to "Why is it lethal?" and "Who is behind the silence?".

The analysis is structured into two critical investigations: deconstructing the **"Efficiency Paradox"** of attack tactics and performing a **"Ghost Hunt"** to profile unknown perpetrators using Machine Learning.

### STRATEGIC OBJECTIVES
* **The Efficiency Paradox:** Challenge the assumption that high-frequency attacks are the most dangerous by visualizing the trade-off between **Popularity** (e.g., Bombing) and **Lethality** (e.g., Hijacking).
* **Anatomy of Lethality:** Dissect the "Suicide Multiplier" effect and map "Regional Signatures" to understand how terrain and power dynamics dictate tactical choices.
* **"Ghost Hunting" (Clustering):** Address the critical data gap where ~50% of attacks are attributed to "Unknown" groups. We apply **K-Means Clustering** to profile these invisible actors into 5 distinct operational personas.

### DATASET CHARACTERISTICS
* **Source:** Global Terrorism Database (GTD) by START Consortium.
* **Scope:** 181,691 incidents (1970-2017).
* **Key Features:** `AttackType`, `WeaponType`, `Suicide`, `Casualties` (nkill + nwound), `Region`, and `Group Name`.

---

# Data Dictionary: Global Terrorism Database (GTD)

## I. Temporal Information

| Variable | Description | Data Type / Values |
| :--- | :--- | :--- |
| `iyear` | Year of incident | Numeric |
| `imonth` | Month of incident (0 = Unknown, for pre-2011 data) | Numeric |
| `iday` | Day of incident (0 = Unknown, for pre-2011 data) | Numeric |
| `approxdate` | Text description if exact date is unknown (e.g., "Mid-June 1978") | Text |
| `extended` | Did the incident extend more than 24 hours? | Categorical (1 = Yes, 0 = No) |
| `resolution` | End date of incident (if `extended` = 1) | Date |

## II. Incident Information

| Variable | Description | Data Type / Values |
| :--- | :--- | :--- |
| `summary` | Brief narrative summary of the incident. **(Note: Post-1997 only)** | Text |
| `crit1` | Criterion 1: Does the incident have political, economic, religious, or social goals? | Categorical (1 = Yes, 0 = No) |
| `crit2` | Criterion 2: Intent to coerce, intimidate, or convey message beyond immediate victims? | Categorical (1 = Yes, 0 = No) |
| `crit3` | Criterion 3: Outside the context of legitimate warfare/international humanitarian law? | Categorical (1 = Yes, 0 = No) |
| `doubtterr` | Doubt whether the incident is terrorism? **(Note: Post-1997 only)** | Categorical (1 = Yes, 0 = No, -9 = Unknown) |
| `alternative` / `alternative_txt` | If `doubtterr` = 1, most likely alternative classification | Categorical (1 = Insurgency/Guerilla, 2 = Other Crime, 3 = Inter-group Conflict, 4 = Lack of Intent, 5 = State Actor) |
| `multiple` | Part of a multiple/coordinated incident? **(Note: Post-1997 only)** | Categorical (1 = Yes, 0 = No) |
| `related` | Lists other related `eventid` if `multiple` = 1. **(Note: Post-1997 only)** | Text |

## III. Location Information

| Variable | Description | Data Type / Values |
| :--- | :--- | :--- |
| `country` / `country_txt` | Country where incident occurred | Categorical (Country code) |
| `region` / `region_txt` | Geographic region (e.g., North America, Western Europe, Southeast Asia) | Categorical (1-12) |
| `provstate` | Province, state, or first-level administrative division | Text |
| `city` | City, village, or town where incident occurred | Text |
| `vicinity` | Did incident occur in the vicinity of the city (not within city limits)? | Categorical (1 = Yes, 0 = No) |
| `location` | Additional details about location (e.g., "near embassy", "on Highway 5") | Text |
| `latitude` | Latitude of the city | Numeric |
| `longitude` | Longitude of the city | Numeric |
| `specificity` | Precision level of geo-coding | Categorical (1 = City center, 2 = Regional center (city not found), 3 = Regional center (outside city), 4 = Province/state center, 5 = Unknown) |

## IV. Attack Information

| Variable | Description | Data Type / Values |
| :--- | :--- | :--- |
| `attacktype1` / `attacktype1_txt` | Primary attack type | Categorical (1 = Assassination, 2 = Armed Assault, 3 = Bombing/Explosion, 4 = Hijacking, 5 = Hostage Taking (Barricade), 6 = Hostage Taking (Kidnapping), 7 = Facility/Infrastructure Attack, 8 = Unarmed Assault, 9 = Unknown) |
| `attacktype2` / `attacktype2_txt` | Secondary attack type (if applicable, in hierarchical order) | Categorical (Same as above) |
| `attacktype3` / `attacktype3_txt` | Tertiary attack type (if applicable) | Categorical (Same as above) |
| `success` | Was the attack successful (by attack type definition, e.g., did bomb detonate)? | Categorical (1 = Yes, 0 = No) |
| `suicide` | Was this a suicide attack? | Categorical (1 = Yes, 0 = No) |

## V. Weapon Information

| Variable | Description | Data Type / Values |
| :--- | :--- | :--- |
| `weaptype1` / `weaptype1_txt` | General weapon type used | Categorical (1 = Biological, 2 = Chemical, 3 = Radiological, 4 = Nuclear, 5 = Firearms, 6 = Explosives, 7 = Fake Weapons, 8 = Incendiary, 9 = Melee, 10 = Vehicle, 11 = Sabotage Equipment, 12 = Other, 13 = Unknown) |
| `weapsubtype1` / `weapsubtype1_txt` | More specific weapon subtype (e.g., Handgun, Letter Bomb, Timed Bomb) | Categorical |
| `weaptype2` / `weapsubtype2` | Secondary weapon type and subtype | Categorical |
| `weaptype3` / `weapsubtype3` | Tertiary weapon type and subtype | Categorical |
| `weaptype4` / `weapsubtype4` | Fourth weapon type and subtype | Categorical |
| `weapdetail` | Additional notes about weapons (e.g., gun model, concealment method) | Text |

## VI. Target/Victim Information

| Variable | Description | Data Type / Values |
| :--- | :--- | :--- |
| `targtype1` / `targtype1_txt` | General target type | Categorical (1 = Business, 2 = Government (General), 3 = Police, 4 = Military, 7 = Government (Diplomatic), 8 = Educational Institution, 10 = Journalists & Media, 14 = Private Citizens & Property, 15 = Religious Figures/Institutions, 19 = Transportation, etc.) |
| `targsubtype1` / `targsubtype1_txt` | More specific target subtype (e.g., Restaurant/Bar, Embassy, Patrol) | Categorical |
| `corp1` | Name of targeted organization/agency/company (if applicable) | Text |
| `target1` | Specific description of target (e.g., "US Embassy", "5 patrol soldiers", "President X") | Text |
| `natlty1` / `natlty1_txt` | Nationality of the target | Categorical (Country code) |
| `targtype2` / `targsubtype2` / `corp2` / `target2` / `natlty2` | Information for second target | Same as above |
| `targtype3` / `targsubtype3` / `corp3` / `target3` / `natlty3` | Information for third target | Same as above |

## VII. Perpetrator Information

| Variable | Description | Data Type / Values |
| :--- | :--- | :--- |
| `gname` | Name of perpetrator group ("Unknown" if unclear) | Text |
| `gsubname` | Name of sub-group or specific faction (if applicable) | Text |
| `gname2` / `gsubname2` | Second perpetrator group name and sub-group | Text |
| `gname3` / `gsubname3` | Third perpetrator group name and sub-group | Text |
| `guncertain1` | Is attribution to perpetrator 1 doubtful/uncertain? | Categorical (1 = Yes, 0 = No) |
| `guncertain2` | Is attribution to perpetrator 2 doubtful/uncertain? | Categorical (1 = Yes, 0 = No) |
| `guncertain3` | Is attribution to perpetrator 3 doubtful/uncertain? | Categorical (1 = Yes, 0 = No) |
| `individual` | Attack carried out by unaffiliated individual(s)? **(Note: Post-1997 only)** | Categorical (1 = Yes, 0 = No) |
| `nperps` | Total number of perpetrators involved | Numeric (-99 = Unknown) |
| `nperpcap` | Number of perpetrators captured **(Note: Post-1997 only)** | Numeric (-99 = Unknown) |
| `claimed` | Did (group 1) claim responsibility? **(Note: Post-1997 only)** | Categorical (1 = Yes, 0 = No) |
| `claimmode` / `claimmode_txt` | Mode of claim (e.g., call, letter, video) **(Note: Post-1997 only)** | Categorical (1-10) |
| `claim2` / `claimmode2` | Group 2 claim and mode **(Note: Post-1997 only)** | Categorical |
| `claim3` / `claimmode3` | Group 3 claim and mode **(Note: Post-1997 only)** | Categorical |
| `compclaim` | Competing claims by multiple groups? **(Note: Post-1997 only)** | Categorical (1 = Yes, 0 = No, -9 = Unknown) |
| `motive` | Specific motive of the attack (if stated) **(Note: Post-1997 only)** | Text |

## VIII. Casualties & Consequences

| Variable | Description | Data Type / Values |
| :--- | :--- | :--- |
| `nkill` | Total confirmed fatalities (including victims and perpetrators) | Numeric |
| `nkillus` | Number of US citizens killed | Numeric |
| `nkillter` | Number of perpetrators killed | Numeric |
| `nwound` | Total confirmed injured (including victims and perpetrators) | Numeric |
| `nwoundus` | Number of US citizens injured | Numeric |
| `nwoundte` | Number of perpetrators injured | Numeric |
| `property` | Was there property damage? | Categorical (1 = Yes, 0 = No, -9 = Unknown) |
| `propextent` / `propextent_txt` | Extent of property damage | Categorical (1 = Catastrophic (>= $1B), 2 = Major (>= $1M), 3 = Minor (< $1M), 4 = Unknown) |
| `propvalue` | Estimated property damage value (USD at time of incident) | Numeric |
| `propcomment` | Notes describing property damage | Text |
| `ishostkid` | Were there hostages taken or kidnapping? | Categorical (1 = Yes, 0 = No, -9 = Unknown) |
| `nhostkid` | Total number of hostages/kidnap victims | Numeric (-99 = Unknown) |
| `nhostkidus` | Number of hostages/victims who are US citizens | Numeric |
| `nhours` | Duration of kidnapping (in hours, if < 24 hours) | Numeric (-99 = Unknown) |
| `ndays` | Duration of kidnapping (in days, if > 24 hours) | Numeric |
| `divert` | Country where vehicle was diverted (hijacking) or victims taken (kidnapping) | Text |
| `kidhijcountry` | Country where incident was resolved | Text |
| `ransom` | Was ransom demanded? | Categorical (1 = Yes, 0 = No, -9 = Unknown) |
| `ransomamt` | Ransom amount demanded (USD) | Numeric (-99 = Unknown) |
| `ransomamtus` | Ransom amount demanded from US sources (USD) | Numeric (-99 = Unknown) |
| `ransompaid` | Ransom amount paid (USD) | Numeric (-99 = Unknown) |
| `ransompaidus` | Ransom amount paid by US sources (USD) | Numeric (-99 = Unknown) |
| `ransomnote` | Notes on ransom demands or non-monetary demands **(Note: Post-1997 only)** | Text |
| `hostkidoutcome` / `hostkidoutcome_txt` | Outcome of hostage/kidnapping situation | Categorical (1 = Rescue Attempt, 2 = Hostage(s) Released, 3 = Hostage(s) Escaped, 4 = Hostage(s) Killed, 5 = Successful Rescue, 6 = Combined, 7 = Unknown) |
| `nreleased` | Number of hostages/victims who survived (released, escaped, or rescued) | Numeric (-99 = Unknown) |

## IX. Additional Information & Sources

| Variable | Description | Data Type / Values |
| :--- | :--- | :--- |
| `addnotes` | Additional notes on relevant details not captured elsewhere **(Note: Post-1997 only)** | Text |
| `scite1` | First citation source **(Note: Post-1997 only)** | Text |
| `scite2` | Second citation source **(Note: Post-1997 only)** | Text |
| `scite3` | Third citation source **(Note: Post-1997 only)** | Text |
| `dbsource` | Data collection source origin (e.g., PGIS, CETIS, ISVG, START) | Text |
| `INT_LOG` | International (Logistics): Did perpetrator group cross borders to execute attack? | Categorical (1 = Yes, 0 = No, -9 = Unknown) |
| `INT_IDEO` | International (Ideological): Does perpetrator nationality differ from target nationality? | Categorical (1 = Yes, 0 = No, -9 = Unknown) |
| `INT_MISC` | International (Miscellaneous): Does attack location differ from target nationality? | Categorical (1 = Yes, 0 = No, -9 = Unknown) |
| `INT_ANY` | International (Any): Is attack international by any criterion (LOG, IDEO, MISC)? | Categorical (1 = Yes, 0 = No, -9 = Unknown) |

---

**Dataset Scope:** 181,691 attacks | 205 countries | 12 regions | 135 variables | 1970-2017 (48 years)

# PART 1: INITIALIZATION & IMPORT LIBRARIES

## 1.1 Import Libraries

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings('ignore')

# Setup style for visualizations
sns.set_style("white")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

print("‚úÖ Libraries imported successfully!")

## 1.2 Load Data

In [None]:
# Load data
df_raw = pd.read_csv('data/globalterrorismdb.csv', encoding='latin-1', low_memory=False)

print(f"‚úÖ Data loaded successfully!")
print(f"üìä Data size: {df_raw.shape}")
print(f"üìÖ Time range: {df_raw['iyear'].min()} - {df_raw['iyear'].max()}")

# PART 2: SETTING - THE INVESTIGATION SETUP
**Objective:** Introduce context, problem (Unknown data) and research questions.

## Background
The Global Terrorism Database contains over 181,000 terrorist attacks from 1970-2017. However, a large proportion of incidents have unknown perpetrators (gname = "Unknown"). 

## Problem
How can we better understand these "Unknown" groups? Do they share common characteristics or belong to different "personas"?

## Research Questions
- How many behavioral clusters exist in the "Unknown" data?
- What are the characteristics of each group in terms of tactics, weapons, targets, and danger level?

# PART 3: PROBLEM - "CLEANING" THE CRIME SCENE
**Objective:** Perform Data Cleaning and Preprocessing. Illustrate differences before and after processing.

## 3.1 First Clue - "Contradictory Statements" (Owner: Kh√°nh)
**Problem:** Logical error `nkill < nkillus` (Total deaths < US citizen deaths).
**Solution:** Check and fix logic.

In [None]:
# [KH√ÅNH] - Check and fix nkill < nkillus error
# TODO: Code to fix nkill < nkillus logic error

# Placeholder code
df_clean = df_raw.copy()
print("‚è≥ Waiting for code from Kh√°nh...")

In [None]:
# [KH√ÅNH] - Visual comparison Before-After for Slide 3
# TODO: Code to create comparison visual

print("‚è≥ Waiting for code from Kh√°nh...")

## 3.2 Second Clue - "Secret Language of Missing Values" (Owner: Trang)
**Problem:** Inconsistent error codes (-9, -99, NaN, "Unknown").
**Solution:** Standardize all to one consistent value (e.g., `np.nan` or "Unknown").

In [None]:
# [TRANG] - Replace values -9, -99 with NaN
# TODO: Code to handle error codes

print("‚è≥ Waiting for code from Trang...")

In [None]:
# [TRANG] - Visual Word Cloud or error code statistics table for Slide 4
# TODO: Code to create visual

print("‚è≥ Waiting for code from Trang...")

## 3.3 Third Clue - "Forgotten Cases" (Owner: Kh√°nh)
**Problem:** Too many missing values (NaN) in features.
**Solution:** Remove columns with excessively high missing rates (> threshold).

In [None]:
# [KH√ÅNH] - Calculate missing ratio and drop columns
# TODO: Code to handle missing values

print("‚è≥ Waiting for code from Kh√°nh...")

In [None]:
# [KH√ÅNH] - Visual comparison of features count Before-After for Slide 5
# TODO: Code to create visual

print("‚è≥ Waiting for code from Kh√°nh...")

## 3.4 Case Study "Cleaning" #1: nhostkid (Owner: TAnh)
**Problem:** Negative values (-99) in column `nhostkid` (number of hostages).
**Solution:** Convert or remove negative values.

In [None]:
# [TANH] - Handle nhostkid column
# TODO: Code to handle negative values

print("‚è≥ Waiting for code from TAnh...")

In [None]:
# [TANH] - Draw 2 histograms comparing Before-After for Slide 6
# TODO: Code to create visual

print("‚è≥ Waiting for code from TAnh...")

## 3.5 Case Study "Cleaning" #2: nperps (Owner: TAnh)
**Problem:** Negative values in column `nperps` (number of perpetrators).
**Solution:** Similar to `nhostkid`.

In [None]:
# [TANH] - Handle nperps column
# TODO: Code to handle negative values

print("‚è≥ Waiting for code from TAnh...")

In [None]:
# [TANH] - Draw 2 histograms comparing Before-After for Slide 7
# TODO: Code to create visual

print("‚è≥ Waiting for code from TAnh...")

## 3.6 Data Prep Summary
**Result:** Create clean dataset `df_clean` ready for modeling.

In [None]:
# Consolidate clean dataset
print(f"Clean dataset is ready!")
print(f"Size: {df_clean.shape}")
print(f"Number of columns: {df_clean.shape[1]}")

# PART 4: RESOLUTION - UNVEILING THE GHOST (CLUSTERING)
**Objective:** Cluster "Unknown" data to discover behavioral groups (Personas).

## 4.1 Feature Engineering & Selection (Owner: Kh√°nh)
**Task:** Select important features for the model (AttackType, WeaponType, Region, Lethality...).
**Solution:** Encoding (One-hot/Label), Scaling (StandardScaler).

In [None]:
# [KH√ÅNH] - Select features and scale data
# TODO: Code feature engineering

print("‚è≥ Waiting for code from Kh√°nh...")

## 4.2 First Failure - When the Machine is "Haunted" (Owner: Hanh)
**Purpose:** Demo clustering results on poorly processed data or wrong feature selection (for Slide 10).

In [None]:
# [HANH] - Run KMeans on raw/incorrect data
# TODO: Code bad clustering for demo

print("‚è≥ Waiting for code from Hanh...")

In [None]:
# [HANH] - Draw poor/overlapping visual for Slide 10
# TODO: Code to create visual

print("‚è≥ Waiting for code from Hanh...")

## 4.3 Turning Point - 5 "Faces" of the Enemy (Owner: Hanh)
**Purpose:** Run KMeans on clean data (`df_clean`).
**Configuration:** `n_clusters = 5`.
**Labeling:** Assign cluster labels (0-4) to original dataframe.

In [None]:
# [HANH] - Run proper KMeans
# TODO: Code correct clustering with n_clusters=5

print("‚è≥ Waiting for code from Hanh...")

In [None]:
# [HANH] - Assign cluster labels to df
# TODO: Code to assign labels

print("‚è≥ Waiting for code from Hanh...")

# PART 5: INSIGHTS & VISUALIZATION (CLUSTER ANALYSIS)
**Objective:** Visualize characteristics of the 5 clusters just discovered.

**Definition of 5 Clusters (According to scenario):**
* Cluster 0: The Agitator (Amateur/Low-level)
* Cluster 1: The Guerrilla
* Cluster 2: The Specialist (Kidnapping)
* Cluster 3: The Mass-Casualty Bomber
* Cluster 4: The Assassin

## 5.1 Power Map: Who Dominates by Volume? (Slide 12 - Owner: Trang)
**Visual:** Treemap or Bar chart comparing incident volume of 5 clusters.
**Highlight:** Cluster 0 (Amateur) dominates.

In [None]:
# [TRANG] - Draw Treemap for Slide 12
# TODO: Code to create treemap

print("‚è≥ Waiting for code from Trang...")

## 5.2 Who is the Most Dangerous? (Slide 13 - Owner: ƒê√¥ng)
**Visual:** Bar chart comparing average lethality (casualties) of 5 clusters.
**Highlight:** Cluster 3 (Bomber) and Cluster 1 (Guerrilla).

In [None]:
# [ƒê√îNG] - Calculate mean lethality by cluster
# TODO: Code to calculate

print("‚è≥ Waiting for code from ƒê√¥ng...")

In [None]:
# [ƒê√îNG] - Draw Bar chart for Slide 13
# TODO: Code to create visual

print("‚è≥ Waiting for code from ƒê√¥ng...")

## 5.3 Tactical Fingerprint (Slide 14 - Owner: Hanh)
**Visual:** 100% Stacked Bar Chart showing AttackType proportions in each cluster.
**Highlight:** Specialization (Cluster 2 only kidnapping, Cluster 3 only bombing).

In [None]:
# [HANH] - Calculate AttackType proportions by cluster
# TODO: Code to calculate

print("‚è≥ Waiting for code from Hanh...")

In [None]:
# [HANH] - Draw Stacked Bar Chart for Slide 14
# TODO: Code to create visual

print("‚è≥ Waiting for code from Hanh...")

## 5.4 Operational Efficiency (Slide 15 - Owner: TAnh)
**Visual:** Area chart or Bar chart comparing Success Rate between clusters.
**Highlight:** Compare with average level.

In [None]:
# [TANH] - Calculate success rate by cluster
# TODO: Code to calculate

print("‚è≥ Waiting for code from TAnh...")

In [None]:
# [TANH] - Draw chart for Slide 15
# TODO: Code to create visual

print("‚è≥ Waiting for code from TAnh...")

## 5.5 Territory Map (Slide 16 - Owner: Trang - Optional)
**Visual:** 100% Stacked Bar Chart showing Region distribution of each cluster.

In [None]:
# [TRANG] - Draw geographic distribution chart for Slide 16
# TODO: Code to create visual (optional)

print("‚è≥ Waiting for code from Trang (optional)...")