# Launch @ Forge — Data Science Challenge (Police Killings Dataset)

**Goal:**
Understand how socioeconomic and demographic factors relate to fatal police encounters in the U.S. (2015).

**Questions to answer:**

1. What are the affected demographic groups?
2. In which areas (low-income/higer-poverty) are the incdents concentrated?
3. Do certain states show higher per-capita rates?
4. If time allows: does “armed vs unarmed” correlate with any tract-level features?

**Framework & Tools:** Python (pandas, numpy, seaborn, matplotlib, plotly)

**1. Load and preview the data**

In [5]:
# setup & load data

from pathlib import Path
import pandas as pd, numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# define necessary paths
BASE = Path("..").resolve()
RAW = BASE / "data" / "raw"
PROC = BASE / "data" / "processed"
FIGS = BASE / "visualizations" / " figures"

# create folders if they don't exist
PROC.mkdir(parents=True, exist_ok=True)
FIGS.mkdir(parents=True, exist_ok=True)

# load dataset
df = pd.read_csv(RAW / "police_killings.csv", encoding="latin1")

# quick check
display(df.head(3))
df.info()

Unnamed: 0,name,age,gender,raceethnicity,month,day,year,streetaddress,city,state,...,share_hispanic,p_income,h_income,county_income,comp_income,county_bucket,nat_bucket,pov,urate,college
0,A'donte Washington,16,Male,Black,February,23,2015,Clearview Ln,Millbrook,AL,...,5.6,28375,51367.0,54766,0.937936,3.0,3.0,14.1,0.097686,0.16851
1,Aaron Rutledge,27,Male,White,April,2,2015,300 block Iris Park Dr,Pineville,LA,...,0.5,14678,27972.0,40930,0.683411,2.0,1.0,28.8,0.065724,0.111402
2,Aaron Siler,26,Male,White,March,14,2015,22nd Ave and 56th St,Kenosha,WI,...,16.8,25286,45365.0,54930,0.825869,2.0,3.0,14.6,0.166293,0.147312


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 467 entries, 0 to 466
Data columns (total 34 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   name                  467 non-null    object 
 1   age                   467 non-null    object 
 2   gender                467 non-null    object 
 3   raceethnicity         467 non-null    object 
 4   month                 467 non-null    object 
 5   day                   467 non-null    int64  
 6   year                  467 non-null    int64  
 7   streetaddress         463 non-null    object 
 8   city                  467 non-null    object 
 9   state                 467 non-null    object 
 10  latitude              467 non-null    float64
 11  longitude             467 non-null    float64
 12  state_fp              467 non-null    int64  
 13  county_fp             467 non-null    int64  
 14  tract_ce              467 non-null    int64  
 15  geo_id                4

**2. Inspect and audit the data**

In [14]:
# initial audit
print(f"Rows: {df.shape[0]} - Columns: {df.shape[1]}")

# check for missing values
missing = df.isna().mean().sort_values(ascending=False) * 100
display(missing.head(15).to_frame("Missing percentage"))

# check for duplicate records
dupes = df.duplicated(subset=["name", "city", "state"]).sum()
print(f"Duplicate rows by (name, city, state): {dupes}")

# categorical previews
for c in ["raceethnicity", "gender", "armed", "cause", "state"]:
    if c in df.columns:
        print(f"\n{c.upper()} unique values:")
        print(df[c].value_counts(dropna=False).head(10))

Rows: 467 - Columns: 34


Unnamed: 0,Missing percentage
county_bucket,5.781585
streetaddress,0.856531
college,0.428266
urate,0.428266
nat_bucket,0.428266
comp_income,0.428266
h_income,0.428266
share_black,0.0
armed,0.0
pop,0.0


Duplicate rows by (name, city, state): 0

RACEETHNICITY unique values:
raceethnicity
White                     236
Black                     135
Hispanic/Latino            67
Unknown                    15
Asian/Pacific Islander     10
Native American             4
Name: count, dtype: int64

GENDER unique values:
gender
Male      445
Female     22
Name: count, dtype: int64

ARMED unique values:
armed
Firearm               230
No                    102
Knife                  68
Other                  26
Vehicle                18
Non-lethal firearm     14
Unknown                 7
Disputed                2
Name: count, dtype: int64

CAUSE unique values:
cause
Gunshot              411
Taser                 27
Death in custody      14
Struck by vehicle     12
Unknown                3
Name: count, dtype: int64

STATE unique values:
state
CA    74
TX    46
FL    29
AZ    25
OK    22
GA    16
NY    14
CO    12
WA    11
LA    11
Name: count, dtype: int64


## Initial observations

- Each record contains **34 columns** that are used to cover the fatal police encounters and there are **467 records** in the dataset. 

- There are a few missing data (in `county_bucket`,`steetaddress`, `college`, `urate`, `nat_bucket`, `comp_income` and `h_income`. Less than 6% ). **No duplicate**

- There is a comprehensive structure of most of the fields: demographic data (age, gender, race/ethnicity), geographic identifiers (city, state, county, tract), and socioeconomic variables (income, poverty, unemployment, education).  

- `age`is stored as an *object* (string), and so will need to be converted to a numerical representation.  

- The **race/ethnicity** column indicates six categories, the majority of victims are either the **White (236)** or **Black (135)**, the next six are **Hispanic/Latino (67)**.  

- **Gender** is massively male (445 / 467 ≈ 95%).  

- The listed categories of the **armed** field, such as "Firearm", "Knife", and "No" which can later be grouped into "Armed" vs "Unarmed".  

- **State distribution** has the most numbers in **CA (73)** and in **TX (48)** which is expected based on the size of the population, yet dissimilar rates in per-capita will require analysis.  
In general, the data is relatively clean though requires:
  - Numeric transformations (`age`, `income`, `poverty`, etc.)  
  - Standardization of categories (uniform race/ethnicity labels)  
  - Elimination or merging of unused columns (such as `streetaddress`) 

**Next Step**: Start data cleaning to normalize column names, coeerce numeric types, and prepare it for exploratory visualizations.