# Urban Crime Analytics & Geospatial Intelligence System

## 1. Research Topic and Background
### 1.1. Introduction
This study examines arrest patterns across two major American metropolitan police departments—the New York Police Department (NYPD) and the Los Angeles Police Department (LAPD)—from 2010 to 2019. Understanding temporal and spatial crime patterns is crucial for effective law enforcement resource allocation, community safety initiatives, and evidence-based policy development. Recent research by David Weisburd et al.[1] has highlighted the importance of examining micro-geographic concentrations of crime, while John MacDonald [3] demonstrated how policy changes have influenced arrest patterns in major urban centers. This analysis builds upon these works by comparing how arrest patterns manifest differently in America's two largest cities.

### 1.2. Research Questions
- How do temporal arrest patterns differ between NYPD and LAPD, and what might explain these differences?
- What geographic clustering patterns emerge in arrests, and how do these reflect the different urban structures of New York and Los Angeles?
- How have arrest patterns evolved over the 2010-2019 period, and what policy implications might this suggest?

### 1.3. Key Criminological Concepts
1. **Temporal crime patterns**: Cyclical variations in criminal activity based on time of day, day of week, month, or year
2. **Crime hotspots**: Geographic areas with disproportionately high concentrations of criminal activity
3. **Enforcement discretion**: The latitude officers have in deciding whether to make arrests for certain offenses
4. **Broken windows policing**: Enforcement strategy targeting minor offenses to prevent more serious crime
5. **Density-crime relationship**: Theoretical frameworks linking population density to crime rates and patterns
6. **Enforcement density**: The concentration of police resources relative to population and geography

### 1.4. Data Sources
The datasets used in this analysis come from publicly available police arrest records from the NYPD and LAPD, standardized to allow for direct comparison. The NYPD dataset (https://data.cityofnewyork.us/Public-Safety/NYPD-Arrests-Data-Historic-/8h9b-rp9u/about_data) contains more than 5 million records of which only 3,414,946 records are relevant for our analysis, while the LAPD dataset(https://data.lacity.org/Public-Safety/Arrest-Data-from-2010-to-2019/yru6-6re4/about_data) contains 1,320,817 records over the same preiod of time.

These data were originally collected for administrative and operational purposes by each department, then released as part of open data initiatives to increase transparency in policing. While useful for research, it's important to note these datasets reflect enforcement actions rather than actual crime rates—a crucial distinction when interpreting patterns.

### 1.5. Ethical and Legal Considerations
This analysis uses de-identified, aggregated data to protect individual privacy. However, historical biases in policing that have disproportionately affected minority communities must be acknowledged when interpreting these results. As Sampson and Loeffler [4] note, arrest concentrations often reflect a complex interplay of actual crime patterns, policy decisions, and potential enforcement biases.

### 1.6. Data Validity Assessment
While these datasets are comprehensive, several potential limitations exist:
1. Arrest data may reflect enforcement priorities rather than actual crime rates
2. Geocoding accuracy varies, particularly in densely populated areas
3. Cross-city comparisons require careful interpretation due to differences in legal codes, department policies, and reporting practices

In [None]:
import pandas as pd
import sys
import os
import matplotlib.pyplot as plt

# Add the project root to the python path to import src modules
sys.path.append(os.path.abspath('..'))

from src.io import load_data
from src.standardize import rename_nypd_columns, rename_lapd_columns, clean_nypd_data, clean_lapd_data
from src.data_processing import (
    standardize_age_categories, 
    standardize_gender, 
    standardize_race_ethnicity, 
    standardize_offense_categories,
    create_aligned_datasets,
    filter_datasets_by_year_range
)
import src.visualization as viz

# Define paths
DATA_DIR = '../data'
NYPD_PATH = os.path.join(DATA_DIR, 'nypd_aligned.csv')
LAPD_PATH = os.path.join(DATA_DIR, 'lapd_aligned.csv')

# Load data
print("Loading datasets...")
nypd_df = load_data(NYPD_PATH)
lapd_df = load_data(LAPD_PATH)

nypd_final = None
lapd_final = None

if nypd_df is not None and lapd_df is not None:
    # Check if data is already aligned (has standardized columns)
    required_aligned_cols = {'Data_Source', 'Arrest_Year', 'Offense_Std'}
    is_nypd_aligned = required_aligned_cols.issubset(nypd_df.columns)
    is_lapd_aligned = required_aligned_cols.issubset(lapd_df.columns)

    if is_nypd_aligned and is_lapd_aligned:
        print("Data is already aligned. Skipping ETL pipeline.")
        nypd_final, lapd_final = filter_datasets_by_year_range(nypd_df, lapd_df)
    else:
        print("Running ETL pipeline on raw data...")
        # 1. Rename columns to standard names
        nypd_df = rename_nypd_columns(nypd_df)
        lapd_df = rename_lapd_columns(lapd_df)

        # 2. Clean individual datasets
        nypd_df = clean_nypd_data(nypd_df)
        lapd_df = clean_lapd_data(lapd_df)

        # 3. Standardize categories across datasets
        nypd_df, lapd_df = standardize_age_categories(nypd_df, lapd_df)
        nypd_df, lapd_df = standardize_gender(nypd_df, lapd_df)
        nypd_df, lapd_df = standardize_race_ethnicity(nypd_df, lapd_df)
        nypd_df, lapd_df = standardize_offense_categories(nypd_df, lapd_df)

        # 4. Create aligned datasets with common columns
        nypd_aligned, lapd_aligned, common_cols = create_aligned_datasets(nypd_df, lapd_df)

        # 5. Filter to overlapping years
        nypd_final, lapd_final = filter_datasets_by_year_range(nypd_aligned, lapd_aligned)
    
    print(f"Final NYPD shape: {nypd_final.shape}")
    print(f"Final LAPD shape: {lapd_final.shape}")
else:
    print("Failed to load one or both datasets.")

## 3.1. Temporal Crime Pattern
The primary goal of this visualization is to identify how population density might influence arrest patterns in these contrasting urban environments. New York City, with its extremely high population density, consistently shows significantly higher arrest volumes than Los Angeles, despite LA County's larger geographic area. This visualization helps explore whether this disparity is merely a function of overall population or if density itself creates unique crime dynamics.

Additional aims of this visualization include:
- Identifying whether temporal crime patterns are universal across different urban environments or are influenced by local density factors
- Examining how arrest patterns have evolved over a decade in both cities
- Detecting specific time periods where enforcement activities peak in each jurisdiction
- Providing visual evidence for how urban structure (compact vs. sprawling) might influence not just the volume but also the timing of arrests

In [None]:
if nypd_final is not None and lapd_final is not None:
    fig = viz.create_temporal_analysis_plot(nypd_final, lapd_final)
    plt.show()

## 3.2. Map Visualization
This visualization explores the geographic distribution and intensity of crime in America's two largest cities, New York City and Los Angeles. New York City is the most populous city in the United States with 8.8 million residents (2020 census) concentrated in just 302 square miles, creating an exceptional population density of 29,302 people per square mile. In contrast, Los Angeles County is the most populous county in the United States with 9.8 million residents spread across 4,751 square miles, resulting in a much lower population density of 2,063 people per square mile. The City of Los Angeles itself contains 3.9 million residents across 503 square miles, with a density of 7,754 people per square mile—still only about a quarter of NYC's density.

Building on our previous temporal analysis—which revealed significant differences in arrest volumes despite LA County's larger geographic area—these heat maps illuminate how these dramatic population density differences and contrasting urban structures shape the spatial distribution of criminal activity.

The primary goal of this visualization is to demonstrate the "density effect" in criminal activity. While our earlier analysis showed that NYPD consistently records 2.6 times more arrests than LAPD, these maps help explain why: crime in New York concentrates intensely in specific areas due to the city's compact geography and high population density, while Los Angeles's sprawling urban landscape disperses criminal activity across a much wider area.

Additional aims of this visualization include:
- Identifying how different urban designs influence crime clustering patterns
- Examining the relationship between transportation infrastructure and crime hotspots
- Visualizing the stark contrast in crime density between a transit-oriented city (NYC) and a car-dependent city (LA)
- Providing visual evidence for why per-square-mile crime rates differ so dramatically between these cities

In [None]:
if nypd_final is not None and lapd_final is not None:
    fig = viz.create_crime_density_comparison(nypd_final, lapd_final)
    plt.show()

## 3.3. Categorical Data Visualization
This visualization examines the demographic patterns of arrests across NYC and LA, America's two most populous urban centers with dramatically different population densities and urban structures. Building on our previous temporal and spatial analyses—which revealed both higher arrest volumes and more concentrated crime hotspots in dense NYC compared to sprawling LA—this demographic breakdown explores whether these contrasting urban environments also produce different patterns in who gets arrested.

The primary goal of this visualization is to identify how population density and urban structure might influence not just where and when arrests occur, but who experiences them. While our previous analyses showed that NYPD makes significantly more arrests per capita and per square mile than LAPD, these charts help us understand whether the demographic composition of those arrests differs in ways that might reflect each city's unique population distribution, enforcement priorities, or community-police dynamics.

Additional aims of this visualization include:
- Examining racial disparities in arrests between high-density and lower-density urban environments
- Identifying similarities and differences in age and gender patterns across these contrasting cities
- Comparing offense type distributions to understand different enforcement priorities or criminal opportunities in dense versus sprawling urban settings
- Providing context for how demographic factors intersect with the spatial patterns observed in our crime density maps

In [None]:
if nypd_final is not None and lapd_final is not None:
    fig = viz.create_demographic_dashboard(nypd_final, lapd_final)
    plt.show()