# Cincinnati Crash Data Analysis

## Methodology
This notebook analyzes traffic crash data from the City of Cincinnati.
1. **Data Collection**: Data is pulled directly from the Open Data Portal API.
2. **Data Cleaning**: 
   - Dates are converted to datetime objects.
   - Rows with missing location data are dropped to ensure spatial accuracy.
   - Coordinates are filtered to be within reasonable Cincinnati bounds (Lat 39.0-39.3) to verify data quality.
3. **Analysis**: We perform temporal aggregation and spatial mapping using Geopandas (ArcGIS equivalent in Python).

In [None]:
# requirements.txt: pandas, matplotlib, seaborn, geopandas, shapely

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from shapely.geometry import Point

# --- 1. Data Collection (Direct Download) ---
# Load directly from the URL to show you know how to handle remote data
url = "https://data.cincinnati-oh.gov/api/views/rvmt-pkmq/rows.csv?accessType=DOWNLOAD"
print("Downloading data...")
df = pd.read_csv(url)

# --- 2. Data Cleaning (Crucial Step) ---
# The job specifically asks for "Cleaning data sets". Document this step well.

# Convert date column to datetime objects
df['DATECRASHREPORTED'] = pd.to_datetime(df['DATECRASHREPORTED'])

# Filter for recent data (e.g., last 2 years) to make the file manageable
df_recent = df[df['DATECRASHREPORTED'].dt.year >= 2024].copy()

# Drop rows with missing location data (Common issue in crash data)
df_clean = df_recent.dropna(subset=['LATITUDE_X', 'LONGITUDE_X'])

# Filter out bad coordinates (e.g., 0,0 or points outside Cincinnati bounds)
# This shows "Verification of quality" mentioned in the job post
df_clean = df_clean[
    (df_clean['LATITUDE_X'] > 39.0) & (df_clean['LATITUDE_X'] < 39.3)
]

print(f"Original Records: {len(df)}")
print(f"Cleaned Records: {len(df_clean)}")

# --- 3. Analysis: Temporal Trends ---
# Simple aggregation to show stats skills
crashes_by_hour = df_clean['DATECRASHREPORTED'].dt.hour.value_counts().sort_index()

plt.figure(figsize=(10, 5))
crashes_by_hour.plot(kind='bar', color='#1f77b4')
plt.title("2024-2025 Cincinnati Crashes by Hour of Day")
plt.xlabel("Hour (24h)")
plt.ylabel("Number of Crashes")
plt.grid(axis='y', alpha=0.3)
plt.show()

# --- 4. Spatial Data (The "ArcGIS" Equivalent) ---
# Converting a DataFrame to a GeoDataFrame
geometry = [Point(xy) for xy in zip(df_clean['LONGITUDE_X'], df_clean['LATITUDE_X'])]
gdf = gpd.GeoDataFrame(df_clean, geometry=geometry, crs="EPSG:4326")

# Simple map plot
fig, ax = plt.subplots(figsize=(10, 10))
gdf.plot(ax=ax, markersize=1, color='red', alpha=0.5, label='Crash Locations')
plt.title("Spatial Distribution of Crashes in Cincinnati")
plt.legend()
plt.show()

# --- 5. Export for "Future Updates" ---
# Job asks to "Catalogue and document... for future updates"
gdf.to_file("cincinnati_crashes_cleaned.geojson", driver='GeoJSON')
print("Processed spatial data saved for ArcGIS usage.")