# Data Preparation: San Francisco SFPD Incidents

**Project:** A Tale of Two Cities - Comparative Public Safety Analysis

**Purpose:** This notebook handles data loading, cleaning, preprocessing, and feature engineering for the San Francisco dataset.

**Output:** Clean, analysis-ready dataset saved to `data/processed/sf_incidents_cleaned.csv`

---

## 1. Import Libraries

In [60]:
import pandas as pd
from pathlib import Path

## 2. Data Loading & Initial Inspection

We begin by loading the SFPD incident dataset and examining its structure, dimensions, and key features.

In [61]:
# Load CSV file into a DataFrame
file_path = '../data/raw/sfpd_incidents.csv'
df = pd.read_csv(file_path)

print(f"Dataset loaded: {len(df):,} rows, {len(df.columns)} columns")
df.head()

  df = pd.read_csv(file_path)


Dataset loaded: 974,157 rows, 29 columns


Unnamed: 0,Row ID,Incident Datetime,Incident Date,Incident Time,Incident Year,Incident Day of Week,Report Datetime,Incident ID,Incident Number,CAD Number,...,CNN,Police District,Analysis Neighborhood,Supervisor District,Supervisor District 2012,Latitude,Longitude,Point,data_as_of,data_loaded_at
0,150750507041,2025/08/26 11:17:00 PM,2025/08/26,23:17,2025,Tuesday,2025/08/26 11:17:00 PM,1507505,250333102,,...,,Out of SF,,,,,,,2025/08/28 09:38:07 AM,2025/08/29 09:53:03 AM
1,150752104134,2025/08/27 12:37:00 AM,2025/08/27,00:37,2025,Wednesday,2025/08/27 12:37:00 AM,1507521,250479881,252390049.0,...,33557000.0,Park,Lone Mountain/USF,1.0,1.0,37.780415,-122.449013,POINT (-122.449012756 37.780414581),2025/08/28 09:38:07 AM,2025/08/29 09:53:03 AM
2,150762309027,2025/07/17 03:00:00 PM,2025/07/17,15:00,2025,Thursday,2025/08/27 11:55:00 AM,1507623,250480775,252391585.0,...,26469000.0,Park,Lone Mountain/USF,1.0,1.0,37.775177,-122.451355,POINT (-122.45135498 37.775177002),2025/08/28 09:38:07 AM,2025/08/29 09:53:03 AM
3,150740506244,2025/08/23 09:30:00 PM,2025/08/23,21:30,2025,Saturday,2025/08/24 02:53:00 PM,1507405,256091227,,...,25905000.0,Northern,Hayes Valley,6.0,5.0,37.774551,-122.422501,POINT (-122.42250061 37.774551392),2025/08/27 09:38:07 AM,2025/08/28 09:53:00 AM
4,150723571000,2025/08/15 12:00:00 PM,2025/08/15,12:00,2025,Friday,2025/08/24 07:10:00 PM,1507235,256090348,,...,26412000.0,Park,Haight Ashbury,5.0,5.0,37.769661,-122.449646,POINT (-122.449645996 37.76966095),2025/08/27 09:38:07 AM,2025/08/28 09:53:00 AM


In [62]:
# Display dataset info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 974157 entries, 0 to 974156
Data columns (total 29 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Row ID                    974157 non-null  int64  
 1   Incident Datetime         974157 non-null  object 
 2   Incident Date             974157 non-null  object 
 3   Incident Time             974157 non-null  object 
 4   Incident Year             974157 non-null  int64  
 5   Incident Day of Week      974157 non-null  object 
 6   Report Datetime           974157 non-null  object 
 7   Incident ID               974157 non-null  int64  
 8   Incident Number           974157 non-null  int64  
 9   CAD Number                754846 non-null  float64
 10  Report Type Code          974157 non-null  object 
 11  Report Type Description   974157 non-null  object 
 12  Filed Online              192457 non-null  object 
 13  Incident Code             974157 non-null  i

## 2. Data Preprocessing

### 2.1 DateTime Processing
The dataset contains separate date and time columns. We'll combine these into a single DateTime index for efficient time-series analysis.

In [63]:
# Convert 'Incident Date' to datetime objects
df['Incident Date'] = pd.to_datetime(df['Incident Date'])

# Combine date and time into a single column
df['Incident DateTime'] = pd.to_datetime(
    df['Incident Date'].dt.strftime('%Y-%m-%d') + ' ' + df['Incident Time']
)

# Set as index
df.set_index('Incident DateTime', inplace=True)

print("DateTime index created successfully")
df.head()

DateTime index created successfully


Unnamed: 0_level_0,Row ID,Incident Datetime,Incident Date,Incident Time,Incident Year,Incident Day of Week,Report Datetime,Incident ID,Incident Number,CAD Number,...,CNN,Police District,Analysis Neighborhood,Supervisor District,Supervisor District 2012,Latitude,Longitude,Point,data_as_of,data_loaded_at
Incident DateTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2025-08-26 23:17:00,150750507041,2025/08/26 11:17:00 PM,2025-08-26,23:17,2025,Tuesday,2025/08/26 11:17:00 PM,1507505,250333102,,...,,Out of SF,,,,,,,2025/08/28 09:38:07 AM,2025/08/29 09:53:03 AM
2025-08-27 00:37:00,150752104134,2025/08/27 12:37:00 AM,2025-08-27,00:37,2025,Wednesday,2025/08/27 12:37:00 AM,1507521,250479881,252390049.0,...,33557000.0,Park,Lone Mountain/USF,1.0,1.0,37.780415,-122.449013,POINT (-122.449012756 37.780414581),2025/08/28 09:38:07 AM,2025/08/29 09:53:03 AM
2025-07-17 15:00:00,150762309027,2025/07/17 03:00:00 PM,2025-07-17,15:00,2025,Thursday,2025/08/27 11:55:00 AM,1507623,250480775,252391585.0,...,26469000.0,Park,Lone Mountain/USF,1.0,1.0,37.775177,-122.451355,POINT (-122.45135498 37.775177002),2025/08/28 09:38:07 AM,2025/08/29 09:53:03 AM
2025-08-23 21:30:00,150740506244,2025/08/23 09:30:00 PM,2025-08-23,21:30,2025,Saturday,2025/08/24 02:53:00 PM,1507405,256091227,,...,25905000.0,Northern,Hayes Valley,6.0,5.0,37.774551,-122.422501,POINT (-122.42250061 37.774551392),2025/08/27 09:38:07 AM,2025/08/28 09:53:00 AM
2025-08-15 12:00:00,150723571000,2025/08/15 12:00:00 PM,2025-08-15,12:00,2025,Friday,2025/08/24 07:10:00 PM,1507235,256090348,,...,26412000.0,Park,Haight Ashbury,5.0,5.0,37.769661,-122.449646,POINT (-122.449645996 37.76966095),2025/08/27 09:38:07 AM,2025/08/28 09:53:00 AM


In [64]:
# Drop redundant date/time columns
df.drop(['Incident Datetime', 'Incident Date', 'Incident Time'], axis=1, inplace=True, errors='ignore')

print(f"Columns after cleanup: {len(df.columns)}")
df.info()

Columns after cleanup: 26
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 974157 entries, 2025-08-26 23:17:00 to 2019-06-28 10:00:00
Data columns (total 26 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Row ID                    974157 non-null  int64  
 1   Incident Year             974157 non-null  int64  
 2   Incident Day of Week      974157 non-null  object 
 3   Report Datetime           974157 non-null  object 
 4   Incident ID               974157 non-null  int64  
 5   Incident Number           974157 non-null  int64  
 6   CAD Number                754846 non-null  float64
 7   Report Type Code          974157 non-null  object 
 8   Report Type Description   974157 non-null  object 
 9   Filed Online              192457 non-null  object 
 10  Incident Code             974157 non-null  int64  
 11  Incident Category         972786 non-null  object 
 12  Incident Subcategory      972786 non-nul

### 2.2 Handling Missing Values
Before proceeding with analysis, we need to identify and handle missing data appropriately.

In [65]:
# Calculate missing value percentages
missing_percentage = (df.isnull().sum() / len(df)) * 100

print("Columns with missing values:")
print(missing_percentage[missing_percentage > 0].sort_values(ascending=False))

Columns with missing values:
Filed Online                80.243739
CAD Number                  22.512901
Supervisor District          5.610081
Analysis Neighborhood        5.577027
Supervisor District 2012     5.556497
Intersection                 5.547463
CNN                          5.547463
Latitude                     5.547463
Longitude                    5.547463
Point                        5.547463
Incident Category            0.140737
Incident Subcategory         0.140737
dtype: float64


In [66]:
# Drop columns with high percentage of missing values (>20%)
columns_to_drop = missing_percentage[missing_percentage > 20].index.tolist()

if columns_to_drop:
    df.drop(columns_to_drop, axis=1, inplace=True)
    print(f"Dropped {len(columns_to_drop)} columns with >20% missing values:")
    print(columns_to_drop)
else:
    print("No columns with >20% missing values")

print(f"\nColumns remaining: {len(df.columns)}")

Dropped 2 columns with >20% missing values:
['CAD Number', 'Filed Online']

Columns remaining: 24


In [67]:
# Drop rows with remaining missing values
rows_before = len(df)
df.dropna(inplace=True)
rows_after = len(df)

print(f"Rows dropped: {rows_before - rows_after:,} ({((rows_before - rows_after)/rows_before)*100:.2f}%)")
print(f"Clean dataset: {rows_after:,} rows")
print(f"Remaining missing values: {df.isnull().sum().sum()}")

Rows dropped: 56,039 (5.75%)
Clean dataset: 918,118 rows
Remaining missing values: 0


## 5. Feature Engineering

Extract temporal features for analysis.

In [68]:
# Create temporal features from the DateTime index
df['Hour'] = df.index.hour
df['Day'] = df.index.day
df['Month'] = df.index.month
df['Year'] = df.index.year
df['Day of Week'] = df.index.dayofweek  # Monday=0, Sunday=6
df['Day of Week Name'] = df.index.day_name()
df['Month Name'] = df.index.month_name()
df['Quarter'] = df.index.quarter
df['Is Weekend'] = df['Day of Week'].isin([5, 6]).astype(int)

print("Temporal features created:")
print(df[['Hour', 'Day of Week Name', 'Month Name', 'Year', 'Is Weekend']].head())

Temporal features created:
                     Hour Day of Week Name Month Name  Year  Is Weekend
Incident DateTime                                                      
2025-08-27 00:37:00     0        Wednesday     August  2025           0
2025-07-17 15:00:00    15         Thursday       July  2025           0
2025-08-23 21:30:00    21         Saturday     August  2025           1
2025-08-15 12:00:00    12           Friday     August  2025           0
2025-08-15 21:45:00    21           Friday     August  2025           0


## 6. Data Quality Check

In [69]:
# Final data quality summary
print("=" * 60)
print("FINAL CLEAN DATASET SUMMARY")
print("=" * 60)
print(f"Total Rows: {len(df):,}")
print(f"Total Columns: {len(df.columns)}")
print(f"Date Range: {df.index.min()} to {df.index.max()}")
print(f"Missing Values: {df.isnull().sum().sum()}")
print(f"Duplicate Rows: {df.duplicated().sum()}")
print("\nColumn List:")
print(df.columns.tolist())

FINAL CLEAN DATASET SUMMARY
Total Rows: 918,118
Total Columns: 33
Date Range: 2018-01-01 00:00:00 to 2025-10-12 21:04:00
Missing Values: 0
Duplicate Rows: 0

Column List:
['Row ID', 'Incident Year', 'Incident Day of Week', 'Report Datetime', 'Incident ID', 'Incident Number', 'Report Type Code', 'Report Type Description', 'Incident Code', 'Incident Category', 'Incident Subcategory', 'Incident Description', 'Resolution', 'Intersection', 'CNN', 'Police District', 'Analysis Neighborhood', 'Supervisor District', 'Supervisor District 2012', 'Latitude', 'Longitude', 'Point', 'data_as_of', 'data_loaded_at', 'Hour', 'Day', 'Month', 'Year', 'Day of Week', 'Day of Week Name', 'Month Name', 'Quarter', 'Is Weekend']


## 7. Save Processed Data

Save the clean dataset for use by team members in downstream analysis.

In [70]:
# Create processed data directory if it doesn't exist
output_dir = Path('../data/processed')
output_dir.mkdir(parents=True, exist_ok=True)

# Save to CSV
output_path = output_dir / 'sf_incidents_cleaned.csv'
df.to_csv(output_path)

print(f"✅ Clean dataset saved to: {output_path}")
print(f"File size: {output_path.stat().st_size / (1024*1024):.2f} MB")

✅ Clean dataset saved to: ../data/processed/sf_incidents_cleaned.csv
File size: 350.57 MB


---

## Note:

**For Team Members:**
- Load this clean dataset using: `pd.read_csv('../data/processed/sf_incidents_cleaned.csv', index_col='Incident DateTime', parse_dates=True)`

