# Data Preparation: San Jose SJPD Incidents

** Project:** A Tale of Two Cities - Comparative Public Safety Analysis

**Purpose:** This notebook handles data loading, cleaning, preprocessing, and feature engineering for the San Jose dataset.

**Output:** Clean, analysis-ready dataset saved to `data/processed/sj_calls_cleaned.csv`

---

# 1. Import Libraries and Data Loading

In [8]:
import pandas as pd
import os
from pathlib import Path
import numpy as np

In [9]:
# Path to the directory containing the San Jose CSV files
sj_data_path = '../data/raw/sj_calls_for_service/'

# Get a list of all the CSV file names in the directory
csv_files = [f for f in os.listdir(sj_data_path) if f.endswith('.csv')]
csv_files.sort()  # Sort files for consistent loading order

print(f"Found {len(csv_files)} CSV files:")
for file in csv_files:
    print(f"  - {file}")

# Create an empty list to hold the individual DataFrames
df_list = []

# Loop through each CSV file, load it, and add it to the list
for file in csv_files:
    file_path = os.path.join(sj_data_path, file)
    print(f"Loading: {file}...")
    temp_df = pd.read_csv(file_path)
    print(f"  Rows: {len(temp_df):,}")
    df_list.append(temp_df)

# Concatenate all the DataFrames into a single master DataFrame
print("\nConcatenating all files...")
df = pd.concat(df_list, ignore_index=True)

print(f"\n✅ Dataset loaded: {len(df):,} rows, {len(df.columns)} columns")
df.head()

Found 8 CSV files:
  - policecalls2018.csv
  - policecalls2019.csv
  - policecalls2020.csv
  - policecalls2021.csv
  - policecalls2022.csv
  - policecalls2023.csv
  - policecalls2024.csv
  - policecalls2025.csv
Loading: policecalls2018.csv...
  Rows: 322,365
Loading: policecalls2019.csv...
  Rows: 322,624
Loading: policecalls2020.csv...
  Rows: 297,458
Loading: policecalls2021.csv...
  Rows: 302,119
Loading: policecalls2022.csv...
  Rows: 294,515
Loading: policecalls2023.csv...
  Rows: 286,481
Loading: policecalls2024.csv...
  Rows: 276,069
Loading: policecalls2025.csv...
  Rows: 225,561

Concatenating all files...

✅ Dataset loaded: 2,327,192 rows, 15 columns


Unnamed: 0,CDTS,EID,START_DATE,CALL_NUMBER,PRIORITY,REPORT_DATE,OFFENSE_DATE,OFFENSE_TIME,CALLTYPE_CODE,CALL_TYPE,FINAL_DISPO_CODE,FINAL_DISPO,ADDRESS,CITY,STATE
0,20180101000426PS,7000038,5/14/2021 12:00:00 AM,P180010001,2,1/1/2018 12:00:00 AM,1/1/2018 12:00:00 AM,00:00:02,FDAID,FIRE DEPARTMENT REQUEST FOR PD,N,No report required; dispatch r,[2900]-[3000] ALUM ROCK AV,San Jose,CA
1,20180101000120PS,7000040,5/14/2021 12:00:00 AM,P180010003,3,1/1/2018 12:00:00 AM,1/1/2018 12:00:00 AM,00:00:15,1033A,"ALARM, AUDIBLE",NR,No Response,[4700]-[4800] CALENDULA CT,San Jose,CA
2,20180101003329PS,7000041,5/14/2021 12:00:00 AM,P180010004,4,1/1/2018 12:00:00 AM,1/1/2018 12:00:00 AM,00:00:32,415M,"DISTURBANCE, MUSIC",CAN,Canceled,[3100]-[3200] WILLIAMSBURG DR,San Jose,CA
3,20180101000358PS,7000043,5/14/2021 12:00:00 AM,P180010005,4,1/1/2018 12:00:00 AM,1/1/2018 12:00:00 AM,00:01:41,415FC,"DISTURBANCE, FIRECRACKERS",CAN,Canceled,[1300]-[1400] BOURET DR,San Jose,CA
4,20180101000535PS,7000044,5/14/2021 12:00:00 AM,P180010006,4,1/1/2018 12:00:00 AM,1/1/2018 12:00:00 AM,00:01:53,415M,"DISTURBANCE, MUSIC",CAN,Canceled,[400]-[500] WASHINGTON ST,San Jose,CA


# 2. Initial Inspection

## 2.1 Checking for Duplicates 

In [10]:
print("Checking for duplicate IDs...")
eid_duplicates = df.duplicated(subset=['EID']).sum()

print(f"Duplicate EIDs: {eid_duplicates:,}")

if eid_duplicates > 0:
    print("Dropping duplicate rows based on 'EID'...")
    df.drop_duplicates(subset=['EID'], inplace=True)
    print(f"New row count: {len(df):,}")

Checking for duplicate IDs...
Duplicate EIDs: 45,812
Dropping duplicate rows based on 'EID'...
New row count: 2,281,380


## 2.2 Checking for missing values

In [11]:
# Calculate missing value percentages
missing_percentage = (df.isnull().sum() / len(df)) * 100

print("Columns with missing values (%):")
print(missing_percentage[missing_percentage > 0].sort_values(ascending=False))

Columns with missing values (%):
ADDRESS      2.899298
CALL_TYPE    0.001271
dtype: float64


## 3. Data Preprocessing

### 3.1 DateTime Processing
Combine OFFENSE_DATE and OFFENSE_TIME into a single Incident DateTime index.

In [12]:
date_str = pd.to_datetime(df['OFFENSE_DATE'], errors='coerce').dt.strftime('%Y-%m-%d')

time_str = pd.to_datetime(df['OFFENSE_TIME'], format='%H:%M:%S', errors='coerce').dt.strftime('%H:%M:%S')

# Combine OFFENSE_DATE and OFFENSE_TIME into a single datetime column
full_datetime_str = date_str + ' ' + time_str
df['Incident DateTime'] = pd.to_datetime(full_datetime_str, errors='coerce')

# Count and report any parsing errors
errors = df['Incident DateTime'].isna().sum()
total = len(df)
print(f"DateTime Parsing Complete.")
print(f"Total Rows: {total:,}")
print(f"Successful: {total - errors:,}")
print(f"Errors (NaT): {errors:,}")

# Show successful examples
print("\nSample of successfully parsed dates:")
print(df.head()[['OFFENSE_DATE', 'OFFENSE_TIME', 'Incident DateTime']])

  date_str = pd.to_datetime(df['OFFENSE_DATE'], errors='coerce').dt.strftime('%Y-%m-%d')


DateTime Parsing Complete.
Total Rows: 2,281,380
Successful: 2,281,380
Errors (NaT): 0

Sample of successfully parsed dates:
           OFFENSE_DATE OFFENSE_TIME   Incident DateTime
0  1/1/2018 12:00:00 AM     00:00:02 2018-01-01 00:00:02
1  1/1/2018 12:00:00 AM     00:00:15 2018-01-01 00:00:15
2  1/1/2018 12:00:00 AM     00:00:32 2018-01-01 00:00:32
3  1/1/2018 12:00:00 AM     00:01:41 2018-01-01 00:01:41
4  1/1/2018 12:00:00 AM     00:01:53 2018-01-01 00:01:53


In [13]:
# Set the new 'Incident DateTime' column as the DataFrame's index
df.set_index('Incident DateTime', inplace=True)

# Drop the original date and time columns as they are now redundant
date_time_cols = ['START_DATE', 'REPORT_DATE', 'OFFENSE_DATE', 'OFFENSE_TIME']
df.drop(date_time_cols, axis=1, inplace=True, errors='ignore')

# Drop columns not relevant for comparative analysis
# EID, CDTS, CALL_NUMBER are internal identifiers
# STATE is redundant (all data is from California)
identifier_cols = ['EID', 'CDTS', 'CALL_NUMBER', 'STATE']
df.drop(identifier_cols, axis=1, inplace=True, errors='ignore')

print(f"✅ Index set and redundant columns dropped")
print(f"Columns remaining: {len(df.columns)}")
df.info()

✅ Index set and redundant columns dropped
Columns remaining: 7
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2281380 entries, 2018-01-01 00:00:02 to 2025-11-08 23:56:34
Data columns (total 7 columns):
 #   Column            Dtype 
---  ------            ----- 
 0   PRIORITY          int64 
 1   CALLTYPE_CODE     object
 2   CALL_TYPE         object
 3   FINAL_DISPO_CODE  object
 4   FINAL_DISPO       object
 5   ADDRESS           object
 6   CITY              object
dtypes: int64(1), object(6)
memory usage: 139.2+ MB


## 3.2 Handling Missing Values

Now we drop all rows that had NaT dates or any other missing information.

In [14]:
# Drop rows with remaining missing values
rows_before = len(df)
df.dropna(inplace=True)
rows_after = len(df)

print(f"Rows dropped: {rows_before - rows_after:,} ({((rows_before - rows_after)/rows_before)*100:.2f}%)")
print(f"Clean dataset: {rows_after:,} rows")
print(f"Remaining missing values: {df.isnull().sum().sum()}")

Rows dropped: 66,173 (2.90%)
Clean dataset: 2,215,207 rows
Remaining missing values: 0


## 3.3 Handling Duplicate Values 

In [15]:
# Handling remaining duplicates

rows_before_final_drop = len(df)
df.drop_duplicates(inplace=True)
rows_after_final_drop = len(df)

print(f"Dropped {rows_before_final_drop - rows_after_final_drop:,} full-row duplicates.")

Dropped 1,046,383 full-row duplicates.


# 4. Feature Engineering

## 4.1 Temporal Features

Extract temporal features for time-series analysis.

In [16]:
# Create temporal features from the DateTime index
df['Hour'] = df.index.hour
df['Day'] = df.index.day
df['Month'] = df.index.month
df['Year'] = df.index.year
df['Day of Week'] = df.index.dayofweek  # Monday=0, Sunday=6
df['Day of Week Name'] = df.index.day_name()
df['Month Name'] = df.index.month_name()
df['Quarter'] = df.index.quarter
df['Is Weekend'] = df['Day of Week'].isin([5, 6]).astype(int)

print("✅ Temporal features created:")
print(df[['Hour', 'Day of Week Name', 'Month Name', 'Year', 'Is Weekend']].head())

✅ Temporal features created:
                     Hour Day of Week Name Month Name  Year  Is Weekend
Incident DateTime                                                      
2018-01-01 00:00:02     0           Monday    January  2018           0
2018-01-01 00:00:15     0           Monday    January  2018           0
2018-01-01 00:00:32     0           Monday    January  2018           0
2018-01-01 00:01:41     0           Monday    January  2018           0
2018-01-01 00:01:53     0           Monday    January  2018           0


## 4.2 Location Features

Extract a clean street name from the ADDRESS field for grouping.

In [17]:
# Clean the ADDRESS field by removing patterns like '[number]-[number] '
df['Clean_Address'] = df['ADDRESS'].str.replace(r'\[\d+\]-\[\d+\]\s', '', regex=True)

print("Created 'Clean_Address' feature. Examples:")
print(df[['ADDRESS', 'Clean_Address']].head())

Created 'Clean_Address' feature. Examples:
                                           ADDRESS    Clean_Address
Incident DateTime                                                  
2018-01-01 00:00:02     [2900]-[3000] ALUM ROCK AV     ALUM ROCK AV
2018-01-01 00:00:15     [4700]-[4800] CALENDULA CT     CALENDULA CT
2018-01-01 00:00:32  [3100]-[3200] WILLIAMSBURG DR  WILLIAMSBURG DR
2018-01-01 00:01:41        [1300]-[1400] BOURET DR        BOURET DR
2018-01-01 00:01:53      [400]-[500] WASHINGTON ST    WASHINGTON ST


## 4.3 Category Harmonization

We need to map the granular CALL_TYPE to high-level categories that can be compared with San Francisco.

In [18]:
print(f"\nOriginal unique SJ Categories: {df['CALL_TYPE'].nunique()}")

def map_high_level_category(cat):
    cat = str(cat).upper()

    # Violent
    if any(s in cat for s in ['ASSAULT', 'ROBBERY', 'HOMICIDE', 'RAPE', 'BATTERY', 'WEAPON']):
        return 'Violent'
    
    # Theft/Property
    if any(s in cat for s in ['BURGLARY', 'THEFT', 'STOLEN VEHICLE', 'MALICIOUS MISCHIEF', 'VANDALISM']):
        return 'Theft/Property'
    
    # Disturbance/Suspicious
    if any(s in cat for s in ['DISTURBANCE', 'SUSPICIOUS', 'WELFARE CHECK', 'TRESPASSING']):
        return 'Disturbance/Suspicious'

    # Traffic/Vehicle
    if any(s in cat for s in ['TRAFFIC', 'PARKING VIOLATION', 'VEHICLE STOP', 'ACCIDENT', 'RECKLESS DRIVING', 'HIT AND RUN', 'PEDESTRIAN STOP']):
        return 'Traffic/Vehicle'

    # Alarm
    if 'ALARM' in cat:
        return 'Alarm'
    
    # Non-Criminal / Admin
    if any(s in cat for s in ['MEET THE CITIZEN', 'UNK TYPE 911 CALL', 'RECOVERED STOLEN VEHICLE']):
        return 'Non-Criminal/Admin'
    
    # All others fall into a general 'Other'
    return 'Other'

# Apply the mapping
df['Incident_High_Level_Category'] = df['CALL_TYPE'].apply(map_high_level_category)

print("\nHigh-level categories created:")
print(df['Incident_High_Level_Category'].value_counts())


Original unique SJ Categories: 241

High-level categories created:
Incident_High_Level_Category
Disturbance/Suspicious    439807
Traffic/Vehicle           238636
Other                     192200
Theft/Property            127662
Alarm                      62575
Violent                    55084
Non-Criminal/Admin         52860
Name: count, dtype: int64


# 5. Final Schema Synchronization

To allow for direct comparison, we rename the key SJ columns to match the names and intent of the SF dataset.

In [19]:
print("Harmonizing column names for comparison...")

rename_map = {
    'CALL_TYPE': 'Incident_Category',
    'FINAL_DISPO': 'Resolution',
    'Clean_Address': 'Address_Clean',
    'CITY': 'City'
}

# Only rename columns that exist
cols_to_rename = {k: v for k, v in rename_map.items() if k in df.columns}
df.rename(columns=cols_to_rename, inplace=True)

print(f"SJ columns renamed.")

Harmonizing column names for comparison...
SJ columns renamed.


In [20]:
print("\nStarting final schema synchronization...")
print(f"Columns before cleanup: {len(df.columns)}")

# Define final columns to keep
columns_to_keep = {
    # Key Fields
    'Incident_High_Level_Category': 'Incident_High_Level_Category',
    'Resolution': 'Resolution',
    'Address_Clean': 'Neighborhood',
    'City': 'Police_District',   
    
    # Temporal Features
    'Hour': 'Hour',
    'Day': 'Day',
    'Month': 'Month',
    'Year': 'Year',
    'Day of Week': 'Day_of_Week',
    'Day of Week Name': 'Day_of_Week_Name',
    'Month Name': 'Month_Name',
    'Quarter': 'Quarter',
    'Is Weekend': 'Is_Weekend'
}

# Filter DataFrame
df_final = df[columns_to_keep.keys()].copy()

# Rename columns for perfect sync
df_final.rename(columns=columns_to_keep, inplace=True)

print(f"Columns after cleanup: {len(df_final.columns)}")
print("Final column list:")
print(df_final.columns.tolist())


Starting final schema synchronization...
Columns before cleanup: 18
Columns after cleanup: 13
Final column list:
['Incident_High_Level_Category', 'Resolution', 'Neighborhood', 'Police_District', 'Hour', 'Day', 'Month', 'Year', 'Day_of_Week', 'Day_of_Week_Name', 'Month_Name', 'Quarter', 'Is_Weekend']


## 6. Data Quality Check

In [21]:
# Final data quality summary
print("=" * 60)
print("FINAL CLEAN DATASET SUMMARY")
print("=" * 60)
print(f"Total Rows: {len(df_final):,}")
print(f"Total Columns: {len(df_final.columns)}")
print(f"Date Range: {df_final.index.min()} to {df_final.index.max()}")
print(f"Missing Values: {df_final.isnull().sum().sum()}") # Will show values for Lat/Lon, this is OK
print(f"Duplicate Rows: {df_final.duplicated().sum()}")
print("\nColumn List:")
print(df_final.columns.tolist())

FINAL CLEAN DATASET SUMMARY
Total Rows: 1,168,824
Total Columns: 13
Date Range: 2018-01-01 00:00:02 to 2025-11-08 23:55:37
Missing Values: 0
Duplicate Rows: 2214

Column List:
['Incident_High_Level_Category', 'Resolution', 'Neighborhood', 'Police_District', 'Hour', 'Day', 'Month', 'Year', 'Day_of_Week', 'Day_of_Week_Name', 'Month_Name', 'Quarter', 'Is_Weekend']


## 6. Save Processed Data

Save the clean dataset for use by team members in downstream analysis.

In [22]:
# Create processed data directory if it doesn't exist
output_dir = Path('../data/processed')
output_dir.mkdir(parents=True, exist_ok=True)

# Save to CSV
output_path = output_dir / 'sj_calls_cleaned.csv'
df_final.to_csv(output_path)

print(f"✅ Clean dataset saved to: {output_path}")
print(f"File size: {output_path.stat().st_size / (1024*1024):.2f} MB")

✅ Clean dataset saved to: ../data/processed/sj_calls_cleaned.csv
File size: 133.80 MB


---

## Note

**For Team Members:**
- Load this clean dataset using: `pd.read_csv('../data/processed/sj_calls_cleaned.csv', index_col='Incident DateTime', parse_dates=True)`