# Initial Data Exploration - Seattle Crime Data

**Objective**: Load and perform initial exploration of Seattle Police Department crime data

**Author**: Seattle Crime Analysis Team  
**Date**: 2026-02-06  
**Data Source**: [Seattle Open Data Portal - SPD Crime Data](https://data.seattle.gov/Public-Safety/SPD-Crime-Data-2008-Present/tazs-3rd5)

## Notebook Outline
1. Load and inspect raw data
2. Check data types and missing values
3. Basic statistical summary
4. Temporal analysis (date/time patterns)
5. Spatial analysis (coordinate validation)
6. Crime type distribution
7. Data quality assessment
8. Initial findings and next steps

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings

warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries loaded successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## 1. Load Raw Data

Load the Seattle crime data from the raw data directory.

In [None]:
# Define data path
data_path = Path('../../data/raw/spd_crime_data.csv')

# Check if file exists
if not data_path.exists():
    print(f"❌ Data file not found at: {data_path}")
    print("\nPlease download the data using one of these methods:")
    print("1. Manual: https://data.seattle.gov/Public-Safety/SPD-Crime-Data-2008-Present/tazs-3rd5")
    print("2. Script: python scripts/download_data.py")
    print("3. API: See SETUP.md for instructions")
else:
    print(f"✅ Data file found at: {data_path}")
    
    # Load data
    print("\nLoading data...")
    df = pd.read_csv(data_path)
    
    print(f"✅ Data loaded successfully!")
    print(f"   Records: {len(df):,}")
    print(f"   Columns: {len(df.columns)}")
    print(f"   Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

## 2. Initial Data Inspection

In [None]:
# Display first few rows
print("First 5 rows:")
df.head()

In [None]:
# Display column information
print("Dataset Information:")
df.info()

In [None]:
# Display column names
print("Column Names:")
for i, col in enumerate(df.columns, 1):
    print(f"{i:2d}. {col}")

## 3. Missing Values Analysis

In [None]:
# Calculate missing values
missing_stats = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum(),
    'Missing_Percent': (df.isnull().sum() / len(df) * 100).round(2),
    'Data_Type': df.dtypes
})

# Sort by missing percentage
missing_stats = missing_stats.sort_values('Missing_Percent', ascending=False)

print("Missing Values Summary:")
print(missing_stats[missing_stats['Missing_Count'] > 0])

# Visualize missing values
if missing_stats['Missing_Count'].sum() > 0:
    fig, ax = plt.subplots(figsize=(10, 6))
    missing_cols = missing_stats[missing_stats['Missing_Count'] > 0]
    ax.barh(missing_cols['Column'], missing_cols['Missing_Percent'])
    ax.set_xlabel('Missing Percentage (%)')
    ax.set_title('Missing Values by Column')
    plt.tight_layout()
    plt.show()

## 4. Basic Statistical Summary

In [None]:
# Numerical columns summary
print("Statistical Summary:")
df.describe()

In [None]:
# Categorical columns summary
print("Categorical Columns Summary:")
df.describe(include=['object'])

## 5. Temporal Analysis

Analyze date/time patterns in the crime data.

In [None]:
# Identify date/time columns (adjust based on actual column names)
date_columns = [col for col in df.columns if 'date' in col.lower() or 'time' in col.lower()]
print(f"Date/Time columns found: {date_columns}")

# TODO: Parse dates and analyze temporal patterns
# Example:
# df['offense_start_datetime'] = pd.to_datetime(df['offense_start_datetime'])
# df['year'] = df['offense_start_datetime'].dt.year
# df['month'] = df['offense_start_datetime'].dt.month
# df['day_of_week'] = df['offense_start_datetime'].dt.day_name()
# df['hour'] = df['offense_start_datetime'].dt.hour

## 6. Spatial Analysis

Validate and explore geographic coordinates.

In [None]:
# Identify coordinate columns
coord_columns = [col for col in df.columns if 'lat' in col.lower() or 'lon' in col.lower()]
print(f"Coordinate columns found: {coord_columns}")

# TODO: Validate coordinates and check for outliers
# Seattle approximate bounds:
# Latitude: 47.4 to 47.8
# Longitude: -122.5 to -122.2

## 7. Crime Type Distribution

In [None]:
# Identify crime type columns
crime_columns = [col for col in df.columns if 'offense' in col.lower() or 'crime' in col.lower()]
print(f"Crime-related columns found: {crime_columns}")

# TODO: Analyze distribution of crime types
# Example:
# crime_counts = df['offense'].value_counts()
# print(crime_counts.head(20))

## 8. Data Quality Assessment

Document any data quality issues found.

In [None]:
# Data quality checks
print("Data Quality Summary:")
print(f"\n1. Total Records: {len(df):,}")
print(f"2. Duplicate Records: {df.duplicated().sum():,}")
print(f"3. Columns with Missing Data: {(df.isnull().sum() > 0).sum()}")
print(f"4. Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# TODO: Add more quality checks based on data characteristics

## 9. Initial Findings

**Summary of Key Observations:**

1. **Data Volume**: [To be completed after analysis]
2. **Data Quality**: [To be completed after analysis]
3. **Temporal Coverage**: [To be completed after analysis]
4. **Spatial Coverage**: [To be completed after analysis]
5. **Crime Types**: [To be completed after analysis]

**Data Quality Issues Identified:**
- [List issues found]

**Next Steps:**
1. Clean and preprocess data based on findings
2. Handle missing values appropriately
3. Validate and correct spatial coordinates
4. Standardize date/time formats
5. Create processed dataset for analysis

---

**Notebook Status**: Template - Ready for data exploration  
**Next Notebook**: `02_data_cleaning_preprocessing.ipynb`