# Chicago Crime Data Exploration

This notebook explores the content and structure of the Chicago crime dataset to understand:
- Data quality and completeness
- Crime patterns and trends
- Geographic distribution
- Temporal patterns
- Data relationships and correlations

## Dataset Information
- Source: Chicago Data Portal (data.cityofchicago.org)
- Dataset: Crimes 2023
- URL: https://data.cityofchicago.org/Public-Safety/Crimes-2023/xguy-4ndq

## Exploration Goals
1. **Data Quality Assessment**: Identify missing values, duplicates, and data inconsistencies
2. **Crime Type Analysis**: Understand the distribution of different crime categories
3. **Temporal Patterns**: Analyze crime trends over time (monthly, daily, hourly)
4. **Geographic Analysis**: Explore crime distribution across Chicago districts and neighborhoods
5. **Spatial Analysis**: Examine coordinate data and geographic patterns


## Step 1: Setup and Data Loading

Load the necessary libraries and import the Chicago crime dataset using Dask for efficient processing.


In [1]:
# Import necessary libraries
import dask.dataframe as dd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")


Libraries imported successfully!


In [4]:
# Load Chicago crime data
file_path = "/home/akash2016/dask-CSE255/chicago_crimes/Crimes_2023.csv"

# Lazy data load with Dask
df = dd.read_csv(
    file_path,
    assume_missing=True,      # helps with mixed integer/float columns
    dtype=str,                # start with string to inspect columns safely
    blocksize="64MB"          # each partition ~64MB
).compute()

print("Data loaded successfully!")
print(f"Number of partitions: {df.npartitions}")
print(f"Estimated size: {df.memory_usage(deep=True).sum().compute() / 1024**3:.2f} GB")


Data loaded successfully!


AttributeError: 'DataFrame' object has no attribute 'npartitions'

In [5]:

# Show all columns in the dataset
print("Columns in the dataset:")
print(df.columns.tolist())

# If available, provide a dictionary mapping column names to their meanings.
# This mapping is based on the standard Chicago crime data schema.
chicago_crime_column_meanings = {
    "ID": "Unique identifier for the record",
    "Case Number": "Chicago Police Department case number",
    "Date": "Date and time of the incident",
    "Block": "Approximate address where the incident occurred",
    "IUCR": "Illinois Uniform Crime Reporting code",
    "Primary Type": "Primary description of the crime (e.g., THEFT, BATTERY)",
    "Description": "Detailed description of the crime",
    "Location Description": "Type of location where the crime occurred",
    "Arrest": "Whether an arrest was made (True/False)",
    "Domestic": "Whether the incident was domestic-related (True/False)",
    "Beat": "The police beat where the incident occurred",
    "District": "Police district where the incident occurred",
    "Ward": "City council ward where the incident occurred",
    "Community Area": "Chicago community area code",
    "FBI Code": "FBI crime classification code",
    "X Coordinate": "Spatial coordinate (state plane)",
    "Y Coordinate": "Spatial coordinate (state plane)",
    "Year": "Year of the incident",
    "Updated On": "Date and time the record was last updated",
    "Latitude": "Latitude of reported incident",
    "Longitude": "Longitude of reported incident",
    "Location": "Point object in (latitude, longitude) format"
}
print("\nColumn meanings (if present in dataset):")
for col in df.columns:
    if col in chicago_crime_column_meanings:
        print(f"{col}: {chicago_crime_column_meanings[col]}")
    else:
        print(f"{col}: (Description not available)")




Columns in the dataset:
['ID', 'Case Number', 'Date', 'Block', 'IUCR', 'Primary Type', 'Description', 'Location Description', 'Arrest', 'Domestic', 'Beat', 'District', 'Ward', 'Community Area', 'FBI Code', 'X Coordinate', 'Y Coordinate', 'Year', 'Updated On', 'Latitude', 'Longitude', 'Location']

Column meanings (if present in dataset):
ID: Unique identifier for the record
Case Number: Chicago Police Department case number
Date: Date and time of the incident
Block: Approximate address where the incident occurred
IUCR: Illinois Uniform Crime Reporting code
Primary Type: Primary description of the crime (e.g., THEFT, BATTERY)
Description: Detailed description of the crime
Location Description: Type of location where the crime occurred
Arrest: Whether an arrest was made (True/False)
Domestic: Whether the incident was domestic-related (True/False)
Beat: The police beat where the incident occurred
District: Police district where the incident occurred
Ward: City council ward where the incide

In [6]:
df.shape

(263137, 22)