# Crime Hotspot Mapping and Prediction Accuracy Index (PAI)


> Chainey, S., Tompson, L., & Uhlig, S. (2008b). The utility of hotspot mapping for predicting spatial patterns of crime. Security Journal, 21(1‚Äì2), 4‚Äì28. [10.1057/palgrave.sj.8350066](https://doi.org/10.1057/palgrave.sj.8350066)


## üéØ Objectives

- Explore a real crime data set with Python and GeoPandas.  
- Create different types of **hotspot maps** (districts, beats, hexagonal grid, and clusters).  
- Implement and interpret the **Prediction Accuracy Index (PAI)** as proposed in crime analysis research (Chainey, Tompson & Uhlig, 2008).  
- Reflect on how different hotspot methods perform at predicting where crime will occur next.

### Before you start

Answer these short questions in your own words (just a sentence each):

1. What do you think a **crime hotspot** is?  
2. Why might police care about **predicting** where crime will happen, rather than only mapping where it already happened?  
3. What is one potential **risk** of basing decisions on hotspot maps?

## Part 0: Setup

- (Optionally) install missing packages.  
- Import the Python libraries used throughout the notebook.  
- Define some **coordinate reference systems (CRS)** and file paths.

### Data

For reproducibility, use an extract of the **Chicago ‚ÄúCrimes ‚Äì 2001 to Present‚Äù** dataset
from the City of Chicago Open Data Portal.

1. Open the City of Chicago Open Data Portal.  
2. Search for **[Crimes ‚Äì 2001 to Present](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2/about_data)**.  
3. Filter (for example) to the year 2025 and export as CSV.  
4. Save it locally as: `data/chicago_crimes_sample.csv`.
5. Download also the districts and the beats of Chicago

> You may choose a different filename or time period ‚Äì just remember to update the path
> and filters in the notebook.


In [None]:
# OPTIONAL: install missing libraries in your environment
# Run this cell **only if** you get ImportError messages below.
# Remove the leading `#` before %pip to actually install.

# %pip install pandas geopandas shapely numpy matplotlib plotly keplergl h3 scikit-learn

In [None]:
# Core data handling
import pandas as pd        
import numpy as np         # numerical operations and arrays
import geopandas as gpd
from shapely import wkt
from shapely.geometry import Polygon

# Visualisation
from keplergl import KeplerGl  

# Spatial analysis
import h3 # Hexagonal grid indexing
from sklearn.cluster import DBSCAN # Clustering for hotspot detection

# Spatial statistics (need to install pysal)
import pysal.lib as ps # PySAL 
from pysal.explore import esda # Moran's I, Local Moran's I, and Local Getis-Ord G
from splot.esda import moran_scatterplot # visualize moran


In [None]:
# Define constants

# Global CRS settings
GLOBAL_CRS = "EPSG:4326"   # WGS84 geographic coordinates (lat/lon)
METRIC_CRS = "EPSG:3857"   # Web Mercator (units ~ metres) -> you can also use EPSG:2163 for a projected US Albers

# File paths (adapt if your files are in a subfolder like './data/...')
CRIMES_DATA_PATH = ___
DISTRICT_DATA_PATH = ___
BEATS_DATA_PATH = ___

## Part 1: Density and PAI estimation

### A. Load crime data and first exploration

We start by:

1. Loading the crime CSV file into a `pandas.DataFrame`.  
2. Dropping clearly redundant coordinate columns (if present).  
3. Removing records with missing coordinates.  
4. Converting the table to a `GeoDataFrame`.

In [None]:
# Load crime data
crimes = pd.read_csv(CRIMES_DATA_PATH)

print("Original number of rows:", ___)
print("Columns:", list(___))

In [None]:
# Drop redundant coordinate columns if present
cols_to_drop = [___]
df_crimes = ___.drop(columns=___)

In [None]:
# Drop rows with missing lat/lon
number_of_rows_before = ___
df_crimes = df_crimes.dropna(subset=[___])
number_of_rows_after
print("Missing coordinates (%):", ___)

In [None]:
# Quick descriptive statistics
df_crimes.___()


In [None]:
# Display the interactive visualization with Kepler

___

> ‚ÅâÔ∏è How might **data quality issues** (missing coordinates, mis-typed districts, wrong timestamps) influence your hotspot analysis and any decisions based on it?
>
> ‚ÅâÔ∏è After exploring the map: do you see any **clearly visible clusters** or corridors of crime?  
>
> ‚ÅâÔ∏è How might the **underlying urban structure** (roads, land use, transport) explain what you see?

### B. Train‚Äìtest split: preparing for prediction evaluation

To evaluate hotspot maps as **predictors**, we follow the general idea from Chainey et al. (2008):

- Use **historical crime data** (training period) to build hotspot maps.  
- Use **future crime data** (test period) to evaluate how many test crimes fall inside the hotspots.

Here we use a **simple split**:

- Training set: all crimes **before** the last 7 days.  
- Test set: crimes in the **last 7 days** of the data.

In [None]:
# Ensure 'Date' is a proper datetime column
df_crimes["Date"] = pd.to_datetime(___)

max_date = ___
last_week_start_date = max_date - pd.Timedelta(days=7)

# To split out the dataframe, we will use a mask. A mask is a boolean condition (test)

mask_test = df_crimes["Date"] > last_week_start_date # here we want to filter all the columns that have a date > to the last_week_start_date

df_test = df_crimes.loc[mask_test].copy()
df_train = df_crimes.loc[___].copy() # here we want the opposite of our test mask. In pandas you can use `~` to get the opposite of a boolean condition 

print(f"Training set: {___} crimes")
print(f"Test set: {___} crimes")

print("Training period:", ___, "to", ___)
print("Test period:", ___, "to", ___)

### C. Loading district and beat boundaries

We now load the polygon boundaries for:

- **Districts** (larger administrative areas)  
- **Beats** (smaller operational policing units)

Both are stored as CSVs with a WKT geometry column called `the_geom`.

---

##### üìö Concept: administrative units and the MAUP

Administrative units (districts, beats) are **not neutral**:

- Their size and shape are products of history, politics, and operational needs.  
- Changing the boundaries can change the **appearance** of a hotspot map.  
- This relates to the **modifiable areal unit problem (MAUP)** ‚Äì results can change when you change the zoning or aggregation.

In [None]:
# Load districts
df_districts = ___

# Convert WKT to geometry
df_districts["geometry"] = df_districts["the_geom"].apply(wkt.loads)
gdf_districts = gpd.GeoDataFrame(___, geometry=___, crs=___)

# A cool addition to make you work on your conflict resolution

print("Number of Districts:", ___)

# Quick plot
gdf_districts.___

In [None]:
# Load beats
___

### D. Thematic mapping of geographic boundary areas

One common hotspot method is to **count crimes per administrative unit** and shade polygons by that count.

We will:

1. Aggregate crimes to **districts** and map them.  
2. Aggregate crimes to **beats** and map them.

This corresponds to ‚Äúthematic mapping of geographic boundary areas‚Äù in the paper.

--- 

##### üìö Concept: choropleth maps

A **choropleth map** colours polygons based on a value (here: crime count).  

**Advantages:**

- Very simple and widely understood.  
- Easy to compute and update regularly.

**Disadvantages:**

- Sensitive to how boundaries are drawn (MAUP).  
- Does not show within-unit variation.
- Large polygons with small populations can look very "hot".  

Keep these pros and cons in mind when interpreting your maps.

In [None]:
# Aggregate training crimes by district
crime_dist_counts = ( # we write everything in parentheses to make it more readable and write over several lines
    ___ # training data
    .groupby(___) # group by district
    .size() # count crimes per district
    .reset_index(name="crimes_count") # reset index and name the count column
)

# What does the aggregated data look like?
___


In [None]:
# Join to district polygons 
gdf_districts_with_crimes = ___.merge(
    ___,
    how="left",
    left_on="___",
    right_on="___"
)

# Replace missing counts (districts with no crimes) with 0
gdf_districts_with_crimes["crimes_count"] = gdf_districts_with_crimes["crimes_count"].fillna(0)


In [None]:
# let's show the result
___

> ‚ÅâÔ∏è Which districts appear as the most intense hotspots?  
>
> ‚ÅâÔ∏è Do these districts cover **large areas** or relatively small ones?  
>
> ‚ÅâÔ∏è How might this affect resource allocation decisions?

In [None]:
# Do the same operation with the beats (it is the one we will use to calculate the PAI)
___

In [None]:
# üí° You can already add 2 layers in your kepler map: 
# 1. The beats layer with their crime counts
# 2. The point test layer
# Try to visualize both layers at the same time!


> ‚ÅâÔ∏è Do the same areas appear as hotspots at both scales?  
>
> ‚ÅâÔ∏è Where do the two maps **disagree**?  
>
> ‚ÅâÔ∏è Which map do you find more **actionable** for operational policing, and why?

### E. Grid / hexagonal hotspot mapping with H3

Another common approach is to impose a **regular grid** over the city and count crimes in each cell.

We will use:

- The **H3 hexagonal grid system** (https://h3geo.org)  
- Each crime point will be assigned to a hexagon.  
- We will then map the count of crimes per hexagon.

---

##### üìö Concept: what is H3?

H3 is a **global hierarchical hexagonal grid system** originally developed at Uber. Key ideas:

- The world is divided into **hexagonal cells** at different resolutions.  
- Each cell has a unique **index** (e.g. `"882a1072b9fffff"`).  
- Higher resolutions ‚Üí smaller hexagons (finer detail).  
- Neighbouring cells have similar size and shape, which avoids some biases from irregular polygons.

Why hexagons?

- They have more **neighbours** than squares (6 vs 4), reducing directional bias.  
- They cover space more smoothly than many other shapes.  
- They are popular in spatial statistics and ecological modelling.

In hotspot analysis, H3 offers:

- A **neutral, regular** spatial unit (not tied to administrative boundaries).  
- Easy multi-scale analysis by changing the resolution.

We will map crimes to hexagons at a single resolution as a first step.

In [None]:
H3_RESOLUTION = 8  # smaller number = bigger hexagons

In [None]:
# We will define 2 functions: one for assigning a hexagon to each geometry, and one to extract the geometry of the hexagons
def get_hex_id(geometry, resolution=H3_RESOLUTION):
    """Assign a hexagon ID to a geometry."""
    if geometry is None or geometry.is_empty:
        return None
    return h3.latlng_to_cell(geometry.___, geometry.___, resolution)


def get_hex_geometry(hex_id):
    """Get the geometry of a hexagon from its ID."""

    boundary = h3.cell_to_boundary(hex_id) # unfortunately, h3 returns a list of reverted coordinates and not a geometry

    # swap to (lon, lat) for shapely
    coords = [(lon, lat) for lat, lon in boundary]
    
    return Polygon(coords)
    


In [None]:
# make a copy of your dataframe to keep the original clean
df_h3_train = gdf_train.copy()

# transform our df in a GeoDataFrame 
gdf_h3_train = gpd.GeoDataFrame(___, geometry=gpd.points_from_xy(___), crs=___)

# assign a hex_id in each row based on your geometry

gdf_h3_train["h3_cell"] = gdf_h3_train[___].apply(___)


In [None]:
# Group and Count crimes per cell
h3_crimes = (
    ___
) # we first group the hex to optimize the calculation of the geometry

In [None]:
# Build hexagon geometries
h3_crimes["geometry"] = h3_crimes[___].apply(___)

# recreate a geodataframe
gdf_h3_crimes = ___

In [None]:
# let's show the result (h3 grid and points from test layer)
___

# What geometry is taken by default by Kepler? 

> ‚ÅâÔ∏è Visually, do the last-week crimes tend to fall inside or outside the **densest hexagons**?  
>
> ‚ÅâÔ∏è How does your visual impression compare with the PAI values you will compute later?
>

### F. DBSCAN clustering

As a simple unsupervised method, we can use **DBSCAN** to find clusters of crime points.

Steps:

1. Take a random sample of training crimes (to keep computation light).  
2. Project to a metric CRS (so distances are in metres).  
3. Run DBSCAN with an `eps` value in metres.  
4. Visualise the clusters.

---

##### üìö Concept: what is DBSCAN?

DBSCAN (**D**ensity-**B**ased **S**patial **C**lustering of **A**pplications with **N**oise) is a clustering algorithm with two key parameters:

- **`eps`** ‚Äì radius of the neighbourhood (here, in metres).  
- **`min_samples`** ‚Äì minimum number of points required to form a dense region.

The algorithm:

1. For each point, count how many neighbours it has within distance `eps`.  
2. If it has at least `min_samples` neighbours, it is a **core point**.  
3. Clusters are formed by connecting core points and their nearby neighbours.  
4. Points that are not part of any cluster are labelled as **noise** (`-1`).

Why DBSCAN is useful for crime analysis:

- It discovers clusters of **arbitrary shape** (not just circular).  
- It can identify **noise** (isolated incidents).  
- You do not need to choose the **number of clusters** in advance.

Limitations:

- Results are sensitive to the choice of `eps` and `min_samples`.  
- A single pair of parameters may not work well across all areas (dense city centre vs sparse suburbs).

In [None]:
# DBSCAN can be very heavy on computation for large datasets
# Let's reduce our dataset to make it manageable

# Extract a subsample of our dataframe
dbscan_crimes = gdf_test.sample(frac=0.1, random_state=42) # frac=0.1 takes 10% of the data, random_state ensures reproducibility

# Project to a metric CRS for distance calculations
___

# DBSCAN takes a set/stack of coordinates (from numpy) and not a dataframe. 
# Extract coodinates of our sample
coords = np.array( # create a numpy array
            list( # convert to list
                zip( # zip x and y coordinates
                    ___, 
                    ___
                )
            )
        )

In [None]:
# We imported sklearn that provide a plug and play solution to run a DBSCAN
db = DBSCAN(
    eps=500,          # neighbourhood radius in metres, try different values 
    min_samples=50,   # minimum points to form a cluster, try different values
    n_jobs=-1,        # use all available cores
    metric="euclidean" # distance metric to use
)

# Fit the DBSCAN model
labels = db.fit_predict(coords)

# Add cluster labels to the sample dataframe
___

print("Unique cluster IDs (‚àí1 = noise):", ___.unique())


In [None]:
# let's show the result (DBSCAN and points from test layer)
___

> ‚ÅâÔ∏è What happens if you **increase** `eps` while keeping `min_samples` constant?  
>
> ‚ÅâÔ∏è What happens if you **decrease** `min_samples` while keeping `eps` constant?  
>
> ‚ÅâÔ∏è How do your identified clusters compare to the choropleth maps or the H3 hexagon areas you created earlier?

### G. Prediction Accuracy Index (PAI)

We now want to **evaluate** how good our hotspot maps are at predicting where crime will occur next.

Following Chainey et al. (2008), we use the **Prediction Accuracy Index (PAI)**:


PAI = ( n / N ) / ( a / A )


Where:

- n = number of **test crimes** that fall inside hotspot areas  
- N = total number of test crimes in the study area  
- a = area of hotspots (e.g. total area of selected hexagons / beats / clusters)  
- A = total area of the study area

Interpretation:

- **PAI = 1** ‚Üí hotspot performs like random selection.  
- **PAI > 1** ‚Üí hotspot is better than random (good).  
- **Higher PAI** ‚Üí more crimes captured in a smaller area.  

We will:

1. Compute the **study area** (union of all districts).  
2. Turn the test-period crimes into a GeoDataFrame in the metric CRS.  
3. Define a helper function to compute PAI.  
4. Apply it to:
   - Top 15% beats by crime count  
   - Top 15% hexagons by crime count  
   - DBSCAN clusters

In [None]:
# Total study area (in metric CRS) -> we can use an aggregation of the districts
# first to calculate a metric value we need to change the CRS of our dataframe
gdf_districts_metric = ___
study_area = gdf_districts_metric.geometry.___


print(f"Total study area: {study_area/1e6:.2f} km¬≤")


In [None]:
# Transform our test df in gdf (and change the crs)
___

# count the total number of crimes in our test set
total_test_crimes = ___

print(f"Total number of test crimes: {total_test_crimes}")

In [None]:
# let's create a helper function to compute PAI
def compute_pai(hits, hotspot_area, total_test_crimes, study_area):
    """Compute Prediction Accuracy Index (PAI) for a given set of hotspot polygons.

    hits: number of test crimes within hotspots
    hotspot_area: total area of hotspot polygons in m¬≤
    total_test_crimes: total number of test crimes
    study_area: total area of study region (in m¬≤)
    """

    # Hit rate (n / N)
    hit_rate = hits / total_test_crimes

    # Area percentage (a / A)
    area_pct = hotspot_area / study_area

    pai = (hit_rate / area_pct) if area_pct > 0 else np.nan

    return pai

#### G.1 PAI for beat-based hotspots

We will:

1. Normalise crime counts by the maximum.  
2. Select the **top 15%** beats.  
3. Compute PAI.

In [None]:
# Copy to avoid modifying earlier GeoDataFrame
beats_hotspots = gdf_beats_with_crimes.copy()

# Normalize crime counts by maximum
beats_hotspots["norm_crime"] = ___

# Top 15% beats by crime intensity
threshold_beat = beats_hotspots["norm_crime"].quantile(0.85) # you can get the value at a specific quantile with .quantile()
# Create a mask to select top beats
mask_top_beat = ___
top_beats = beats_hotspots.loc[mask_top_beat].copy()

print("Number of beat hotspots:", ___)

In [None]:
# compute area of top beats
# change crs to metric
top_beats = ___
top_beats_area = top_beats.geometry.___

print(f"Total area of top beats: {top_beats_area/1e6:.2f} km¬≤")


In [None]:
# Calculate the hits within the top beats area
beat_hits = gdf_test_metric.within(top_beats.geometry.unary_union).sum() # .unary_union combines all geometries into one, within() checks if points are within the union

print(f"Hits within top beats: {___}")


In [None]:
# Compute PAI
pai_beats = compute_pai(____)

print(f"BEATS PAI: {___:.2f}")

#### G.2 PAI for H3 hexagon hotspots

Similar to the beats

In [None]:
# Normalize the top h3 hotspots by area
# identify and extract the top h3

top_h3 = ___

print("Number of h3 hotspots:", ___)

In [None]:
# Compute the area of the top h3 hotspots

top_h3_area = ___

print(f"Total area of top beats: {top_h3_area/1e6:.2f} km¬≤")


In [None]:
# Calculate the hits within the top h3 area

h3_hits = ___ 

print(f"Hits within top h3: {___}")

In [None]:
# Compute PAI
pai_h3 = compute_pai(____)

print(f"H3 PAI: {___:.2f}")

#### G.3 PAI for DBSCAN clusters

To evaluate DBSCAN clusters as hotspots, we:

1. Construct one polygon per cluster (convex hull).  
2. Treat these polygons as hotspot areas.  
3. Compute PAI.

In [None]:
# Create a dataframe with the cluster geometry 

clusters = []

for cid, group in dbscan_crimes.groupby("cluster_id"): # Loop over each cluster 
    # cid is the cluster ID
    # group is the set of crimes belonging to that cluster
    
    if cid == -1:
        continue  # skip noise

    hull = group.unary_union.convex_hull  # create one polygon per cluster

    clusters.append({
        "cluster_id": ___,
        "n_points": ___,
        "geometry": ___
    })

clusters_gdf = gpd.GeoDataFrame(___, geometry=___, crs=___)

top_clusters = ___
print("Number of hot points:", top_clusters)

In [None]:
# Compute the area of the top cluster hotspots

top_cluster_area = ___
print(f"Hotspot area: {top_cluster_area/1e6:.2f} km¬≤")


In [None]:
# Calculate the hits within the top h3 area

cluster_hits = ___ 

print(f"Hits within top h3: {___}")

In [None]:
# Compute PAI
pai_cluster = compute_pai(____)

print(f"H3 PAI: {___:.2f}")

## PART 2: Basic exploratory data analysis (EDA)


### A. When is crime most frequent? (Temporal pattern by month)

üí° **Tips**: extracting the month
- Use the `pd.to_datetime` function to convert the `Date` column to a datetime object.
- Use the `dt.month` attribute to extract the month from the datetime object.
- Use the `value_counts` function to count the number of crimes per month.


In [None]:
# make a copy of your crimes data
eda_crimes = ___

# Extract month from date
eda_crimes['month'] = ___

# create a plotly bar chart by counting crimes per month with .value_counts()
px.bar(eda_crimes['month']___)


> ‚ÅâÔ∏è Which months show the highest and lowest crime counts?

### B. What are the main crime types? (Categorical distribution)


In [None]:
type_counts = ___.value_counts()

# extract the 20 top crime categories and their counts 
top_20 = ___.head(___)

px.bar(___)

> ‚ÅâÔ∏è Which crime types dominate the dataset?
>
> ‚ÅâÔ∏èAre there any crime types you expected to see but are rare or missing?

### C. Grouping detailed crime types into broader categories

Right now we have many detailed crime types (e.g. ‚ÄúTHEFT‚Äù, ‚ÄúBURGLARY‚Äù, ‚ÄúNARCOTICS‚Äù).
For some analyzes, it is helpful to group them into broader categories. Additionally, sometimes you have some categories that need to be translated in other categories.

üí° **Tips**: Mapping
- Create a Python dictionary mapping detailed types (broader category label or new categories)
- Use .map() to apply this mapping to the Primary Type column.

In [None]:
# Mapping categories

# create mapping of crime types to main categories (keeping only the 12 most important crime types)
main_categories = {
    # "old category": "new category"
    # Violence Against Persons
    "ASSAULT": "violence_against_person",
    "BATTERY": "violence_against_person",
    # Residential Burglary
    "BURGLARY": "burglary",
    # Thefts
    "THEFT": "theft",  
    "MOTOR VEHICLE THEFT": "theft",    
    "DECEPTIVE PRACTICE": "theft",
    "ROBBERY": "theft",
    # Drugs
    "NARCOTICS": "drug_offense",
    # Property Environmental/Damage
    "CRIMINAL DAMAGE": "prop_env_damage",
    "CRIMINAL TRESPASS": "prop_env_damage",
    # Other
    "OTHER OFFENSE": "Other",
    "WEAPONS VIOLATION": "Other"
}

# Only create main_category with the mapping
eda_crimes[___] = eda_crimes[___].map(___).fillna("non_assigned")



In [None]:
# How many non-assigned values do you have?
___

# drop all non-assigned values (üí° you can create a mask)
___

In [None]:
# Visualize counts of the broad main categories
px.bar(___)

> ‚ÅâÔ∏è What are the pros and cons of working with fewer, broader categories instead of detailed types?

### D. How do crime categories shift in space over time? (Mean centres)

##### üìö Concept: Mean centres (spatial average)

A **mean centre** is the geographic equivalent of the *average* in 1D statistics.

- In ordinary statistics you might compute the **mean value** of a list of numbers.  
- In spatial analysis, we can compute the **mean x-coordinate** and **mean y-coordinate** of a set of points.

If each crime event has coordinates x_i, y_i, the mean centre is avg(array[x]), avg(array[y])

In our case, we use **mean latitude** and **mean longitude** for each group, for example:
- One mean centre per **month** and **crime category**.

---

##### What does it mean in spatial analysis?

- It represents the **‚Äúcentre of gravity‚Äù** of a set of events.  
- If the mean centre moves over time (e.g. month to month), this suggests a **shift in the typical location** of that crime type.  
- Comparing mean centres between categories (e.g. burglary vs theft) shows whether different crimes tend to be concentrated in **similar or different parts of the city**.

---

##### What the results could mean

- If the mean centre of burglary is consistently **north** of the mean centre of theft, this suggests that **burglary risk** is more concentrated in northern areas.  
- If the mean centres for a category **wander a lot** over months, the crime type might be **spatially more mobile** or dispersed.  
- If they hardly move, it suggests a more **stable core area** of risk.

‚ö†Ô∏è **Limitations**:  
The mean centre is very sensitive to **outliers** and does **not** tell you about the *spread* or *shape* of the distribution (only its central tendency).

---

##### üöÄ Task: Compute mean centres for each month and 3 highest crime categories


In [None]:
# Group by Month and main_category,
# compute average Latitude and Longitude for each combination

df_mean_center = (
    ___
    .groupby([___])[[___]] # what columns do we group? What columns do we keep? 
    .___ # which aggregating function we want to do?
    .reset_index()
)

# Visualize the head of the dataframe
df_mean_center.___

In [None]:
# Visualise month-by-month mean centres of key categories in KeplerGl
# You can filter the value in a dataframe column by using df[df['column'] == value]

___ 

> ‚ÅâÔ∏è Do the mean centres for theft, violence, and burglary overlap, or are they in different parts of the city?
>
> ‚ÅâÔ∏è Does any category seem to wander more over months (higher dispersion)? How could you quantify this dispersion? (Hint: distance from mean or variance of coordinates.)

### E. Is the spatial distribution of crime random? (Global Moran‚Äôs I)

##### üìö Concept: Global Moran‚Äôs I (overall spatial autocorrelation)

**Global Moran‚Äôs I** is a measure of **spatial autocorrelation**, it tells us whether areas with similar values tend to be **near each other**.

In our case, the value is the **crime count per hexagon**.

- If high-count hexagons tend to be near other high-count hexagons,  
  and low-count hexagons tend to be near low-count hexagons,  
  we say the pattern is **positively autocorrelated** or **clustered**.
- If high values tend to be near low values, the pattern is **negatively autocorrelated** or **checkerboard-like**.
- If there is no clear pattern, the spatial distribution is close to **random**.

Moran‚Äôs I is roughly interpreted as:

- ( I > 0 ): similar values cluster together (positive autocorrelation).  
- ( I approx 0 ): no spatial pattern (random-like).  
- ( I < 0 ): neighbouring values tend to be dissimilar (negative autocorrelation).

We also look at a **p-value** (usually from permutation tests) to decide if the observed I is **unlikely to occur by chance**.

---

##### What it means in the context of spatial crime analysis

- A **positive and significant** Moran‚Äôs I (e.g. I = 0.3, p < 0.01) suggests that crime counts are **clustered**:  
  high-crime cells are near other high-crime cells ‚Üí **hot areas** and **cold areas** exist.  
- A value near zero with a high p-value suggests that the pattern is **not distinguishable from random**.  
- A negative Moran‚Äôs I would indicate an alternating pattern (rare in crime data).

---

##### What the results could mean

- A strong positive Moran‚Äôs I supports the common claim that **‚Äúcrime is not randomly distributed‚Äù** but concentrated in certain neighbourhoods or street segments.  
- This justifies focusing more detailed analysis and interventions on **clusters of high crime**.  
- If Moran‚Äôs I is weak or non-significant, hotspot analysis might be less meaningful, or the chosen **spatial scale** (hex size) may not be appropriate.

Remember: Global Moran‚Äôs I is a **single number** summarising the **entire study area**; it does *not* tell you *where* the clusters are located.

---

##### üöÄ Tasks:
- Use the hex grid GeoDataFrame (here called gdf_h3_crimes) with a point_count column.
- Build K-nearest neighbours weights (each hex uses its 5 nearest neighbours).
- Compute Moran‚Äôs I for the point_count field. 

In [None]:
# copy your h3 dataframe
hex_for_moran = gdf_h3_crimes.copy()

# change the type of your column crime count to float
hex_for_moran['___'] = hex_for_moran['___'].astype("float64")

In [None]:
# 1. Create spatial weights: 5 nearest neighbours for each hexagon
# We are using pysal to create the weights, which is a library for spatial analysis
# ps.weights.KNN uses the centroids of geometries by default.
# You can also define your weight as Rook or Queen
w = ps.weights.KNN.from_dataframe(___, k=___) # try different k value, what do you observe?
# Replace KNN by Queen or Rook, what do you observe?

# 2. Row-standardise the weights so each row sums to 1
w.transform = "R"

In [None]:
# 3. Compute global Moran's I on the crime count
mi = esda.Moran(hex_for_moran["crimes"], w) # use esda, which is a library for spatial analysis

print("------ Global Moran's I on hexagon counts ------")
print(f"Moran's I: {mi.I:.3f}")
print(f"Expected I under randomness: {mi.EI:.3f}")
print(f"p-value (permutation): {mi.p_sim:.4f}")

In [None]:
# 4. Visualize the data as a scatter plot 
# The library pysal allows to create a scatter plot of the data
moran_scatterplot(mi, p=0.05);

üí° **Interpretation**
- If Moran‚Äôs I is clearly positive and the p-value is small (e.g. < 0.05),
then crime counts are spatially clustered, not random.
- If Moran‚Äôs I is near zero and p-value is large, the pattern is closer to random.

> ‚ÅâÔ∏è Is your observed Moran‚Äôs I value closer to +1, 0, or ‚àí1? What does that mean substantively?
>
> ‚ÅâÔ∏è What does the p-value tell you about whether the clustering is statistically significant?

### F. Where exactly are the clusters? (Local Moran‚Äôs I)

##### üìö Concept: Local Moran‚Äôs I (Anselin), where are the clusters?

While **Global Moran‚Äôs I** tells us whether there is clustering *overall*,  
**Local Moran‚Äôs I** (also called *Anselin Local Moran*) tells us **where** clusters and outliers are located.

For each spatial unit (each hexagon in our case), Local Moran‚Äôs I compares:

- The **value in the hex** (e.g. its crime count), and  
- The **average value of its neighbours**.

Based on this, each hexagon can be classified into:

1. **High‚ÄìHigh (HH)**: high value, surrounded by high values ‚Üí **hot spot cluster**.  
2. **Low‚ÄìLow (LL)**: low value, surrounded by low values ‚Üí **cold spot cluster**.  
3. **High‚ÄìLow (HL)**: high value, surrounded by low values ‚Üí **spatial outlier** (a stand-alone hot cell).  
4. **Low‚ÄìHigh (LH)**: low value, surrounded by high values ‚Üí **another type of outlier**.

The method also gives a **p-value** for each hexagon, indicating whether its local pattern is statistically significant.

---

##### What it means in spatial crime analysis

- **High‚ÄìHigh** (HH) hexagons highlight **concentrated hot spots** ‚Äì areas where both the cell and its neighbourhood have high crime.  
- **Low‚ÄìLow** (LL) hexagons indicate **clustered low-crime areas** (cool spots).  
- **High‚ÄìLow** and **Low‚ÄìHigh** show **spatial outliers** that might be interesting for diagnosis (e.g. a single problematic block in an otherwise quiet area).

Local Moran‚Äôs I is therefore a **localised decomposition** of global Moran‚Äôs I:  
it breaks down the overall pattern into **location-specific stories**.

---

##### What the results could mean

- Regions with many significant **High‚ÄìHigh** cells could be priority zones for **targeted interventions** (e.g. focused patrols, environmental design changes).  
- **Low‚ÄìLow** clusters might be interpreted as relatively **safe areas**, possibly offering lessons about what works (good lighting, mixed land use, etc.).  
- **Outliers** (HL or LH) might signal **special cases**, such as:
  - a new emerging hotspot,  
  - measurement issues,  
  - or a localised crime generator (e.g. one problematic venue).

‚ö†Ô∏è **Caution**: When many local tests are performed, some ‚Äúsignificant‚Äù clusters may occur **by chance** (multiple testing problem). Interpretation should be careful and contextual.


In [None]:
# Compute Local Moran's I
moran_loc = esda.Moran_Local( 
    hex_for_moran[___],
    ___, # same weight as before
    geoda_quads=True  # automatically gives 1,2,3,4 for HH, LL, LH, HL
)

# what is present in moran_loc?
___

In [None]:
# Attach results to GeoDataFrame
___["moran_cat"] = moran_loc.q               # quadrant category
___["moran_zscore"] = moran_loc.z_sim        # z-score of Local Moran
___["moran_pvalue"] = moran_loc.p_sim        # p-value

In [None]:
# print the result with moran_scatterplot
___(moran_loc, ___)

> ‚ÅâÔ∏è What is the difference compared to your previous plot?

Classifying clusters in a simple categorical field

We create a new column cluster_moran with labels:
- "High-High_90" for significant HH clusters at the 10% level
- "Low-Low_90" for LL clusters
- "High-High_10" for HH clusters at the 10% level
- "Low-Low_10" for LL clusters
- "Not_Significant" for non-significant clusters


In [None]:
# define default value for new column
hex_for_moran["cluster_moran"] = "Not_Significant"

# Assign the other values by creating a mask and using .loc on a specific column
# mask for p value < 0.05 and category "High-High"
mask = ___
hex_for_moran.loc[mask, "cluster_moran"] = "High-High_90"

# do the same for the other values
___

In [None]:
# Visualize the results using kepler 
___

In [None]:
# You can create a quick visualization with .plot()
ax = hex_for_moran.plot(
    column="cluster_moran", # column to plot
    figsize=(8, 8),
    legend=True,
    categorical=True # colors are categorical
)
ax.set_title("Local Moran's I cluster types (High-High, Low-Low, etc.)")
ax.set_axis_off()


In [None]:
# You can also visualize the data directly in Plotly with an interactive map but it can be combersome

# define the color mapping 
color_map = {
    'Not_Significant': 'lightgray',
    'High-High_90': 'lightcoral',
    'Low-Low_90': 'lightblue',
    'Low-High_90': 'blue',
    'High-low_90': 'red',
}

# plot with attributes
px.choropleth_map(
    hex_for_moran, # dataframe to plot
    geojson=hex_for_moran.geometry, # column geometry
    locations=hex_for_moran.index, # index of each polygon
    color='cluster_moran', # column to color
    color_discrete_map=color_map, # color to use
    hover_data=['moran_zscore', 'moran_pvalue', 'cluster_moran'], # data to show on hover
    zoom=10, # zoom level
    center={"lat": 41.8781, "lon": -87.6298}, # center of the map
    height= 1000, # height of the map
    map_style="light", # style of the basemap
    title="Anselin Local Moran'I of crimes count" # title of the map
)

> ‚ÅâÔ∏è Where are the most prominent High-High areas located?
>
> ‚ÅâÔ∏è Can you identify any High-Low or Low-High outliers and think of possible explanations?
>
> ‚ÅâÔ∏è What are benefits/drawbacks of the different visualization methods we used? 

### G. Hot and cold spots with Getis‚ÄìOrd Gi*

##### Concept: Getis‚ÄìOrd Gi*, direct hot spot and cold spot detection

The **Getis‚ÄìOrd Gi\*** statistic is another local measure used to identify **hot spots** and **cold spots**.

Instead of comparing ‚Äúvalue vs. neighbours‚Äô average‚Äù (as Local Moran does), Gi\* looks at:

- The **sum of values** in a location and its neighbours, and  
- Compares this sum to what would be expected **under spatial randomness**.

The output is:

- A **z-score** for each spatial unit (hexagon), and  
- A **p-value** indicating whether this z-score is statistically significant.

Interpretation of the z-score:

- **High positive z-score** + low p-value ‚Üí significant **hot spot** (cluster of high values).  
- **Large negative z-score** + low p-value ‚Üí significant **cold spot** (cluster of low values).  
- z-scores near zero with high p-values ‚Üí not significantly different from random.

---

##### What it means in spatial crime analysis

In a crime context, Gi\* tells you:

> ‚ÄúIs this hexagon part of a **local concentration of high crime counts** (hot spot)  
>  or a **local concentration of low counts** (cold spot)?‚Äù

It focuses strongly on areas where **high values reinforce each other** spatially.

- Hot spots (high z, low p) are where high crime counts **pile up** spatially.  
- Cold spots (low z, low p) are areas where low crime counts cluster (possibly safer zones).

---

##### What the results could mean

- **Significant hot spots** highlight locations that may deserve **high priority** for prevention, enforcement, or situational interventions.  
- **Significant cold spots** might be used as reference or **‚Äúcontrol‚Äù areas** ‚Äì places where crime is consistently low.  
- Comparing Gi\* results with Local Moran‚Äôs I:
  - If both methods flag the same area as a hot spot, this **strengthens the evidence**.  
  - If they disagree, it invites more reflection on **scale, neighbourhood definition, or data issues**.

‚ö†Ô∏è **Caution**: Results depend on how you define ‚Äúneighbours‚Äù (distance band, k-nearest neighbours, etc.).  
Different choices can slightly change which areas are labelled as hot or cold spots.


In [None]:
# Compute and visualize local G (Gi*) on hexagon point counts
# use the function G_Local from esda

___

> ‚ÅâÔ∏è Do the Gi* hot spots coincide with the Local Moran High-High clusters?

> ‚ÅâÔ∏è Which method (Local Moran vs Gi*) do you find clearer to interpret for policing decisions, and why?

**Congratulations! üéâ You have successfully completed the exercise.**

In [None]:
# this line is to clear the output of the notebook, so that when you commit it, it is clean
!jupyter nbconvert --clear-output --inplace crime_ex.ipynb