# Geographically Weighted Logistic Regression: Tokyo Mortality

This example applies `GWLogisticRegression` to a real public health question: **do the socioeconomic predictors of high mortality risk vary spatially across Tokyo municipalities?**

The standardised mortality ratio (SMR) compares observed deaths to expected deaths given the age structure of each area. A municipality with SMR > 1 has more deaths than expected - a sign of excess mortality that may be linked to socioeconomic deprivation. But the relationship between deprivation indicators and mortality may not be spatially uniform: factors that drive excess mortality in the urban core may differ from those in suburban or rural areas.

**Dataset:** Tokyo mortality data (`libpysal` `tokyo`), 262 municipalities in the Greater Tokyo region. Variables include unemployment, home ownership, elderly population share, and proportion in technical occupations.

## 1. Load data

In [None]:
import geopandas as gpd
import libpysal
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics

from gwlearn.linear_model import GWLogisticRegression

# Load shapefile and mortality CSV
gdf = gpd.read_file(libpysal.examples.get_path("tokyomet262.shp"))
df = pd.read_csv(libpysal.examples.get_path("Tokyomortality.csv"))

# Merge on area ID
gdf = gdf.merge(df, left_on="AreaID", right_on="IDnum0")

print(gdf.shape)
gdf[["AREANAME", "db2564", "eb2564", "OCC_TEC", "OWNH", "POP65", "UNEMP"]].head()

## 2. Construct the binary target

The standardised mortality ratio (SMR) is `db2564 / eb2564` - observed over expected deaths for ages 25–64. Municipalities with SMR > 1 have excess mortality. We use this as a binary classification target.

In [None]:
gdf["SMR"] = gdf["db2564"] / gdf["eb2564"]
gdf["high_mortality"] = (gdf["SMR"] > 1).astype(int)

print(gdf["high_mortality"].value_counts())
print("Proportion high mortality:", gdf["high_mortality"].mean().round(3))

About 40% of municipalities have excess mortality - a reasonably balanced binary target.

## 3. Explore the data

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

gdf.plot(
    column="SMR",
    cmap="RdYlGn_r",
    legend=True,
    legend_kwds={"label": "SMR (observed/expected deaths)", "shrink": 0.6},
    ax=axes[0],
)
axes[0].set_title("Standardised Mortality Ratio (SMR)")
axes[0].set_axis_off()

gdf.plot(
    column="UNEMP",
    cmap="YlOrRd",
    legend=True,
    legend_kwds={"label": "Unemployment rate (%)", "shrink": 0.6},
    ax=axes[1],
)
axes[1].set_title("Unemployment rate")
axes[1].set_axis_off()

plt.tight_layout()
plt.show()

There is visible spatial clustering in both SMR and unemployment, suggesting the relationship between deprivation and mortality may vary across the region.

## 4. Global logistic regression baseline

Before fitting GWLR, establish a global logistic regression baseline.

In [None]:
from sklearn.linear_model import LogisticRegression

X = gdf[["OCC_TEC", "OWNH", "POP65", "UNEMP"]]
y = gdf["high_mortality"]

glr = LogisticRegression(max_iter=1000).fit(X, y)
global_pred = glr.predict(X)

print(f"Global accuracy: {metrics.accuracy_score(y, global_pred):.3f}")
print(f"Global F1:       {metrics.f1_score(y, global_pred):.3f}")
print("\nGlobal coefficients:")
for name, coef in zip(X.columns, glr.coef_[0], strict=True):
    print(f"  {name}: {coef:.3f}")

## 5. Fit GWLogisticRegression

With 262 municipalities, an adaptive bandwidth of 80 neighbours gives each local model enough data while still capturing spatial variation.

In [None]:
model = GWLogisticRegression(bandwidth=80, fixed=False, kernel="bisquare")
model.fit(X, y, geometry=gdf.representative_point())

print(f"Prediction rate: {model.prediction_rate_:.3f}")
print(f"Global accuracy: {metrics.accuracy_score(y, global_pred):.3f}")

## 6. Where does excess mortality cluster?

Map the focal predictions against the actual high-mortality municipalities.

In [None]:
gdf["pred"] = model.pred_
gdf["proba"] = model.proba_[1].values

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

gdf.plot(
    column="high_mortality",
    cmap="RdYlGn_r",
    legend=True,
    legend_kwds={"label": "High mortality (1=yes)", "shrink": 0.6},
    ax=axes[0],
)
axes[0].set_title("Observed: high mortality (SMR > 1)")
axes[0].set_axis_off()

gdf.plot(
    column="proba",
    cmap="RdYlGn_r",
    legend=True,
    legend_kwds={"label": "P(high mortality)", "shrink": 0.6},
    ax=axes[1],
)
axes[1].set_title("GWR predicted probability of high mortality")
axes[1].set_axis_off()

plt.tight_layout()
plt.show()

## 7. Spatially varying coefficients

The local coefficients reveal which socioeconomic factors drive high mortality in different parts of the region. A positive local coefficient means that variable increases the log-odds of excess mortality locally.

In [None]:
for col in X.columns:
    gdf[f"coef_{col}"] = model.local_coef_[col].values

model.local_coef_.describe()

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for ax, col in zip(axes, X.columns, strict=True):
    gdf.plot(
        column=f"coef_{col}",
        cmap="RdBu_r",
        legend=True,
        legend_kwds={"label": "Local coef.", "shrink": 0.6},
        ax=ax,
    )
    ax.set_title(f"Local coef: {col}")
    ax.set_axis_off()

plt.suptitle("GWR: Spatially varying predictors of high mortality - Tokyo", fontsize=13)
plt.tight_layout()
plt.show()

The coefficient maps reveal clear spatial heterogeneity:

- **`UNEMP`** has a strong positive effect (higher unemployment → higher mortality risk) in some areas but a weaker or negligible effect in others, suggesting unemployment is not a universal driver of excess mortality across the region.
- **`OWNH`** (home ownership) tends to be protective (negative coefficient) in the urban core but less so in outer municipalities.
- **`POP65`** and **`OCC_TEC`** also show spatially varying effects, reflecting the heterogeneous urban fabric of the Greater Tokyo region.

These patterns are invisible in the global logistic regression, which forces a single coefficient for each predictor.

## 8. Pooled metrics

Use the pooled predictions to compute overall classification performance.

In [None]:
print(metrics.classification_report(model.y_pooled_, model.pred_pooled_))

## 9. Summary

This example used `GWLogisticRegression` to investigate spatial variation in the socioeconomic drivers of excess mortality across Tokyo:

- The **global logistic regression** identifies average relationships between deprivation indicators and high mortality risk across all 262 municipalities.
- **GWLR** reveals these relationships are spatially non-stationary - unemployment, home ownership, and elderly population share all have locally varying effects depending on where in the region a municipality sits.
- `local_coef_` maps the spatially varying log-odds coefficients, `proba_` gives the local predicted probability of excess mortality, and `pred_pooled_` enables pooled performance evaluation.

For guidance on handling class imbalance in geographically weighted classification, see [class imbalance](./imbalance.ipynb). For bandwidth selection, see [bandwidth search](./bandwidth_search.ipynb).