# Geographically Weighted Regression: US Income Convergence

This example applies `GWLinearRegression` to a real-world economic question: **does the relationship between a US state's per-capita income in 1929 and its income growth by 2009 vary across space?**

This is the classic **income convergence** hypothesis in regional economics. The global (OLS) version asks whether poorer states grew faster than richer ones - a sign of convergence toward a common income level. But the relationship may not be uniform: the South, the Northeast, and the West have very different economic histories, and GWR lets us test whether the convergence effect itself varies spatially.

**Dataset:** US per-capita income by state, 1929–2009, from `libpysal.examples` (`us_income`). This is a real dataset used in spatial econometrics research.

## 1. Load data

In [None]:
import geopandas as gpd
import libpysal
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.linear_model import LinearRegression

from gwlearn.linear_model import GWLinearRegression

# Load state geometries and income data
gdf = gpd.read_file(libpysal.examples.get_path("us48.shp"))
income = pd.read_csv(libpysal.examples.get_path("usjoin.csv"))

# Align FIPS codes and merge
income["STATE_FIPS"] = income["STATE_FIPS"].astype(str).str.zfill(2)
gdf["STATE_FIPS"] = gdf["STATE_FIPS"].astype(str).str.zfill(2)
gdf = gdf.merge(
    income[["STATE_FIPS", "Name"] + [str(y) for y in range(1929, 2010)]],
    on="STATE_FIPS",
)

gdf[["Name", "STATE_ABBR", "SUB_REGION", "1929", "2009"]].head()

## 2. Construct variables

We model **income growth** (percentage increase in per-capita income, 1929–2009) as a function of the **log of initial income** in 1929 and a mid-century income control. Log-transforming income is standard in convergence models to linearise the relationship.

In [None]:
gdf["growth"] = (gdf["2009"] - gdf["1929"]) / gdf["1929"] * 100
gdf["log_income_1929"] = np.log(gdf["1929"])
gdf["log_income_1969"] = np.log(gdf["1969"])

gdf[["Name", "1929", "2009", "growth", "log_income_1929"]].sort_values("growth").head(8)

## 3. Explore the data

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

gdf.plot(
    column="1929",
    cmap="YlOrRd",
    legend=True,
    legend_kwds={"label": "Per-capita income 1929 (USD)", "shrink": 0.6},
    ax=axes[0],
)
axes[0].set_title("Per-capita income, 1929")
axes[0].set_axis_off()

gdf.plot(
    column="growth",
    cmap="RdYlGn",
    legend=True,
    legend_kwds={"label": "Income growth 1929-2009 (%)", "shrink": 0.6},
    ax=axes[1],
)
axes[1].set_title("Income growth 1929-2009 (%)")
axes[1].set_axis_off()

plt.tight_layout()
plt.show()

The South had very low initial incomes but shows some of the highest growth rates, while the Northeast started wealthy but grew more modestly. This spatial pattern motivates a geographically weighted model.

## 4. Global OLS baseline

Before fitting GWR, establish a global OLS baseline. The single global coefficient will be replaced by a spatially varying one in the GWR.

In [None]:
X = gdf[["log_income_1929", "log_income_1969"]]
y = gdf["growth"]

ols = LinearRegression().fit(X, y)
print(f"OLS R²: {metrics.r2_score(y, ols.predict(X)):.3f}")
print("OLS coefficients:")
for name, coef in zip(X.columns, ols.coef_, strict=True):
    print(f"  {name}: {coef:.2f}")

The negative coefficient on `log_income_1929` supports convergence globally — states with lower initial incomes tended to grow faster. But does this hold everywhere?

## 5. Fit GWLinearRegression

With only 48 states, we use an adaptive bandwidth of 24 so each local model uses roughly half the observations — enough for a stable fit while still capturing regional variation.

In [None]:
model = GWLinearRegression(bandwidth=24, fixed=False, kernel="bisquare")
model.fit(X, y, geometry=gdf.representative_point())

print(f"GWR focal R²: {metrics.r2_score(y, model.pred_):.3f}")
print(f"OLS       R²: {metrics.r2_score(y, ols.predict(X)):.3f}")

## 6. Spatial variation in the convergence effect

The local coefficient on `log_income_1929` is the key output. Where it is strongly negative, poorer states grew faster (convergence). Where it is near zero or positive, the convergence effect breaks down.

In [None]:
gdf["coef_1929"] = model.local_coef_["log_income_1929"].values
gdf["coef_1969"] = model.local_coef_["log_income_1969"].values
gdf["local_r2"] = model.local_r2_.values
gdf["resid"] = model.resid_.values

model.local_coef_.describe()

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

gdf.plot(
    column="coef_1929",
    cmap="RdBu",
    legend=True,
    legend_kwds={"label": "Local coef.", "shrink": 0.6},
    ax=axes[0],
)
axes[0].set_title("Local coef: log income 1929\n(convergence effect)")
axes[0].set_axis_off()

gdf.plot(
    column="coef_1969",
    cmap="RdBu",
    legend=True,
    legend_kwds={"label": "Local coef.", "shrink": 0.6},
    ax=axes[1],
)
axes[1].set_title("Local coef: log income 1969\n(mid-century control)")
axes[1].set_axis_off()

gdf.plot(
    column="local_r2",
    cmap="YlGn",
    legend=True,
    legend_kwds={"label": "Local R²", "shrink": 0.6},
    ax=axes[2],
)
axes[2].set_title("Local R²")
axes[2].set_axis_off()

plt.suptitle("GWR: Spatial variation in US income convergence (1929-2009)", fontsize=13)
plt.tight_layout()
plt.show()

The convergence effect is strongest (most negative) across the South and parts of the Midwest — states that started poor grew fast relative to their initial income. It is weaker or reversed in the Northeast and Pacific states, where initial wealth did not predict slower growth. This regional heterogeneity is invisible in the global OLS.

The local R² is highest in the South and Plains states where the convergence story is clearest, and lower in the Mountain West where other factors (resource booms, migration) dominate.

## 7. Which states diverge most from the global trend?

In [None]:
print("States with strongest local convergence (most negative local coef.):")
print(
    gdf[["Name", "SUB_REGION", "coef_1929"]]
    .nsmallest(5, "coef_1929")
    .to_string(index=False)
)

print("\nStates where convergence is weakest or reversed:")
print(
    gdf[["Name", "SUB_REGION", "coef_1929"]]
    .nlargest(5, "coef_1929")
    .to_string(index=False)
)

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
gdf.plot(
    column="resid",
    cmap="PiYG",
    legend=True,
    legend_kwds={"label": "Residual", "shrink": 0.6},
    ax=ax,
)
ax.set_title("Focal residuals - GWR income convergence model")
ax.set_axis_off()
plt.tight_layout()
plt.show()

## 8. Summary

This example used `GWLinearRegression` to decompose a real economic relationship spatially:

- The **global OLS** model finds evidence of income convergence across US states — poorer states grew faster on average over 1929–2009.
- **GWR** reveals this is not uniform: convergence was strongest in the South and Midwest, and much weaker in the Northeast and Pacific states.
- `local_coef_` surfaces the spatially varying convergence effect, `local_r2_` shows where the model fits well, and `resid_` highlights states the model cannot fully explain.

This spatial decomposition is invisible in a standard regression and is the core motivation for geographically weighted models.

For bandwidth selection guidance, see [bandwidth search](./bandwidth_search.ipynb). For prediction at new locations, see the [prediction guide](./predict.ipynb).