# Exploration of geographically weighted random forest classification modelling

To-do:
- [x] global model
- [x] model evaluation
- [x] bandwidth optimisation
- [x] feature importances
- [x] golden section bandwidth selection
- [x] other metrics than accuracy
- [x] generic support (logistic regression, gradient boosting)
- [x] dedicated classes
- [ ] local performance of models that do not support OOB
    - [x] with logistic regression I guess we can do predict_proba and measure those on the full sample directly
    - with gradient boosting we can't as the model has seen the data - might need to split to train/test to mimic OOB.
- [x] logistic regression local coefficients
- [x] (optionally) predict method

In [None]:
import geopandas as gpd
import pandas as pd
from geodatasets import get_path
from sklearn import preprocessing

from core.gw import BandwidthSearch
from core.gw.ensemble import GWGradientBoostingClassifier, GWRandomForestClassifier
from core.gw.linear_model import GWLogisticRegression

Get sample data

In [None]:
gdf = gpd.read_file(get_path("geoda.ncovr"))

In [None]:
# It is in the geographic coords in the  US and we need to work with distances. Re-project and use only points as the graph builder will require points anyway.
gdf = gdf.set_geometry(gdf.representative_point()).to_crs(5070)

### Random forest

In [None]:
gwrf = GWRandomForestClassifier(
    bandwidth=250, fixed=False, n_jobs=-1, keep_models=False
)
gwrf.fit(
    gdf.iloc[:, 9:15],
    gdf["STATE_NAME"],
    gdf.geometry,
)

Global OOB score (accuracy) for the GW model measured based on OOB predictions from individual local trees.

In [None]:
gwrf.oob_score_

Local OOB score.

In [None]:
gdf.plot(gwrf.local_oob_score_, legend=True, s=2)

Global score (accuracy) for the GW model measured based on prediction of focals.

In [None]:
gwrf.score_

F1 scores for the GW model measured based on prediction of focals. 

In [None]:
gwrf.f1_macro, gwrf.f1_micro, gwrf.f1_weighted

OOB score of the global model.

In [None]:
gwrf.global_model.oob_score_

Get local feature importances.

In [None]:
gwrf.feature_importances_

In [None]:
gdf.plot(gwrf.feature_importances_["HC60"], legend=True, s=2)

Compare to global feature importance.

In [None]:
gwrf.global_model.feature_importances_

### Gradient boosting

In [None]:
gwgb = GWGradientBoostingClassifier(
    bandwidth=250,
    fixed=False,
    n_jobs=-1,
    keep_models=False,
)
gwgb.fit(
    gdf.iloc[:, 9:15],
    gdf["STATE_NAME"],
    gdf.geometry,
)

Global score (accuracy) for the GW model measured based on prediction of focals.

In [None]:
gwgb.score_

F1 scores for the GW model measured based on prediction of focals. 

In [None]:
gwgb.f1_macro, gwgb.f1_micro, gwgb.f1_weighted

Get local feature importances.

In [None]:
gwgb.feature_importances_

In [None]:
gdf.plot(gwgb.feature_importances_["HR90"], legend=True, s=2)

Compare to global feature importance.

In [None]:
gwgb.global_model.feature_importances_

### Logistic regression

In [None]:
gwlr = GWLogisticRegression(
    bandwidth=900_000,
    fixed=True,
    n_jobs=-1,
    keep_models=True,
    max_iter=500,
)
gwlr.fit(
    pd.DataFrame(
        preprocessing.scale(gdf.iloc[:, 9:15]), columns=gdf.iloc[:, 9:15].columns
    ),
    gdf["STATE_NAME"],
    gdf.geometry,
)

In [None]:
gwlr.score_

In [None]:
gdf.plot(gwlr.local_score_, legend=True, s=2)

In [None]:
gwlr.f1_macro, gwlr.f1_micro, gwlr.f1_weighted

Local coefficients

In [None]:
gwlr.local_coef_

In [None]:
gdf.plot(
    gwlr.local_coef_.xs("HR90", level=1)["Kansas"],
    missing_kwds=dict(color="lightgray"),
    legend=True,
)

Local intercepts

In [None]:
gwlr.local_intercept_

In [None]:
gdf.plot(
    gwlr.local_intercept_["Kansas"], missing_kwds=dict(color="lightgray"), legend=True
)

## Bandwidth search

Golden section search with a fixed distance bandwidth.

In [None]:
search = BandwidthSearch(
    GWRandomForestClassifier,
    fixed=True,
    n_jobs=-1,
    search_method="golden_section",
    criterion="aic",
    max_iterations=10,
    min_bandwidth=250_000,
    max_bandwidth=2_000_000,
    verbose=True,
)
search.fit(
    gdf.iloc[:, 9:15],
    gdf["STATE_NAME"],
    gdf.geometry,
)

Get the optimal one.

In [None]:
search.optimal_bandwidth

Golden section search with an adaptive KNN bandwidth.

In [None]:
search = BandwidthSearch(
    GWLogisticRegression,
    fixed=False,
    n_jobs=-1,
    search_method="golden_section",
    criterion="aic",
    max_iterations=10,
    tolerance=0.1,
    verbose=True,
    max_iter=500,  # passed to log regr
)
search.fit(
    pd.DataFrame(
        preprocessing.scale(gdf.iloc[:, 9:15]), columns=gdf.iloc[:, 9:15].columns
    ),
    gdf["STATE_NAME"],
    gdf.geometry,
)

Get the optimal one.

In [None]:
search.optimal_bandwidth

## Prediction

If you want to use the model for prediction, all the local models need to be retained. That may require significant memory for RF.

In [None]:
gwlr = GWLogisticRegression(
    bandwidth=900_000,
    fixed=True,
    n_jobs=-1,
    keep_models=True,
    max_iter=500,
)
gwlr.fit(
    pd.DataFrame(
        preprocessing.scale(gdf.iloc[:, 9:15]), columns=gdf.iloc[:, 9:15].columns
    ),
    gdf["STATE_NAME"],
    gdf.geometry,
)

Predict probabilities

In [None]:
all_data = pd.DataFrame(
    preprocessing.scale(gdf.iloc[:, 9:15]), columns=gdf.iloc[:, 9:15].columns
)

gwlr.predict_proba(all_data.iloc[:100], geometry=gdf.geometry.iloc[:100])

Predict label (taking max of probabilities)

In [None]:
gwlr.predict(all_data.iloc[5:10], geometry=gdf.geometry.iloc[5:10])