# `scikit-learn` Interface Demo

**By Tyler D. Hoffman**

This notebook demonstrates the usage of the `spreg.sklearn` submodule which provides a `scikit-learn` style interface to spatial regression models in `spreg`. Importantly, you **must** import `spreg.sklearn` directly -- for compatibility reasons, importing `spreg` won't automatically import anything from `spreg.sklearn`.

## Imports

In [1]:
import os
os.chdir("..")  # make sure to run the notebook in spreg/, not spreg/notebooks

In [2]:
import spreg.sklearn  # directly import spreg.sklearn
import numpy as np
import pandas as pd
import geopandas as gpd
from libpysal.examples import load_example
from libpysal.weights import Kernel, fill_diagonal
from sklearn.linear_model import LinearRegression

## Load data 

We'll use the Boston housing example for this demonstration.

In [3]:
boston = load_example("Bostonhsg")
boston_df = gpd.read_file(boston.get_path("boston.shp"))

Example not available: Bostonhsg
Example not downloaded: Chicago parcels
Example not downloaded: Chile Migration
Example not downloaded: Spirals


### Make weights matrix 

We'll use a kernel weights matrix. We also set the diagonal of the weights matrix to 0 (necessary for lag model).

In [4]:
weights = Kernel(boston_df[["x", "y"]], k=50, fixed=False)
weights = fill_diagonal(weights, 0)

### Transform data

These variable transformations are inspired by the original paper (Harrison and Rubinfeld, 1978).

In [5]:
boston_df["NOXSQ"] = (10 * boston_df["NOX"])**2
boston_df["RMSQ"] = boston_df["RM"]**2
boston_df["LOGDIS"] = np.log(boston_df["DIS"].values)
boston_df["LOGRAD"] = np.log(boston_df["RAD"].values)
boston_df["TRANSB"] = boston_df["B"].values / 1000
boston_df["LOGSTAT"] = np.log(boston_df["LSTAT"].values)
boston_df["LCMEDV"] = np.log(boston_df["CMEDV"].values)

fields = ["RMSQ", "AGE", "LOGDIS", "LOGRAD", "TAX", "PTRATIO",
          "TRANSB", "LOGSTAT", "CRIM", "ZN", "INDUS", "CHAS", "NOXSQ"]

X = boston_df[fields].values
y = boston_df["LCMEDV"].values  # predict log corrected median house prices from covars

## Run regressions

The `scikit-learn` paradigm requires users to first instantiate a model object with all relevant hyperparameters (e.g., a flag to fit without an intercept) and then to call the `.fit()` method on the model object with the data in question. For `spreg`, the spatial weights matrix has been interpreted as a *hyperparameter* and therefore belongs in the object instantiation, not the fit method. Users will need to create different model objects if they want to use different weights matrices (e.g., to study a different spatial domain or to test a different weights matrix construction). This reflects the idea that using different weights matrices creates fundamentally different models.

For more information about the design pattern, see [`scikit-learn`'s documentation](https://scikit-learn.org/dev/developers/develop.html#apis-of-scikit-learn-objects).

First, we'll fit an ordinary linear regression. There is no `OLS` model in `spreg.sklearn` as that functionality is already implemented in [`sklearn.linear_model.LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html). 

In [6]:
ols_model = LinearRegression()
ols_model = ols_model.fit(X, y)

To examine the parameters, print the model's `coef_` attribute. To examine the intercept, print the model's `intercept_` attribute.

In [7]:
ols_model.coef_

array([ 6.25494507e-03,  7.09912518e-05, -1.97838956e-01,  8.95655637e-02,
       -4.19109258e-04, -2.95984861e-02,  3.61095092e-01, -3.74895205e-01,
       -1.17721086e-02,  9.16910022e-05,  1.78958133e-04,  9.21276564e-02,
       -6.37263848e-03])

In [8]:
ols_model.intercept_

4.5624910924128645

To check the R^2 value, call the model's `.score()` method after fitting it. This is the only built-in diagnostic.

In [9]:
ols_model.score(X, y)

0.8107607250154572

### Spatial regressions: Error model

Next, we'll run all the possible spatial regressions available in `spreg.sklearn`, beginning with a spatial error model. Note the presence of the weights matrix in the object instantiation -- this is required to fit the model.

In [10]:
err_model = spreg.sklearn.Error(w=weights)
err_model = err_model.fit(X, y)

Again, to examine the parameters print `coef_` and to examine the intercept print `intercept_`. To examine the indirect effects, print `indir_coef_`.

In [11]:
print(err_model.coef_)
print(err_model.intercept_)
print(err_model.indir_coef_)

[[ 7.35518674e-03 -4.29865292e-04 -2.01115258e-01  6.82776069e-02
  -3.30976561e-04 -1.93652673e-02  4.54408242e-01 -3.26286708e-01
  -9.68432158e-03  3.51129915e-04  9.19893727e-05  1.88359031e-02
  -5.96219189e-03]]
[4.23116752]
0.04353101847090626


The `.score()` method works as expected here.

In [12]:
err_model.score(X, y)

0.7939042454931853

### Spatial regressions: Lag model

The lag model has the same interface as the error model.

In [13]:
lag_model = spreg.sklearn.Lag(w=weights)
lag_model = lag_model.fit(X, y)

In [14]:
print(lag_model.coef_)
print(lag_model.intercept_)
print(lag_model.indir_coef_)

[[ 6.14339339e-03  2.26105887e-04 -1.84181072e-01  9.31509417e-02
  -4.55771906e-04 -2.95346123e-02  3.50112238e-01 -3.76173848e-01
  -1.18480437e-02  1.56371840e-05  4.55785100e-04  9.39152867e-02
  -6.23001773e-03]]
[4.60421033]
[-0.00114406]


In [15]:
lag_model.score(X, y)

0.8080272377791352

### Spatial regressions: Durbin Error and Durbin Lag

The module also supports spatial Durbin error models and spatial Durbin lag models, which are error and lag models with spatial lags of the covariates included as well. The `coef_` vector is divided in two halves: the first half are all coefficients on the covariates and the second half are coefficients on the spatial lags of the covariates. The `indir_coef_` attribute remains the error or lag indirect effect.

In [16]:
dbe_model = spreg.sklearn.DurbinError(w=weights)
dbe_model = dbe_model.fit(X, y)

In [17]:
print(dbe_model.coef_)
print(dbe_model.intercept_)
print(dbe_model.indir_coef_)

[[ 8.51880600e-03 -9.61310814e-04 -4.12041520e-01  6.87692988e-02
  -2.81979093e-04 -1.72742843e-02  5.72545729e-01 -2.69246399e-01
  -9.61633719e-03  4.74338313e-04 -2.20766438e-04 -2.45517804e-02
  -6.10938045e-03 -3.57820949e-04  3.62743548e-04  3.89233750e-03
  -3.09133014e-03  7.47187817e-05  1.99352024e-03 -8.94818943e-02
  -1.71318357e-02 -2.40669452e-03  4.18398242e-04 -4.49588142e-04
   3.93452810e-02 -4.08762373e-04]]
[4.45922345]
0.05251925600792307


In [18]:
dbe_model.score(X, y)

0.828800396792965

In [19]:
dbl_model = spreg.sklearn.DurbinLag(w=weights)
dbl_model = dbl_model.fit(X, y)

In [20]:
print(dbl_model.coef_)
print(dbl_model.intercept_)
print(dbl_model.indir_coef_)

[[ 7.75931400e-03 -8.85986639e-04 -4.41425611e-01  6.42999863e-02
  -2.58947268e-04 -1.95120176e-02  5.55489850e-01 -2.83361306e-01
  -9.41170489e-03  7.20905407e-04  1.26453644e-03 -3.16009229e-02
  -7.33345485e-03 -1.89275246e-03  4.40414080e-04  1.81035813e-02
  -9.16128156e-04  8.44565903e-05 -2.31942559e-04 -8.43331238e-02
  -1.37801401e-02 -1.85738593e-03  2.16608095e-04 -1.18221640e-03
   3.11502856e-02 -6.05074684e-05]]
[4.6268752]
[0.0203128]


In [21]:
dbl_model.score(X, y)

0.5644467981833367

## Formulas

Finally, the formula parser works for the `scikit-learn` models as well -- just be sure to use `spreg.sklearn.from_formula()`, not `spreg.from_formula()`. The behavior is the same (refer to `formula_example.ipynb` in this directory for more info), but does not accept combinations of spatial error and spatial lag of y models or skedastic errors. These have been intentionally left unimplemented in `spreg.sklearn` to streamline this submodule.