# A primer for working with the <u>*Sp*</u>atial <u>*Int*</u>eraction modeling (SpInt) module in the python spatial analysis library (PySAL)

## 1 Introduction

Spatial interaction modeling involves the analysis of flows from an origin to a destination either over physical space (i.e., migration) or through abstract space (i.e., telecommunication). Despite being a fundamental technique to regional science (citation), there is relatively little software available to carry out spatial interaction modeling and the analysis of flow data. Therefore, the purpose of this primer is to provide an overview of the recently develped spatial interaction modeling (SpInt) module of the python spatial analysis library (PySAL). First, the current framework of the module will be highlighted. Next, the main functionality of the module will be illustrated with a common regional science example, migration flows, with a dataset previously used for spatial interaction modeling tutorials in the R programming environment. Finally, some auxilliary functionality and future additions are discussed. 


## 2 The SpInt Framework

### 2.1 Modeling framework

The core purpose of the SpInt module is to provide the functionality to calibrate spatial interaction models. Since the "family" of spatial interaction models put forth by Wilson (Wilson, 1971) are perhaps the most popular, they were chosen as the starting point of the module. Consider the basic gravity model (Fotheringham & O'Kelly, 1989), 

$$T_{ij} = k\frac{V_{i}^\mu W_{j}^\alpha}{d_{ij}^\beta} \quad(1)$$
where 

$T_{ij}$ = an $n$ by $m$ matrix of flows between $n$ origins (subscripted by $i$) to $m$ destinations (subscripted by $j$)

$V$ = an $n$ by 1 and $m$ by 1 vectors of origin attributes

$W$ = an $n$ by 1 and $m$ by 1 vectors of destination attributes

$d$ = an $n$ by $m$ matrix of the costs to overcome the physical separation between  $i$ and $j$ (usually distance or time)

$k$ = a scaling factor to be estimated

$\mu$ = a vector of exponential parameters representing the effect origin attributes on flows

$\alpha$ = a vector of exponential parameters representing the effect destination attributes on flows

$\beta$ = an exponential parameter representing the effect of movement costs on flows. 

When data for $T$, $V$,  $W$, and $d$ are available we can estimate the model parameters (also called calibration), which summarize the effect that each model component contributes towards explaining the system of known flows ($T$). In contrast, known parameters can be used to predict unknown flows when there are deviations in model components ($V$, $W$, and $d$) or the set of locations in the system are altered.

Using an entropy-maximizing framework, Wilson derives a more informative and flexible "family" of four spatial interaction models (Wilson, 1971). This framework seeks to assign flows between a set of origins and destinations by finding the most probable configuration of flows out of all possible configurations, without making any additional assumptions. By using a common optimization problem and including information about the total inflows and outflows at each location (also called constraints), the following "family" of models can be obtained:

$$
\begin{align}
&Unconstrained \\
&Tij = V_{i}^\mu W_{j}^\alpha  f(d_{ij}) \quad & (2) \\
\\
&Production-constrained \\
&T_{ij} = A_{i}O_{i}W_{j}^\alpha f(d_{ij}) \quad & (3) \\
&A_{i} = \sum_{j} W_{j}^\alpha f(d_{ij}) \quad & (3a) \\
\\
&Attraction-constrained \\
&T_{ij} = B_{j}D_{j}V_{i}^\mu f(d_{ij}) \quad & (4) \\
&B_{j} = \sum_{i} V_{i}^\mu f(d_{ij}) \quad & (4a) \\
\\
&Doubly-constrained \\
&T_{ij} = A_{i}B_{j}O_{i}D_{j}f(d_{ij}) \quad & (5) \\
&A_{i} = \sum_{j} W_{j}^\alpha B_{j} D_{j} f(d_{ij}) \quad & (5a) \\
&B_{j} = \sum_{i} V_{i}^\mu A_{i} O_{i} f(d_{ij}) \quad & (5b)
\end{align}
$$


where 

$O_{i}$ = the total number of flows emanating from origin $i$

$D_{j}$ = the total number of flows terminating at destination $j$

$A_{i}$ = origin balancing factor that ensures the total out-flows are preserved in the predicted flows

$B_{j}$ = destination balancing factor that ensures the total in-flows are preserved in the predicted flows

$f(d_{ij})$ = a function of cost or distance, referred to as the distance-decay function. Most commonly this an exponential or power function given by,

$$
\begin{align}
&Power\\
&f(d_{ij}) = d_{ij}^\beta \quad & (6) \\
\\
&Exponential \\
&f(d_{ij}) = exp(\beta*d_{ij}) \quad & (7) \\
\end{align}
$$

where $\beta$ is expected to take a negative value. Different distance-decay functions assume different responses to higher costs associated with moving to more distant locations. Of note is that the unconstrained model with of power function distance-decay is equivalent to the basic gravity model in equation (2), except that the scaling factor, $k$, is not included. In fact, there is no scaling factor in any of the members of the family of maximum entropy models because there is a total trip constrained implied in their derivation and subsequently incorporated into their calibration (Fotheringham & O'Kelly , 1989).  

The family provides different model strucutre depedning on the available data or the research question at hand. The so-called unconstrained model does not conserve the total inflows or outflows during parameter estimation. The production-constrained and attraction-constrained models conserve either the number of total inflows or outflows, respectively, and are therefore useful for building models that allocate flows either to a set of origins or to a set of destinations. Finally, the doubly-constrained model conserves both the inflows and the outflows at each location during model calibration. The quantity of explanatory information provided by each model is given by the number of parameters it provides. As such, the unconstrained model provides the most information, followed by the two singly-constrained models, with the doubly-constrained model providing the least information. Conversely, the model's predictive power increases with higher quantities of built-in information (i.e. total in or out-flows) so that the doubly-constrained model usually provides the most accurate predictions, followed by the two singly-constrained models, and the unconstrained model supplying the weakest predictions (Fotheringham & O'Kelly, 1989). 

### 2.2 Calibration framework

Spatial interaction models are often calibrated via linear programming, nonlinear optimization, or increasingly more often through linear regression. Given the flexibility and extendability of a regression framework it was chosen as the primary model calibration technqiue within the SpInt module. By taking the log of both sides of a spatial interaction model, say the basic gravity model, is is possible to obtain the so-called log-linear or log-normal spatial interaction model:


$$\log{Tij} = k + V_{i}\mu + W_{j}\alpha - d_{ij}\beta + \epsilon$$


where $\epsilon$ is a a normally distributed error term with a mean of 0. Parameters can then be obtained using either a log-normal linear model (like above) or a Poisson generalized linear model within a regression framework. In addition, known parameters can be used to predict unknown flows when there are deviations in model components ($V$, $W$, and $d$) or the set of locations in the system are altered.



## 3 An Illustrative example: migration in Austria

### 3.1 The data

Finally, there are several paradigms for incorporating spatial effects into spatial interaction models (competing destinations, spatial autoregressive, eigenvector spatial filter). Therefore, the primary goals of this GSoC proposal are to:

    Implement new structures and algorithms to capture dependence in flow data.
    Develop Poisson count models that incorporate spatial effects.

A generalized linear model (GLM) approach will be adopted for modeling counts, so developing this framework would be the first outcome of this project, which could be used more widely throuhgout PySAL. While this functionality currently exists in another python library (statsmodels), the newly developed GLM framework would a) accommodate a sparse data structure often utilized in PySAL's spatial regression module and b) be light-weight so it would be simple to extend for various models with spatial effects. The other major outcome of this project would be a comprehensive module focused on spatial interaction modeling, which would include exploratory statistics, data structures, and models, as well as documentation and educational materials. In terms of exploratory statistics, the module will consist of tests of overdispersion and spatial dependence to detect potential problems when modeling counts of flows. New data structures will include origin-destination weights and network-based weights, which are needed to capture spatial dependence in flows. Finally, the Poisson GLM framework will be extended to include models that account for overdisperison or spatial dependence. The final module could exist within the core of PySAL or as a contributor module, depending on which dependencies are necessary. 

$$
\begin{align}
a = \frac{1}{2} && b = \frac{1}{3} && c = \frac{1}{4} \\
a && b && c
\end{align}
$$

$$
\begin{align}
&Tij = V_{i}^\mu W_{j}^\alpha  f(d_{ij}) \\
&T_{ij} = A_{i}O_{i}W_{j}^\alpha f(d_{ij}) \\
\end{align}
$$