Spatial Interaction Modeling Package
===========================================

The **Sp**atial **Int**eraction Modeling (SpInt) package seeks to provide a collection of tools to study spatial interaction processes.

It currently supports the calibration of the 'family' of spatial interaction models (Wilson, 1971) which are derived using an entropy maximizing (EM) framework or the equivalent information minimizing (IM) framework. As such, it is able to derive parameters for the following models:

- unconstrained gravity model
- production-constrained model (origin-constrained)
- attraction-constrained model (destination-constrained)
- doubly-constrained model


Calibration is carried out using maximum likelihood estimation routines outlined in (Fotheringham and O’Kelly, 1989; Willimans and Fotheringham, 1984). Optimization is achieved using scipy.optimize.fsolve(). Overall, the package is currently dependent upon numpy, spicy, and pandas.

Fotheringham, A. S. and O'Kelly, M. E. (1989). Spatial Interaction Models: Formulations and Applications. London: Kluwer Academic Publishers.

Williams, P. A. and A. S. Fotheringham (1984), The Calibration of Spatial Interaction
Models by Maximum Likelihood Estimation with Program SIMODEL, Geographic Monograph
Series, 7, Department of Geography, Indiana University.

Wilson, A. G. (1971). A family of spatial interaction models, and associated developments. Environment and
Planning A, 3, 1–32.

#Basic Concepts and the Gravity Model

At their core, spatial interaction models consider the cost associated with overcoming a physical separation, against the benefits of doing so. Early models originated from an analogy to Newton's physical law of gravitational attraction between two bodies, where the number of flows between two locations is given by the product of the populations of the origin and destination, divided by the distance between them. This relationship can be generalized in the following manner:

$$Tij = k\frac{V_{i}^\mu W_{j}^\alpha}{d_{ij}^\beta}$$ 
where 

$T_{ij}$ = an $n$ by $m$ matrix of flows between $n$ origins (subscripted by $i$) to $m$ destinations (subscripted by $j$)

$V$ = an $n$ by 1 and $m$ by 1 vectors of origin attributes

$W$ = an $n$ by 1 and $m$ by 1 vectors of destination attributes

$d$ is an $n$ by $m$ matrix of the costs to overcome the physical separation between  $i$ and $j$ (usually distance or time)

$k$ is a scaling factor to be estimated

$\mu$ = a vector of exponential parameters representing the effect origin attributes on flows

$\alpha$ = a vector of exponential parameters representing the effect destination attributes on flows

$\beta$ = an exponential parameter representing the effect of movement costs on flows. 



When data for $T$, $V$,  $W$, and $d$ are available we can estimate the model parameters (also called calibration), which summarize the effect that each model component contributes towards explaining the system of known flows ($T$).  This is often done via regression by taking the log of both sides of the gravity} to obtain the following linear model:


$$\log{Tij} = k + V_{i}\mu + W_{j}\alpha - d_{ij}\beta + \epsilon$$


where $\epsilon$ is an error term. Parameters can then be obtained using either a log-normal linear model (like above) or a Poisson generalized linear model within a regression framework. In addition, known parameters can be used to predict unknown flows when there are deviations in model components ($V$, $W$, and $d$) or the set of locations in the system are altered.



A maximum entropy framework can be used to analytically derive the gravity model (Wilson, 1971). This framework seeks to assign flows between a set of origins and destinations by finding the most probable configuration of flows out of all possible configurations, without making any additional assumptions. By using a common optimization problem and including information about the total inflows and outflows at each location (also called constraints), the following "family" of models can be obtained:

###$$Unconstrained$$

$$Tij = V_{i}^\mu W_{j}^\alpha  f(d_{ij})$$


###$$Production-Constrained$$

$$T_{ij} = A_{i}O_{i}W_{j}^\alpha f(d_{ij})$$

###$$Attraction-Constrained$$
$$T_{ij} = B_{j}D_{j}V_{i}^\mu f(d_{ij})$$

###$$Doubly-Constrained$$
$$T_{ij} = A_{i}B_{j}O_{i}D_{j}f(d_{ij})$$


where 

$O_{i}$ = the total number of flows emanating from origin $i$

$D_{j}$ = the total number of flows terminating at destination $j$

$A_{i}$ = origin balancing factor that ensures the total out-flows are preserved in the predicted flows

$B_{j}$ = destination balancing factor that ensures the total in-flows are preserved in the predicted flows

$f(d_{ij})$ = a function of cost or distance, referred to as the distance-decay function. Most commonly this an exponential or power function

This set of models may collectively be referred to as gravity models. For details regarding the balancing factors, as well as the derivation and estimation of these models the reader is referred to Fotheringham and O'Kelly (1989). 

The so-called unconstrained model (or total-flow constrained model since the number of predicted flows is preserved to match the observed data), which does not conserve the total inflows or outflows during parameter estimation and is considered equivalent to the traditional gravity mode. The production-constrained and attraction-constrained models conserve either the number of total inflows or outflows at each location and are therefore useful for building models that allocate individuals either to a set of origins or to a set of destinations. Finally, the doubly-constrained model conserves both the inflows and the outflows at each location during model calibration. The quantity of explanatory information provided by each model is given by the number of parameters it provides. As such, the unconstrained model provides the most information, followed by the two singly-constrained models, with the doubly-constrained model providing the least information. Conversely, the model's predictive power increases with higher quantities of built-in information (i.e. total in or out-flows) so that the doubly-constrained model usually provides the most accurate predictions, followed by the two singly-constrained models, and the unconstrained model supplying the weakest predictions (Fotheringham & O'Kelly, 1989). Of note is that the entire family of models can be calibrated either by using a custom optimization routine (Williams & Fotheringham, 1984) or using a regression framework.

#Calibrating Spatial Interaction Models

Here we first demonstrate how to use the SpInt module within PySAL to calibrate an unconstrained model, a singly-constrained model, and a doubly constrained model.

###Lets preapre the proper imports

In [26]:
#from pysal.contrib.spint import gravity
#The above is not working becuase I never added spint to the init file, so I am importing it locally

import os
os.chdir('/Users/toshan/Dropbox/spint/pysal/pysal/contrib/spint')
import gravity
import numpy as np
import pandas as pd

###Next, we prepare the data

Here is a simple toy dataset that was generated using the unconstrained gravity model with all of the parameters set to 1 (-1 for $\beta$) and and power function for the distance-decay.

In [27]:


#Simulated flows with pre-set parameters and origin/destionation/distane data below
flows = np.array([56, 100.8, 173.6, 235.2, 87.36,
                 28., 100.8, 69.44, 235.2, 145.6,
                 22., 26.4, 136.4, 123.2, 343.2,
                 14., 75.6, 130.2, 70.56, 163.8,
                 22, 59.4,  204.6,  110.88,  171.6])

#Origin populations
V = np.repeat(np.array([56, 56, 44, 42, 66]), 5)

#Origin labels
origins = np.repeat(np.array(range(1, 6)), 5)

#Destinaton populations
W = np.tile(np.array([10, 18, 62, 84, 78]), 5)

#Destination labels
destinations = np.tile(np.array(range(1, 6)), 5)

#Distances between all origins and destination
dij = np.array([10, 10, 20, 20, 50,
                20, 10, 50, 20, 30,
                20, 30, 20, 30, 10,
                30, 10, 20, 50, 20,
                30, 20, 20, 50, 30])

#Create pandas DataFrame from vectors of input data
data = pd.DataFrame({'origins': origins,
                        'destinations': destinations,
                        'V': V,
                        'W': W,
                        'dij': dij,
                        'flows': flows})

#Lets examine the format of the data
data.head()

Unnamed: 0,V,W,destinations,dij,flows,origins
0,56,10,1,10,56.0,1
1,56,18,2,10,100.8,1
2,56,62,3,20,173.6,1
3,56,84,4,20,235.2,1
4,56,78,5,50,87.36,1


###Now let's calibrate the model

In [28]:
#Calibrate unconstrained gravity model using SpInt.gravity
#The first input is our DataFrame
#The second input is the column name containing our origins
#The third input is our column name containing our destinations
#The fourth input is the column name containing observed flows between an origin and destination
#The fifth input is a list of column names containing origin attributes
#The sixth input is a list of column names containing destination attributes
#The seventh input is the column name containing the cost (distance) measure
#The final input is the distance-decay function, which is either 'pow' or 'exp' - in thi case a power function
model = gravity.Unconstrained(data, 'origins', 'destinations', 'flows', ['V'], ['W'], 'dij', 'pow')

#Let's examine the parameters dervied from calibration and model fit
print model.p
print model.fit_stats

{'beta': -1.0, 'W': 1.0, 'V': 1.0}
{'r_squared': 1.0, 'srmse': 2.6524754783996632e-13}


Obviously, since we get the same otuput parameters that we used to generate the data, we expect input flow data and the calibrated model output data to be the same. Therefore, we get an $r^2$ value of 1 and a root mean square error (RMSE) value of essentially 0, indicating perfect fit. 

###Production-Constrained Model of 1970 Airline passenger data

Now lets take a look at a dataset of 1970 airline trips from Atlanta to 25 other major U.S. cities using a production-constrained model. This particular data set is from one origin to 25 destinatons, therefore we refore to it as an origin-specific, production-constrained model. Destinations are represented by 1970 census population estimates and distances are given my great-circle routes between cities in miles.

In [29]:
#Empirically observed flows
flows = np.array([0, 6469, 7629, 20036, 4690,
                  6194, 11688, 2243, 8857, 7248,
                  3559, 9221, 10099, 22866, 3388,
                  9986, 46618, 11639, 1380, 5261,
                  5985, 6731, 2704, 12250, 16132])

#Empirically observed populations for destination cities
pop = np.array([1596000, 2071000, 3376000, 6978000, 1345000,
                2064000, 2378000, 1239000, 4435000, 1999000,
                1274000, 7042000, 834000, 1268000, 1965000,
                1046000, 12131000, 4824000, 969000, 2401000,
                2410000, 2847000, 1425000, 1089000, 2909000])

#Origin labels - All flows leave from the same origin
origins = np.repeat(1, 25)

#Destination labels
destinations = np.array(range(1, 26))

#Great cirle routes between the origin and each destination in miles
dij = np.array([0, 576, 946, 597, 373,
                559, 707, 1208, 602, 692,
                681, 1934, 332, 595, 906,
                425, 755, 672, 1587, 526,
                484, 2141, 2182, 410, 540])

#Create pandas DataFrame from vectors of input data
data = pd.DataFrame({'origins': origins,
                    'destinations': destinations,
                    'pop': pop,
                    'dij': dij,
                    'flows': flows})

#Calibrate unconstrained gravity model using SpInt.gravity
#The first input is our DataFrame
#The second input is the column name containing our origins
#The third input is our column name containing our destinations
#The fourth input is the column name containing observed flows between an origin and destination
#The fifth input is a list of column names containing destination attributes - now we don't include origin attributes
#The sixth input is the column name containing the cost (distance) measure
#The final input is the distance-decay function, which is either 'pow' or 'exp' - in thi case we use a power function.
model = gravity.ProductionConstrained(data, 'origins', 'destinations', 'flows', ['pop'], 'dij', 'pow')

#Let's examine the parameters dervied from calibration and model fit
print 'Model Parameters:'
print model.p
print 'Parameter standard errors:'
for parameter in model.p:
    print parameter, model.parameter_stats[parameter]['standard_error']
print 'Model fit:'
print model.fit_stats

Model Parameters:
{'beta': -0.7365098, 'pop': 0.7818262}
Parameter standard errors:
beta 0.00527344186143
pop 0.00276730528921
Model fit:
{'r_squared': 0.60516003720997413, 'srmse': 0.57873206718148507}


Here we can see in an empirical scenario that simply using the destination population and Euclidian distance between two locations is accounting for than half of the variation in airline trips from Atlanta. If we had more than one origin, we could calibrate an origin-specific model for each origin, and then compare the parameter estimate at each origin. Additionally, if we had more origins, we could calibrate a global model and then look at the origin balancng factors as a measure of accessibility.

In [30]:
model.dt.Ai

0     0.000049
1     0.000049
2     0.000049
3     0.000049
4     0.000049
5     0.000049
6     0.000049
7     0.000049
8     0.000049
9     0.000049
10    0.000049
11    0.000049
12    0.000049
13    0.000049
14    0.000049
15    0.000049
16    0.000049
17    0.000049
18    0.000049
19    0.000049
20    0.000049
21    0.000049
22    0.000049
23    0.000049
Name: Ai, dtype: float64

In this case, we only have one origin, and so the balancing facotr will be the same for all of the OD pairs. 

###Doubly-Constrained model of 1970 inter-regional migration in the U.S.

Next, we will look at a doubly-constrained model based on 1970 migration between 9 major census of the U.S. (New England, Mid-Atlantic, East North-Central, West North-Central, South Atlantic, East South-Central, West South-Central, Mountain, and Pacific). Distance is measured in miles between regional centroids.

In [31]:
#Empirically observed flows
flows = np.array([0, 180048, 79223, 26887, 198144, 17995, 35563, 30528, 110792,
                  283049, 0, 300345, 67280, 718673, 55094, 93434, 87987, 268458,
                  87267, 237229, 0, 281791, 551483, 230788, 178517, 172711, 394481,
                  29877, 60681, 286580, 0, 143860, 49892, 185618, 181868, 274629,
                  130830, 382565, 346407, 92308, 0, 252189, 192223, 89389, 279739,
                  21434, 53772, 287340, 49828, 316650, 0, 141679, 27409, 87938,
                  30287, 64645, 161645, 144980, 199466, 121366, 0, 134229, 289880,
                  21450, 43749, 97808, 113683, 89806, 25574, 158006, 0, 437255,
                  72114, 133122, 229764, 165405, 266305, 66324, 252039, 342948, 0])

#Origin labels
origins = np.repeat(np.array(range(1, 10)), 9)

#Destination labels
destinations = np.tile(np.array(range(1, 10)), 9)

#Distances - miles between regional centroids
dij = np.array([0, 219, 1009, 1514, 974, 1268, 1795, 2420, 3174,
                219, 0, 831, 1336, 755, 1049, 1576, 2242, 2996,
                1009, 831, 0, 505, 1019, 662, 933, 1451, 2205,
                1514, 1336, 505, 0, 1370, 888, 654, 946, 1700,
                974, 755, 1019, 1370, 0, 482, 1144, 2278, 2862,
                1268, 1049, 662, 888, 482, 0, 662, 1795, 2380,
                1795, 1576, 933, 654, 1144, 662, 0, 1287, 1779,
                2420, 2242, 1451, 946, 2278, 1795, 1287, 0, 754,
                3147, 2996, 2205, 1700, 2862, 2380, 1779, 754, 0])

#Create pandas DataFrame from vectors of input data
data = pd.DataFrame({'origins': origins,
                   'destinations': destinations,
                   'flows': flows,
                   'dij': dij})


#Calibrate unconstrained gravity model using SpInt.gravity
#The first input is our DataFrame
#The second input is the column name containing our origins
#The third input is our column name containing our destinations
#The fourth input is the column name containing observed flows between an origin and destination
#The fifth input is the column name containing the cost (distance) measure
#The final input is the distance-decay function, which is either 'pow' or 'exp' - in thi case we use an exponential function.
#Note that now we do not include origin or destination attributes

data = data[data['origins'] != data['destinations']]

model = gravity.DoublyConstrained(data, 'origins', 'destinations', 'flows', 'dij', 'exp')
print 'Model Parameters:'
print model.p
print 'Parameter standard errors:'
for parameter in model.p:
    print parameter, model.parameter_stats[parameter]['standard_error']
print 'Model fit:'
print model.fit_stats

Model Parameters:
{'beta': -0.0007369}
Parameter standard errors:
beta 4.91774334184e-07
Model fit:
{'r_squared': 0.89682406680906979, 'srmse': 0.24804939821988789}


We can immediately see that the doubly-constrained model has a very good model fit (high $r^2$ and low rmse). We only get one parameter, which is the distance-decay so we cannot comment on how origin or destination attributes effects flow magnitudes. Further, in this example, we used an exponetial distance-decay funtion. This function is scale dependent, which means different units of distance (i.e., miles versus feet) will result in different magnitudes of parameter estimates. Since, this does not happen when using a power function, it is considered scale-free, and is suggested for use when the goal is to compare distance-decay over different regions. 

We can also look at balancing factors:



In [32]:
print model.dt.Ai.unique()

[  2.39483263e-07   2.35092001e-07   2.17669513e-07   1.92408732e-07
   2.89951493e-07   1.85045929e-07   2.14688091e-07   2.15225252e-07
   5.17919756e-07]


In [33]:
print model.dt.Bj.unique()

[ 0.94347802  0.84516698  0.7672086   0.99427312  0.72600121  0.84248238
  0.92510457  1.98316323  0.85263218]


This is the very basics of spatial interaction modeling. These models can be used to analyze flows of many type, such as international trade, migration, commuting, public transportation use, retail revenue, international freight, telecommunications, etc. There are an infinite number of variables that an be used to describe origins, destinations, or the barriers to travel between locations. There are also a number of extensions to account for spatial strucutre/spatial dependencyspatial heterogeneity. These are eigenvector spatial filtering, spatial regression using connectivity weight matrices, or geographically weighted regression techniques. 